Docker Pull Is Painfully Slow: DNS, MTU, and Proxy Fixes That Actually Work

Was this helpful?

You run docker pull and it crawls like it’s downloading over a 1999 dial-up modem that’s also on fire. Layers pause. “Waiting” sits there smirking. CI pipelines time out. Someone says “must be Docker Hub,” and you feel your soul leave your body.

Slow pulls are rarely “just slow internet.” They’re usually one of three boring villains: DNS that lies or stalls, MTU that blackholes packets, or proxies that rewrite reality. Sometimes all three, stacked like a corporate lasagna.

Fast diagnosis playbook (check this first)

This is the “stop guessing” sequence. It’s optimized for the real world: you have one terminal, no patience, and a pipeline yelling at you.

1) Decide: network download or local extraction?

  • If the pull stalls at “Downloading” or “Waiting”, think DNS/proxy/MTU/CDN.
  • If it races through download then hangs at “Extracting” or your disk pegs, think storage driver / filesystem / disk IO.

2) Check DNS latency and correctness

  • Run a query against the exact registry hostname (and your chosen resolver).
  • If DNS takes >100ms consistently, or returns weird private IPs, fix DNS before you touch anything else.

3) Check path MTU (the blackhole test)

  • If TLS handshakes or large layer downloads stall, run an MTU probe with “do not fragment”.
  • MTU issues look like “some things work, big transfers freeze.” Classic.

4) Check proxy environment and daemon proxy config

  • Mismatch between user shell proxy vars and Docker daemon proxy config causes head-scratching partial failures.
  • If you have a TLS-inspecting proxy, expect extra latency and occasional broken HTTP/2 behavior.

5) Check IPv6 behavior (dual-stack timeouts)

  • Broken IPv6 often shows up as long pauses before falling back to IPv4.
  • Don’t permanently disable IPv6 unless you own the consequences. Prefer fixing the route/RA/DNS AAAA first.

6) Only then: registry throttling and mirrors

  • Rate limits and CDN edge selection can be real; don’t make them your first theory because it feels comforting.
  • A mirror can help, but also adds a new failure domain. Choose wisely.

One quote to keep you honest: paraphrased idea from Richard Feynman: “Reality must take precedence over public relations.” In ops, the logs and packet traces are reality.

What docker pull actually does on the wire

docker pull isn’t one download. It’s a pipeline of small decisions:

  1. Name resolution: resolve the registry hostname (DNS; possibly multiple A/AAAA answers).
  2. TCP connect: establish a connection to an edge node (CDN) or registry service.
  3. TLS handshake: negotiate encryption, validate certificates, maybe do OCSP/CRL checks.
  4. Auth: get a token from an auth endpoint; sometimes additional redirects.
  5. Manifest fetch: pull JSON manifest(s), choose platform/architecture, locate layers.
  6. Layer downloads: parallel HTTP range requests for blobs; retries; backoff.
  7. Decompression + extraction: CPU + disk heavy; metadata heavy; depends on storage driver.
  8. Content store bookkeeping: containerd/Docker update local metadata; garbage collection later.

Slowness hides in any step. DNS can add seconds per hostname. MTU issues can hang large TLS records. Proxies can break keep-alives so every layer pays connection setup costs again. And storage can turn “downloaded” into “still waiting” because extraction is choking.

Joke #1: “It’s always DNS” is funny because it’s true—and because it’s the only thing keeping us from screaming.

Interesting facts and short history

  • Docker image distribution borrowed heavily from Git thinking. Layers behave like reusable commits: pull once, reuse forever (until you don’t).
  • The registry protocol evolved. Registry v1 was deprecated; v2 improved content addressing and enabled better caching and CDN integration.
  • Content addressability matters. Modern registries key blobs by digest; caching works because the same digest is the same content everywhere.
  • containerd is the workhorse. Many “Docker” pulls on modern systems actually run through containerd’s content store and snapshotters.
  • HTTP range requests are common. Large blobs may be fetched in chunks; flaky networks can cause repeated partial fetches that look like “slow.”
  • PMTUD has been a recurring pain since the 1990s. Firewalls dropping ICMP “Fragmentation Needed” still break modern workloads today.
  • Corporate proxies changed the failure profile. TLS interception, idle timeouts, and header rewriting cause problems that look like “Docker is buggy.”
  • IPv6 dual-stack can amplify latency. Happy Eyeballs helps, but broken IPv6 can still add seconds of delay per connection if misconfigured.
  • Layer extraction is metadata-heavy. Millions of small files (common in language runtimes) punish overlay filesystems and slow disks.

Practical diagnostic tasks (commands, outputs, decisions)

You want tasks that produce evidence. Here are more than a dozen. Each one includes: a runnable command, what output means, and what you do next.

Task 1: Time the pull with debug logs

cr0x@server:~$ sudo docker -D pull alpine:3.19
DEBU[0000] pulling image                               image=alpine:3.19
3.19: Pulling from library/alpine
DEBU[0001] resolved tags                               ref=alpine:3.19
DEBU[0002] received manifest                           digest=sha256:...
DEBU[0003] fetching layer                              digest=sha256:...
f56be85fc22e: Downloading [==================>                                ]  1.2MB/3.4MB
DEBU[0015] attempting next endpoint
DEBU[0035] Download complete
DEBU[0036] Extracting
DEBU[0068] Pull complete

What it means: Debug output shows where the time goes: resolving, fetching manifest, downloading blobs, extracting.

Decision: If the stall is before “fetching layer”, chase DNS/auth/proxy. If it’s “Downloading” with retries, chase MTU/proxy/CDN. If it’s “Extracting”, chase storage/CPU.

Task 2: Confirm daemon and client versions (behavior changes)

cr0x@server:~$ docker version
Client: Docker Engine - Community
 Version:           26.1.1
 API version:       1.45
 Go version:        go1.22.2
 OS/Arch:           linux/amd64

Server: Docker Engine - Community
 Engine:
  Version:          26.1.1
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.22.2
  OS/Arch:          linux/amd64
 containerd:
  Version:          1.7.19
 runc:
  Version:          1.1.12

What it means: Different Docker/containerd versions have different defaults for networking, concurrency, and registry behavior.

Decision: If you’re ancient (or wildly new) and seeing weirdness, test on a known-good version before redesigning your network.

Task 3: Identify which registry endpoints are used

cr0x@server:~$ docker pull --quiet busybox:latest && docker image inspect busybox:latest --format '{{.RepoDigests}}'
sha256:...
[docker.io/library/busybox@sha256:3f2...]

What it means: You’re pulling from Docker Hub (docker.io), which often redirects to CDN endpoints.

Decision: If only Docker Hub is slow but other registries are fine, suspect CDN routing, rate limits, or a proxy policy specific to those hostnames.

Task 4: Inspect daemon DNS settings and overall config

cr0x@server:~$ sudo cat /etc/docker/daemon.json
{
  "dns": ["10.10.0.53", "1.1.1.1"],
  "log-level": "info"
}

What it means: Docker daemon may use specific resolvers that differ from your host’s resolver.

Decision: If those DNS servers are slow/unreachable from the daemon’s network namespace, fix or remove them. Prefer your organization’s fast, local caching resolvers.

Task 5: Measure DNS query time against the resolver Docker uses

cr0x@server:~$ dig @10.10.0.53 registry-1.docker.io +stats +tries=1 +time=2
;; ANSWER SECTION:
registry-1.docker.io. 60 IN A 54.87.12.34
registry-1.docker.io. 60 IN A 54.87.98.21

;; Query time: 420 msec
;; SERVER: 10.10.0.53#53(10.10.0.53)
;; WHEN: Tue Jan 02 10:11:07 UTC 2026
;; MSG SIZE  rcvd: 92

What it means: 420ms per lookup is brutal when a pull triggers many lookups (auth endpoints, token services, CDN names).

Decision: Fix DNS performance first: local cache, resolver placement, or reduce upstream latency. Don’t touch MTU yet.

Task 6: Check for DNS search-domain disasters

cr0x@server:~$ cat /etc/resolv.conf
search corp.example internal.example
nameserver 10.10.0.53
options ndots:5 timeout:2 attempts:3

What it means: ndots:5 means many hostnames get treated as “relative” first, triggering multiple useless queries per lookup.

Decision: In environments where you pull from many external registries, consider lowering ndots (often to 1–2) or tightening search domains. Test carefully; this can break internal name expectations.

Task 7: Verify if systemd-resolved is in play (and misbehaving)

cr0x@server:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 10.10.0.53
       DNS Servers: 10.10.0.53 1.1.1.1

What it means: You’re using a stub resolver; Docker might read a stub IP like 127.0.0.53 and that may or may not work depending on namespace and config.

Decision: If Docker containers or the daemon can’t reach the stub, set explicit DNS servers in daemon.json or adjust your resolver setup.

Task 8: Observe live connection attempts (IPv6 vs IPv4, retries)

cr0x@server:~$ sudo ss -tpn dst :443 | head
ESTAB 0 0 10.20.30.40:51524 54.87.12.34:443 users:(("dockerd",pid=1321,fd=62))
SYN-SENT 0 1 10.20.30.40:51526 2600:1f18:2148:...:443 users:(("dockerd",pid=1321,fd=63))

What it means: Docker is trying IPv6 and hanging in SYN-SENT while IPv4 works. That’s the “dual-stack timeout tax.”

Decision: Fix IPv6 routing/DNS AAAA correctness, or temporarily prefer IPv4 for the daemon if you need an immediate stop-the-bleeding mitigation.

Task 9: Test MTU with a no-fragment ping probe

cr0x@server:~$ ping -M do -s 1472 registry-1.docker.io -c 2
PING registry-1.docker.io (54.87.12.34) 1472(1500) bytes of data.
ping: local error: message too long, mtu=1460
ping: local error: message too long, mtu=1460

--- registry-1.docker.io ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1024ms

What it means: Your path MTU is 1460 (typical when overhead exists: VPN, tunnels, some cloud fabrics). Trying to push 1500-byte frames blackholes or errors.

Decision: Set Docker bridge MTU to match the real path MTU (or slightly below), and/or fix the network so PMTUD works (ICMP “frag needed” must pass).

Task 10: Confirm interface MTU and tunnel overhead

cr0x@server:~$ ip -br link
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0             UP             02:42:ac:11:00:02 <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
tun0             UP             00:00:00:00:00:00 <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420
docker0          UP             02:42:8f:01:23:45 <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500

What it means: Your VPN tunnel MTU is 1420 but Docker bridge is 1500. Containers might emit packets too big for the real path, and PMTUD might not save you.

Decision: Align Docker MTU with the smallest effective MTU in the egress path.

Task 11: Check daemon proxy configuration (systemd drop-in)

cr0x@server:~$ sudo systemctl cat docker | sed -n '1,120p'
# /lib/systemd/system/docker.service
[Service]
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

# /etc/systemd/system/docker.service.d/proxy.conf
[Service]
Environment="HTTP_PROXY=http://proxy.corp.local:3128"
Environment="HTTPS_PROXY=http://proxy.corp.local:3128"
Environment="NO_PROXY=localhost,127.0.0.1,.corp.local"

What it means: Docker daemon is configured to use a proxy. That affects registry access, TLS termination, and sometimes performance dramatically.

Decision: If proxy is required, ensure it’s stable and supports modern TLS. If it’s optional, try bypassing for registry hostnames via NO_PROXY to reduce latency and complexity.

Task 12: Verify your shell proxy vars aren’t misleading you

cr0x@server:~$ env | egrep -i 'http_proxy|https_proxy|no_proxy'
HTTP_PROXY=http://proxy.corp.local:3128
HTTPS_PROXY=http://proxy.corp.local:3128
NO_PROXY=localhost,127.0.0.1

What it means: Your shell settings differ from the daemon’s. You can curl fine in your shell while Docker daemon takes a different route—or vice versa.

Decision: Treat daemon config as authoritative for pulls. Align proxy settings across daemon, CI runners, and the host.

Task 13: Test TLS handshake time to likely endpoints

cr0x@server:~$ time bash -lc 'echo | openssl s_client -connect registry-1.docker.io:443 -servername registry-1.docker.io -brief 2>/dev/null | head -n 3'
CONNECTION ESTABLISHED
Protocol version: TLSv1.3
Ciphersuite: TLS_AES_128_GCM_SHA256

real    0m2.418s
user    0m0.032s
sys     0m0.012s

What it means: 2.4 seconds for a handshake is suspicious unless you’re on a high-latency link. Proxy TLS interception and broken IPv6 commonly cause this.

Decision: If handshake is slow, verify DNS resolution path, proxy path, and IPv6 attempts before blaming Docker.

Task 14: Observe ICMP “frag needed” or retransmits during pulls

cr0x@server:~$ sudo tcpdump -ni eth0 '(icmp and (icmp[0]=3 and icmp[1]=4)) or (tcp and port 443 and (tcp[tcpflags] & (tcp-rst) != 0))' -c 20
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
IP 10.20.30.40 > 54.87.12.34: ICMP 10.20.30.40 unreachable - need to frag (mtu 1460), length 556
IP 10.20.30.40.51524 > 54.87.12.34.443: Flags [R], seq 123, win 0, length 0

What it means: Seeing “need to frag” is actually good: PMTUD is working and telling you the MTU. Seeing RSTs could indicate proxy resets, idle timeouts, or middlebox behavior.

Decision: If you see “need to frag” but downloads still hang, something is still blocking ICMP on part of the path. If you see resets, focus on proxy/firewall policy and TCP idle timers.

Task 15: Check extraction bottleneck: disk and CPU

cr0x@server:~$ sudo iostat -xz 1 3
Linux 6.5.0 (server)  01/02/2026  _x86_64_  (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.00    0.00    6.00   42.00    0.00   40.00

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   w_await  aqu-sz  %util
nvme0n1          5.00    80.00     0.00   0.00   10.00    16.00  450.00  62000.00   35.00   18.00  98.00

What it means: High iowait and near-100% disk utilization means extraction is gated by storage. Pull “slowness” isn’t your network.

Decision: Address filesystem/snapshotter choice, disk performance, and image composition (too many small files). Consider moving Docker root to faster storage.

DNS failure modes that make pulls “slow”

DNS is the first domino. Slow DNS looks like slow downloads, because every new endpoint or redirect starts with a lookup. And registries are not shy about hostnames: token services, registry endpoints, CDN edges, sometimes geo-based answers.

Common DNS patterns that hurt pulls

  • Slow recursive resolvers: centralized DNS servers across a WAN; every query pays RTT.
  • Broken caching: resolvers that don’t cache effectively (misconfiguration, tiny cache, aggressive flush policy).
  • Search domain amplification: ndots and long search lists causing multiple futile queries per name.
  • Split-horizon surprises: internal DNS returning private or intercept IPs for public registries.
  • DNSSEC/validation latency: validation failures leading to retries and fallback behavior.

What to do (opinionated)

  • Use a nearby caching resolver for build nodes. If your “central resolver” is across continents, that’s a design bug.
  • Don’t hardcode public resolvers as a reflex. They can be fast, but they can also violate corporate routing, break split DNS, or get blocked.
  • Fix ndots/search domains intentionally. People copy/paste resolver configs like they’re harmless. They are not.

Concrete fix: set Docker daemon DNS explicitly

If containers (not the host) suffer DNS delay, set daemon DNS. This doesn’t always affect registry pulls directly (those are done by the daemon), but it prevents secondary pain once containers start.

cr0x@server:~$ sudo tee /etc/docker/daemon.json >/dev/null <<'EOF'
{
  "dns": ["10.10.0.53", "10.10.0.54"],
  "log-level": "info"
}
EOF
sudo systemctl daemon-reload
sudo systemctl restart docker

Decision: If restart fixes it, you had resolver drift or the daemon was using a stub it couldn’t reliably reach.

MTU and PMTUD: the silent packet murder mystery

MTU problems don’t usually fail loudly. They fail like a bad meeting: everyone is technically present, and nothing moves forward.

The pattern: small requests work, big transfers stall. You can ping. You can fetch a small manifest. Then a blob starts downloading and stops at a suspiciously consistent byte count, or it hangs forever with retries.

How MTU breaks pulls

  • Blackholed PMTUD: ICMP “Fragmentation Needed” is blocked by a firewall/middlebox, so endpoints keep sending packets that never arrive.
  • Overlay and tunnel overhead: VPNs, GRE, VXLAN, WireGuard, IPSec all reduce effective MTU.
  • Mismatched bridge MTU: Docker networks set at 1500 while the real path is smaller.

Fix: set Docker network MTU deliberately

You can set the default MTU for Docker’s bridge networks via daemon configuration. Choose a value that fits your egress path. If your tunnel is 1420, set Docker to 1400–1420 to keep headroom.

cr0x@server:~$ sudo tee /etc/docker/daemon.json >/dev/null <<'EOF'
{
  "mtu": 1420,
  "log-level": "info"
}
EOF
sudo systemctl restart docker

What it means: New networks will use this MTU; existing networks may need recreation.

Decision: If pulls stabilize immediately after aligning MTU (especially on VPN), you’ve found the culprit. Then go fix PMTUD properly so you’re not papering over the network.

Fix: allow ICMP for PMTUD

If you control firewalls, allow the specific ICMP types needed for PMTUD (IPv4 “frag needed”, IPv6 Packet Too Big). Blocking them is a legacy superstition.

Joke #2: Some networks block ICMP “for security,” which is like removing road signs to prevent speeding.

Corporate proxies: when “helpful” becomes harmful

Proxies are either a necessary compliance tool or a performance tax collector. Sometimes both. Docker pulls stress proxies in a few ways: lots of concurrent connections, large TLS flows, redirects, and a mix of endpoints.

Proxy behaviors that slow pulls

  • Short idle timeouts: blob downloads pause, proxy drops the connection, client retries.
  • TLS interception: adds handshake overhead, breaks session reuse, and can interact badly with HTTP/2 or modern ciphers.
  • Connection concurrency limits: proxy throttles parallel downloads so each blob crawls.
  • Incorrect NO_PROXY: daemon sends registry traffic through proxy when it shouldn’t.

Fix: configure daemon proxy correctly (and completely)

Set proxy for the Docker daemon via systemd drop-in, not just your shell environment.

cr0x@server:~$ sudo mkdir -p /etc/systemd/system/docker.service.d
sudo tee /etc/systemd/system/docker.service.d/proxy.conf >/dev/null <<'EOF'
[Service]
Environment="HTTP_PROXY=http://proxy.corp.local:3128"
Environment="HTTPS_PROXY=http://proxy.corp.local:3128"
Environment="NO_PROXY=localhost,127.0.0.1,::1,registry.corp.local,.corp.local"
EOF
sudo systemctl daemon-reload
sudo systemctl restart docker

What it means: The daemon now consistently uses the proxy settings. This removes the “works in curl, fails in docker” split-brain.

Decision: If pulls improve, your previous state was inconsistent. If pulls worsen, your proxy is the bottleneck—bypass for registries where allowed.

Fix: add a local registry mirror (carefully)

A mirror can eliminate proxy traversal and reduce WAN traffic. It can also become your new single point of failure. If you run a mirror, run it like production infrastructure: monitoring, storage capacity, TLS hygiene, and predictable eviction.

IPv6, dual-stack, and the long timeout

Broken IPv6 is a special kind of pain: everything looks “mostly fine,” but each new connection pays a penalty as clients try IPv6 first, wait, then fall back.

Diagnose dual-stack delay

Look for SYN-SENT on v6, or long gaps before the first byte. Also check if DNS returns AAAA records that route nowhere from your hosts.

Mitigations (pick the least bad)

  • Fix IPv6 properly: correct routes, RA, firewall rules, and DNS AAAA answers.
  • Prefer IPv4 temporarily: as an emergency measure if CI is down and you need images now.

If you choose the temporary IPv4 preference, be explicit about it and track it as a debt item. “Temporary” settings have a habit of becoming archaeology.

Registry throttling, rate limits, and CDN oddities

Yes, sometimes it really is the registry. Rate limits, throttling, or an unlucky CDN edge can turn pulls into sludge. But diagnose first. Don’t use “the internet is slow” as your monitoring strategy.

Signs you’re being throttled

  • Pulls are fast off-hours and slow during business hours.
  • Errors mention too many requests, or you see frequent 429-like behavior in higher-level tooling.
  • Only one registry is slow; others behave normally.

Fixes that are usually worth it

  • Authenticate pulls where applicable to get higher limits.
  • Use an internal cache/mirror for CI fleets, especially if you rebuild often.
  • Reduce image churn: pin base images, avoid rebuilds that invalidate every layer.

Storage and extraction: when the network isn’t the bottleneck

A dirty secret: many “slow pulls” are fast downloads followed by slow extraction. Especially on shared runners, thin-provisioned disks, or overlay-on-overlay situations.

Extraction is a storage benchmark in disguise

  • Lots of tiny files: metadata and inode churn.
  • Compression: CPU time, then writes.
  • Overlay filesystem behavior: copy-up penalties, dentry pressure, and write amplification.

Fixes that matter

  • Move Docker’s data root to faster storage (/var/lib/docker on NVMe beats networked disks).
  • Keep images smaller and reduce file count in layers (multi-stage builds, slim base images, prune caches).
  • On busy hosts, schedule pulls to avoid IO contention (or isolate build nodes).

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The company had a “secure-by-default” network baseline. One team rolled out new firewall rules to “tighten things up” between build workers and the internet. Pulls didn’t fail outright; they just got painfully slow and sometimes stalled. That’s worse, because it feels like a random flake rather than a crisp outage.

The assumption was simple and wrong: “ICMP is optional.” The rule set blocked ICMP type 3 code 4 (IPv4 fragmentation needed) and the IPv6 equivalent. Most web browsing still worked. Small HTTPS requests were fine. Docker pulls? Large TLS records and big blobs would hit the effective MTU, disappear into a blackhole, and sit there until retries and backoff made the pipeline look haunted.

On-call spent hours chasing “Docker Hub issues,” then “proxy issues,” then “maybe our runners are overloaded.” Someone finally ran an MTU probe and got the smoking gun: the path MTU was lower than the interface MTU, and PMTUD couldn’t signal it.

The fix was not heroic. They allowed the right ICMP types on the egress path and set a conservative Docker MTU on the runners that traversed tunnels. Pull times returned to normal. The postmortem action item was even more valuable: every network change that touches egress now gets an MTU/PMTUD check as a standard gate.

Mini-story 2: The optimization that backfired

A platform team wanted faster builds, so they introduced a registry mirror in each region. Sensible idea: localize traffic, reduce WAN bandwidth, keep CI snappy. They rolled it out quickly, and at first it looked great.

Then the backfire. The mirror had a small disk budget and aggressive eviction. When multiple teams started shipping larger images (language runtimes, ML dependencies, debug symbols—pick your poison), the mirror churned constantly. A blob would be cached, evicted, and fetched again within the same day. The proxy in front of the mirror also had a short idle timeout; partial downloads got reset, causing retries, increasing load, causing more resets. A little feedback loop of misery.

The symptom was deceptive: pulls were sometimes fast and sometimes catastrophically slow, even within the same region. Engineers blamed the registry, then Docker, then their own images. The real root cause was the “optimization”: a cache too small to be stable, plus timeouts tuned for web pages, not multi-hundred-megabyte blobs.

The eventual fix was boring: right-size storage, tune timeouts, add monitoring on cache hit rate, and disable the proxy behavior that broke long-lived transfers. They also set policy: mirrors are production services, not side projects living on leftover VMs.

Mini-story 3: The boring but correct practice that saved the day

A financial services shop had strict outbound controls. They also had a habit I wish everyone had: every build worker and Kubernetes node ran the same baseline diagnostics on boot and published the results internally.

One Monday, developers complained that image pulls were slow “everywhere.” The SRE on call didn’t open Slack debates about whether Docker Hub was having a bad day. They opened the node diagnostics dashboard. DNS query times had tripled on a subset of subnets. MTU probes were normal. Disk IO was normal. It was DNS, cleanly and unromantically.

The culprit was a resolver failover that technically worked but moved traffic to a farther site. Latency went up. Caches were cold. Registry pulls were suddenly paying extra milliseconds dozens of times per pull. Not enough to scream “outage,” plenty enough to melt CI throughput.

They switched resolver preference back, warmed caches, and the complaint evaporated. The saving practice wasn’t a clever tool; it was consistent baseline measurement. Boring. Correct. Incredibly effective when the room fills with theories.

Common mistakes (symptoms → root cause → fix)

  • Symptom: Pull hangs at “Downloading” around the same percentage each time.
    Root cause: MTU/PMTUD blackhole; large packets dropped, no ICMP feedback.
    Fix: Probe MTU with ping -M do, allow ICMP frag-needed/Packet Too Big, set Docker MTU to match tunnel path.
  • Symptom: Long pause before any progress, then it suddenly speeds up.
    Root cause: Broken IPv6 with slow fallback to IPv4; AAAA records exist but route doesn’t.
    Fix: Fix IPv6 routing/firewall/DNS; as mitigation, prefer IPv4 for daemon or remove bad AAAA at the resolver if you own it.
  • Symptom: curl to registry is fast, but docker pull is slow or times out.
    Root cause: Proxy split-brain: shell proxy vars differ from daemon proxy config.
    Fix: Configure proxy via systemd drop-in for Docker daemon; align NO_PROXY.
  • Symptom: Pull is fast on laptops, slow on servers.
    Root cause: Servers use different DNS resolvers/search domains, or traverse VPN/tunnels with reduced MTU.
    Fix: Compare resolver and MTU; align daemon DNS and MTU to the server’s actual path.
  • Symptom: “Downloading” is quick, “Extracting” takes forever, CPU is fine but IO is pegged.
    Root cause: Slow disk, thin-provisioned storage, overlay filesystem overhead, too many small files.
    Fix: Move Docker root to faster storage; optimize images; reduce layer file count; consider runner isolation.
  • Symptom: Only one registry is slow; others are normal.
    Root cause: CDN edge selection, rate limiting, or proxy policy scoped to certain domains.
    Fix: Authenticate pulls; use a mirror; adjust proxy bypass; validate DNS answers (geo, split horizon).
  • Symptom: Pull speed oscillates wildly minute-to-minute.
    Root cause: Proxy connection resets, mirror cache churn, or network congestion with retries.
    Fix: Tune proxy idle timeouts, right-size mirror, measure cache hit rate, reduce concurrent pulls during congestion windows.

Checklists / step-by-step plan

Checklist A: One-node triage (15 minutes)

  1. Run docker -D pull and note where time is spent (resolve/auth/download/extract).
  2. Measure DNS time for registry hostname with dig +stats.
  3. Check /etc/resolv.conf for ndots and search domains that amplify lookups.
  4. Check MTU mismatch: ip -br link and ping -M do probe.
  5. Check daemon proxy settings via systemctl cat docker.
  6. Check IPv6 connection attempts with ss -tpn.
  7. If extraction is slow, confirm with iostat and stop blaming the network.

Checklist B: Fleet fix (what scales beyond one host)

  1. Standardize daemon config: DNS servers, MTU, proxy, log level.
  2. Baseline tests on boot: DNS latency, MTU probe to key endpoints, disk IO sanity.
  3. Central observability: pull durations, failure rates, and where time is spent (download vs extraction).
  4. Introduce mirrors deliberately: treat as production—capacity, monitoring, TTL/eviction policy, timeouts.
  5. Network policy hygiene: explicitly allow PMTUD ICMP types; document it so it doesn’t get “hardened” away.

Checklist C: Image hygiene (you control this part)

  1. Keep base images pinned and consistent to maximize layer reuse.
  2. Use multi-stage builds to reduce size and file count.
  3. Stop embedding package manager caches inside layers.
  4. Prefer fewer, meaningful layers over dozens of micro-layers.

FAQ

1) Why does docker pull feel slower than downloading a big file?

Because it’s not one file. It’s multiple DNS lookups, auth calls, redirects, parallel blob downloads, and then decompression + extraction. Any weak link becomes “Docker is slow.”

2) I can resolve DNS quickly on the host. Why would Docker still be slow?

The daemon may use different DNS settings than your shell or your containers. Check /etc/docker/daemon.json, systemd-resolved behavior, and whether the daemon is pointed at a stub resolver it can’t reliably use.

3) What’s the fastest way to confirm an MTU problem?

Use a “do not fragment” ping probe and binary search the payload size. If large packets fail but smaller ones work, and especially if you’re on VPN/tunnels, treat MTU as guilty until proven otherwise.

4) Should I just set Docker MTU to 1400 everywhere?

No. That’s a blunt instrument. It can reduce performance on clean networks and hide real PMTUD/firewall defects. Set MTU based on actual path constraints, and fix ICMP handling so you don’t need superstition.

5) Why does disabling IPv6 “fix” it?

If IPv6 is broken, clients waste time trying it before falling back. Disabling IPv6 forces IPv4 immediately, avoiding the timeout tax. Better fix: make IPv6 work or stop publishing broken AAAA records.

6) My proxy is required. What should I ask the network team for?

Ask for: longer idle timeouts for large downloads, support for modern TLS without breaking session reuse, capacity for many concurrent connections, and a clear bypass remember-list for registry hostnames if policy allows.

7) Pull is slow only on Kubernetes nodes. Why?

Nodes often have different egress paths (NAT gateways, CNI overlays, stricter firewalls, different DNS). Also, kubelet uses containerd directly, so proxy/DNS settings may differ from your “Docker host” assumptions.

8) How do I tell “download slow” from “extract slow” without guessing?

Use docker -D pull and watch where it stalls, then confirm with disk metrics. If IO is pegged during “Extracting,” your network isn’t the bottleneck.

9) Is a registry mirror always worth it?

It’s worth it when you have many nodes repeatedly pulling common images, especially in CI. It’s not worth it if you can’t operate it reliably. An unreliable mirror is a creative way to invent outages.

10) Why do pulls sometimes get slower after “optimizing” DNS?

Because “optimizing” often means changing resolvers without understanding split DNS, caching behavior, or geo answers. Faster resolver RTT doesn’t help if it returns a worse CDN edge or breaks corporate routing.

Conclusion: next steps that pay off

If docker pull is painfully slow, treat it like a production incident: gather evidence, isolate the stage, and fix the real bottleneck. Start with DNS latency and correctness, then MTU/PMTUD, then proxy behavior, then IPv6, and only then argue with the registry.

Practical next steps:

  1. Run the fast diagnosis playbook on one affected node and one known-good node. Diff results.
  2. Standardize Docker daemon config across your fleet (DNS, MTU, proxy) and restart intentionally.
  3. Add baseline DNS/MTU checks to node provisioning so you catch this before CI becomes performance art.
  4. If extraction is your bottleneck, move Docker storage to faster disks and clean up your images. Network tuning won’t save a disk-bound host.
← Previous
Docker networks: Bridge vs host vs macvlan — pick the one that won’t bite later
Next →
ZFS Readonly: The Anti-Ransomware Trick You Can Deploy Today

Leave a comment