You deploy a new container version on Friday. Nothing exotic. Same image, same compose file, just a tag bump.
Then your app starts logging: “could not resolve db” or “Name or service not known.” Five minutes later it “fixes itself.”
Or it doesn’t. Everyone agrees it’s “DNS,” which is corporate-speak for “nobody knows yet.”
Docker’s service discovery is simple when you stay on the paved road, and surprisingly weird when you step one foot off it.
The good news: most failures are deterministic. The bad news: the determinism is hiding behind defaults like ndots,
network scoping, and the embedded DNS server you never asked for.
A mental model that matches production
Here’s the model to keep in your head: Docker service discovery is network-scoped DNS.
Names resolve only inside the same Docker network, using Docker’s embedded DNS server (usually 127.0.0.11 inside the container),
and the name-to-IP mapping is built from container endpoints and network aliases. That’s it. Everything else is fallout.
If you remember one sentence, make it this: if two containers are not on the same user-defined network, they don’t “see” each other by name.
“But they’re on the same host!” is not a networking model; it’s a cry for help.
What you can expect to work
- On a user-defined bridge network created by Docker/Compose: service names and network aliases resolve to container IPs.
- On Compose default network: each service name becomes a DNS name (
db,redis,api). - In Swarm: service names resolve either to a VIP (load-balanced) or to multiple task IPs (DNSRR mode).
What you should assume is broken until proven otherwise
- Name resolution across different Docker networks without explicit connections.
- Using container IPs as stable identities (they’re cattle, not pets).
- Relying on search domains and bare hostnames when
ndotsand corporate DNS policies are in play. - “It works on my laptop” Compose files that implicitly depend on default networks and friendly defaults.
One more thought: Docker DNS is not a general DNS server. It’s a name registry for container endpoints with a forwarding layer.
Treat it like a control-plane feature with a data-plane dependency. If you run it like “just DNS,” it will remind you who’s in charge.
Paraphrased idea (attributed): Werner Vogels often pushes the idea that everything fails; design for failure rather than assuming reliability.
That’s the right posture for Docker service discovery too.
Interesting facts and a bit of history
These are not trivia-night facts. They explain why Docker behaves the way it does, and why your “simple DNS question” turns into
a two-hour incident bridge.
- Docker originally relied heavily on
/etc/hostsentries for container name lookups; DNS-based service discovery matured later as networking evolved. - The embedded DNS server (
127.0.0.11) is per-container in the sense that it’s a stub listener inside each container’s network namespace, backed by the Docker engine. - User-defined bridge networks brought real service discovery: the default
bridgenetwork historically didn’t provide automatic name resolution the way user-defined networks do. - Compose popularized “service name = DNS name”, which made microservice dev easy and also made people forget DNS is scoped per network.
- Swarm introduced VIP-based service discovery: resolving a service name often returns a virtual IP, not a task IP, shifting load balancing into the platform.
- DNS TTL behavior is not a promise in containerized environments; Docker’s DNS responses and client-side caches can make changes appear “sticky.”
ndotsbecame a silent troublemaker as Kubernetes and container runtimes increased use of search domains; Docker inherits the same resolver behavior from libc.- Corporate DNS split-horizon patterns collide with containers: what your host can resolve via VPN may not be reachable or resolvable inside a container without plumbing.
How Docker DNS actually works (bridge, compose, swarm)
The embedded DNS server: why you keep seeing 127.0.0.11
Inside most containers on user-defined networks, /etc/resolv.conf points to nameserver 127.0.0.11.
That address isn’t your corporate DNS. It’s Docker’s embedded DNS stub. The stub does two jobs:
- Answers queries for container names and aliases on the same network.
- Forwards other queries to the upstream resolvers configured for the Docker daemon (often copied from the host).
This matters because when upstream DNS fails, you’ll see failures even for internal names if the resolver library gets confused
(timeouts, retransmits, search-domain expansions). It also matters because a container can have different upstream DNS servers than the host.
User-defined bridge networks: the sane default
Create a network, attach containers to it, and Docker will register their names in that network’s DNS namespace.
Compose does this for you by creating a project-scoped network. That’s why db resolves in Compose without you doing anything clever.
The default bridge network is a legacy convenience. It has improved, but it’s still a foot-gun compared to user-defined networks.
If you care about predictable service discovery, stop using the default bridge for multi-container apps.
Overlay networks (Swarm): the “it depends” zone
In Swarm, service discovery is baked into the orchestrator. You typically get one of two modes:
- VIP mode: service name resolves to a virtual IP; the swarm routes to tasks. Great for simplicity; sometimes confusing for debugging.
- DNSRR mode: service name resolves to multiple A records (task IPs). Great for client-side load balancing; easy to misuse.
Overlay networks add moving parts: gossip/control-plane traffic, ingress routing mesh, and per-node DNS handling. When service discovery fails here,
“DNS” may actually be an overlay control-plane problem. Your diagnostic approach should reflect that.
What Docker DNS is not
- It is not a full authoritative DNS server for your enterprise domain.
- It is not a service mesh.
- It is not a promise that your app’s resolver behavior is sane.
Joke #1: DNS stands for “Did Not Sleep,” and Docker will make sure you earn the acronym if you ignore resolver settings.
Service names, container names, hostnames, and network aliases
People mix these terms like they’re synonyms. They aren’t, and the difference is the difference between “works” and “mysteriously fails in prod.”
Service name (Compose)
In Docker Compose, the service name (the key under services:) becomes a DNS name on the Compose network.
That’s why this works:
apican connect todb:5432if both are on the same Compose network.
Compose also adds project scoping behind the scenes, but DNS names are generally the service name, not the full container name.
Unless you override things. And people do.
Container name
The container name is what you see in docker ps. Compose often generates it like project-service-1.
You can set container_name, but that tends to create more problems than it solves:
it breaks scaling, and it encourages brittle dependencies on a specific identity.
Hostname
A container’s hostname is what hostname returns inside the container. It can influence how some software identifies itself,
but it’s not the same as a DNS record. Setting hostname: in Compose does not automatically create a stable DNS name across networks.
Network alias (the tool you actually want)
A network alias is a DNS name associated with a container endpoint on a specific network.
Aliases are network-scoped. That’s the point. You can give the same container different aliases on different networks.
Use aliases when:
- You want a stable “well-known” name like
dbwhile swapping implementations (postgresvscockroach). - You need multiple names for the same service for migration purposes (
dbandpostgres-primary). - You’re connecting one container to multiple networks and want to control the names visible in each.
Avoid aliases when:
- You’re using them to paper over bad network design (“just alias it so it resolves”).
- You’re building a pseudo-global namespace. Docker networks are meant to be boundaries.
A word about “localhost”
Inside a container, localhost is the container itself. Not the host. Not the other container. Not your feelings.
If your app uses localhost to reach a dependency, it will fail unless that dependency is in the same container.
Split the process, and you must change the address.
Joke #2: “It worked when I used localhost” is the container equivalent of “the check is in the mail.”
Fast diagnosis playbook
When service discovery fails, you don’t have time for interpretive dance. You need a sequence that finds the bottleneck fast.
This playbook assumes you’re debugging from the Docker host with CLI access.
First: confirm it’s a name problem, not a TCP problem
- If resolving the name fails: you’re in DNS territory.
- If resolving works but connection fails: you’re in routing/firewall/listening territory.
Second: verify both containers share a user-defined network
- If they don’t share a network, no amount of alias tweaking will help.
- If they do, inspect the network and endpoints for aliases and IPs.
Third: inspect the client container’s resolver configuration
- Check
/etc/resolv.conffor127.0.0.11, search domains, andoptions ndots. - Check whether the container can reach upstream DNS (if the name is external).
Fourth: reproduce the lookup with deterministic tools
- Use
getent hoststo match libc behavior. - Use
dig/nslookupto see raw DNS answers (if installed).
Fifth: check Docker engine state and network objects
- Look for weirdness: stale endpoints, orphaned networks, daemon restarts, or “helpful” DNS overrides.
Practical tasks: commands, outputs, and decisions
These are the field tasks I actually run. Each includes what the output means and what decision you make next.
Run them in order when you’re under pressure.
Task 1: Confirm which networks the client container is on
cr0x@server:~$ docker inspect -f '{{json .NetworkSettings.Networks}}' api | jq
{
"app_net": {
"IPAMConfig": null,
"Links": null,
"Aliases": [
"api",
"3d2c9a8c4c1a"
],
"MacAddress": "02:42:ac:14:00:05",
"DriverOpts": null,
"NetworkID": "c5f2e4b8d0f1...",
"EndpointID": "4c0f2a7c2e0b...",
"Gateway": "172.20.0.1",
"IPAddress": "172.20.0.5",
"IPPrefixLen": 16,
"IPv6Gateway": "",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"DNSNames": [
"api",
"3d2c9a8c4c1a"
]
}
}
Meaning: The container api is attached to app_net and has DNS names api plus its container ID.
Decision: Inspect whether the target container is on app_net too. If not, fix the network attachment, not DNS.
Task 2: Confirm the target container shares the same network
cr0x@server:~$ docker inspect -f '{{json .NetworkSettings.Networks}}' db | jq
{
"app_net": {
"IPAMConfig": null,
"Links": null,
"Aliases": [
"db",
"postgres",
"a9b1c2d3e4f5"
],
"NetworkID": "c5f2e4b8d0f1...",
"EndpointID": "d1a6b9aa0f2c...",
"Gateway": "172.20.0.1",
"IPAddress": "172.20.0.10",
"IPPrefixLen": 16
}
}
Meaning: Both containers share app_net. DNS should work for db and alias postgres.
Decision: Move to resolver checks inside the client container.
Task 3: Check resolver config inside the client container
cr0x@server:~$ docker exec -it api cat /etc/resolv.conf
nameserver 127.0.0.11
options ndots:0
search corp.example
Meaning: The container uses Docker’s embedded DNS. ndots:0 means even single-label names like db are treated as “absolute” first.
Decision: If you see ndots:5 and a long search list, expect delays and weird external lookups; consider tuning or using FQDNs for external names.
Task 4: Reproduce the lookup using libc behavior
cr0x@server:~$ docker exec -it api getent hosts db
172.20.0.10 db
Meaning: libc name resolution succeeds. If your app still claims “cannot resolve,” suspect app-level DNS caching, a different resolver library, or wrong hostname.
Decision: If getent fails, it’s a real resolver/DNS path issue—continue with raw DNS tools.
Task 5: Ask Docker DNS directly (raw query)
cr0x@server:~$ docker exec -it api sh -lc 'apk add --no-cache bind-tools >/dev/null 2>&1; dig @127.0.0.11 db +short'
172.20.0.10
Meaning: Docker DNS returns the expected A record.
Decision: If this works but the app fails, you’re dealing with application DNS caching, race conditions at startup, or connecting to the wrong network name.
Task 6: Check whether the app is failing due to IPv6 preference
cr0x@server:~$ docker exec -it api sh -lc 'getent ahosts db | head -n 5'
172.20.0.10 STREAM db
172.20.0.10 DGRAM
172.20.0.10 RAW
Meaning: Only IPv4 answers. If your app tries IPv6 first and your DNS returns AAAA records elsewhere, you can get “hang then fallback.”
Decision: If you see AAAA records but no IPv6 routing, fix IPv6 configuration or force IPv4 in the client.
Task 7: Confirm the network has the right endpoints and aliases
cr0x@server:~$ docker network inspect app_net | jq '.[0] | {Name, Driver, Containers}'
{
"Name": "app_net",
"Driver": "bridge",
"Containers": {
"3d2c9a8c4c1a...": {
"Name": "api",
"IPv4Address": "172.20.0.5/16",
"IPv6Address": ""
},
"a9b1c2d3e4f5...": {
"Name": "db",
"IPv4Address": "172.20.0.10/16",
"IPv6Address": ""
}
}
}
Meaning: The network sees both endpoints. If the target container isn’t listed, DNS can’t resolve it on that network.
Decision: Attach the container to the network or fix Compose networking configuration.
Task 8: Verify Compose created the network you think it did
cr0x@server:~$ docker compose ls
NAME STATUS
payments running(6)
cr0x@server:~$ docker network ls | grep payments
c5f2e4b8d0f1 payments_default bridge local
Meaning: Compose created payments_default. If your containers are on some other network, you may be running multiple projects or mixing manual runs.
Decision: Standardize: run everything through Compose (or through your orchestrator), not a half-and-half situation.
Task 9: Prove whether a container is accidentally on the default bridge
cr0x@server:~$ docker inspect -f '{{range $k,$v := .NetworkSettings.Networks}}{{$k}} {{end}}' legacy_worker
bridge
Meaning: This container is only on the default bridge network. It will not resolve names from your user-defined Compose network.
Decision: Move it to a user-defined network; stop expecting cross-network name discovery.
Task 10: Attach a running container to the correct network (hot fix)
cr0x@server:~$ docker network connect payments_default legacy_worker
cr0x@server:~$ docker inspect -f '{{range $k,$v := .NetworkSettings.Networks}}{{$k}} {{end}}' legacy_worker
bridge payments_default
Meaning: Now the container shares the Compose network and should resolve service names on it.
Decision: Treat this as a stopgap. Fix the Compose file or deployment so it starts attached correctly next time.
Task 11: Validate the name you’re using is actually registered as an alias
cr0x@server:~$ docker inspect -f '{{json (index .NetworkSettings.Networks "payments_default").Aliases}}' db | jq
[
"db",
"postgres",
"a9b1c2d3e4f5"
]
Meaning: The alias postgres is real on that network.
Decision: If your app connects to postgresql and that alias isn’t here, either add the alias or update the app config. Don’t guess.
Task 12: Detect search-domain expansion causing slow lookups
cr0x@server:~$ docker exec -it api sh -lc 'cat /etc/resolv.conf; echo; time getent hosts db >/dev/null'
nameserver 127.0.0.11
options ndots:5
search corp.example svc.corp.example cloud.corp.example
real 0m2.013s
user 0m0.000s
sys 0m0.003s
Meaning: A two-second lookup for a local name is a smell. With ndots:5, the resolver tries appending search domains first. That can cause timeouts before it asks for db as-is.
Decision: For internal Docker names, prefer single-label names with ndots:0 where appropriate, or set an explicit trailing dot (db.) in clients that support it. Alternatively, reduce search domains for that workload.
Task 13: Confirm the daemon’s DNS settings (host-side)
cr0x@server:~$ docker info | sed -n '/DNS:/,/Registry Mirrors:/p'
DNS: 10.10.0.53
10.10.0.54
Registry Mirrors:
Meaning: Docker daemon forwards non-container queries to these upstream servers.
Decision: If external name lookups fail in containers but work on the host, compare upstream resolvers; you may need to configure /etc/docker/daemon.json DNS explicitly.
Task 14: Verify connectivity to the target service after DNS succeeds
cr0x@server:~$ docker exec -it api sh -lc 'apk add --no-cache busybox-extras >/dev/null 2>&1; nc -vz db 5432'
db (172.20.0.10:5432) open
Meaning: DNS and TCP connectivity are both good. If the app still errors, you’re looking at TLS, credentials, protocol mismatch, or app config.
Decision: Stop blaming DNS. Move up the stack.
Task 15: Swarm mode sanity check (VIP vs DNSRR)
cr0x@server:~$ docker service inspect payments_api --format '{{json .Endpoint.Spec.Mode}}'
"vip"
Meaning: In Swarm, the service resolves to a VIP. You’ll see one IP for the service name, not per-task IPs.
Decision: If your client expects multiple A records for load balancing, switch to DNSRR or fix the client to connect to the VIP.
Three corporate mini-stories from the DNS trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized company ran a payments API on a single Docker host for a legacy product line. It was “temporary” for a year,
which is the standard unit of time in enterprise architecture.
They used Compose, with an api service and a db service, plus a one-off “migration” container run manually during deploys.
One day, the migration job started failing with “could not translate host name ‘db’ to address.”
The on-call engineer checked the Compose stack: db was up, healthy, logs looked fine.
They restarted the database anyway, because tradition. The failure persisted.
The wrong assumption was subtle: they believed “containers on the same host can resolve each other by name.”
The migration container was started with docker run and landed on the default bridge.
The Compose services were on project_default. Two separate universes. Same host, different networks, no DNS relationship.
The fix was boring: run migrations as a Compose service attached to the same network, or attach the one-off container to the Compose network explicitly.
After that, the team wrote a small deploy wrapper that refused to run ad-hoc containers without --network.
It was mildly annoying for developers, which is how you know it worked.
Mini-story 2: The optimization that backfired
Another org had a latency-sensitive internal API. Someone noticed occasional 1–2 second delays during startup when services tried to reach dependencies.
A well-meaning engineer concluded, correctly, that DNS retries and search-domain expansion were involved.
Their “optimization” was to hardcode IP addresses for dependencies in environment variables.
For a week, it looked great. Startup was faster, and graphs were calmer. Then the incident hit: a routine redeploy shuffled container IPs.
One service kept trying to connect to the old IP, and because the IP now belonged to something else, the failure mode was not “connection refused.”
It was “connected to the wrong service,” followed by authentication errors that looked like credential drift.
The outage wasn’t catastrophic, but it was ugly: partial failures, confusing logs, and a rollback that didn’t restore sanity because
the “optimized” configuration lived in a shared CI template.
The postmortem was polite and deeply uncomfortable.
The real fix was to address DNS behavior, not bypass it: reduce search-domain noise, use network aliases, and implement retries with jitter
at the application layer for dependency readiness. IPs went back to being ephemeral, as they should be.
Mini-story 3: The boring but correct practice that saved the day
A larger enterprise ran multiple Compose projects on shared hosts during a migration phase.
It was messy, but the SRE team enforced one rule: every project defines explicit networks with explicit names, and every cross-project dependency
uses a dedicated “shared” network with controlled aliases.
They also enforced a naming convention: service-to-service connections must use a DNS name that is either the Compose service name
on the local network, or a network alias on the shared network. No container names. No IPs. No “whatever resolves.”
It felt pedantic and slowed down a few quick hacks.
Then a vendor container arrived with a hardcoded expectation to reach license-server.
Without the convention, teams would have renamed services, added random extra_hosts, or started editing images.
Instead, they attached the vendor container to the shared network and gave the actual license service a network alias license-server.
The vendor integration worked on the first try in staging and production. Nobody celebrated because it was boring.
That’s the highest compliment you can pay to operations work.
Common mistakes: symptom → root cause → fix
1) “Name or service not known” from one container, but works from another
Symptom: api can resolve db, but worker cannot.
Root cause: Containers are on different networks (often one on default bridge).
Fix: Attach both to the same user-defined network, or connect worker to the Compose network. Prefer explicit networks in Compose.
2) “Temporary failure in name resolution” that disappears after a restart
Symptom: Initial boot fails; restart “fixes it.”
Root cause: Startup ordering plus resolver retries plus dependency not ready; sometimes also slow DNS due to search domains.
Fix: Add app-level retries with backoff; use healthchecks and wait-for patterns; reduce search-domain noise or set ndots appropriately.
3) Service name resolves to an IP, but connections hang or fail
Symptom: getent hosts db works; nc -vz db 5432 fails.
Root cause: Service not listening, wrong port, firewall rules, or the container isn’t reachable due to policy routing/iptables conflicts.
Fix: Check listening sockets (ss -lntp in the target), container logs, and Docker host iptables/nft rules. DNS isn’t your problem anymore.
4) External domains resolve on the host but not in containers
Symptom: Host can resolve corp-internal; container cannot.
Root cause: Docker daemon uses different upstream DNS servers than the host’s current VPN resolver; or the VPN routes aren’t available to containers.
Fix: Configure daemon DNS explicitly; ensure VPN DNS and routes are accessible from containers (sometimes requires running VPN on the host in a way containers can use).
5) “It worked until we scaled to 2 replicas”
Symptom: After scaling, connections go to the wrong backend or become inconsistent.
Root cause: Misuse of container_name or reliance on a single container identity; in Swarm, misunderstanding VIP vs DNSRR.
Fix: Remove container_name in scalable services; use service names; choose VIP or DNSRR intentionally.
6) Name collision between projects
Symptom: db resolves, but to the “other” DB.
Root cause: Shared network with overlapping aliases or multiple stacks attached to the same network with casual naming.
Fix: Use project-specific aliases (payments-db), or isolate networks and only share through a dedicated, curated network.
7) Slow lookups for single-label names
Symptom: Every internal lookup costs ~1–5 seconds.
Root cause: ndots too high plus long search list causes multiple failed queries before trying the bare name.
Fix: Reduce ndots for the workload, shorten search domains, or use a trailing dot where supported.
8) “We added extra_hosts and now everything is weird”
Symptom: A service intermittently connects to old endpoints after redeploys.
Root cause: extra_hosts pins names to fixed IPs, bypassing Docker DNS updates.
Fix: Remove extra_hosts for internal services; use network aliases and proper networks. Use extra_hosts only for special cases you’re willing to own forever.
Checklists / step-by-step plan
Checklist: designing Docker service discovery that won’t page you
- Use user-defined networks everywhere. In Compose, define them explicitly; don’t rely on default bridge behavior.
- Keep service discovery network-scoped. Treat networks as trust boundaries and failure domains.
- Use network aliases for stable “interface” names. Especially during migrations and vendor integrations.
- Do not hardcode container IP addresses. If you think you need to, you actually need a stable name or a different architecture.
- Avoid
container_namefor scalable services. It breaks scaling and encourages brittle dependencies. - Make startup resilient. DNS being “up” doesn’t mean the dependency is ready. Build retries with jitter and sensible timeouts.
- Control resolver behavior. Watch
ndotsand search domains; align them with your naming strategy. - Separate internal and external naming concerns. Internal services: short names/aliases. External dependencies: FQDNs.
- Decide on Swarm mode consciously. VIP for simplicity; DNSRR if your clients can handle multiple A records correctly.
- Write runbooks with commands. If your diagnosis steps live only in someone’s head, they don’t exist.
Step-by-step: migrating from “random names” to sane aliases
- Pick a canonical dependency name per service (
db,cache,queue). - Implement network aliases on the internal app network for those names.
- Update applications to use those names only (no container names, no IPs).
- Roll out one service at a time; keep temporary dual aliases for compatibility.
- Remove deprecated aliases after a full deploy cycle and a rollback window.
Step-by-step: a disciplined debug flow during an incident
- Run
getent hostsinside the failing container for the dependency name. - If it fails, inspect container networks and shared networks.
- Inspect
/etc/resolv.confforndotsand search domains. - Use
dig @127.0.0.11(if possible) to validate embedded DNS behavior. - If DNS works, test TCP connectivity with
nc -vz. - If TCP works, stop. It’s not DNS. Move to TLS/app config/auth.
FAQ
1) Why do my containers use nameserver 127.0.0.11?
That’s Docker’s embedded DNS stub inside the container namespace. It resolves container names/aliases on Docker networks and forwards other queries upstream.
2) Why does service discovery work in Compose but not with docker run?
Compose attaches services to a user-defined project network where DNS-based discovery is enabled. A bare docker run often lands on default bridge
unless you specify --network. Different network means different DNS namespace.
3) Is container_name a good way to get stable DNS names?
No. It makes scaling painful and encourages coupling. Use service names and network aliases instead; they express intent without pinning identity to a single container.
4) How do network aliases differ from hostnames?
Hostname is local identity inside the container. A network alias is a DNS name registered on a specific Docker network for that endpoint.
Aliases are what other containers can resolve.
5) Why do lookups take seconds sometimes?
Common cause: ndots combined with search domains. A single-label name like db may be tried as db.corp.example,
db.svc.corp.example, etc., with timeouts, before trying db directly.
6) Should I use FQDNs for internal Docker services?
Usually no. Use short service names and aliases within the network. Use FQDNs for external services, especially across VPNs and enterprise DNS.
Mixing the two increases resolver complexity and failure modes.
7) In Swarm, why does my service name resolve to one IP even with many replicas?
You’re likely in VIP mode. The service name resolves to a virtual IP; Swarm handles load balancing. If you want multiple A records, use DNSRR mode
and ensure your client can handle it.
8) Is it safe to use extra_hosts to “fix DNS”?
Only if you’re comfortable owning that mapping long-term. extra_hosts pins names to IPs and bypasses dynamic service discovery.
It’s acceptable for a genuinely static endpoint; it’s a trap for internal services.
9) Why does my app fail DNS but getent hosts works?
Your app may use a different resolver library, cache aggressively, prefer IPv6, or have its own DNS client with different timeouts.
Confirm what resolver stack the runtime uses (glibc vs musl vs custom) and test with equivalent tooling.
10) Can two networks share the same alias safely?
Yes, because aliases are network-scoped. It becomes unsafe when you attach a container to multiple networks and then assume the name resolves the same way everywhere.
Be explicit about which network a client is on.
Conclusion: next steps you can ship
Docker service discovery fails for predictable reasons: containers aren’t on the same network, aliases aren’t what you think they are,
or the resolver is doing exactly what you configured (or inherited) and you just didn’t read it.
Fixing it is mostly about choosing a model and enforcing it.
Practical next steps
- Audit networks: list containers that still use default
bridgeand migrate them to user-defined networks. - Standardize names: pick service names and add network aliases for stable dependency identities.
- Kill IP configs: remove any internal dependency IP addresses and
extra_hostshacks unless they’re truly static. - Instrument startup: add retries with jitter and bounded timeouts; treat DNS success as necessary but not sufficient.
- Write the runbook: copy the fast diagnosis playbook and the tasks section into your on-call docs, then keep it current.
If you do those five things, most “Docker DNS incidents” stop being incidents. They become a five-minute diagnosis and a one-line fix.
Which is what DNS deserves: quiet competence, not drama.