Docker “Cannot connect to the Docker daemon”: fixes that actually work

Was this helpful?

The error always arrives at the worst time: you’re mid-deploy, CI just went red, or you’re trying to run a quick one-liner before a meeting.
You type docker ps and Docker replies with the equivalent of a shrug:
Cannot connect to the Docker daemon.

This isn’t one problem. It’s a symptom. Sometimes the daemon is down. Sometimes it’s running but you’re talking to the wrong socket.
Sometimes you’re not allowed to talk to it. And sometimes you’re not even on the machine you think you are. Let’s fix the right thing, quickly,
without cargo-cult “sudo everything” rituals.

What the error actually means (and why it lies)

The Docker CLI is a client. The Docker daemon (dockerd) is a server. The “Cannot connect” error is the client saying:
“I tried to reach the server at some endpoint and failed.”

That endpoint is usually a Unix socket on Linux: /var/run/docker.sock. On Docker Desktop, it’s a socket proxied through
the Desktop app and a VM. In remote setups, it might be TCP with TLS, or SSH transport. If the client can’t open the socket, can’t
authenticate, or connects to the wrong place, you get the same generic message.

There are four big categories:

  • The daemon isn’t running (service stopped, crash loop, failed update).
  • The daemon is running but unreachable (wrong socket path, wrong context, wrong DOCKER_HOST, DNS/SSH/TLS issues).
  • Permission denied (user not in docker group, rootless mismatch, socket perms).
  • Daemon can’t start (disk full, corrupted state, storage driver issues, cgroup/iptables failures, incompatible config).

Interesting facts and short history (because the past keeps breaking your present)

  1. Docker’s original default was a local Unix socket. The “remote API over TCP” came later and has been a security footgun ever since.
  2. systemd changed the game for Linux service management; “Docker is down” became “systemd says it’s up, but it’s not responsive” (different problem).
  3. Docker Desktop uses a VM on macOS and Windows; you are never talking to a “native” dockerd on the host OS.
  4. Rootless mode exists to avoid a root-owned daemon, but it changes sockets, paths, and expectations—great idea, frequent confusion.
  5. The docker group is effectively root on most systems. It can mount the host filesystem and escape containers. Treat it like sudo.
  6. OverlayFS became the mainstream storage driver on Linux, replacing older drivers like AUFS; upgrades can surface latent filesystem quirks.
  7. cgroups v2 adoption (now common) shifted resource control assumptions; old daemon configs can break in odd ways.
  8. iptables/nftables transitions caused years of container networking surprises; a daemon might “start” but fail to set up NAT rules.
  9. Docker contexts were introduced to manage multiple endpoints cleanly; they also made it easier to accidentally talk to the wrong daemon.

A useful mental model: Docker CLI doesn’t “start Docker.” It just asks Docker to do things. If the phone line is cut, you can yell louder
(sudo) but you’re still yelling into a dead receiver.

One quote worth keeping on a sticky note, because it prevents hours of superstition-driven debugging:
Paraphrased idea from Sidney Dekker: reliability lives in how systems respond to surprises, not in the absence of surprises.

Fast diagnosis playbook: first/second/third checks

When you’re on-call, you don’t “explore.” You narrow. Here’s the shortest path to the truth.

First: What endpoint is the CLI trying to use?

  • Check docker context and environment variables (DOCKER_HOST, DOCKER_CONTEXT).
  • Decision: if it’s pointing remote, Desktop, or rootless, troubleshoot that path—not systemctl on the host.

Second: Is the daemon alive and listening?

  • On Linux: systemctl status docker and journalctl -u docker.
  • Decision: if the service is down or crash-looping, fix startup errors before touching permissions.

Third: If it’s alive, is it permission or socket path?

  • Inspect /var/run/docker.sock ownership/mode, confirm your user groups.
  • Decision: if it’s permission denied, fix group membership (or use rootless correctly). Avoid “chmod 666” like it’s malware.

Fourth (only if needed): Is the daemon failing due to resources or configuration?

  • Disk space/inodes. Storage driver. Cgroups/iptables. Proxy env. TLS certs.
  • Decision: choose the narrow fix that addresses the error in logs, not the one you remember from last year.

Joke #1: Docker troubleshooting is like making espresso—most people blame the machine, and it’s usually the grind setting (your endpoint).

Practical tasks: commands, outputs, decisions (12+)

Each task below has three parts: a command you can run, what the output usually means, and what decision you should make.
This is the “stop guessing” section.

Task 1: See exactly where Docker CLI is trying to connect

cr0x@server:~$ docker context ls
NAME        DESCRIPTION                               DOCKER ENDPOINT               ERROR
default *   Current DOCKER_HOST based configuration   unix:///var/run/docker.sock
prod        Production daemon over SSH                ssh://deploy@prod-host

Meaning: The asterisk marks the active context. The endpoint tells you the transport and socket/host.

Decision: If the active context isn’t what you intended, switch contexts before doing anything else.

Task 2: Show environment variables that override the endpoint

cr0x@server:~$ env | egrep '^(DOCKER_HOST|DOCKER_CONTEXT|DOCKER_TLS_VERIFY|DOCKER_CERT_PATH)='
DOCKER_HOST=tcp://127.0.0.1:2375

Meaning: Your shell is forcing Docker to use TCP on 2375 (often from an old script or profile).

Decision: Unset it (unset DOCKER_HOST) or fix it to the intended endpoint. Don’t debug /var/run/docker.sock until you do.

Task 3: Quick health check—does the CLI reach the daemon at all?

cr0x@server:~$ docker version
Client: Docker Engine - Community
 Version:           26.1.3
 API version:       1.45
 Go version:        go1.22.3
 Git commit:        9e34c2a
 OS/Arch:           linux/amd64

Server:
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Meaning: Client is installed; server isn’t reachable.

Decision: Proceed to service/socket checks. Don’t reinstall the client; it’s already there.

Task 4: On Linux, check if systemd thinks Docker is running

cr0x@server:~$ systemctl status docker --no-pager
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; preset: enabled)
     Active: active (running) since Tue 2026-01-02 09:14:03 UTC; 7min ago
TriggeredBy: ● docker.socket
       Docs: man:dockerd(8)
   Main PID: 1423 (dockerd)
      Tasks: 19
     Memory: 88.2M
        CPU: 1.268s
     CGroup: /system.slice/docker.service
             └─1423 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

Meaning: The daemon is running, so “Cannot connect” is likely endpoint mismatch or permissions.

Decision: Jump to socket permissions and context/env checks, not restarts.

Task 5: If Docker is down, start it and see if it stays up

cr0x@server:~$ sudo systemctl start docker
cr0x@server:~$ systemctl is-active docker
active

Meaning: Service started successfully (for now).

Decision: Immediately check logs if it flaps. “Active” now doesn’t mean it will be active in 30 seconds.

Task 6: Read the last 100 log lines; stop guessing

cr0x@server:~$ sudo journalctl -u docker --no-pager -n 100
Jan 02 09:12:41 server dockerd[1399]: failed to start daemon: error initializing graphdriver: no space left on device
Jan 02 09:12:41 server systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Jan 02 09:12:41 server systemd[1]: docker.service: Failed with result 'exit-code'.

Meaning: Not a permissions problem. Not a socket problem. The daemon can’t initialize storage due to disk exhaustion.

Decision: Fix disk space/inodes first. Restarting won’t help.

Task 7: Confirm the socket exists and who owns it

cr0x@server:~$ ls -l /var/run/docker.sock
srw-rw---- 1 root docker 0 Jan  2 09:14 /var/run/docker.sock

Meaning: Socket exists, owned by root:docker, accessible to members of docker group.

Decision: If you’re not in the docker group, either add yourself (with eyes open) or use sudo/rootless properly.

Task 8: Confirm your user is in the docker group

cr0x@server:~$ id
uid=1001(cr0x) gid=1001(cr0x) groups=1001(cr0x),27(sudo)

Meaning: You’re not in docker.

Decision: Add user to group (or decide you do not want that privilege and stick to sudo).

Task 9: Add user to docker group safely, then re-login

cr0x@server:~$ sudo usermod -aG docker cr0x
cr0x@server:~$ newgrp docker
cr0x@server:~$ id
uid=1001(cr0x) gid=1001(cr0x) groups=1001(cr0x),27(sudo),999(docker)

Meaning: Current shell now has docker group.

Decision: Retry docker ps. If your organization treats docker group as privileged (it should), document and restrict membership.

Task 10: Detect rootless vs rootful mismatch

cr0x@server:~$ docker info 2>/dev/null | egrep 'Rootless|Docker Root Dir|Server Version'
 Server Version: 26.1.3
 Docker Root Dir: /var/lib/docker
 Rootless: false

Meaning: You’re talking to a rootful daemon (system-wide).

Decision: If you intended rootless, you’re on the wrong context or wrong socket. If you intended rootful, keep troubleshooting permissions and service health.

Task 11: If disk is full, measure it like you mean it (space and inodes)

cr0x@server:~$ df -h /var/lib/docker
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p3  120G  120G  120M 100% /

cr0x@server:~$ df -i /var/lib/docker
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/nvme0n1p3  7.6M    7.6M    1.2K  100% /

Meaning: You’re out of both bytes and inodes. Containers and layers love inodes.

Decision: Free space (prune images/containers) or expand storage. Don’t expect the daemon to start until both numbers are sane.

Task 12: Find what in /var/lib/docker is eating the machine

cr0x@server:~$ sudo du -xhd1 /var/lib/docker | sort -h
1.2G    /var/lib/docker/containers
6.8G    /var/lib/docker/overlay2
8.1G    /var/lib/docker

Meaning: overlay layers are the main consumer. That’s normal; the question is whether it’s out of control.

Decision: Consider docker system df and pruning, but only after confirming you won’t delete needed images in production.

Task 13: Prune safely (and understand what you’re deleting)

cr0x@server:~$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          42        6         18.3GB    12.7GB (69%)
Containers      9         2         1.1GB     700MB (63%)
Local Volumes   16        8         9.4GB     2.2GB (23%)
Build Cache     88        0         6.5GB     6.5GB

cr0x@server:~$ docker system prune -f
Deleted Containers:
...
Total reclaimed space: 8.1GB

Meaning: You reclaimed space, mostly from unused images/cache.

Decision: If this is production, get explicit sign-off or use targeted pruning (docker image prune) to avoid deleting cached layers used by deployment speed.

Task 14: Verify the daemon is listening where you think it is

cr0x@server:~$ sudo ss -xlpn | egrep 'docker\.sock|containerd\.sock'
u_str LISTEN 0      4096   /run/docker.sock  22788            * 0    users:(("dockerd",pid=1423,fd=4))
u_str LISTEN 0      4096   /run/containerd/containerd.sock  22787 * 0 users:(("containerd",pid=961,fd=7))

Meaning: dockerd is listening on /run/docker.sock, not necessarily /var/run/docker.sock (which is often a symlink).

Decision: If your client points at /var/run/docker.sock and that path is broken, fix the symlink or correct the endpoint.

Task 15: Spot a broken /var/run symlink (yes, it happens)

cr0x@server:~$ ls -l /var/run/docker.sock
ls: cannot access '/var/run/docker.sock': No such file or directory

cr0x@server:~$ ls -ld /var/run
lrwxrwxrwx 1 root root 4 Jan  2 08:59 /var/run -> /run

Meaning: The canonical runtime dir is /run; /var/run is a symlink. If /run/docker.sock exists but the other doesn’t, something is inconsistent.

Decision: Prefer unix:///run/docker.sock if your system uses it. Fix broken paths rather than chmod’ing random files.

Linux (systemd): when the service is down or sick

On Linux servers, this error is often simple: dockerd isn’t running. But “isn’t running” has layers, like every good storage incident.
A service may be stopped, crash-looping, blocked on dependencies, or alive but unable to answer API requests because it’s stuck in startup work.

Start with systemd’s view, then confirm reality

If systemctl status says “active (running),” verify the socket exists and that you can actually query the daemon.
If it says “failed,” don’t touch permissions. Logs first.

Common startup blockers you’ll see in logs

  • Disk full / inode exhaustion: graphdriver init fails; Docker can’t mount overlay layers.
  • Invalid daemon.json: a typo stops the daemon cold.
  • iptables issues: Docker can start but can’t create networks; sometimes it fails startup depending on distro and config.
  • cgroup mismatch: older configs on newer kernels can break container runtime assumptions.
  • containerd down: dockerd depends on containerd; if it’s unhealthy, Docker may fail to start or behave oddly.

Validate daemon.json instead of arguing with it

cr0x@server:~$ sudo cat /etc/docker/daemon.json
{
  "log-driver": "json-file",
  "log-opts": { "max-size": "10m", "max-file": "3" },
  "data-root": "/var/lib/docker",
  "iptables": true
}

cr0x@server:~$ sudo jq . /etc/docker/daemon.json >/dev/null
cr0x@server:~$ echo $?
0

Meaning: JSON parses cleanly.

Decision: If jq fails, fix JSON before restarting Docker. systemd will happily restart a broken config forever.

Restart with intent, not panic

cr0x@server:~$ sudo systemctl restart docker
cr0x@server:~$ systemctl --no-pager -l status docker
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; preset: enabled)
     Active: active (running) since Tue 2026-01-02 10:02:19 UTC; 3s ago
       Docs: man:dockerd(8)

Meaning: Docker restarted and is currently alive.

Decision: Immediately test docker ps. If it works, you’re done. If it fails, the problem is not “service down.”

If docker.service is “active” but the CLI still can’t connect

This is where professionals waste time if they don’t slow down. If the daemon is running, “Cannot connect” is almost always one of:
(a) endpoint mismatch, (b) permissions, or (c) you’re in a container/namespace where the socket path doesn’t exist.

Check the process arguments to see what it’s listening on:

cr0x@server:~$ ps -ef | grep -E '[d]ockerd'
root      1423     1  0 09:14 ?        00:00:03 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

Meaning: -H fd:// means systemd socket activation is in play; Docker’s socket unit matters.

Decision: Inspect docker.socket if the socket file isn’t appearing.

cr0x@server:~$ systemctl status docker.socket --no-pager
● docker.socket - Docker Socket for the API
     Loaded: loaded (/lib/systemd/system/docker.socket; enabled; preset: enabled)
     Active: active (listening) since Tue 2026-01-02 09:13:58 UTC; 48min ago
   Triggers: ● docker.service
     Listen: /run/docker.sock (Stream)

Meaning: Socket unit is listening on /run/docker.sock.

Decision: Ensure your client endpoint matches (unix:///run/docker.sock) or that /var/run symlink is intact.

Permissions and groups: docker.sock and the “sudo habit”

The most common “Cannot connect” variant is really:
permission denied while trying to connect to the Docker daemon socket.
Docker often prints both messages, but people only read the first line and go restart services like it’s a fire drill.

Understand what you’re granting

Adding a user to the docker group is convenient. It’s also granting near-root control, because Docker can mount the host filesystem
or run privileged containers. In corporate environments, this should be treated like sudo access.

Diagnose permission problems cleanly

cr0x@server:~$ docker ps
permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.45/containers/json": dial unix /var/run/docker.sock: connect: permission denied

Meaning: The daemon is likely running; your user can’t open the socket.

Decision: Fix group membership or use sudo docker for that host (with policy approval). Do not chmod the socket to world-writable.

Do not “chmod 666 /var/run/docker.sock”

It “fixes” the error by making the Docker control socket writable by everyone. That is not a fix. That’s leaving your server keys under the doormat
and then being surprised you got robbed. Also: systemd will recreate the socket on restart with proper permissions, so your “fix” is unstable anyway.

When sudo is acceptable

On personal dev boxes, fine. On shared systems, CI runners, and anything with customer data, you should strongly prefer explicit access patterns:
controlled group membership, rootless Docker, or a dedicated build agent with minimal privileges.

Verify what the socket permissions should look like

cr0x@server:~$ stat -c '%A %U:%G %n' /run/docker.sock
srw-rw---- root:docker /run/docker.sock

Meaning: Only root and docker group can read/write the socket.

Decision: If group isn’t docker or the mode is unusual, check systemd unit overrides or distro packaging changes.

Wrong endpoint: Docker contexts, DOCKER_HOST, SSH, TLS

If Docker is running and permissions are fine, the next most likely culprit is that you’re connecting to the wrong place.
Contexts made multi-environment work sane. They also made it easy to quietly talk to a dead daemon because your shell remembered a choice.

Know the precedence rules

  • DOCKER_HOST overrides nearly everything.
  • DOCKER_CONTEXT selects a context (and overrides the “current” context).
  • The “current context” is what docker context use sets.
  • Docker Desktop often sets up a context like desktop-linux.

Switch context explicitly and verify

cr0x@server:~$ docker context use default
default
cr0x@server:~$ docker context inspect --format '{{.Name}} -> {{.Endpoints.docker.Host}}' default
default -> unix:///var/run/docker.sock

Meaning: You’re back to the local socket.

Decision: Retry the Docker command. If it works now, you just fixed a human-memory bug, not a daemon bug.

Remote over SSH: diagnose the SSH leg first

cr0x@server:~$ docker --context prod ps
Cannot connect to the Docker daemon at ssh://deploy@prod-host. Is the docker daemon running?

cr0x@server:~$ ssh -o BatchMode=yes -o ConnectTimeout=5 deploy@prod-host 'systemctl is-active docker'
active

Meaning: Docker is active on remote, and SSH works. The remaining issue could be remote socket permissions or an SSH config mismatch.

Decision: Try running a remote docker command via SSH directly as a control test, or check whether remote user can access docker socket.

cr0x@server:~$ ssh deploy@prod-host 'docker ps'
permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: connect: permission denied

Meaning: Remote user lacks permissions.

Decision: Fix the remote user’s access (docker group or sudo policy). Don’t keep poking contexts locally.

TCP endpoints: if you must, do it like an adult

If you see tcp:// in DOCKER_HOST, be suspicious. Port 2375 is typically plain HTTP (no TLS). Port 2376 is commonly TLS.
Plain TCP Docker API is a “remote root shell” if exposed. Great for attackers. Bad for sleep.

cr0x@server:~$ echo "$DOCKER_HOST"
tcp://127.0.0.1:2375

cr0x@server:~$ curl -sS 127.0.0.1:2375/_ping
OK

Meaning: Something is listening on 2375 and responds like Docker.

Decision: If this is unexpected, remove that configuration. If it’s expected, ensure it’s bound to loopback only and preferably protected with TLS.

Docker Desktop: macOS and Windows failure modes

Desktop environments create a special kind of confusion: the Docker CLI is on your host, but the daemon is in a VM.
When Desktop breaks, you can chase phantom Linux services that don’t exist.

macOS: the daemon is not a launchd service you can “systemctl”

On macOS, Docker Desktop manages its own backend. If you get “Cannot connect,” you usually have one of:

  • Desktop app not running or stuck starting
  • context switched away from desktop-linux
  • corrupted Desktop state (rare, but real)
  • VPN/proxy/network filters interfering with VM plumbing
cr0x@server:~$ docker context ls
NAME            DESCRIPTION                               DOCKER ENDPOINT               ERROR
default *       Current DOCKER_HOST based configuration   unix:///var/run/docker.sock
desktop-linux   Docker Desktop                            unix:///Users/cr0x/.docker/run/docker.sock

Meaning: You’re on default but Desktop expects desktop-linux.

Decision: Switch to desktop-linux and retry. If the socket path under your home directory doesn’t exist, Desktop backend isn’t up.

Windows: WSL2 and context confusion

On Windows with WSL2, you can have at least three different realities:
Docker CLI in PowerShell talking to Docker Desktop,
Docker CLI inside a WSL distro talking through Desktop integration,
or a dockerd you installed inside WSL (which you probably shouldn’t).

If you’re inside WSL and you installed Docker Engine there, you might be missing systemd (depending on distro/settings) and the service never starts.
If you’re relying on Desktop integration, your WSL distro should not run its own dockerd at all.

cr0x@server:~$ docker info 2>/dev/null | egrep 'Operating System|Docker Root Dir|Server Version'
 Server Version: 26.1.3
 Operating System: Docker Desktop
 Docker Root Dir: /var/lib/docker

Meaning: You’re talking to Docker Desktop’s backend.

Decision: If it can’t connect, fix Desktop (restart, reset, resolve WSL integration). Don’t debug Linux systemd inside WSL unless you run your own engine.

Joke #2: If Docker Desktop says it’s “Starting…” for 20 minutes, it’s not starting—it’s contemplating your life choices.

Rootless Docker, CI runners, and “it works on my laptop”

Rootless Docker changes the location of the socket and the ownership model. That’s the point. It also breaks assumptions baked into scripts,
CI jobs, and that one README nobody has updated since the last reorg.

Recognize rootless sockets

Rootless Docker usually listens on a user-owned socket under something like /run/user/1001/docker.sock.
Your CLI might still be pointing to /var/run/docker.sock, which will fail (or connect to a different daemon).

cr0x@server:~$ ls -l /run/user/$(id -u)/docker.sock
srw-rw---- 1 cr0x cr0x 0 Jan  2 10:21 /run/user/1001/docker.sock

Meaning: Rootless socket exists for your user.

Decision: Point your client to it (via context or DOCKER_HOST=unix:///run/user/1001/docker.sock) and stop trying to “fix” /var/run.

CI runners: ephemeral, restricted, and sometimes intentionally daemonless

In CI, “Cannot connect to the Docker daemon” can be correct behavior: you’re on a runner that doesn’t provide Docker-in-Docker, or the service container
wasn’t started. The right fix is to align the pipeline architecture, not to add privileged mode everywhere.

cr0x@server:~$ ls -l /var/run/docker.sock
ls: cannot access '/var/run/docker.sock': No such file or directory

Meaning: There is no local daemon socket. In CI, this often means the job is not supposed to use Docker directly.

Decision: Either mount the socket (if your security model allows it), use a remote builder, or use a rootless build toolchain. Don’t “install docker” blindly.

Docker-in-Docker (DinD): understand the trade

DinD typically runs a separate daemon inside a container. If you forget to start it, or if it lacks required privileges/storage, you’ll get the same error.
It’s a valid pattern for some CI workloads. It’s also a performance and security trade.

cr0x@server:~$ docker run --rm docker:26.1-dind dockerd --version
Docker version 26.1.3, build 9e34c2a

Meaning: The dind image contains dockerd, but that doesn’t mean your CI environment will allow it to run.

Decision: If you need DinD, ensure the runner supports privileged containers and provides enough disk I/O. Otherwise use a remote daemon or BuildKit-based approach.

Storage and disk: when the daemon can’t breathe

Storage failures are the sneaky cousin of the “Cannot connect” error. The CLI can’t connect because the daemon never made it to the point where it can accept connections.
Or it accepts connections but is unusably slow because it’s thrashing disk.

Disk full is not just disk full

Docker cares about:

  • Bytes (df -h)
  • Inodes (df -i)
  • Filesystem features (OverlayFS expectations)
  • Write amplification (overlay layers plus heavy logging equals sadness)

Identify the storage driver and whether it’s appropriate

cr0x@server:~$ docker info 2>/dev/null | egrep 'Storage Driver|Backing Filesystem|Supports d_type'
 Storage Driver: overlay2
 Backing Filesystem: ext4
 Supports d_type: true

Meaning: overlay2 on ext4 with d_type support is the good path.

Decision: If you see Supports d_type: false or an unexpected driver, expect weird layer behavior and startup issues; fix filesystem/config before blaming Docker.

When performance problems look like connectivity problems

Sometimes the daemon is “running” but so wedged on I/O that the client times out and reports a connect failure.
You’ll see long delays, hung docker ps, and logs with timeouts talking to containerd.

cr0x@server:~$ sudo iostat -xz 1 3
Linux 6.8.0 (server)  01/02/2026  _x86_64_  (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.23    0.00    6.12   58.90    0.00   24.75

Device            r/s     rkB/s   rrqm/s  %rrqm  r_await  w/s     wkB/s   w_await  aqu-sz  %util
nvme0n1         20.0    1200.0     0.0    0.0    2.10   900.0  64000.0   45.80   35.2   99.5

Meaning: Disk is pegged; write latency is high. Docker metadata operations will stall.

Decision: Treat this as a resource incident: reduce write load (logs!), free space, move Docker data-root to faster disk, or scale out workload.

Log growth: the silent /var killer

Docker’s default json-file logging can eat disks when applications spam logs. Your daemon doesn’t care that it’s “just logs”;
it cares that it can’t write metadata and layers.

cr0x@server:~$ sudo find /var/lib/docker/containers -name '*-json.log' -printf '%s %p\n' | sort -n | tail -n 3
8246331120 /var/lib/docker/containers/1d9c.../1d9c...-json.log
9123341120 /var/lib/docker/containers/88af.../88af...-json.log
10322354112 /var/lib/docker/containers/a3b1.../a3b1...-json.log

Meaning: Individual container logs are multi-GB. This can fill the disk and prevent dockerd from starting.

Decision: Implement log rotation via daemon.json (max-size, max-file) and/or move logs to a centralized system. Then truncate the worst offenders carefully.

Three corporate-world mini-stories

1) The incident caused by a wrong assumption: “default context means local”

A mid-size company had a tidy setup: developers used Docker Desktop locally, production ran on Linux VMs, and ops used Docker contexts to manage multiple environments.
The team had a “prod” context configured over SSH for maintenance tasks. It was convenient. Too convenient.

One morning, an engineer got “Cannot connect to the Docker daemon” on their laptop. They assumed Docker Desktop had died. So they did what everyone does:
restarted Docker Desktop, rebooted their Mac, and still got the error. Frustration rose. Slack messages flew. Nobody looked at docker context ls.

The actual problem: their shell profile exported DOCKER_CONTEXT=prod from a previous week’s hotfix session. So the laptop’s Docker CLI wasn’t trying to connect locally at all.
It was trying to reach the production SSH endpoint. Except production had rotated host keys and the local SSH config now rejected the connection.
The CLI’s error was technically accurate but emotionally unhelpful.

The fix took two minutes once someone asked the boring question: “Where are you connecting?”
They removed the environment export, switched context back to Desktop, and documented a team policy: never export persistent Docker context variables in shell profiles.
Use explicit commands and short-lived shells for prod work. Convenience is a tax you pay later.

2) The optimization that backfired: “Let’s put Docker on the big shared filesystem”

Another organization tried to optimize disk usage. Their build servers ran out of space, and storage was expensive.
Someone proposed moving Docker’s data-root to a shared network filesystem mounted on every builder.
One copy of layers, shared across machines. In PowerPoint, it looked like victory.

They implemented it, updated /etc/docker/daemon.json, restarted Docker, and for a few days it “worked.”
Then the first real incident: intermittent “Cannot connect to the Docker daemon” on random builders. Not all at once; one here, one there.
systemd showed docker.service “active,” but the CLI hung or failed. Logs had containerd timeouts and overlay errors.

The backfire came from a few realities that don’t show up in a cost spreadsheet:
overlay2 and container metadata are extremely chatty. Latency matters. Locking semantics matter. Network hiccups matter.
Under load, the shared filesystem introduced long write waits; dockerd became unresponsive, and sometimes crashed during startup scans.

The final fix was not heroic. It was architectural: keep Docker’s writable layers on local SSD, use a registry for sharing images,
and let caching happen at the right layers (build cache, registry mirrors) rather than the filesystem pretending to be a single big disk.
The “optimization” was trying to share the wrong thing.

3) The boring but correct practice that saved the day: log rotation and disk headroom

A finance-adjacent platform had strict uptime requirements and a pragmatic SRE team. They weren’t exciting people at parties.
They did something unfashionable: they enforced disk headroom policies and configured Docker log rotation everywhere, even in dev.

One Friday, a new service version shipped with a bug that spammed logs during a retry loop. In other organizations, this is where disks fill,
Docker stops, and the weekend disappears. Here, the container logs rotated automatically. Disk usage rose, then flattened. Alerts fired at “unusual logging,”
not “system is dead.”

Engineers rolled back, fixed the bug, and went home. No daemon crash. No cascading “Cannot connect” errors. No midnight filesystem triage.
The root cause still mattered, but the platform had enough guardrails that the failure didn’t spread.

The lesson is annoyingly predictable: the “boring defaults” aren’t boring when they prevent you from debugging storage at 3 a.m.
Put limits on logs. Keep headroom. Assume humans will ship noisy software.

Common mistakes: symptom → root cause → fix

1) “Cannot connect…” after switching between environments

  • Symptom: Works yesterday, fails today; systemctl status docker looks fine.
  • Root cause: Wrong Docker context or DOCKER_HOST/DOCKER_CONTEXT set in shell.
  • Fix: docker context ls, unset env overrides, docker context use default or the correct Desktop context.

2) “permission denied … /var/run/docker.sock”

  • Symptom: Error mentions permission denied; Docker service is active.
  • Root cause: User not in docker group, or socket has unexpected permissions due to unit override.
  • Fix: Add user to docker group (with policy), re-login/newgrp, verify socket ownership root:docker and mode 660.

3) “No such file or directory” for docker.sock

  • Symptom: Socket path missing.
  • Root cause: Daemon not started, docker.socket disabled, or you’re in a namespace/container without the socket bind mount.
  • Fix: Start/enable docker service and socket; in containers, mount /var/run/docker.sock intentionally or use a remote daemon.

4) Docker service won’t start after config change

  • Symptom: systemctl start docker fails; logs show parse errors.
  • Root cause: Invalid JSON or unsupported key in /etc/docker/daemon.json.
  • Fix: Validate with jq, revert last change, restart. Keep changes minimal and reviewed.

5) Docker “active” but CLI hangs or times out

  • Symptom: docker ps hangs; eventually “Cannot connect” or timeouts; high load.
  • Root cause: Disk I/O saturation, containerd stuck, or huge layer/metadata operations during startup.
  • Fix: Check iowait, free space, reduce log spam, consider moving data-root to faster disk, restart containerd if needed.

6) Docker Desktop: CLI can’t connect but Linux troubleshooting shows nothing

  • Symptom: On macOS/Windows, you try systemctl and it’s nonsense.
  • Root cause: Desktop backend not running, corrupted state, or wrong context.
  • Fix: Switch to Desktop context, restart Desktop, verify WSL integration settings (Windows), reset state only if necessary.

7) Rootless confusion: socket exists, but Docker CLI points elsewhere

  • Symptom: You have /run/user/UID/docker.sock but CLI tries /var/run/docker.sock.
  • Root cause: Context/env still points to rootful endpoint.
  • Fix: Set the correct context or export DOCKER_HOST to the rootless socket for that session.

8) Corporate proxies: Docker commands fail in ways that look like daemon issues

  • Symptom: “Cannot connect” appears during pulls/builds, especially with remote contexts.
  • Root cause: Proxy env vars applied inconsistently; NO_PROXY missing for local socket/hosts; corporate MITM intercepts TLS.
  • Fix: Confirm proxy variables and NO_PROXY; keep daemon proxy config explicit rather than inheriting random shell env.

Checklists / step-by-step plan

Checklist A: You’re on Linux and you just need Docker working now

  1. Confirm endpoint: docker context ls; env | egrep '^DOCKER_'.
  2. Check service: systemctl is-active docker.
  3. If inactive: sudo systemctl start docker, then sudo journalctl -u docker -n 100 if it fails.
  4. If active but failing: check socket exists and perms: ls -l /run/docker.sock and id.
  5. If permission denied: decide policy (docker group vs sudo). Add group only if acceptable.
  6. If disk-related: run df -h and df -i; reclaim space deliberately.

Checklist B: You suspect wrong context/remote endpoint

  1. docker context ls: identify active context and endpoint.
  2. env | egrep '^(DOCKER_HOST|DOCKER_CONTEXT)=': remove overrides.
  3. docker context use default (or your intended one).
  4. Retest with docker version to see client/server sections.
  5. If using SSH: test SSH separately; then run ssh host docker ps to isolate remote permissions.

Checklist C: Docker service fails to start (don’t brute-force it)

  1. sudo journalctl -u docker -n 200: capture the real error.
  2. Validate /etc/docker/daemon.json with jq.
  3. Check disk bytes and inodes: df -h, df -i.
  4. Check storage driver and backing FS expectations (docker info if it starts; otherwise consult logs).
  5. Only then restart: sudo systemctl restart docker.

Checklist D: Docker Desktop (macOS/Windows)

  1. Confirm context is Desktop: docker context ls.
  2. If socket path under your home directory is missing, Desktop backend is down.
  3. Restart Desktop; if WSL2, confirm integration is enabled for the correct distro.
  4. Avoid running a second dockerd inside WSL unless you intentionally own that complexity.

FAQ

1) Why does Docker say “Is the docker daemon running?” when it is running?

Because the CLI can’t reach the endpoint it’s configured to use. The daemon might be running on a different socket, different context, or different machine.
Check contexts and DOCKER_HOST first.

2) Should I just run everything with sudo?

For quick debugging, sudo docker ps can confirm a permissions issue. As a long-term habit, it’s messy and hides real problems.
Either manage docker group membership explicitly or use rootless Docker where it fits.

3) Is adding myself to the docker group safe?

“Safe” depends on your threat model. Practically, docker group membership is close to root on that host.
Treat it like administrative access and restrict it accordingly.

4) Why does /var/run/docker.sock not exist?

Either Docker isn’t started, the socket unit is disabled, or you’re in an environment (like a container or minimal CI runner) where the socket is not mounted.
Also note that many distros use /run/docker.sock with /var/run as a symlink.

5) I fixed group membership but it still says permission denied. What now?

Confirm your current shell has updated groups (id). If not, re-login or use newgrp docker.
Then verify the socket group is actually docker and mode is srw-rw----.

6) Docker Desktop is running, but the CLI still can’t connect. What’s the most common cause?

Wrong context. People bounce between “default” and “desktop-linux” (or similar) and forget.
Run docker context ls and switch to the Desktop context.

7) Can a full disk cause “Cannot connect to the Docker daemon”?

Yes. A full disk (or full inodes) can prevent dockerd from starting, or wedge it during storage initialization.
Check journalctl -u docker, df -h, and df -i.

8) What’s the fastest way to tell if I’m hitting the wrong daemon (local vs remote)?

Use docker context inspect to see the endpoint, then run docker info and look at the “Operating System” / “Name” fields.
Desktop and remote engines often identify themselves clearly.

9) How do I avoid this class of outage in production?

Keep disk headroom and log rotation configured, monitor docker.service and containerd health, avoid exotic storage placements for data-root,
and make context usage explicit in operational scripts (no hidden env exports).

10) Is it ever a Docker client/server version mismatch?

Rarely for pure “cannot connect,” but it can show up as API errors after connection.
If you get connection but commands fail oddly, compare API versions in docker version output.

Conclusion: practical next steps

“Cannot connect to the Docker daemon” isn’t a mystery; it’s a routing problem, a permissions problem, or a daemon health problem.
The trick is refusing to treat it like a single bug.

  1. Pin down the endpoint: contexts and DOCKER_HOST before anything else.
  2. Check service health and logs: if dockerd can’t start, logs will tell you why.
  3. Fix permissions the right way: group membership or rootless, not world-writable sockets.
  4. Respect storage: disk headroom, inode monitoring, and log rotation prevent “connectivity” incidents that are really filesystem incidents.

Do those four things and this error goes back to being what it should have been all along: a minor inconvenience, not a personality test.

← Previous
Docker: Keep Containers from Booting on Updates — Pin Images Responsibly
Next →
How 3dfx lost: the saddest fall of a king

Leave a comment