Docker Daemon Won’t Start: Read This Log First (Then Fix It)

Was this helpful?

When the Docker daemon won’t start, your host becomes a museum exhibit: containers frozen in time, CI jobs stuck, deploys rolling back, and someone asking if “we can just reboot it again.” You can reboot, sure. You can also microwave a wet laptop. Neither is a strategy.

The fastest path out is not a random sequence of restarts. It’s one clean read of the right log, followed by a small number of deliberate commands that tell you what broke: config, storage, kernel features, networking rules, permissions, or containerd.

Fast diagnosis playbook (what to check first)

If you only have five minutes and a pager vibrating your molars, do this in order. The aim is to identify the bottleneck class quickly: config parse failure, runtime dependency failure, storage corruption/capacity, kernel feature mismatch, or networking rules failure.

First: systemd says why it refused to keep Docker alive

Docker is usually managed by systemd. systemd has the first opinion that matters: exit code and immediate stderr.

cr0x@server:~$ systemctl status docker --no-pager -l
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2026-01-02 10:12:54 UTC; 17s ago
    Process: 1842 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=1/FAILURE)
   Main PID: 1842 (code=exited, status=1/FAILURE)
        CPU: 230ms

Jan 02 10:12:54 server dockerd[1842]: failed to start daemon: error initializing graphdriver: overlay2: failed to mount /var/lib/docker/overlay2: invalid argument
Jan 02 10:12:54 server systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Jan 02 10:12:54 server systemd[1]: docker.service: Failed with result 'exit-code'.
Jan 02 10:12:54 server systemd[1]: Failed to start Docker Application Container Engine.

Decision: Take the first failed to start daemon: line seriously. It’s usually the root cause class. Here it screams “overlay2 mount invalid argument” → kernel/filesystem/overlayfs mismatch, not “Docker bug.”

Second: journalctl for Docker gives the full stack, not just the headline

cr0x@server:~$ journalctl -u docker -b --no-pager -n 200
Jan 02 10:12:54 server dockerd[1842]: time="2026-01-02T10:12:54.118922635Z" level=info msg="Starting up"
Jan 02 10:12:54 server dockerd[1842]: time="2026-01-02T10:12:54.152001115Z" level=error msg="failed to mount overlay: invalid argument" storage-driver=overlay2
Jan 02 10:12:54 server dockerd[1842]: time="2026-01-02T10:12:54.152114935Z" level=fatal msg="Error starting daemon: error initializing graphdriver: overlay2: failed to mount /var/lib/docker/overlay2: invalid argument"

Decision: If you see level=fatal followed by a concrete subsystem (graphdriver, iptables, daemon.json), stop guessing. Pivot into that subsystem’s checks.

Third: check capacity and the filesystem under /var/lib/docker

Disk full and inode exhaustion don’t always announce themselves politely. They just make daemons behave like they forgot how to write.

cr0x@server:~$ df -h /var/lib/docker
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4   80G   79G  300M 100% /

cr0x@server:~$ df -i /var/lib/docker
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/nvme0n1p4   5.0M   5.0M       0  100% /

Decision: If either blocks or inodes are at 100%, your “Docker won’t start” is a storage incident. Free space first; do not change drivers, reinstall packages, or “reset Docker” until the host can write.

Fourth: validate the daemon config before you chase ghosts

One trailing comma in JSON can take out your entire container platform. I wish that were a joke. (It isn’t.)

cr0x@server:~$ sudo cat /etc/docker/daemon.json
{
  "log-driver": "json-file",
  "log-opts": { "max-size": "10m", },
  "iptables": true
}

Decision: That trailing comma after "10m" will prevent dockerd from starting. Fix the JSON, then restart. Don’t touch anything else.

Fifth: verify containerd is alive (or confirm it isn’t)

cr0x@server:~$ systemctl status containerd --no-pager -l
● containerd.service - containerd container runtime
     Loaded: loaded (/lib/systemd/system/containerd.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2026-01-02 10:08:11 UTC; 6min ago
       Docs: man:containerd(8)
   Main PID: 1210 (containerd)

Decision: If containerd is down, Docker may fail with a socket or runtime error. Fix containerd first. If containerd is healthy, move on.

The one log to read first (and why)

Read the systemd journal for the docker unit before you read anything else. Not because it’s fancy. Because it’s authoritative. It captures:

  • Why systemd stopped restarting the service (start-limit hits, crash loops).
  • Exactly what dockerd printed to stderr/stdout.
  • The timing relative to other services (containerd, networking, mounts).

On most modern distros, this is the money command:

cr0x@server:~$ journalctl -u docker -b --no-pager -o cat
time="2026-01-02T10:12:54.118922635Z" level=info msg="Starting up"
time="2026-01-02T10:12:54.152114935Z" level=fatal msg="Error starting daemon: failed to load listeners: can't create unix socket /var/run/docker.sock: permission denied"

Decision: That error is not a “Docker can’t talk to Docker” issue. It’s a filesystem permission/ownership/SELinux/AppArmor issue on the socket path (or its parent). You now know what class of failure you’re in.

Don’t start with /var/log/docker.log unless you’re on a system that explicitly logs there. Many installations don’t. Don’t start with random Stack Overflow fixes. Your system has already told you what’s wrong; you just haven’t listened yet.

Interesting facts and history (so the errors make sense)

  • Docker originally used LXC (Linux Containers) for isolation before moving to libcontainer, which changed how low-level kernel features were consumed.
  • containerd was split out of Docker so the core runtime could evolve independently; that’s why “Docker is down” can actually mean “containerd is down.”
  • overlay2 became the default storage driver on many distros because it’s fast and space-efficient, but it’s picky about filesystem features (especially on older kernels).
  • iptables integration is not optional for classic Docker networking; when firewalld/nftables/iptables disagree, Docker can fail at startup, not just at container run time.
  • cgroups v2 adoption changed resource control plumbing; older Docker versions on newer distros can fail early with cgroup driver mismatches.
  • Docker’s logging defaults (json-file) can fill disks quietly; the daemon failing to start after a disk-full event is often self-inflicted log growth.
  • Start-limit behavior is a feature of systemd: after repeated failures, it stops trying. Operators often misread this as “Docker froze.”
  • /var/lib/docker is not sacred; it’s just state. It contains images, layers, metadata, and volumes (depending on config). It can be migrated, but doing it casually is how you earn weekend work.
  • Rootless Docker exists to reduce daemon privileges, but it adds a separate class of failures around user services, XDG_RUNTIME_DIR, and cgroup delegation.

Practical tasks: commands, outputs, and the decision you make

You don’t fix Docker by chanting “restart.” You fix Docker by collecting a small set of facts and making a decision after each one. Below are tasks I’ve used in real incidents, with realistic outputs and what they mean.

Task 1: Confirm the unit state and the last failure reason

cr0x@server:~$ systemctl is-enabled docker; systemctl is-active docker; systemctl status docker --no-pager -l
enabled
failed
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2026-01-02 10:12:54 UTC; 2min 11s ago
    Process: 1842 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=1/FAILURE)
Jan 02 10:12:54 server dockerd[1842]: failed to start daemon: Error initializing network controller: failed to create NAT chain DOCKER: iptables failed

Decision: If the failure points to network controller / iptables, don’t waste time on storage checks first. Jump to the iptables/nftables section.

Task 2: Pull the full boot-scoped logs for docker

cr0x@server:~$ journalctl -u docker -b --no-pager -n 300
Jan 02 10:12:54 server dockerd[1842]: time="2026-01-02T10:12:54Z" level=info msg="Starting up"
Jan 02 10:12:54 server dockerd[1842]: time="2026-01-02T10:12:54Z" level=info msg="libcontainerd: started new containerd process" pid=1901
Jan 02 10:12:54 server dockerd[1842]: time="2026-01-02T10:12:54Z" level=error msg="iptables failed: iptables -t nat -N DOCKER: iptables v1.8.7 (nf_tables): Chain already exists."
Jan 02 10:12:54 server dockerd[1842]: time="2026-01-02T10:12:54Z" level=fatal msg="Error starting daemon: Error initializing network controller: iptables failed"

Decision: “Chain already exists” hints at stale rules from a previous run or a conflict with nftables backend. You’re in the networking rules failure mode.

Task 3: Check whether systemd is start-limiting you

cr0x@server:~$ systemctl status docker --no-pager -l | sed -n '1,25p'
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
     Active: failed (Result: start-limit-hit) since Tue 2026-01-02 10:13:09 UTC; 18s ago
Jan 02 10:13:09 server systemd[1]: docker.service: Start request repeated too quickly.
Jan 02 10:13:09 server systemd[1]: docker.service: Failed with result 'start-limit-hit'.

Decision: Clear the start-limit after you fix the underlying problem; otherwise you’ll “fix it” and still see it refused.

cr0x@server:~$ sudo systemctl reset-failed docker

Task 4: Validate /etc/docker/daemon.json without trusting your eyes

cr0x@server:~$ sudo python3 -m json.tool /etc/docker/daemon.json
Expecting property name enclosed in double quotes: line 3 column 36 (char 61)

Decision: Fix JSON syntax first. If this fails, Docker will not start. Period. After correcting the file, rerun the validator until it prints formatted JSON and exits 0.

Task 5: Extract Docker’s effective command line (drop-in overrides matter)

cr0x@server:~$ systemctl cat docker --no-pager
# /lib/systemd/system/docker.service
[Service]
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

# /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// --data-root /mnt/docker-data --storage-driver=overlay2

Decision: If you see overrides, treat them as suspect until proven otherwise. Many “Docker broke after update” incidents are actually “old override meets new defaults.”

Task 6: Check the Docker data-root mount and filesystem type

cr0x@server:~$ findmnt -no SOURCE,FSTYPE,OPTIONS /var/lib/docker
/dev/nvme0n1p4 ext4 rw,relatime

cr0x@server:~$ findmnt -no SOURCE,FSTYPE,OPTIONS /mnt/docker-data
/dev/sdb1 xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k

Decision: Overlay2 on XFS generally requires ftype=1. If you migrated Docker data to an older XFS formatted with ftype=0, overlay2 will fail.

Task 7: Verify XFS ftype (critical for overlay2 on XFS)

cr0x@server:~$ sudo xfs_info /dev/sdb1 | grep ftype
naming   =version 2              bsize=4096   ascii-ci=0, ftype=0

Decision: ftype=0 is a hard stop for overlay2. Your fix is to reformat with ftype=1 (data migration required) or switch storage driver (usually a bad day). Do not keep retrying.

Task 8: Check kernel support for overlayfs (and spot “invalid argument” causes)

cr0x@server:~$ uname -r
4.15.0-213-generic

cr0x@server:~$ lsmod | grep overlay
overlay               102400  0

cr0x@server:~$ sudo dmesg -T | tail -n 20
[Mon Jan  2 10:12:54 2026] overlayfs: filesystem on '/var/lib/docker/overlay2' not supported as upperdir

Decision: That dmesg line tells you the kernel rejected the underlying filesystem as an overlay upperdir (common with certain network filesystems, mis-mounted paths, or unsupported options). Fix the mount/filesystem choice; Docker can’t paper over it.

Task 9: Confirm containerd socket and health

cr0x@server:~$ ls -l /run/containerd/containerd.sock
srw-rw---- 1 root root 0 Jan  2 10:08 /run/containerd/containerd.sock

cr0x@server:~$ systemctl status containerd --no-pager -l | sed -n '1,15p'
● containerd.service - containerd container runtime
     Active: active (running) since Tue 2026-01-02 10:08:11 UTC; 6min ago

Decision: If the socket is missing or containerd is failing, fix containerd before Docker. If containerd is fine, Docker’s error is elsewhere.

Task 10: Look for obvious permission denials (SELinux/AppArmor show up here)

cr0x@server:~$ sudo journalctl -b --no-pager | grep -E 'DENIED|apparmor="DENIED"|avc:'
Jan 02 10:12:54 server kernel: audit: type=1400 apparmor="DENIED" operation="create" profile="docker-default" name="/var/run/docker.sock" pid=1842 comm="dockerd"

Decision: If you see explicit denials, stop treating it as a Docker config issue. Fix the policy/profile or the file context. Starting Docker with “just disable security” is how incidents graduate into breaches.

Task 11: Inspect iptables backend mismatch (iptables vs nft)

cr0x@server:~$ sudo iptables --version
iptables v1.8.7 (nf_tables)

cr0x@server:~$ sudo iptables -t nat -S | sed -n '1,25p'
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N DOCKER
-N DOCKER-ISOLATION-STAGE-1
-N DOCKER-ISOLATION-STAGE-2

Decision: If Docker complains about chains existing, you may have conflicting rule managers (firewalld, kube-proxy, custom scripts). Decide who owns the rules. In a pinch, flush only Docker-managed chains carefully—after you understand the blast radius.

Task 12: Confirm cgroup mode and driver mismatch

cr0x@server:~$ mount | grep cgroup2
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

cr0x@server:~$ journalctl -u docker -b --no-pager | grep -i cgroup | tail -n 5
Jan 02 10:12:54 server dockerd[1842]: time="2026-01-02T10:12:54Z" level=fatal msg="Error starting daemon: Devices cgroup isn't mounted"

Decision: This often indicates an older Docker build or wrong configuration for cgroups v2. The fix is version alignment (upgrade Docker) or configuring the correct cgroup driver/mode for your distro. Don’t hack around it by disabling resource controls unless you enjoy performance roulette.

Task 13: Check for corruption or partial writes after power events

cr0x@server:~$ journalctl -u docker -b --no-pager | tail -n 20
Jan 02 10:12:54 server dockerd[1842]: time="2026-01-02T10:12:54Z" level=error msg="failed to load container metadata" error="unexpected end of JSON input"
Jan 02 10:12:54 server dockerd[1842]: time="2026-01-02T10:12:54Z" level=fatal msg="Error starting daemon: error while opening volume store metadata database"

Decision: Metadata corruption is real. You now decide: recover from backup/restore, surgically remove the corrupted object, or rebuild Docker state. Your choice depends on whether you can lose local images/volumes.

Task 14: Dry-run dockerd directly (useful when systemd hides stderr)

cr0x@server:~$ sudo dockerd --debug --validate --config-file=/etc/docker/daemon.json
unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: hosts

Decision: You’ve got conflicting config sources (systemd ExecStart flags vs daemon.json). Remove one source of truth. In production, I prefer leaving hosts to systemd and keeping daemon.json for daemon settings, not listeners.

Task 15: Check for port conflicts (common with old dockerd flags)

cr0x@server:~$ sudo ss -ltnp | grep -E ':2375|:2376' || true
LISTEN 0      4096         0.0.0.0:2375      0.0.0.0:*    users:(("dockerd",pid=902,fd=7))

Decision: If an old dockerd process is still bound (or another service), your new daemon can’t bind. Kill the stray process cleanly, then fix the unit so you don’t spawn multiple daemons.

Task 16: Free space safely without destroying everything

cr0x@server:~$ sudo du -sh /var/lib/docker/* 2>/dev/null | sort -h | tail -n 10
2.1G	/var/lib/docker/containers
12G	/var/lib/docker/overlay2
18G	/var/lib/docker/volumes

cr0x@server:~$ sudo find /var/lib/docker/containers -name '*-json.log' -size +200M -printf '%p %s\n' | head
/var/lib/docker/containers/2f3.../2f3...-json.log 987654321

Decision: If container logs are the culprit, truncate logs rather than deleting container directories.

cr0x@server:~$ sudo truncate -s 0 /var/lib/docker/containers/2f3.../2f3...-json.log

Decision: Get the daemon starting first, then implement log rotation properly. Disk full is an outage; perfect hygiene can wait an hour.

The big failure modes (what they look like in logs)

Docker daemon startup failures cluster into a handful of buckets. Recognize the bucket, and you’ve cut the incident in half.

1) Config parse and configuration conflicts

Typical log lines:

  • unable to configure the Docker daemon with file ...: invalid character
  • directives are specified both as a flag and in the configuration file
  • unknown log opt after a version change

What’s really happening: Docker is strict about JSON syntax and about duplicate settings. systemd flags are “settings,” too.

Fix philosophy: One source of truth. Keep daemon.json minimal; put listeners (-H) in systemd or vice versa, but don’t split them across both.

2) Storage driver failures (overlay2, devicemapper, btrfs, zfs)

Typical log lines:

  • error initializing graphdriver
  • overlay2: failed to mount
  • failed to register layer

What’s really happening: The kernel and filesystem are negotiating features. If they disagree, overlayfs returns “invalid argument” and Docker translates that into despair.

Fix philosophy: Confirm filesystem type/options and kernel support. Don’t change the storage driver mid-incident unless you’re prepared to lose local images and potentially volumes.

3) Capacity and filesystem health

Typical log lines:

  • no space left on device
  • read-only file system after an I/O error
  • database is locked or metadata partial reads after abrupt shutdown

What’s really happening: Docker is state-heavy. If the host storage can’t write reliably, startup becomes the first victim.

Fix philosophy: Restore write capability (space, fsck, remount, fix underlying disk). Only then consider Docker-specific remediation.

4) containerd/runtime dependency breakage

Typical log lines:

  • failed to dial "/run/containerd/containerd.sock"
  • containerd: connect: no such file or directory

What’s really happening: Docker delegates runtime responsibilities. If containerd is missing, incompatible, or not running, Docker can’t proceed.

Fix philosophy: Treat containerd as a prerequisite service. Align versions via your package manager and keep the unit healthy.

5) Networking initialization failures (iptables/nftables/firewalld)

Typical log lines:

  • failed to create NAT chain DOCKER
  • iptables: No chain/target/match by that name
  • Chain already exists with nf_tables backend

What’s really happening: Docker tries to program NAT and filter rules. If another actor is managing those tables differently (or the iptables backend changed), Docker can’t create what it expects.

Fix philosophy: Decide rule ownership. If you must intervene, do so surgically and document it. Random flushes are how you cut off SSH to your own host.

6) Permissions, LSMs, and socket creation

Typical log lines:

  • can't create unix socket /var/run/docker.sock: permission denied
  • apparmor="DENIED" or SELinux AVC denials

What’s really happening: Docker needs to create a privileged socket and mount namespaces. Security modules and filesystem permissions can stop it cold.

Fix philosophy: Read the denial. Fix the policy/context/ownership. Disabling SELinux “because Docker” is like removing smoke alarms because you burnt toast.

One quote, because it’s still the job: “Hope is not a strategy.” — Gene Kranz

Common mistakes: symptom → root cause → fix

These show up constantly because they’re easy to create and annoying to diagnose under pressure.

1) Docker stuck in start-limit-hit

Symptom: systemctl shows start-limit-hit and refuses new starts.

Root cause: Docker crashed repeatedly; systemd stopped retrying.

Fix: Fix the underlying error first, then:

cr0x@server:~$ sudo systemctl reset-failed docker
cr0x@server:~$ sudo systemctl start docker

2) “invalid character” or JSON errors on startup

Symptom: Docker fails instantly after a config tweak.

Root cause: Invalid JSON, comments in JSON, trailing commas, or wrong types.

Fix: Validate the file, then correct:

cr0x@server:~$ sudo python3 -m json.tool /etc/docker/daemon.json

3) overlay2 “failed to mount” / “invalid argument”

Symptom: graphdriver init fails, overlay2 mount errors.

Root cause: Unsupported filesystem for upperdir (common with NFS), XFS ftype=0, or kernel/filesystem mismatch.

Fix: Put Docker data-root on a supported local filesystem (ext4, XFS with ftype=1). Verify with xfs_info and dmesg. Migrate state if needed.

4) iptables chain errors after firewall changes

Symptom: Docker fails with NAT chain creation errors.

Root cause: Conflicting firewall managers or iptables backend switch (legacy vs nf_tables).

Fix: Align iptables tooling and rule ownership. In controlled environments, set the system to a consistent backend and make firewalld/kube/Docker coexist intentionally.

5) “permission denied” on /var/run/docker.sock

Symptom: Docker can’t create its socket.

Root cause: Wrong permissions on /var/run subpaths, stale socket owned by the wrong user, or LSM denial.

Fix: Remove stale socket (if Docker is stopped), correct ownership, address AppArmor/SELinux denials. Don’t chmod 777 your way into regret.

6) Docker fails after moving data-root to “fast storage”

Symptom: Docker used to work; after moving --data-root it dies at startup.

Root cause: New mount is a network filesystem or formatted without required features.

Fix: Re-evaluate the storage. Docker’s hot path wants low-latency local disk. If you must use network storage, keep it for volumes, not overlay layers.

7) Disk full and Docker won’t start after cleanup attempts

Symptom: You deleted “some stuff,” Docker still won’t start, and now metadata looks corrupted.

Root cause: Deleting random files under /var/lib/docker breaks metadata consistency.

Fix: Stop the daemon. Restore from backup if available. If not, accept that a rebuild (wipe data-root) may be safer than continued partial surgery.

8) cgroup errors on newer distros

Symptom: Errors about devices cgroup not mounted, or cgroup driver mismatch.

Root cause: Docker version not compatible with cgroups v2 defaults, or misconfigured cgroup driver.

Fix: Upgrade Docker to a version that supports your distro’s cgroup mode, and keep config consistent across fleet.

Joke #1: If your “fix” is restarting Docker until it works, you’re basically running a slot machine with root privileges.

Three corporate-world mini-stories from real life

Mini-story 1: The outage caused by a wrong assumption

The team had a clean plan: move Docker’s data-root off the OS disk and onto a bigger shared mount. The shared storage team promised it was “just like a disk.” It was mounted via NFS, but nobody said NFS out loud in the change request. Everyone assumed “storage is storage.”

The migration happened during a quiet window. They rsynced /var/lib/docker to /mnt/docker-data, added a systemd override with --data-root, and restarted. Docker didn’t come back. systemd said “failed to mount overlay.” The on-call spent 40 minutes chasing overlay2 docs and kernel modules.

The breakthrough came from dmesg, not Docker logs: overlayfs rejected the upperdir because the filesystem wasn’t supported. NFS was fine for bulk storage. It was not fine for Docker’s union filesystem layers. The wrong assumption wasn’t “overlay2 is flaky.” The wrong assumption was “NFS behaves like ext4.”

They rolled back to local XFS with the correct features and kept NFS for application volumes only, with clear boundaries. The outage ended. The postmortem ended with a new rule: storage types must be named explicitly in change requests. “Shared mount” isn’t a type.

Mini-story 2: The optimization that backfired

A performance-minded engineer wanted faster container startup. They noticed overlay2 directory churn and decided to tune mounts and reduce write amplification. The “optimization” included moving Docker to a filesystem with aggressive options and turning off some metadata features they thought were overhead.

It benchmarked well in a dev environment with a handful of containers. Then it hit production: hundreds of containers, frequent image pulls, lots of small file operations. Within days, the node started throwing intermittent I/O errors. After a rough reboot, Docker wouldn’t start. The logs showed layer registration failures and metadata database errors.

What happened wasn’t mysterious. The optimization had reduced safety margins: the system was now more sensitive to power loss and to filesystem edge cases. The cost of a few percent performance was a much higher chance of corruption and a harder recovery path.

They backed out the mount tweaks, rebuilt the node, and adopted a boring rule: if you change the storage substrate under Docker, you do it with a documented compatibility matrix and a rollback plan. Performance wins are real, but “fast and fragile” is a bad trade in production.

Mini-story 3: The boring but correct practice that saved the day

A different org had a habit nobody bragged about: they collected boot-scoped logs for critical services and shipped them to a central place. Not just application logs—systemd unit logs and kernel messages too. It was dull. It also worked.

One morning, a subset of nodes stopped starting Docker after a routine OS update. The initial human instinct was “Docker regression.” But the log timeline showed something sharper: Docker failed right after a firewall reload on boot. The iptables backend had shifted, and the firewall manager started pre-creating chains in a way Docker didn’t like.

Because they had the logs, they didn’t argue for hours. They identified the change point, reproduced it on a canary, and rolled out a consistent iptables backend configuration. Docker came back without touching storage, images, or container workloads.

That practice didn’t prevent the failure. It prevented the prolonged outage. In production, time-to-understanding is half the incident.

Checklists / step-by-step plan

Checklist A: Get Docker starting again (minimum safe steps)

  1. Stop thrashing. If the service is crash-looping, stop it while you investigate.
    cr0x@server:~$ sudo systemctl stop docker
    
  2. Read the journal once, properly.
    cr0x@server:~$ journalctl -u docker -b --no-pager -n 200
    

    Decision: Identify the bucket: config, storage, capacity, networking, permissions, dependency.

  3. Check disk and inodes on the data-root filesystem.
    cr0x@server:~$ df -h /var/lib/docker; df -i /var/lib/docker
    

    Decision: If full, free space without deleting random state. Truncate logs; remove known large artifacts carefully.

  4. Validate daemon.json if it exists.
    cr0x@server:~$ test -f /etc/docker/daemon.json && sudo python3 -m json.tool /etc/docker/daemon.json
    

    Decision: Fix parse errors; remove duplicates with systemd flags.

  5. Confirm containerd is healthy.
    cr0x@server:~$ systemctl status containerd --no-pager -l
    

    Decision: Fix containerd first if needed.

  6. Clear systemd start limits after the fix.
    cr0x@server:~$ sudo systemctl reset-failed docker
    
  7. Start Docker and watch logs live for 30 seconds.
    cr0x@server:~$ sudo systemctl start docker
    cr0x@server:~$ journalctl -u docker -f --no-pager
    

    Decision: If it fails again, you now capture the exact transition point without scrolling archaeology.

Checklist B: If storage corruption is suspected (be careful)

  1. Snapshot or copy evidence first (if you can). At minimum, capture logs and the current state of /var/lib/docker size breakdown.
  2. Do not delete random directories under /var/lib/docker while dockerd is running. Stop it.
  3. Identify whether volumes matter on this host. Some environments store persistent data in Docker volumes; others don’t. Your recovery plan depends on that.
  4. Prefer node rebuild over artisanal repair when the host is cattle. Prefer careful recovery when the host is unfortunately a pet.

Checklist C: If iptables/networking is suspected

  1. Confirm Docker’s error mentions iptables/nftables.
  2. Check iptables backend and current chains.
  3. Identify the other rule manager (firewalld, kube-proxy, custom scripts).
  4. Make rule ownership explicit, then restart Docker.

Joke #2: Docker networking is easy until it isn’t, at which point it becomes interpretive dance performed by iptables.

FAQ

1) Should I run dockerd manually to debug?

Yes, briefly, if systemd is obscuring stderr or you suspect config conflicts. Use --debug and stop systemd’s unit first to avoid two daemons fighting over the same socket.

2) Is it safe to delete /var/lib/docker to “fix it”?

It’s safe only if you’re intentionally wiping local images, containers, and possibly volumes (depending on your setup). It’s a last resort for fast recovery on stateless nodes, not a default fix.

3) Docker says overlay2 failed to mount. Is Docker broken?

Usually no. It’s commonly a filesystem/kernel compatibility issue (XFS ftype, unsupported upperdir filesystem, or mount options). Check dmesg for the kernel’s real complaint.

4) Why does Docker fail at startup due to iptables? Can’t it just run without NAT?

Classic Docker bridge networking depends on iptables rules. If Docker can’t program NAT/filter rules, it refuses to start networking properly and may fail the daemon start to avoid half-working behavior.

5) I changed daemon.json and now Docker won’t start. What’s the fastest sanity check?

Validate JSON with a real parser:

cr0x@server:~$ sudo python3 -m json.tool /etc/docker/daemon.json

If it errors, fix syntax. If it succeeds, look for directive conflicts with systemd flags.

6) Docker won’t start after an OS upgrade. What’s the likely culprit?

Common culprits: cgroups v2 default changes, iptables backend switches, and older overrides in systemd drop-ins. Start with systemctl cat docker and the journal.

7) Can containerd be running but Docker still fails to start?

Absolutely. containerd health only removes one failure class. Docker can still fail on storage driver init, network rules, socket permissions, or daemon.json parsing.

8) How do I know if the disk is “full enough” to break Docker?

If free space is down to hundreds of MB, or inodes are exhausted, Docker can fail to start or behave erratically. Check both df -h and df -i. Also watch for read-only remounts after I/O errors.

9) What if docker info hangs instead of failing?

If the CLI hangs, it’s usually waiting on the socket. Confirm whether the daemon is running (systemctl is-active docker) and whether the socket exists (ls -l /var/run/docker.sock). Hanging is often “daemon not responding,” not “CLI broken.”

10) Is rootless Docker less likely to fail?

It fails differently. You reduce privilege-related blast radius, but you add dependency on user services, cgroup delegation, and per-user runtime directories. Great when designed for it; confusing when bolted on midstream.

Next steps that actually reduce future incidents

Getting Docker to start is the immediate win. Preventing the next “daemon down” is the real job. Here’s what I’d do after the incident dust settles:

  1. Codify the fast diagnosis playbook into your runbook, with your distro-specific paths and decisions. The commands above are a baseline; tailor them.
  2. Standardize Docker’s config source: choose daemon.json for daemon settings and systemd drop-ins for ExecStart flags, or the other way around—just don’t mix casually.
  3. Put Docker state on the right storage: local ext4 or XFS with the right features, monitored for capacity and inodes. Keep overlay layers off network mounts.
  4. Control log growth: configure container log rotation and monitor /var/lib/docker/containers/*-json.log sizes. Disk-full outages are optional.
  5. Make firewall ownership explicit in your platform design: Docker, firewalld, Kubernetes, and security agents can coexist, but only if you decide who writes which rules.
  6. Practice recovery on a non-production node: simulate a broken daemon.json, a full disk, and an iptables conflict. Your future self will be grateful and slightly less caffeinated.
← Previous
ZFS Benchmarking: The Rules That Prevent Fake Results
Next →
Tooltips Without Libraries: Positioning, Arrow, and ARIA-Friendly Patterns

Leave a comment