Docker Container Keeps Restarting: Find the Real Reason in 5 Minutes

Was this helpful?

The pager doesn’t say “root cause unknown.” It says “service down.” And somewhere, a container is doing that embarrassing thing where it starts, dies, starts again, dies again—like it’s trying to negotiate with reality.

Restart loops waste time because people chase symptoms: “Docker is unstable,” “the image is broken,” “maybe the node is cursed.” It’s almost always something boring: exit code, healthcheck, OOM kill, dependency timing, or a restart policy doing exactly what you told it to do. Here’s how to find the real reason fast, without turning the incident into a lifestyle.

Fast diagnosis playbook (5 minutes)

The goal isn’t “collect data.” The goal is: identify the restart trigger and the failing component before the loop overwrites evidence.
You are trying to answer three questions:

  1. Who restarts it? Docker restart policy, orchestrator (Compose, Swarm), systemd, or you?
  2. Why does it exit? App crash, misconfig, signal, OOM kill, failed healthcheck, dependency missing.
  3. What changed? Image tag, env var, secret, volume, kernel/memory pressure, DNS, firewall.

Minute 1: Identify the container and restart driver

  • Get RestartCount, ExitCode, OOMKilled, restart policy.
  • Confirm whether Compose/Swarm/systemd is involved.

Minute 2: Pull the last failure evidence (before it scrolls away)

  • Check logs from the previous run (--since / tail).
  • Inspect State.Error and timestamps.

Minute 3: Classify the failure mode

  • Exit code 1/2/126/127: app/config/exec issues.
  • Exit code 137 or OOMKilled=true: memory pressure.
  • Healthcheck “unhealthy”: app might still run, but orchestrator kills it.
  • Instant exits: entrypoint script, missing file, wrong user/perm.

Minute 4: Validate dependencies and runtime environment

  • DNS, network, ports, mounted files, permissions, secrets.
  • Backend availability (DB, queue, auth) and timeouts.

Minute 5: Make a decision, not a report

Decide which of these you’re doing next: fix configuration, add resources, rollback image, disable a broken healthcheck, or pin dependencies.
If you can’t decide after five minutes, you’re missing one of: exit code, OOM evidence, healthcheck status, or who’s doing the restarting.

One quote that belongs in every on-call brain: “Hope is not a strategy.” — Gene Kranz.

What “keeps restarting” really means

A container restart loop isn’t a single bug. It’s a contract between your process, Docker, and whatever is supervising Docker.
The container process exits. Something notices. Something restarts it. That “something” might be Docker itself (restart policy), Docker Compose, Swarm, Kubernetes (if Docker is just the runtime), or even systemd managing a docker run.

So the first anti-pattern: staring at the container name like it owes you answers. Containers don’t restart; supervisors restart containers.
The best debugging move is to identify the supervisor, then read the evidence it leaves behind.

Typical restart drivers

  • Docker restart policy: no, on-failure, always, unless-stopped.
  • Compose: restart: in docker-compose.yml, plus dependency ordering issues.
  • systemd: unit file with Restart=always running Docker.
  • External automation: cron, watchdog scripts, CI/CD “ensure running” jobs.

Two kinds of loops that feel identical (but aren’t)

Crash loop: the process dies quickly due to app/config/resource issues.
Kill loop: the process runs, but a healthcheck fails or a supervisor kills it (OOM, watchdog, orchestrator policy).

Your entire job is to separate those two. Logs and exit codes do that in minutes—if you grab them correctly.

Interesting facts and small history (so your intuition improves)

  • Docker’s restart policies landed early because users treated containers like lightweight daemons and needed something like init behavior without an init system inside the container.
  • Exit code 137 usually means SIGKILL (128 + 9). In containers, SIGKILL is commonly the kernel OOM killer or a hard kill by a supervisor.
  • Healthchecks were added after people shipped “running but dead” services—process still alive, but it can’t accept traffic. Without healthchecks, those failures quietly rot.
  • Docker logs are not “the app logs”; they’re whatever the process writes to stdout/stderr, captured by a logging driver. If your app logs to files, docker logs can look empty even while the app is screaming into /var/log inside the container.
  • Overlay filesystems made containers practical by enabling copy-on-write layers, but they can amplify IO overhead for write-heavy workloads—leading to timeouts that look like “random restarts.”
  • Restart loops often mask dependency failures: the app exits because it can’t reach a DB, but the real root cause is DNS, firewall, TLS mismatch, or a rotated password.
  • Compose “depends_on” does not mean “ready” in classic Compose; it primarily orders start, not readiness. That single misunderstanding has burned more teams than obscure kernel bugs.
  • OOM kills can happen with “free memory” showing because cgroup limits and memory+swap accounting are what matter for the container, not the host’s global free RAM.
  • PID 1 behavior matters: signals, zombies, and exit handling can differ if you run a shell as PID 1 vs. an init-like wrapper, changing how restarts look.

12+ practical tasks: commands, what the output means, and the decision you make

These are the moves you can run under pressure. Each task includes: a command, what the output tells you, and the decision it enables.
Run them in order until you hit the smoking gun. And yes, you can do most of this without “exec’ing into” a container that’s dying every four seconds.

Task 1: Confirm restart loop and get the container ID

cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.RunningFor}}'
NAMES        IMAGE                 STATUS                        RUNNING FOR
api          myco/api:1.24.7       Restarting (1) 8 seconds ago   2 minutes
redis        redis:7               Up 3 hours                     3 hours

Meaning: Restarting (1) indicates Docker sees the container repeatedly exiting and applying a restart policy.
The number in parentheses is the last exit code (not always; it’s what Docker observed last).

Decision: Identify the failing service name (api) and move immediately to inspect state and exit details. Don’t start with network theories yet.

Task 2: Inspect restart policy, exit code, OOM, and timestamps

cr0x@server:~$ docker inspect api --format '{{json .State}}'
{"Status":"restarting","Running":true,"Paused":false,"Restarting":true,"OOMKilled":false,"Dead":false,"Pid":24711,"ExitCode":1,"Error":"","StartedAt":"2026-01-02T09:14:44.129885633Z","FinishedAt":"2026-01-02T09:14:51.402183576Z","Health":null}
cr0x@server:~$ docker inspect api --format 'RestartPolicy={{.HostConfig.RestartPolicy.Name}} MaxRetry={{.HostConfig.RestartPolicy.MaximumRetryCount}}'
RestartPolicy=always MaxRetry=0

Meaning: ExitCode=1 is an application-level failure. OOMKilled=false reduces the likelihood of memory kill (not perfect, but a strong signal).
Restart policy always means Docker will keep trying forever. That’s great until it isn’t.

Decision: Focus on application startup/config errors and fetch logs for the last run. If it were OOMKilled=true or exit 137, you’d pivot to memory.

Task 3: Get the last logs from the previous attempt (not from the beginning of time)

cr0x@server:~$ docker logs --timestamps --tail 200 api
2026-01-02T09:14:49.903214817Z level=error msg="config parse failed" err="missing ENV DATABASE_URL"
2026-01-02T09:14:49.903955113Z level=error msg="fatal: cannot start without database"

Meaning: The container is not “mysteriously restarting.” It’s failing deterministically: missing env var DATABASE_URL.

Decision: Fix config and redeploy. Stop. Don’t add memory. Don’t “rebuild the image.” Don’t blame Docker.

Task 4: Confirm the container env and what it thinks it has

cr0x@server:~$ docker inspect api --format '{{range .Config.Env}}{{println .}}{{end}}' | sed -n '1,12p'
APP_ENV=prod
LOG_LEVEL=info
PORT=8080

Meaning: Environment is missing what logs claim is missing. That’s consistent. Good.

Decision: Determine where env should come from: Compose file, --env-file, secret injection, or platform tooling. Fix at the source, not with a one-off docker exec.

Task 5: If it’s Compose, verify the rendered configuration

cr0x@server:~$ docker compose config | sed -n '/services:/,$p' | sed -n '1,120p'
services:
  api:
    environment:
      APP_ENV: prod
      LOG_LEVEL: info
      PORT: "8080"
    image: myco/api:1.24.7
    restart: always

Meaning: Compose config lacks DATABASE_URL. Maybe the env file wasn’t loaded, or the variable name changed.

Decision: Fix docker-compose.yml or the env file path. Then redeploy with a clean recreate so old config doesn’t linger.

Task 6: Look for healthcheck-driven restarts (it’s sneakier than you think)

cr0x@server:~$ docker inspect api --format '{{json .State.Health}}'
{"Status":"unhealthy","FailingStreak":5,"Log":[{"Start":"2026-01-02T09:20:10.001712312Z","End":"2026-01-02T09:20:10.045221991Z","ExitCode":7,"Output":"curl: (7) Failed to connect to localhost port 8080: Connection refused\n"}]}

Meaning: The process might be running, but the healthcheck can’t reach the service. Exit code 7 from curl is “failed to connect.”
Some setups (especially with Compose wrappers or external supervisors) will restart unhealthy containers.

Decision: Decide whether the healthcheck is wrong (checking wrong port/interface), too aggressive (interval/timeout), or accurately detecting a dead app. Then fix either the healthcheck or the service binding.

Task 7: Determine whether the kernel OOM killer is involved

cr0x@server:~$ docker inspect api --format 'ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}} Error={{.State.Error}}'
ExitCode=137 OOMKilled=true Error=
cr0x@server:~$ dmesg -T | tail -n 20
[Thu Jan  2 09:25:13 2026] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker.service,mems_allowed=0,oom_memcg=/docker/3b2e...,task_memcg=/docker/3b2e...,task=myco-api,pid=31244,uid=1000
[Thu Jan  2 09:25:13 2026] Killed process 31244 (myco-api) total-vm:2147488kB, anon-rss:612344kB, file-rss:1420kB, shmem-rss:0kB

Meaning: Now it’s a different class of problem. The container didn’t “crash,” it got shot.
OOMKilled=true plus dmesg confirms the kernel killed the process under cgroup memory pressure.

Decision: Increase memory limit, reduce memory usage, or fix a leak/regression. Also check for node-level memory contention and noisy neighbors.

Task 8: Check container memory limits and current usage

cr0x@server:~$ docker inspect api --format 'Memory={{.HostConfig.Memory}} MemorySwap={{.HostConfig.MemorySwap}}'
Memory=536870912 MemorySwap=536870912
cr0x@server:~$ docker stats --no-stream --format 'table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.CPUPerc}}'
NAME    MEM USAGE / LIMIT     MEM %     CPU %
api     510MiB / 512MiB       99.6%     140.3%
redis   58MiB / 7.7GiB        0.7%      0.4%

Meaning: A 512MiB limit with swap equal to memory is tight; you’re basically forbidding breathing room. CPU at 140% suggests heavy work (multiple threads).

Decision: If this limit was intentional, tune the app (heap sizes, caches) and verify memory profile. If it was accidental, raise it and move on.

Task 9: Identify fast-exit issues: wrong entrypoint, missing binary, permissions

cr0x@server:~$ docker inspect api --format 'Entrypoint={{json .Config.Entrypoint}} Cmd={{json .Config.Cmd}} User={{json .Config.User}}'
Entrypoint=["/docker-entrypoint.sh"] Cmd=["/app/server"] User="10001"
cr0x@server:~$ docker logs --tail 50 api
/docker-entrypoint.sh: line 8: /app/server: Permission denied

Meaning: The binary exists but is not executable for the configured user, or the filesystem is mounted noexec, or the image build lost executable bits.

Decision: Fix image permissions (chmod +x at build time), or run as a user that can execute, or remove noexec from mount options for that volume. Don’t “fix” this with chmod inside a running container; it won’t survive rebuilds.

Task 10: Validate mounts and whether a volume hides your shipped files

cr0x@server:~$ docker inspect api --format '{{range .Mounts}}{{println .Destination "->" .Source "type=" .Type}}{{end}}'
/app -> /var/lib/docker/volumes/api_app/_data type= volume
/config -> /etc/myco/api type= bind

Meaning: Mounting a volume on /app can shadow the application binary shipped in the image. If the volume is empty or stale, the container boots into an empty directory and dies.

Decision: Don’t mount over your application path unless you mean it. Move writable data to /data or similar. If you need hot-reload in dev, keep it in dev only.

Task 11: Check event stream to see who is killing/restarting it

cr0x@server:~$ docker events --since 10m --filter container=api | tail -n 20
2026-01-02T09:31:10.002345678Z container die 3b2e... (exitCode=137, image=myco/api:1.24.7, name=api)
2026-01-02T09:31:10.120456789Z container start 3b2e... (image=myco/api:1.24.7, name=api)

Meaning: Events show explicit die/start cycles and exit codes. If you see kill events with a user/daemon attribution (sometimes visible in audit logs), that’s your external supervisor or operator action.

Decision: If restarts are policy-driven, fix the underlying exit cause. If restarts are manual/automation-driven, find the automation and stop it from fighting you.

Task 12: Check systemd if Docker is being supervised from the outside

cr0x@server:~$ systemctl status myco-api.service --no-pager
● myco-api.service - MyCo API container
     Loaded: loaded (/etc/systemd/system/myco-api.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Thu 2026-01-02 09:33:12 UTC; 4s ago
    Process: 32511 ExecStart=/usr/bin/docker run --rm --name api myco/api:1.24.7 (code=exited, status=1/FAILURE)
   Main PID: 32511 (code=exited, status=1/FAILURE)

Meaning: systemd is restarting the container runner, not Docker itself. Your restart loop might not even be a Docker restart policy.

Decision: Fix the unit file (environment, mounts, restart backoff). Also decide whether this should be managed by Compose instead, to avoid two supervisors playing tug-of-war.

Task 13: Reproduce without restart to preserve evidence

cr0x@server:~$ docker inspect api --format 'RestartPolicy={{.HostConfig.RestartPolicy.Name}}'
RestartPolicy=always
cr0x@server:~$ docker update --restart=no api
api
cr0x@server:~$ docker start -a api
2026-01-02T09:35:01.110Z level=error msg="fatal: cannot open /config/app.yaml" err="permission denied"

Meaning: Disabling restart stops the loop and lets you attach to the failure. That’s often the fastest way to stop losing logs.

Decision: Use this when debugging. Then restore your desired restart policy after the fix. Don’t leave production services with restart disabled unless you enjoy late-night surprises.

Task 14: Validate permissions and user mapping on bind mounts

cr0x@server:~$ ls -l /etc/myco/api/app.yaml
-rw------- 1 root root 2180 Jan  2 09:00 /etc/myco/api/app.yaml
cr0x@server:~$ docker inspect api --format 'User={{.Config.User}}'
10001

Meaning: The container runs as UID 10001, but the bind-mounted config is readable only by root. That’s a clean, boring failure mode.

Decision: Fix ownership/permissions on the host file, or run the container with a user that can read it, or use secrets/config injection designed for this purpose.

Task 15: Check DNS/network dependency quickly from a debug container on the same network

cr0x@server:~$ docker network ls
NETWORK ID     NAME              DRIVER    SCOPE
6c0f1b1e2c0a   myco_default      bridge    local
cr0x@server:~$ docker run --rm --network myco_default busybox:1.36 nslookup postgres
Server:    127.0.0.11
Address 1: 127.0.0.11

Name:      postgres
Address 1: 172.19.0.3 postgres.myco_default

Meaning: DNS inside Docker network resolves postgres. If your app says “host not found,” the issue might be the app config or it’s on a different network.

Decision: If DNS fails here, fix the Docker network or service name. If DNS works, pivot to credentials, TLS, firewall, or readiness timing.

Short joke #1: A container in a restart loop is just DevOps’ way of teaching patience, repeatedly, with conviction.

Exit codes you should actually memorize

Exit codes are the closest thing you’ll get to a confession. Docker surfaces the exit code, but you still have to interpret it with Unix conventions and container-specific realities.

The useful ones

  • 0: clean exit. If it’s restarting anyway, someone told it to.
  • 1: generic error. Look at logs; it’s usually configuration or a thrown exception.
  • 2: misuse of shell builtins/CLI usage errors; often incorrect flags or entrypoint scripting mistakes.
  • 125: Docker couldn’t run the container (daemon error, invalid options). This is not your app.
  • 126: command invoked cannot execute (permissions, wrong architecture, noexec mount).
  • 127: command not found (bad entrypoint/CMD path, missing shell, wrong PATH).
  • 128 + N: process died from signal N. Common: 137 (SIGKILL=9), 143 (SIGTERM=15).
  • 137: SIGKILL. Often OOM killer, sometimes forced kill by a watchdog.
  • 139: SIGSEGV. Native crash; can be bad libc, bad binary, or memory corruption.

How exit codes combine with restart policies

Restart policy on-failure triggers on non-zero exit codes. That means it won’t restart on exit 0.
Restart policy always doesn’t care; it restarts no matter what, which can hide an app that intentionally exits after doing its job.

If you’re running a job-like container (migrations, cron, batch), always is usually wrong. If you’re running a service, always is fine—until you deploy something that exits instantly and you lose log context in the spin cycle.

Healthchecks: when “healthy” becomes a lie detector

Healthchecks are good. Bad healthchecks are chaos generators.
They’re also frequently misunderstood: Docker’s built-in healthcheck does not automatically restart containers by itself. But many supervisors and deployment patterns treat “unhealthy” as “kill and restart.”

How healthchecks fail in production

  • Wrong interface/port: app binds to 0.0.0.0 but healthcheck targets localhost incorrectly—or the reverse.
  • Startup time: healthcheck starts before the app is ready, causing a failing streak and restarts.
  • Dependency coupling: healthcheck calls downstream services. When downstream is down, your container gets killed even though it could serve partial traffic.
  • Resource spikes: healthcheck is too frequent; on a stressed node, it tips the service over.
  • Using curl in minimal images: healthcheck command not found yields exit 127, which reads like “app is dead” when it’s just “curl isn’t installed.”

What to do

Healthchecks should test your service’s ability to serve, not the entire universe.
If a DB is down, it’s valid for an app to report unhealthy—if the app cannot function without it. But don’t bake every external dependency into the health endpoint unless you’re sure restarting helps.

Tune start_period (if available in your setup), intervals, and timeouts. More importantly: keep healthchecks deterministic and cheap.

Storage & performance traps that look like crashes

As a storage person, I’ll say the quiet part out loud: a lot of “container keeps restarting” incidents are really “IO got slow, timeouts happened, process exited.”
Docker doesn’t care why your process exited. It just sees an exit. Your app might exit on a failed DB migration, a lock timeout, or “disk full.”

Disk full: the classic that never dies

Containers write logs, temp files, database files (sometimes accidentally), and layer diffs. If the Docker root filesystem fills, containers start failing in delightful ways:
write() failures, corrupted temp files, databases refusing to start, or your app crashing because it can’t write a PID file.

cr0x@server:~$ df -h /var/lib/docker
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4   80G   79G  320M 100% /var/lib/docker

Meaning: You are out of runway. Expect random failures.

Decision: Free space (prune images/volumes carefully), move Docker root to a larger filesystem, or stop writing big files into container layers. Then fix logging/retention so it doesn’t recur.

Overlay write amplification and “it only restarts under load”

Overlay2 is fine for most things, but if your workload writes heavily to the container filesystem layer (not volumes), performance can collapse.
When latency spikes, timeouts cascade: the app fails readiness, healthchecks fail, orchestrator kills, restarts happen.

Practical advice: writable data goes to volumes. Logs go to stdout/stderr (and are collected by a sane logging driver), not to a file inside the image layer. Databases should not write to overlay unless you enjoy learning about fsync latency at 3 a.m.

Permissions and UID mismatches on bind mounts

Security best practice is “run as non-root.” Great. Then you mount host files owned by root and wonder why it restarts.
This is not Docker being mean. This is Linux being Linux.

Clock and TLS failures: the non-obvious dependency

If the host clock drifts, TLS can fail. Apps that treat “cannot establish TLS” as fatal can exit immediately. That looks like a restart loop with exit code 1.
If you see sudden widespread restarts across services that talk TLS, check NTP/chrony and certificate validity windows.

Three corporate mini-stories from real life

1) Incident caused by a wrong assumption: “depends_on means ready”

A mid-sized company ran a Compose stack: API, worker, Postgres, Redis. It had been stable for months, until a routine host reboot during maintenance.
After the reboot, the API container started flapping. Restarting every few seconds. Engineers blamed a “bad image” because the deploy had happened a day earlier.

The wrong assumption was subtle and common: they believed depends_on meant Postgres was ready to accept connections. In reality, Compose started Postgres first, but Postgres still needed time for crash recovery and initialization.
The API attempted to run migrations on startup, failed to connect, and exited with code 1. Docker dutifully restarted it. Over and over.

The logs were there, but buried—because the restart loop was fast and the log output was mixed with unrelated lines. The team initially chased network issues: “Is Docker DNS broken after reboot?” No. They ran a debug container and verified DNS and connectivity were fine.

The fix was boring and correct: add retry logic with exponential backoff for DB connection on startup, and separate migrations into a one-shot job that runs with clear failure output.
They also added a healthcheck to Postgres and had the API wait for readiness (or at least handle “not ready” without exiting).

The incident ended not with heroics, but with an admission: orchestration ordering is not readiness, and reliability comes from designing startup to tolerate the real world.

2) Optimization that backfired: “We can shrink memory limits for efficiency”

Another org wanted to reduce infrastructure cost and “tighten resource usage.” They lowered container memory limits across several services.
It looked fine in staging, where traffic was low and caches were cold. Production was not staging. Production never is.

Within a day, a key API started restarting intermittently. Not constantly—just enough to make monitoring noisy and customers annoyed.
The exit code was 137. OOM kills. The service used a managed runtime with a heap that adapted to pressure, plus a bursty JSON workload that caused transient allocations.

Engineers first tried tuning the runtime GC and capping the heap. That helped, but they missed a second-order effect: a new “optimization” in the same change set increased compression level for responses to save bandwidth.
CPU went up, latency went up, and because requests piled up, memory pressure got worse. The OOM kills increased.

The real fix was to undo the compression change for that endpoint and restore a realistic memory limit with headroom. Then they profiled allocations under production-like load and applied targeted reductions.
The lesson stuck: “efficient” limits that trigger OOM kills are not efficient; they are a tax paid in incidents.

3) Boring but correct practice that saved the day: stop the loop, preserve evidence

A finance company had a containerized service that started restarting after a certificate rotation.
Their on-call runbook included a simple line: “If a container is flapping, disable restart and run it attached once to capture the fatal error.”
It wasn’t glamorous. It didn’t look like wizardry. It worked.

They ran docker update --restart=no and then docker start -a. The app immediately printed a clear TLS error: it couldn’t read the new private key file.
The key had been deployed with restrictive permissions on a bind mount, readable only by root, while the container ran as a non-root UID.

Without stopping the loop, logs would have been partial and overwritten by repeated restarts. With the loop stopped, the failure message was unmissable.
They fixed the file ownership, restarted the service, and moved on with their lives.

The next improvement was even more boring: they changed certificate deployment to use a secrets mechanism with correct permissions by default, and added a startup check that reports a clear error before trying to serve traffic.

Common mistakes: symptom → root cause → fix

This section is the “I’ve seen this movie” catalog. If you’re on call, scan the symptoms, pick the likely root cause, and test it with one of the tasks above.

1) Restarts every 2–10 seconds, exit code 1

  • Symptom: Restarting (1), logs show config errors or missing env.
  • Root cause: missing env var, bad flag, secrets not mounted, config parse failure.
  • Fix: correct Compose env injection; validate with docker compose config; redeploy with recreate.

2) Restarts with exit code 127 or “command not found”

  • Symptom: logs: exec: "foo": executable file not found in $PATH.
  • Root cause: wrong CMD/ENTRYPOINT, missing binary, using Alpine image without bash but entrypoint uses /bin/bash.
  • Fix: fix Dockerfile entrypoint; prefer exec-form; ensure required shell exists or remove shell dependency.

3) Restarts with exit code 126 or “permission denied”

  • Symptom: binary exists but not executable, or entrypoint script not executable.
  • Root cause: wrong file mode, bind mount with noexec, wrong ownership running as non-root.
  • Fix: set executable bit at build time; adjust mount options; align UID/GID or permissions.

4) Exit code 137, sporadic under load

  • Symptom: OOMKilled=true in inspect; dmesg shows oom-kill.
  • Root cause: container memory limit too low; memory leak; load spike; too little swap allowance.
  • Fix: raise limit, tune heap/caches, investigate memory profile; reduce concurrency; add backpressure.

5) “Up” but constantly flips between healthy/unhealthy and then restarts

  • Symptom: healthcheck failing streak; restarts if orchestrator reacts to unhealthy.
  • Root cause: aggressive healthcheck, wrong endpoint, checking downstream dependencies, app binds to different interface.
  • Fix: make healthcheck cheap and correct; tune interval/timeout/start period; ensure app listens as expected.

6) Works on one host, flaps on another

  • Symptom: same image, different behavior.
  • Root cause: kernel/cgroup differences, filesystem full, different mount permissions, DNS config, CPU architecture mismatch.
  • Fix: compare docker info, host disk space, mount options; verify image arch; standardize runtime.

7) Container exits “successfully” (code 0) but restarts forever

  • Symptom: exit code 0; restart policy always.
  • Root cause: you’re running a job (migrations, init, CLI) with a service restart policy.
  • Fix: use restart: "no" or on-failure; split job from long-running service.

8) After deploying a “tiny change,” everything starts flapping

  • Symptom: multiple containers restart around the same time.
  • Root cause: shared dependency: DNS outage, cert rotation, clock drift, registry pull throttling, disk full, host memory pressure.
  • Fix: check host signals (disk, dmesg, time sync); verify secret/cert permissions; roll back the shared change.

Short joke #2: If your healthcheck depends on five other services, it’s not a healthcheck—it’s a group project.

Checklists / step-by-step plan

Checklist A: Stop the bleeding (production-safe)

  1. Confirm scope: is this one container or many? If many, suspect host-level issues (disk, memory, DNS, time).
  2. Capture evidence: grab docker inspect State, last 200 log lines, and docker events for 10 minutes.
  3. Stabilize: if the loop is too fast, disable restart temporarily (docker update --restart=no) to preserve logs and reduce churn.
  4. Choose action: rollback image tag, fix env/secrets, raise memory, or correct healthcheck.
  5. Communicate clearly: “Exit code 137, OOM kill confirmed in dmesg. Raising limit and rolling back memory change.” Not “Docker seems weird.”

Checklist B: Find the trigger in a clean, repeatable way

  1. Identify restart policy and supervisor (Docker vs Compose vs systemd).
  2. Read exit code and OOM flag.
  3. Read logs from the last attempt (--tail, with timestamps).
  4. Check health status logs if healthchecks are configured.
  5. Verify mounts and permissions (especially when running non-root).
  6. Verify dependency connectivity from the same Docker network.
  7. Check host disk space and host OOM logs.
  8. Re-run container attached once with restart disabled to reproduce cleanly.

Checklist C: Prevent recurrence (the part people skip)

  1. Make startup resilient: retry dependencies with backoff; don’t exit instantly on transient failures.
  2. Separate one-shot jobs: migrations and schema changes should be explicit jobs, not hidden inside the main service boot.
  3. Right-size healthchecks: cheap, deterministic, not dependent on the entire world.
  4. Set sane limits: memory headroom, CPU constraints appropriate to workload, and avoid limits chosen by optimism.
  5. Log to stdout/stderr: keep container logs accessible and centralizable.
  6. Document the invariants: required env vars, required mounts, required permissions, expected exit behavior.

What to avoid when you’re debugging a restart loop

  • Don’t keep rebuilding images until you can state the exit code and the last fatal log line.
  • Don’t exec into the container first; it may die before you learn anything. Start with inspect/logs/events.
  • Don’t add “restart: always” everywhere as a band-aid. It hides job containers and can amplify failure storms.
  • Don’t blame Docker until you’ve checked disk full and OOM. Docker mostly just reports what Linux did.

FAQ

1) Why does docker ps show “Restarting (1)”?

It means the container is exiting and Docker is applying a restart policy. The number is the last exit code observed.
Confirm with docker inspect and read .State.ExitCode plus timestamps.

2) How do I know if it’s Docker restarting it or something else?

Check restart policy in docker inspect (.HostConfig.RestartPolicy), then check for external supervisors:
systemctl status for unit files, and docker events for start/stop patterns. Compose also adds its own lifecycle behavior.

3) What’s the fastest way to catch the real error message?

Disable restart temporarily (docker update --restart=no) and run attached once (docker start -a).
It stops the churn and prints the fatal line clearly.

4) Exit code 137: is it always OOM?

No. It means SIGKILL. OOM is the most common cause in containers, but a supervisor can also SIGKILL a process.
Confirm with docker inspect (OOMKilled=true) and host logs (dmesg).

5) Why are docker logs empty even though the app is failing?

Because Docker only captures stdout/stderr. If your app logs to files inside the container, docker logs can be silent.
Either reconfigure the app to log to stdout/stderr or inspect the files (ideally via a volume, not the container layer).

6) The container is “unhealthy” but the process is running. Why restart?

Docker health status doesn’t inherently restart the container, but many deployment patterns do: external supervisors, scripts, or orchestrators interpret unhealthy as “replace.”
Fix the healthcheck endpoint, timing, or sensitivity, or adjust the supervisor behavior.

7) Why does it work when I run it manually but not under Compose?

Compose changes networks, environment injection, volume mounts, and sometimes the working directory. Compare:
docker compose config vs docker inspect for the running container. Differences in mounts and env vars are the usual culprits.

8) How do I debug a container that exits too fast to exec into?

Use docker logs and docker inspect first. If needed, disable restart and run attached once.
You can also override entrypoint temporarily to get a shell and inspect the filesystem, but treat that as a controlled experiment, not the fix.

9) Can disk issues really cause restart loops?

Absolutely. Disk full, slow IO, or permission problems on volumes can cause apps to fail startup checks, crash, or time out.
Check df -h for Docker root, inspect mounts, and look for “no space left on device” in logs.

10) What should I alert on to catch this early?

Alert on rising restart counts, health status flips, OOM kill events, disk usage on Docker root, and elevated container exit rates.
Restarts are not inherently bad; unexpected restarts are. Baseline your normal.

Conclusion: next steps that prevent the sequel

A container restart loop feels chaotic, but it’s usually deterministic. Stop guessing.
In five minutes, you can know: who restarts it, what exit code it returns, whether it was OOM-killed, and what the last fatal log line said.
After that, the fix is typically mundane: correct env/secrets, fix permissions, adjust healthcheck behavior, or give the process enough memory to live.

Do these next:

  1. Standardize a runbook: inspect → logs → events → host signals (disk/OOM) → reproduce attached once.
  2. Make startup tolerant: retries with backoff, timeouts, and clear fatal messages.
  3. Separate jobs from services; don’t run migrations as a side effect of service boot unless you truly mean it.
  4. Audit restart policies: use always for services, on-failure for crash-only jobs, and no for one-shots.
  5. Move writable data to volumes and keep logs accessible on stdout/stderr.

You don’t need heroics. You need evidence, fast. Then you need the discipline to change the thing that actually broke.

© 2026. Practical ops writing for people who ship and own their systems.

← Previous
Lost passwords + encryption: when mistakes become permanent
Next →
Email Quarantine & Spam Folder Policies: Stop Losing Important Messages

Leave a comment