Docker: Why Your Container Restarts Forever (And the One Log You Need)

Was this helpful?

You deploy a container. It looks fine for a second. Then it dies. Then Docker helpfully brings it back to life… so it can die again. The cycle continues until your monitoring looks like a heart monitor in a bad movie and your on-call phone starts negotiating with you.

This is the part where people waste hours staring at docker ps and guessing. Don’t. When a container is restarting forever, there’s one log that reliably tells you what happened: the logs from the previous attempt, not the current one that’s still booting.

The one log you need (and why it’s not the one you’re watching)

When a container restarts, it often dies quickly—sometimes before it emits anything useful, sometimes after it emits the useful line and immediately exits. If you tail logs during the restart loop, you tend to catch the wrong moment: the new instance starting up (again), not the one that just crashed.

The log you need is the previous container attempt:

cr0x@server:~$ docker logs --previous myservice
error: cannot open config file /etc/myapp/config.yaml: permission denied

If you remember only one thing from this article, remember that flag. It’s the difference between “I think it’s networking?” and “it’s a file permission error, fix the mount.”

Why it works: Docker’s log stream is tied to a container’s lifecycle. When the container is recreated or restarted (depending on the scenario and runtime), you want the last run’s stdout/stderr. That’s what --previous gives you for restart loops where the container is being restarted under the same container name.

And yes, there are caveats. If you’re using Compose and containers are being recreated (new container IDs) rather than restarted, you may need to grab logs by container ID or use docker compose logs. But the principle remains: stop watching the current boot attempt and inspect the last crash.

What “restarting forever” actually means in Docker

A restart loop is not one thing. It’s a family of behaviors that look the same from across the room: container shows as “Restarting (x) …” or it keeps reappearing in docker ps.

Restart policies: the fine print people skip

Restart loops are usually driven by a restart policy. Docker supports:

  • no (default): it exits and stays dead.
  • on-failure[:max-retries]: restarts when exit code is non-zero.
  • always: restarts regardless of exit code (except when you stop it).
  • unless-stopped: restarts unless you explicitly stop it.

In production, you typically want unless-stopped for long-running services. But “good defaults” become “bad noise” when the process is crashing instantly. The policy faithfully restarts a broken thing. Like any diligent employee, it does exactly what you asked, not what you meant.

Exit codes are your first real clue

Docker doesn’t restart a container because it’s bored. It restarts it because the main process ends. That process ends for one of three broad reasons:

  • The app decided to exit (config error, migration failure, missing dependency).
  • The OS killed it (OOM killer, SIGKILL, cgroup constraints).
  • You orchestrated it (healthcheck failure, watchdog, systemd unit).

The exit code and the “OOMKilled” flag tell you which branch you’re on. You’re not diagnosing “Docker.” You’re diagnosing why PID 1 in that container cannot stay alive.

One quote worth keeping in your head while you debug: “Hope is not a strategy.” — General Gordon R. Sullivan. It’s not strictly an SRE quote, but it applies brutally well to restart loops.

Fast diagnosis playbook (first/second/third)

This is the playbook I use when a container is flapping and I want the bottleneck fast, without turning the incident into a research project.

First: capture the last failure (don’t watch the current boot)

  1. Get previous logs: docker logs --previous (or container ID).
  2. Get exit code and reason: docker inspect for State.ExitCode, State.OOMKilled, State.Error.

If the previous logs show a clear config error, stop. Fix that. Don’t “add more memory” to a YAML typo.

Second: determine if it’s a crash, a kill, or a deliberate restart

  1. Check dmesg / journal for OOM kills.
  2. Check container healthcheck state (unhealthy can trigger orchestrator restarts even if the process stays up).
  3. Check who is restarting it: Docker restart policy, systemd, Compose, Swarm.

Third: validate the runtime environment (storage, network, dependencies)

  1. Mounts and permissions: bind mounts, secrets, config files.
  2. Ports and DNS: is it failing to bind, failing to resolve, failing TLS?
  3. Resource limits: memory, pids, ulimits, disk space, inode exhaustion.

Joke #1: Containers are like houseplants—ignore the basics (water, light, soil) and they’ll die on schedule.

Practical tasks: commands, outputs, and decisions (12+)

Below are the tasks that actually move you forward. Each one includes a realistic command, an example output, what it means, and what decision to make next. Run these on the host unless otherwise noted.

Task 1: See the restart loop and grab the container ID

cr0x@server:~$ docker ps --no-trunc
CONTAINER ID                                                       IMAGE               COMMAND                  CREATED          STATUS                         PORTS                    NAMES
b5a1c0b7fd4b1f7b4b5d5c6a9c8d2d9c7a1c2e3f4a5b6c7d8e9f0a1b2c3d4e5   myapp:1.4.2         "/entrypoint.sh"         2 minutes ago    Restarting (1) 5 seconds ago                            myservice

Meaning: Docker reports “Restarting” with an exit code in parentheses. That number is usually the last exit code.

Decision: Move immediately to previous logs and inspect state.

Task 2: Pull the one log you need

cr0x@server:~$ docker logs --previous myservice
[2026-02-04T10:15:02Z] FATAL: DB_URL is not set
[2026-02-04T10:15:02Z] exiting with code 2

Meaning: The app is exiting cleanly but with failure due to missing env var.

Decision: Fix configuration at deployment layer (Compose env_file, secrets, CI). Don’t touch Docker daemon settings.

Task 3: Inspect state, exit code, and OOMKilled

cr0x@server:~$ docker inspect -f 'ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}} Error={{.State.Error}} FinishedAt={{.State.FinishedAt}}' myservice
ExitCode=137 OOMKilled=true Error= FinishedAt=2026-02-04T10:15:19.120401234Z

Meaning: Exit code 137 plus OOMKilled=true is a classic memory kill (SIGKILL).

Decision: Go check kernel logs and container memory limits. This is not an “app bug” until proven otherwise.

Task 4: Confirm OOM kill in kernel logs

cr0x@server:~$ sudo dmesg -T | tail -n 20
[Sun Feb  4 10:15:19 2026] Memory cgroup out of memory: Killed process 23184 (myapp) total-vm:812340kB, anon-rss:512120kB, file-rss:1200kB, shmem-rss:0kB
[Sun Feb  4 10:15:19 2026] oom_reaper: reaped process 23184 (myapp), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Meaning: The kernel killed the process due to cgroup memory pressure.

Decision: Increase container memory limit, reduce memory use, or fix a leak. Also ensure the host has headroom; moving the limit without capacity is just moving the crash.

Task 5: Check container resource limits applied

cr0x@server:~$ docker inspect -f 'Memory={{.HostConfig.Memory}} MemorySwap={{.HostConfig.MemorySwap}} PidsLimit={{.HostConfig.PidsLimit}}' myservice
Memory=268435456 MemorySwap=268435456 PidsLimit=100

Meaning: 256 MiB memory limit with no swap headroom; tight for many runtimes. PID limit might also bite fork-heavy apps.

Decision: If the service is expected to be bigger, raise limits. If it’s expected to be small, profile memory and remove spikes (JIT warmup, caches, migrations).

Task 6: Identify if restart policy is forcing the loop

cr0x@server:~$ docker inspect -f 'Name={{.Name}} RestartPolicy={{.HostConfig.RestartPolicy.Name}} MaximumRetryCount={{.HostConfig.RestartPolicy.MaximumRetryCount}}' myservice
Name=/myservice RestartPolicy=always MaximumRetryCount=0

Meaning: “always” means it will restart even if the app exits 0. MaximumRetryCount=0 means unlimited.

Decision: During debugging, consider temporarily setting on-failure:5 or disabling restarts so you can inspect the dead container state without it immediately respawning.

Task 7: Stop the loop long enough to inspect safely

cr0x@server:~$ docker update --restart=no myservice
myservice

Meaning: You changed the restart policy for this container instance.

Decision: Now stop it, then start manually when ready. Also go fix the source (Compose file, systemd unit) or it will come back next deployment.

Task 8: Inspect container events to see the rhythm and cause

cr0x@server:~$ docker events --since 10m --filter container=myservice
2026-02-04T10:15:18.992345678Z container die b5a1c0b7fd4b (exitCode=137, image=myapp:1.4.2, name=myservice)
2026-02-04T10:15:19.101234567Z container start b5a1c0b7fd4b (image=myapp:1.4.2, name=myservice)
2026-02-04T10:15:24.220987654Z container die b5a1c0b7fd4b (exitCode=137, image=myapp:1.4.2, name=myservice)

Meaning: Clear restart cadence. ExitCode repeated.

Decision: Repeated identical exit codes usually mean deterministic startup failure (config, permissions, port bind) or deterministic kill (OOM on warmup). Focus there, not on random network flakiness.

Task 9: Check health status (healthchecks can create “soft restart loops”)

cr0x@server:~$ docker inspect -f 'Health={{if .State.Health}}{{.State.Health.Status}}{{else}}none{{end}} FailingStreak={{if .State.Health}}{{.State.Health.FailingStreak}}{{else}}0{{end}}' myservice
Health=unhealthy FailingStreak=12

Meaning: The container process may be running, but healthcheck is failing repeatedly. Some setups (Compose with dependent services, external watchdogs) respond by restarting.

Decision: Inspect the healthcheck command and its output next. Treat it like production code, because it is.

Task 10: Retrieve healthcheck logs

cr0x@server:~$ docker inspect -f '{{range .State.Health.Log}}{{.End}} {{.ExitCode}} {{.Output}}{{end}}' myservice
2026-02-04T10:16:02.000000000Z 1 curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
2026-02-04T10:16:12.000000000Z 1 curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused

Meaning: Your app isn’t listening on the port the healthcheck expects (or it’s bound to the wrong interface, or it hasn’t started yet).

Decision: Fix either the app listen address/port or the healthcheck. If startup is slow, adjust start_period to avoid premature failure.

Task 11: Detect port binding conflicts on the host

cr0x@server:~$ sudo ss -ltnp | grep ':8080 '
LISTEN 0      4096         0.0.0.0:8080       0.0.0.0:*    users:(("old-nginx",pid=1187,fd=7))

Meaning: Something else is already listening on that host port.

Decision: Change the published port mapping, or stop the conflicting service. If this is a “works on my laptop” issue, it’s usually because your laptop didn’t have the conflicting daemon.

Task 12: Validate mounts and permissions (the quiet killer)

cr0x@server:~$ docker inspect -f '{{range .Mounts}}{{.Type}} {{.Source}} -> {{.Destination}} (RW={{.RW}}){{"\n"}}{{end}}' myservice
bind /srv/myservice/config.yaml -> /etc/myapp/config.yaml (RW=false)
volume myservice-data -> /var/lib/myapp (RW=true)

Meaning: Config is a bind mount and it’s read-only. That’s good. But if the app tries to write to it, it will crash.

Decision: Ensure the app writes only to writable paths. If it needs to generate config, mount a directory and write into it, or change app behavior.

Task 13: Enter a debugging shell (without changing the image)

cr0x@server:~$ docker run --rm -it --network container:myservice --pid container:myservice --entrypoint /bin/sh myapp:1.4.2
/ # ps aux
PID   USER     TIME  COMMAND
1     root      0:00 myapp --config /etc/myapp/config.yaml
/ # netstat -ltn
Active Internet connections (only servers)
tcp        0      0 127.0.0.1:9090          0.0.0.0:*               LISTEN

Meaning: You can observe process and listening ports in the container’s namespaces. Here it listens on 9090, not 8080.

Decision: Fix healthcheck / port mapping. Namespaces remove guesswork.

Task 14: Check filesystem pressure and inode exhaustion

cr0x@server:~$ df -h /var/lib/docker
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p3  100G   98G  2.0G  99% /var/lib/docker
cr0x@server:~$ df -i /var/lib/docker
Filesystem       Inodes   IUsed    IFree IUse% Mounted on
/dev/nvme0n1p3  6553600  6551000     2600  100% /var/lib/docker

Meaning: Disk is nearly full and inodes are exhausted. Containers can fail in bizarre ways: can’t write PID files, can’t extract layers, can’t append logs.

Decision: Clean up images/containers/volumes, expand storage, move Docker root. Then re-test. If you don’t fix inodes, “adding 10GB” won’t help.

Task 15: Check Docker daemon logs (sometimes the daemon is the villain)

cr0x@server:~$ sudo journalctl -u docker --since "10 minutes ago" -n 50
Feb 04 10:15:19 server dockerd[1023]: containerd: time="2026-02-04T10:15:19Z" level=warning msg="failed to shim reaping" id=b5a1c0b7fd4b
Feb 04 10:15:19 server dockerd[1023]: Error response from daemon: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/srv/myservice/config.yaml" to rootfs at "/etc/myapp/config.yaml": permission denied: unknown

Meaning: OCI/runtime errors can prevent the container from even starting. This is different from “app started and crashed.”

Decision: Fix mount permissions, SELinux/AppArmor profiles, or path existence on the host. Application engineers can’t fix what never starts.

Failure modes that cause restart loops

Most restart loops are one of these categories. Learn the pattern, and you’ll stop treating every outage like a unique snowflake.

1) Application exits because configuration is wrong

Signature: ExitCode is a small non-zero integer (1, 2, 64), logs show “missing env var”, “invalid config”, “failed to parse”.

Typical causes: missing env var after secret rotation, wrong config file path, templating bug, JSON/YAML syntax.

Fix: Validate config at build/deploy time. Add “configtest” entrypoint mode. Fail fast, but fail once (limit retries during rollout).

2) PID 1 behavior and signal handling (the “it works locally” classic)

Inside a container, the main process is PID 1. PID 1 has special semantics in Linux: it ignores some signals by default and is responsible for reaping zombies. If you wrap your app in a naive shell script, you can get odd shutdown behavior, children that never die, or “exits immediately” because the script ends.

Signature: Container exits 0 quickly; logs show script completed; no long-running process. Or container won’t stop cleanly and gets SIGKILL, then restarts.

Fix: Use exec in entrypoint scripts. Consider a minimal init (like tini) when you spawn subprocesses.

3) OOM kills and memory limits

OOM kills create the cleanest, meanest restart loops: everything starts, allocates a bunch of memory (JVM warmup, Python import storm, Node build step, in-memory cache), then the kernel kills it. Docker restarts it. Repeat.

Signature: ExitCode 137, OOMKilled=true, kernel logs show cgroup OOM.

Fix: Raise limits based on measurement, not hope; cap caches; reduce concurrency; avoid doing heavy migrations on every boot.

4) Healthchecks that are too aggressive (or just wrong)

Healthchecks are great until they’re written like a unit test: brittle, timing-dependent, and convinced your service is dead because a TCP connect failed once.

Signature: Service runs but becomes “unhealthy”, orchestrator restarts or dependent services refuse to start.

Fix: Add start_period, tune interval/retries, and make the check reflect user-facing health (not internal perfection).

5) Storage and filesystems: full disk, wrong permissions, broken mounts

Storage issues don’t always scream. Sometimes they whisper: “read-only file system,” “no space left on device,” “permission denied.” Then the app exits. Then Docker restarts it. Infinite, polite suffering.

Signature: Logs mention write failures; daemon logs show mount errors; disk/inodes near 100%.

Fix: Fix mounts, ownership (UID/GID), SELinux labels, and capacity. Also: stop writing logs inside the container filesystem like it’s 2014.

6) Dependency failures: DNS, TLS, databases, and startup ordering

Apps often assume dependencies are available immediately. In distributed systems, that assumption is adorable and wrong.

Signature: Logs show connection refused/timeouts to DB; exit code non-zero; restarts happen immediately at startup.

Fix: Backoff and retry in the application. Or use an init container pattern (outside pure Docker) or a startup script that waits with timeouts. Avoid infinite “wait-for-it” loops with no deadline.

7) “Optimizations” that change timing and break everything

When you squeeze startup time or reduce image size, you change the timing and the runtime environment. That can surface races: healthcheck runs earlier, dependencies not ready, files not created yet.

Signature: Flapping starts after an image “cleanup” or base image change; same code, different behavior.

Fix: Treat base image and entrypoint changes as production changes. Test cold-start. Test with realistic resource limits.

Joke #2: The container is “restarting to apply updates,” which is exactly what it says right before it doesn’t.

Three corporate mini-stories from production

Mini-story 1: The incident caused by a wrong assumption

The team had a small internal API running in Docker Compose on a couple of VMs. Straightforward setup: app container, Postgres container, and a reverse proxy. It had been fine for months—until a minor OS patch and a redeploy.

After the redeploy, the API container entered a tight restart loop. The first responder did what many of us do under pressure: blamed “Docker networking.” The logs on the current run were sparse: just startup banners. Nothing obvious.

Someone else pulled docker logs --previous and immediately saw a line about failing to open a certificate file. The assumption had been: “The cert is baked into the image.” It wasn’t. It was a bind mount from the host, and the OS patch had changed the permissions on the directory where the certificate lived.

The API process ran as a non-root user. It couldn’t read the cert, so it exited. Docker restarted it. Infinite loop.

The fix was boring: correct ownership and permissions on the host path, then redeploy. The lasting fix was even more boring: stop assuming host files are stable; explicitly manage them (config management or a secret store) and add a startup check that prints a clear error before doing anything else.

Mini-story 2: The optimization that backfired

A platform team wanted faster deployments and smaller images. They moved several services from a Debian-based image to a slimmer Alpine-based one. Build times improved. CVE scan output looked nicer. Everyone got to feel like they had “reduced waste.”

A week later, one service started flapping after a routine release. It wasn’t consistent across all nodes. On some nodes it ran for hours; on others it restarted every minute. The restart policy was unless-stopped, so it kept trying.

The root cause turned out to be a native dependency. The service used a library that behaved differently under musl (Alpine) than glibc (Debian). Under load, memory usage spiked, crossing the cgroup limit. The kernel killed it (exit 137), it restarted, and the cycle repeated. Because load distribution differed per node, the issue looked “random.”

They rolled back the base image for that service, then did the hard work: set realistic memory limits, built a proper load test that measured RSS under steady-state and warmup, and documented which services were safe to slim.

The lesson wasn’t “Alpine is bad.” The lesson was: optimizations change physics. If you don’t measure, you’re just rearranging outages.

Mini-story 3: The boring but correct practice that saved the day

A finance-adjacent team ran a containerized batch job that produced files consumed by another system. It wasn’t glamorous. It ran once per hour, wrote to a mounted volume, and exited. The restart policy was on-failure:3, not always. That choice looked conservative, maybe even timid.

One morning, the job began failing immediately. The failure wasn’t in the application logs; it was in the Docker daemon logs: a bind mount path didn’t exist on one of the hosts after a filesystem reorganization. The container didn’t even start.

Because the restart policy was bounded, the job stopped after three attempts instead of flapping for hours and consuming resources. Their alerting fired on “job did not run” rather than “node CPU is on fire.” The on-call engineer could diagnose without the system constantly changing under them.

They fixed the mount path and added a preflight check in the deployment pipeline that verified host paths exist and have correct permissions. The job went back to being boring, which is the correct state for finance-adjacent systems.

Common mistakes: symptom → root cause → fix

This is the section you’ll wish you had during an incident. Symptoms are what you see. Root causes are what’s actually happening. Fixes are the shortest safe path to stable.

1) “Restarting (0)” forever

Symptom: Container restarts, exit code appears as 0.

Root cause: Restart policy always with a process that exits successfully (script ends; job container not meant to be long-running), or a supervisor that exits after spawning a child incorrectly.

Fix: Use on-failure for one-shot jobs. Ensure entrypoint script uses exec so the real service becomes PID 1.

2) Exit code 137 and “OOMKilled=true”

Symptom: Restarts at a consistent point in startup; sometimes works with lower load; logs cut off.

Root cause: Cgroup memory OOM kill.

Fix: Measure memory; raise limits appropriately; fix leaks; tune runtimes (JVM heap, Node memory flags). Confirm with kernel logs.

3) “permission denied” on mounts

Symptom: Daemon logs show OCI mount permission errors; or app logs show file open failures.

Root cause: Host filesystem permissions, SELinux labels, AppArmor profiles, or rootless Docker user mapping.

Fix: Correct ownership/permissions; apply correct SELinux context; for rootless, ensure paths are accessible to the user running dockerd.

4) Healthcheck fails while app is actually fine

Symptom: App responds on one port/path, but healthcheck marks unhealthy; orchestrator restarts or keeps service out of rotation.

Root cause: Wrong port, wrong path, TLS mismatch, or startup slower than healthcheck start time.

Fix: Fix healthcheck command; add start_period; check that health is tied to user-visible readiness.

5) Container never gets to app logs at all

Symptom: docker logs is empty; container dies instantly; daemon shows runtime errors.

Root cause: Image/entrypoint missing, exec format error (wrong architecture), mount failure, missing binary, invalid user.

Fix: Inspect daemon logs; verify image architecture; test docker run --entrypoint with a shell; validate mounts exist.

6) Restart loop after “cleanup” or “hardening”

Symptom: Works in dev; fails in prod after enabling read-only root FS, dropping privileges, or removing packages.

Root cause: App writes to root filesystem, needs CA certs, needs time zone data, or expects /tmp to be writable.

Fix: Provide writable volumes for needed paths; install required runtime data; document required filesystem locations and permissions.

7) Random restarts correlated with load

Symptom: Container stable at night, flaps during peak.

Root cause: Memory spikes causing OOM, file descriptor exhaustion, thread/process exhaustion, or upstream timeouts causing crash-on-start behavior.

Fix: Track resource usage; set ulimits; add backpressure; avoid crash-on-transient dependency failures.

Checklists / step-by-step plan

Checklist A: “My container is restarting right now” (10-minute plan)

  1. Identify the container: docker ps --no-trunc.
  2. Grab previous logs: docker logs --previous <name>.
  3. Inspect exit reason: docker inspect for ExitCode and OOMKilled.
  4. If ExitCode=137 or OOMKilled=true: check dmesg and memory limits.
  5. If logs show config/env: compare expected env vars vs actual container env and deployment config.
  6. If mount/permissions: check docker inspect mounts and daemon logs.
  7. If healthcheck: inspect .State.Health logs; verify port/path.
  8. Stop the loop if it’s harming the host: docker update --restart=no then docker stop.
  9. Fix at the source (Compose/systemd/CI) so the next deploy doesn’t reintroduce the loop.
  10. Restart once, watch events and logs, confirm stable.

Checklist B: “Make restart loops less painful” (design-time controls)

  1. Use bounded retries where appropriate: on-failure:5 for batchy services.
  2. Add clear startup checks: verify required env vars, files, and connectivity with crisp errors.
  3. Make entrypoint scripts use exec and exit non-zero on fatal startup failures.
  4. Define healthchecks that reflect real readiness, with a startup grace period.
  5. Set realistic resource limits and monitor them; “unlimited” is not a strategy, it’s a confession.
  6. Move persistent state to volumes; treat container FS as ephemeral.
  7. Centralize logs; don’t rely on “docker logs” as your only record during an incident.
  8. Document dependencies and failure behavior (what happens if DB is down at boot?).

Interesting facts and historical context

  • Fact 1: Docker’s early popularity (circa 2013–2014) was driven by packaging and distribution, not orchestration; restart loops became a more visible problem once people started treating containers like pets.
  • Fact 2: The OCI runtime standard exists because the ecosystem needed consistent container behavior across tools; many “Docker issues” are actually runtime (runc/containerd) errors surfaced by Docker.
  • Fact 3: Exit code 137 typically indicates SIGKILL (128 + 9). In container land, that often maps to OOM kills, but can also be an external kill.
  • Fact 4: PID 1 semantics are older than containers; containers just make more apps accidentally become PID 1 without being designed for it.
  • Fact 5: Healthchecks were added to Docker long after “docker run” existed; plenty of images still ship without them, and many teams bolt them on without tuning startup timing.
  • Fact 6: Log drivers (json-file, journald, syslog, fluentd, etc.) affect what “docker logs” can show; restart diagnosis changes if logs aren’t stored locally.
  • Fact 7: Overlay filesystems (overlay2) changed container storage performance and semantics compared to older drivers; some “random startup failures” in the past were storage-driver edge cases.
  • Fact 8: Restart policies predate modern orchestrators; they’re a local reliability mechanism, not a full scheduling strategy. That’s why they can amplify badness on a single host.
  • Fact 9: Compose’s behavior of recreating containers (new IDs) vs restarting in place is a frequent source of confusion when people expect --previous to always work by name.

FAQ

1) What is “the one log I need” when a container restarts forever?

The logs from the previous run attempt: docker logs --previous <container>. It captures the crash you missed while staring at the new startup.

2) Why does docker logs show nothing useful during a restart loop?

Because you’re watching the wrong lifecycle moment. The container may exit before producing output, or the useful line was printed in the prior attempt. Use --previous and inspect exit state.

3) Does Docker recreate the container or restart the same one?

Docker restart policy restarts the same container (same ID). Some higher-level tools (Compose on updates, Swarm rescheduling) may create a new container/task, which changes how you fetch “previous” logs.

4) What does exit code 137 mean in Docker?

It commonly means the process received SIGKILL. In containers, that’s frequently the kernel OOM killer. Confirm with docker inspect (OOMKilled) and dmesg.

5) My container exits with code 0 but still restarts. How?

Restart policy always will restart even on successful exit. That’s fine for daemons, wrong for jobs. Switch to on-failure or redesign the container to stay running if it’s a service.

6) Can a failing healthcheck cause restarts?

Docker itself doesn’t automatically restart on unhealthy status, but external systems often do: Compose dependencies, scripts, systemd units, or load balancer controllers. Diagnose the healthcheck output anyway; it’s usually pointing at the real readiness problem.

7) How do I stop a restart loop without deleting everything?

Temporarily disable restart policy: docker update --restart=no <name>, then stop the container. Fix the root cause, then re-enable the intended policy via your deployment configuration.

8) What if I can’t use docker logs because logs are shipped elsewhere?

Then the “one log” is the equivalent in your logging pipeline, filtered by container ID and timestamp around the crash. Still, docker inspect exit codes and daemon logs remain local truth.

9) How do I debug a container that dies too fast to exec into?

Disable restarts, run the image with an override entrypoint (shell), or run a debug container in the same namespaces. The goal is to observe the filesystem, env vars, and network from the same viewpoint as the app.

10) Is this the same as Kubernetes CrashLoopBackOff?

It’s the same basic failure mode—process exits and the system retries—but Kubernetes adds backoff, events, probes, and replica management. The diagnostic primitives (previous logs, exit codes, OOM) still apply.

Next steps you should actually do

If you have a container restarting forever, do this in order:

  1. Run docker logs --previous and read it like you mean it.
  2. Run docker inspect for ExitCode and OOMKilled; decide whether you’re in “crash” or “kill” territory.
  3. Check daemon logs for OCI/mount issues if the container never really starts.
  4. If it’s OOM: confirm with dmesg, then fix limits/capacity or the app’s memory profile.
  5. Fix the source of truth (Compose file, systemd unit, CI config), not the live container, unless you’re doing an emergency stopgap.

Then do the boring improvements: bounded retries where appropriate, tuned healthchecks, proper PID 1 behavior, and a preflight config check that fails loudly once instead of quietly forever.

← Previous
Windows Update Stuck: Fix It Without Wiping the PC
Next →
Proxmox: Backup Windows VMs Without VSS Pain — A Practical Workflow

Leave a comment