Docker Restart Policies: Stop Creating Infinite Crash Loops

November 22, 2025 • February 3, 2026 • Read: 23 min • Views: 12

Was this helpful?

There’s a special kind of outage where nothing is “down” because everything is constantly “starting.” Your dashboards show CPU spikes, log volume goes vertical, and the container name is a blur of restarts. You try to docker exec in, but the process dies before your shell prompt arrives. Congratulations: you’ve built an infinite crash loop.

Docker restart policies are meant to make services resilient. In production, they can also turn a small fault into a self-sustaining incident: noisy, expensive, and hard to debug. This is how you stop doing that.

Restart policies: what they really do (not what you hope)

Docker restart policies are simple on paper. In practice, they’re a contract between your container’s lifecycle and the daemon’s opinionated behavior. They don’t make your app healthier. They make it persistent. Those are different properties, and confusing them is how you get infinite crash loops with “self-healing” written in the postmortem.

The four policies you actually use

no (default): Docker won’t restart the container when it exits. This is not “unsafe.” It’s often the sanest choice for batch jobs and one-off tasks.
on-failure[:max-retries]: Restart only if the container exits with a non-zero code. Optionally stop after N retries. This is the closest thing Docker has to “try a bit, then stop being weird.”
always: Restart regardless of exit code. If the daemon restarts, the container comes back too. This is the policy that turns “graceful shutdown” into “surprise resurrection.”
unless-stopped: Like always, except a manual stop survives daemon restarts. It’s “always, but with respect for human intervention.”

What Docker considers a “restart” (and why it matters)

Docker restarts a container when the container’s main process exits. That’s PID 1 inside the container. If your PID 1 is a shell script that forks your actual service and then exits, Docker will interpret that as “service is dead” and dutifully restart it… forever. The restart policy is not broken; your init strategy is.

Also: Docker has a built-in restart delay/backoff. It is not a configurable circuit breaker in classic Docker Engine. It prevents a tight loop of restarts per second, but it won’t stop a persistent loop. It just makes your incident longer and more confusing.

One quote, because it’s still true in 2026: “Hope is not a strategy.” — General Gordon R. Sullivan.

How to choose a policy in production (the opinionated version)

If you run production services on single hosts (or small fleets) with Docker Engine or Compose, treat restart policies as a last-mile guardrail, not your primary reliability mechanism.

Use on-failure:5 by default for most long-running services you expect to crash rarely. If it can’t start after 5 tries, something is wrong. Stop and page.
Use unless-stopped when you have a strong reason (e.g., simple infra sidecars, local dev, or a host that must come up cleanly after reboot). Still: instrument it.
Avoid always for anything that can fail fast (bad config, missing secret, schema mismatch, migrations). “Always” is how you burn CPU while doing nothing useful.
Use no for batch jobs. If your nightly report job fails, you probably want it to fail loudly, not re-run forever and email finance 400 copies.

First short joke: Containers don’t heal themselves; they just get really good at reincarnation.

Facts and history that change how you think about restarts

Some context makes Docker’s restart behavior feel less arbitrary and more like a set of trade-offs that leaked into your pager.

Docker’s restart policies predate the mainstream adoption of Kubernetes, when single-host container management was the common case and “keep it running” was the main ask.
The “PID 1 problem” is ancient Unix history: signals, zombies, and process reaping. Containers didn’t create it; they made it impossible to ignore.
Exit code semantics are a contract: Docker uses them for on-failure. If your app exits 0 on failure (“it’s fine!”) you have chosen chaos.
Restart loops existed long before containers: systemd units with Restart=always can do the same damage. Docker just made it easy to do from a one-liner.
Healthchecks arrived later than many people assume. For a long time, “container up” meant “process exists,” not “service works.” That legacy still shapes common patterns.
Log drivers matter historically: the default json-file driver made it easy to fill disks during restarts. That’s not theoretical; it’s a repeat offender.
OOM-kill behavior is kernel-level reality: the container didn’t “crash,” the kernel killed it. Docker reports the symptom; you still need to read the autopsy.
Docker’s backoff is not a full circuit breaker. It slows restart frequency, but it doesn’t decide to stop. That decision is yours via policy and automation.

Why crash loops happen: failure modes, not moral failings

A crash loop is usually one of these:

Bad config or missing dependency: wrong env var, missing file, bad DNS, DB not reachable, secret not mounted.
App exits intentionally: migrations required, license check failed, feature flag invalid, “run once” image misused as a service.
Resource pressure: OOM kills, CPU throttling causing timeouts, disk full, inode exhaustion, file descriptor limits.
Broken startup ordering: app starts before DB/queue is ready; without retry logic it exits immediately.
Bad PID 1: shell scripts that exit early; no init; signals not handled; zombie processes accumulate then crash.
Corrupt state: volumes contain partial upgrades, lock files, or schema versions that don’t match the binary.
External throttling: rate-limited by upstream; app treats it as fatal and exits; restarts just amplify the thundering herd.

Second short joke: “We set restart: always for reliability” is the container equivalent of taping over the check-engine light.

Fast diagnosis playbook

When you’re in the incident and the container is flapping, you don’t have time for philosophy. Here’s the order that finds the bottleneck fast.

First: establish whether this is an app failure or a platform failure

Check restart count and last exit code. If you see exit code 1/2/78 or similar, it’s likely app/config. If you see 137, think OOM/kill. If you see 0 with restarts, your policy is always or the daemon restarted.
Look at the last 50 log lines from the previous run. You’re hunting for an explicit error, not “starting…” repeated forever.
Check host dmesg/journal for OOM or disk errors. Containers can’t tell you the kernel killed them unless you ask the kernel.

Second: stop the bleeding (without losing evidence)

Disable restarts temporarily so you can inspect state and logs. Don’t delete the container unless you already captured what you need.
Snapshot config and inspect mounts. Most “mystery loops” are “wrong file path” with extra steps.

Third: decide whether you fix app, host, or policy

If it’s config/dependency, fix configuration, or implement retries/backoff in the app. Restart policy is not a retry algorithm.
If it’s resource pressure, set memory limits properly, tune logging, increase disk, or move workloads. Restart policies don’t create RAM.
If it’s bad policy, switch to on-failure:5 or unless-stopped with alerting on restarts.

Practical tasks (commands + output + decisions)

You wanted commands. You get commands. Each task includes: what to run, what the output means, and what decision you make next.

Task 1: Identify the flapping container fast

cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.RunningFor}}'
NAMES              IMAGE                 STATUS                          RUNNING FOR
api                myco/api:1.42.0        Restarting (1) 6 seconds ago    2 minutes ago
postgres           postgres:16            Up 3 hours (healthy)            3 hours ago
nginx              nginx:1.25             Up 3 hours                      3 hours ago

Meaning: api is restarting; Docker shows the last exit code in parentheses. It’s failing quickly (every few seconds).

Decision: Focus on api. Don’t touch the healthy dependencies yet.

Task 2: Inspect restart policy and restart count

cr0x@server:~$ docker inspect -f 'Name={{.Name}} Policy={{.HostConfig.RestartPolicy.Name}} Max={{.HostConfig.RestartPolicy.MaximumRetryCount}} Restarts={{.RestartCount}}' api
Name=/api Policy=always Max=0 Restarts=37

Meaning: Policy is always; max retries is unlimited; 37 restarts already.

Decision: You are in “infinite loop by design.” Plan to disable restarts to debug.

Task 3: Get the last exit code and termination reason

cr0x@server:~$ docker inspect -f 'ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}} Error={{.State.Error}} FinishedAt={{.State.FinishedAt}}' api
ExitCode=1 OOMKilled=false Error= FinishedAt=2026-01-03T09:14:22.948601234Z

Meaning: Exit code 1, not OOM, no runtime error. Likely app/config/dependency logic.

Decision: Read logs from the failing start attempt.

Task 4: Pull logs from the previous run (not just the current flicker)

cr0x@server:~$ docker logs --tail 80 api
2026-01-03T09:14:21Z INFO starting api version=1.42.0
2026-01-03T09:14:22Z ERROR config: missing required env var: DATABASE_URL
2026-01-03T09:14:22Z INFO exiting with code=1

Meaning: The container is behaving correctly: it refuses to start without config.

Decision: Fix the environment/secret injection. Also: this is exactly why always is a bad default—this error will never self-heal.

Task 5: Confirm the container’s environment (what Docker thinks it is)

cr0x@server:~$ docker inspect -f '{{range .Config.Env}}{{println .}}{{end}}' api | sed -n '1,12p'
NODE_ENV=production
LOG_LEVEL=info
PORT=8080

Meaning: No DATABASE_URL. This isn’t a runtime DNS issue; it’s missing configuration at container creation time.

Decision: Update Compose, systemd unit, or deployment tooling to provide the variable (or mount a file and point to it).

Task 6: Stop the restart loop without deleting the container

cr0x@server:~$ docker update --restart=no api
api

Meaning: Restart policy updated. Docker will not restart it after the next stop/exit event.

Decision: Stop it cleanly, then fix configuration, then start it once with known-good settings.

Task 7: Stop it now (so you can inspect state and host pressure)

cr0x@server:~$ docker stop -t 10 api
api

Meaning: Container is stopped and should remain stopped because you set restart=no.

Decision: If it still comes back, something else (systemd, watchdog) is respawning it.

Task 8: Check whether systemd is restarting Docker itself (daemon restarts can look like container restarts)

cr0x@server:~$ systemctl status docker --no-pager
● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2026-01-03 09:00:05 UTC; 20min ago
TriggeredBy: ● docker.socket
       Docs: man:docker(1)
   Main PID: 1123 (dockerd)
      Tasks: 23
     Memory: 312.4M
        CPU: 1min 54.931s

Meaning: Docker daemon is stable right now.

Decision: Treat this as a container/app issue, not a daemon flapping issue.

Task 9: Check for OOM kills at the host level (even when Docker says OOMKilled=false)

cr0x@server:~$ sudo journalctl -k --since "10 min ago" | tail -n 12
Jan 03 09:12:01 server kernel: Memory cgroup out of memory: Killed process 24081 (node) total-vm:1820040kB, anon-rss:612340kB, file-rss:2140kB, shmem-rss:0kB
Jan 03 09:12:01 server kernel: oom_reaper: reaped process 24081 (node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Meaning: The kernel killed a process in a memory cgroup. Depending on timing and how the container died, Docker’s state flags aren’t always the whole story.

Decision: If this matches your container PID, you need memory limits and/or an app fix, not more restarts.

Task 10: Confirm memory limits and whether they’re sane

cr0x@server:~$ docker inspect -f 'MemLimit={{.HostConfig.Memory}} MemSwap={{.HostConfig.MemorySwap}} PidsLimit={{.HostConfig.PidsLimit}}' api
MemLimit=268435456 MemSwap=268435456 PidsLimit=0

Meaning: 256 MiB memory limit with no swap headroom. That’s fine for a tiny Go service; it’s a trap for a Node app with large heaps.

Decision: Either raise the limit or configure the runtime heap; then switch to on-failure:5 so a regression doesn’t become a denial-of-service against your own host.

Task 11: Check disk pressure and log growth (restart loops love filling disks)

cr0x@server:~$ df -h /var/lib/docker
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p3  200G  189G  1.2G  100% /var/lib/docker

Meaning: Docker’s storage path is full. This can cause bizarre secondary failures (image pulls fail, container writes fail, metadata corruption risk).

Decision: Stop flapping containers, prune safely, and cap logs. Do not keep restarting into a full disk.

Task 12: Identify which containers are producing huge JSON logs

cr0x@server:~$ sudo du -h /var/lib/docker/containers/*/*-json.log 2>/dev/null | sort -h | tail -n 5
2.1G /var/lib/docker/containers/8f2c.../8f2c...-json.log
3.8G /var/lib/docker/containers/31ab.../31ab...-json.log
5.4G /var/lib/docker/containers/aa90.../aa90...-json.log
6.0G /var/lib/docker/containers/3c11.../3c11...-json.log
7.2G /var/lib/docker/containers/1d77.../1d77...-json.log

Meaning: Some containers are writing multi-gigabyte logs. Crash loops multiply this quickly because every startup logs the same banners and stack traces.

Decision: Enable log rotation in Docker daemon config and fix the noisy app. Meanwhile, free space carefully.

Task 13: Verify the container’s last start attempt timestamps

cr0x@server:~$ docker inspect -f 'StartedAt={{.State.StartedAt}} FinishedAt={{.State.FinishedAt}}' api
StartedAt=2026-01-03T09:14:21.115312345Z FinishedAt=2026-01-03T09:14:22.948601234Z

Meaning: It lives for ~1.8 seconds. That’s not a “transient” error; that’s a deterministic startup failure.

Decision: Stop using always. Fix config, then start once, then re-enable a bounded restart policy.

Task 14: Get the exact command/entrypoint Docker is running

cr0x@server:~$ docker inspect -f 'Entrypoint={{json .Config.Entrypoint}} Cmd={{json .Config.Cmd}}' api
Entrypoint=["/bin/sh","-c"] Cmd=["/app/start.sh"]

Meaning: PID 1 is /bin/sh -c, running a script. This is a classic source of signal-handling problems and “script exits early” restarts.

Decision: Inspect the script. Prefer exec-form entrypoints and an init if needed.

Task 15: Reproduce the failure interactively (without the restart policy)

cr0x@server:~$ docker run --rm -it --entrypoint /bin/sh myco/api:1.42.0 -lc '/app/start.sh; echo exit=$?'
config: missing required env var: DATABASE_URL
exit=1

Meaning: You reproduced the issue outside the flapping container. That’s progress: it’s deterministic.

Decision: Fix the environment injection, not Docker.

Task 16: Apply a sane policy after fixing config

cr0x@server:~$ docker update --restart=on-failure:5 api
api

Meaning: If it fails repeatedly, it stops after five failures.

Decision: Pair this with alerting on restart count so “stopped after five” becomes a page, not silent downtime.

Task 17: Validate healthcheck behavior (healthchecks don’t restart containers by themselves)

cr0x@server:~$ docker inspect -f 'Health={{if .State.Health}}{{.State.Health.Status}}{{else}}none{{end}}' api
Health=unhealthy

Meaning: Docker marks it unhealthy, but it will not automatically restart purely because it’s unhealthy (classic Docker Engine behavior).

Decision: If you need “unhealthy triggers restart,” you need an external controller (or a different orchestrator), or implement app-level self-termination on irrecoverable health failure (with care).

Task 18: Watch restart events in real time

cr0x@server:~$ docker events --since 5m --filter container=api
2026-01-03T09:10:21.115Z container start 8b4f... (name=api)
2026-01-03T09:10:22.948Z container die   8b4f... (exitCode=1, name=api)
2026-01-03T09:10:24.013Z container start 8b4f... (name=api)

Meaning: You can see the loop cadence and exit codes. Helpful when logs are noisy or rotated away.

Decision: If restarts correlate with host events (daemon restart, network flap), broaden scope. If it’s stable cadence, it’s app/config.

Healthchecks, dependency readiness, and the myth of “restart fixes it”

Most restart loops are dependency readiness problems masquerading as “Docker weirdness.” Your app starts, tries the database once, fails, exits. Docker restarts. Repeat until the heat death of the universe or until the DB returns and you get lucky.

Do this in the app: retry with backoff, and distinguish fatal vs transient

If the database is down for 30 seconds during maintenance, exiting immediately is not “clean.” It’s fragile. Implement connection retries with exponential backoff and a maximum time budget. If the error is fatal (bad password, wrong host), exit once and let bounded on-failure retries catch a transient deployment ordering issue.

Do this in Docker: use healthchecks for observability and gating, not magical healing

Healthchecks are valuable because they give you a machine-readable signal: healthy/unhealthy. In classic Docker, they don’t automatically restart the container, but they:

help you see “process is running but service is dead,”
integrate with Compose depends_on conditions (in newer Compose implementations),
give your external monitoring something better than “container exists.”

Dependency checks: avoid “wait-for-it” scripts that never end

There are two failure styles:

Fail-fast loops: app exits immediately, Docker restarts. Noisy, but obvious.
Hang-forever starts: entrypoint waits for dependency forever. Docker thinks it’s “Up” but it’s not serving. Quiet, but deadly.

Prefer bounded waits with explicit timeouts. If dependency doesn’t appear, exit non-zero and let on-failure try a few times, then stop.

Docker Compose, Swarm, and why your policy may not be applied

Restart policy behavior depends on how you deploy.

Compose: `restart:` is easy to set and easy to forget you set

Compose makes it trivial to sprinkle restart: always across a file. Teams do this because it “reduces tickets.” It also reduces learning, until the day it turns a simple misconfiguration into a fleet-wide log storm.

Also: Compose version differences matter. Some fields under deploy: are ignored unless you’re using Swarm. People copy/paste configs and assume they’re active. They aren’t.

Swarm services: restart behavior is a different model

Swarm has its own reconciliation loop. Restart policies there are part of service scheduling, not just local daemon behavior. If you’re on Swarm, you want to use service-level restart conditions and delays. If you’re not on Swarm, don’t pretend you are by using deploy: keys in Compose and expecting them to work.

systemd wrapping Docker: double restart loops are a real thing

A common pattern: systemd unit runs docker run and systemd has Restart=always. Docker container has --restart=always. When something goes wrong, both layers “help.” Now you have a restart loop that persists across daemon restarts and survives your attempts to stop the container because systemd immediately recreates it.

If you must use systemd, let systemd own the restart behavior and set Docker container restart policy to no. Or vice versa. Pick one adult in the room.

Observability and guardrails: rate limits for your own chaos

A restart policy without alerting is just silent failure with extra steps. Your goal isn’t “restart forever.” Your goal is “recover quickly from transient faults and surface persistent faults.” That means guardrails.

Guardrail 1: alert on restarts and on restart rate

Restart count alone is not enough. A container that restarts once a day might be fine. A container that restarts 50 times in 5 minutes is an incident. Track both absolute count and rate. If you don’t have a metrics pipeline, you can still do this with cron + docker inspect and a simple state file. It’s not glamorous, but neither is explaining to leadership why your logging bill doubled overnight.

Guardrail 2: log rotation at the daemon level

If you are using json-file (many are), set rotation. Crash loops + unbounded logs + small disks is a predictable outage generator.

Guardrail 3: bounded retries at the policy level

on-failure:5 is not perfect, but it creates a clear state: “this is persistently broken.” That state is actionable. “It’s restarting forever” is not.

Guardrail 4: resource limits that match reality

Unbounded memory makes a container capable of taking down the host. Overly tight memory makes it restart forever. Both are bad. Set reasonable limits and monitor actual usage. Treat limits as SLO tools, not punishment.

Three corporate-world mini-stories (anonymized, plausible, technically accurate)

Mini-story 1: The incident caused by a wrong assumption

At a mid-size company, a team migrated a legacy API from VMs to Docker on a pair of beefy hosts. They were proud: fewer moving parts, easier deploys, consistent environments. They added --restart=always “so the service stays up.” No orchestration, no external supervisor. Just Docker Engine and confidence.

During a routine secret rotation, the database password changed. The new secret made it into the secrets store, but the deployment job that rebuilt containers failed halfway through, leaving one host with the old environment variable and the new image. The API booted, failed authentication, and exited with code 1. Docker restarted it. Again. And again.

The logs were full of authentication failures, written to json-file on a shared disk. Within an hour, the disk holding /var/lib/docker was close to full. Then other unrelated containers started failing to write state. Monitoring began to flake because its own container couldn’t write to disk. The on-call saw “everything is restarting” and initially suspected a kernel issue.

The wrong assumption was subtle: they assumed a restart policy was a reliability feature. It’s not. It’s a persistence feature. Persistent failure is still failure; it just gets louder over time.

The fix was boring: switch services to on-failure:5, rotate logs, and—most important—treat missing/invalid secrets as a page-worthy deployment fault with a clear rollback path.

Mini-story 2: The optimization that backfired

A different organization ran a fleet of Docker hosts for internal tools. Someone noticed that service restarts during deploys were slow because images were large and startup scripts did “helpful” checks. They optimized: trimmed the image, removed a bunch of checks, and changed the entrypoint to a small shell wrapper that set env vars and launched the service. Deploys got faster. Everyone applauded.

Weeks later, an upstream dependency started returning intermittent TLS errors due to a certificate chain issue. The application had previously retried for a minute before exiting; one of the removed “helpful checks” included a network readiness loop. Now it failed fast and exited immediately. Because the service had restart: always, the fleet hammered the failing upstream dependency harder, creating a feedback loop. The upstream rate-limited them, which made failures more frequent, which increased restarts, which increased the rate limit triggers. A nice circular economy of pain.

It got worse: the shell wrapper was PID 1 and didn’t forward signals properly. During mitigation, operators tried to stop the containers, but shutdowns were inconsistent and sometimes hung, leaving ports bound. That made subsequent restarts fail differently (“address already in use”), which added confusion and extended the incident.

The optimization wasn’t inherently wrong—smaller images are good—but the changes removed resilience logic from the application and replaced it with “Docker will restart it.” Docker did exactly that, and the resulting behavior was technically correct and operationally disastrous.

The eventual fix: restore proper retry logic with jitter, use exec-form entrypoints (and an init where needed), and change restart policy to bounded retries. They also implemented upstream failure budgets in the client to avoid stampeding dependencies during partial outages.

Mini-story 3: The boring but correct practice that saved the day

A fintech team ran a payment-adjacent service on Docker Compose across a small cluster of hosts. Nothing fancy. What they did have was discipline: every service used on-failure:3 unless there was a written exception, and every container had a healthcheck. Restart counts were shipped as metrics and paged on a rate threshold.

One morning, a new build hit production with a subtle config parsing bug. The service exited with code 78 (config error) right after logging a single, clear line. The first three restarts happened quickly, then the container stopped. The on-call got a page: “service stopped after retries.” The logs were short and readable because rotation was set globally. The host stayed healthy because the loop ended by policy, not luck.

They rolled back within minutes. No cascading disk-full failures, no “why is the CPU pegged,” no noisy neighbor effects on unrelated services. Postmortem was almost boring, which is the highest compliment you can give operations.

The practice that saved them wasn’t exotic tooling. It was two defaults: bounded restarts, and alerting when the bound is hit. The container didn’t “self-heal.” The system self-reported.

Common mistakes: symptoms → root cause → fix

1) Symptom: Container restarts forever with the same log line

Root cause: restart: always (or unless-stopped) + deterministic startup failure (missing env var, missing file, bad flag).

Fix: Switch to on-failure:5, fix config injection, and make your app print one high-signal error line before exit.

2) Symptom: Restarts show exit code 137

Root cause: OOM kill or forced termination. Often a too-tight memory limit or memory leak.

Fix: Confirm kernel OOM logs, increase memory limit or tune runtime heap, and add memory monitoring. Bounded restarts prevent host thrash.

3) Symptom: `docker stop` works, but container comes back

Root cause: Policy is always (or another supervisor recreates it: systemd, cron, CI agent).

Fix: docker update --restart=no and check for external supervisors. Make a single layer responsible for restarts.

4) Symptom: Container is “Up” but service is dead

Root cause: No healthcheck and the process is alive but wedged (deadlock, dependency stall). Restart policy doesn’t help because nothing exits.

Fix: Add healthcheck and external alerting; consider a watchdog that restarts on persistent unhealthy state (carefully), or fix the deadlock cause.

5) Symptom: After host reboot, containers you “stopped” are running again

Root cause: restart: always ignores previous manual stops after daemon restart; unless-stopped respects them.

Fix: Use unless-stopped when manual stops must persist across daemon restarts, or move control to a higher-level orchestrator.

6) Symptom: Disk fills up during an incident

Root cause: Crash loops amplify logging; default json-file logging without rotation is unbounded.

Fix: Configure daemon log rotation, reduce startup log spam, and bound restarts so failure doesn’t generate infinite logs.

7) Symptom: “depends_on” didn’t prevent the crash loop

Root cause: Startup order is not readiness. Dependency container can be “up” but not ready to accept connections.

Fix: Add readiness checks and retry logic; use healthchecks and readiness gates where supported.

8) Symptom: Graceful shutdown doesn’t happen; data corruption risk

Root cause: PID 1 is a shell wrapper that doesn’t forward signals; app doesn’t handle SIGTERM; stop timeout too short.

Fix: Use exec-form entrypoint, add an init (e.g., --init), handle signals, and set reasonable stop timeouts.

Checklists / step-by-step plan

Step-by-step plan: how to fix restart policies without breaking production

Inventory current policies. List containers and their restart policies. Flag anything with always and no clear justification.
Classify services. Batch jobs, stateless services, stateful services, infra agents. Each gets a default.
Pick a default policy: usually on-failure:5 for services, no for jobs, unless-stopped for a small set of “must come back after reboot” agents.
Add alerting on restart rate. If you can’t, at least create a daily report and a paging threshold for “stopped after N retries.”
Add log rotation. At daemon level. Don’t rely on every app team to do it right.
Review entrypoints. Shell wrappers get extra scrutiny. Add --init where it helps.
Test failure modes. Pull the DB cable (figuratively). Remove a secret. Ensure the system fails loudly and predictably.
Roll out changes gradually. One host or one service group at a time. Watch for “hidden dependencies” on infinite restarts (yes, that happens).
Document exceptions. If a service truly needs always, write why and what alert catches its crash loop.

Checklist: what to capture during a crash loop incident

Restart policy and restart count (docker inspect output).
Last exit code and OOMKilled flag.
Last 100 log lines (before rotating or pruning).
Host kernel logs for OOM/disk/network errors.
Disk usage for /var/lib/docker and volume mountpoints.
Any external supervisor configuration (systemd units, cron jobs, CI runners).

FAQ

1) Should I use `restart: always` in production?

Rarely. Use it only when you understand the failure modes and have alerting on restart rate. Default to on-failure:5 for services.

2) What’s the practical difference between `always` and `unless-stopped`?

unless-stopped respects a manual stop across daemon restarts. always brings the container back after a daemon restart even if you previously stopped it.

3) Does Docker restart a container when it becomes unhealthy?

Not by default in classic Docker Engine. Healthchecks mark status; they don’t trigger restarts automatically. You need an external controller if you want that behavior.

4) If my app exits 0 on error, will `on-failure` restart it?

No. on-failure restarts only on non-zero exit codes. Fix your app’s exit codes; they’re part of the operational contract.

5) Why can’t I `docker exec` into a flapping container?

Because it’s not running long enough. Disable restarts (docker update --restart=no), stop it, then run the image interactively with a shell to reproduce the failure.

6) What exit codes should I watch for?

Common signals: 1 generic failure (read logs), 137 often killed/OOM, 143 SIGTERM, 0 successful exit (but if it restarts, you likely set always or restarted the daemon).

7) Can restart policies hide outages?

Yes. They can convert “service down” into “service flapping,” which looks alive to shallow monitoring. Alert on restarts and service-level health, not just container existence.

8) Should I set a maximum retry count?

Yes, for most services. It creates a stable end-state for persistent failures and prevents infinite resource consumption. Pair it with alerting so “stopped after retries” is actionable.

9) What’s the best way to prevent restart loops from filling disks?

Bound restarts, rotate logs at the daemon level, and reduce noisy startup logging. Also monitor /var/lib/docker usage explicitly.

10) Isn’t Kubernetes better at this?

Kubernetes gives you stronger controllers and primitives, but it can also create crash loops if you misconfigure probes and backoff. The principle remains: restarts are not a fix for deterministic failure.

Conclusion: next steps you can ship this week

Restart policies are a scalpel, not duct tape. Use them to recover from transient faults, not to keep a broken binary spinning forever while your logs eat the disk.

Practical next steps:

Audit all containers for restart: always and justify each one.
Change the default to on-failure:5 for services and no for jobs.
Enable daemon-level log rotation if you use json-file.
Add alerting on restart rate and on “stopped after retries.”
Fix PID 1 and signal handling in images that use shell entrypoints; use exec form and an init when appropriate.

Then the next time something fails at 2 a.m., it will fail like an adult: once, clearly, and with enough evidence to fix it.