Docker: ‘It Works on My Machine’ — The Image Tag Rule That Ends the Drama

February 22, 2026 • February 22, 2026 • Read: 24 min • Views: 1

Was this helpful?

Some production incidents start with a fire. Most start with a shrug: “That’s weird… it worked on my laptop.” You ship a container, it passes CI, it runs fine in staging, and then production behaves like it woke up in a different universe.

Usually it did. Not because Docker is unreliable. Because humans are. Specifically: humans treating image tags like they’re version numbers when they’re actually nicknames. Nicknames lie.

The one rule: never deploy floating tags

Here’s the rule that ends most “works on my machine” container drama:

Deploy by digest, not by mutable tag.

Not “latest”. Not “main”. Not even “1.2”. In production (and ideally in staging, too), you deploy immutable identifiers: image@sha256:….

What counts as a “floating” tag?

Any tag that can move without you noticing. Which is… most tags.

latest is floating by design.
main, stable, prod, candidate are floating because humans move them.
1.2 floats when you rebuild the same tag after patching a base image or “just fixing the Dockerfile.”
Even 1.2.3 floats if your registry allows tag overwrite and your process doesn’t prevent it.

What’s the practical effect of deploying by digest?

It makes “what’s running” answerable in one line. It makes rollbacks deterministic. It kills the class of incidents where the same Kubernetes YAML deploys different bits on different days.

There’s a cultural side too: it forces you to treat “build” and “deploy” as separate events. Build produces an artifact. Deploy selects that artifact. If your “deploy” step triggers a rebuild, you are not deploying artifacts; you are gambling with a compiler.

Dry-funny joke #1: The “latest” tag is like milk in the office fridge: technically labeled, emotionally dangerous, and never what you think it is.

When is it okay to use tags?

Tags are fine for humans and workflows:

In CI: “this build produced tag pr-1847.”
In promotion pipelines: “move candidate to point at digest X.”
In dev: “run :latest locally if you enjoy surprises.”

But production should run digests. If you absolutely must keep tags in manifests for ergonomics, then enforce tag immutability at the registry and still record the digest you deployed. Otherwise you’re back to trusting a nickname.

Tags vs digests: what you think you’re deploying vs what you are deploying

Tags are pointers, not identities

A Docker image tag is a reference in a registry: repo:tag points to some image manifest. That pointer can be updated. Sometimes intentionally (“promote to prod”). Sometimes accidentally (“re-push with the same tag”). Sometimes because tooling did it for you.

An image digest is the content address: repo@sha256:…. It identifies a specific manifest. The digest changes if the content changes. You can’t “update” a digest to point somewhere else; it is the somewhere.

Two common confusions that cause outages

“But I tagged it v1.4.2, so it’s immutable.” No. It’s a string in a registry. Unless you enforce immutability, it can be overwritten.
“But my Dockerfile didn’t change.” Your base image did. Your apt repository did. The time of day did. If you build without pinning inputs, you get different results with the same instructions.

Why digests aren’t enough by themselves

Pinning digests prevents drift at deploy time. It does not automatically guarantee that the digest corresponds to what you think it is (provenance), nor that the build was reproducible (build inputs). Those are separate controls. But the digest pin is the best first move because it turns a slippery problem into one you can reason about.

Also: multi-arch images exist. A tag might resolve to different platform-specific images depending on the node (amd64 vs arm64). Digests can be per-manifest-list (index) or per-platform manifest. You need to know which one you pinned.

Why drift happens (even with “good” processes)

Pull behavior is not a moral judgment

Docker and Kubernetes aren’t trying to trick you. They’re trying to be efficient. If an image with the same name exists locally, the runtime might not pull again. If you have multiple nodes, each node has its own local cache. If your policy is “if-not-present”, drift is basically a feature.

CI rebuilds mean you’re shipping a moving target

A classic anti-pattern looks like this:

Developers merge to main.
CI builds myapp:latest and pushes it.
CD deploys myapp:latest.
A hotfix triggers another build that also pushes myapp:latest.
Some nodes pull, some don’t. Or they pull at different times.

Now you’ve got an unplanned canary deployment distributed by cache behavior. It will not be the canary you wanted.

Base images mutate and you often want them to

Security patching is real. Many teams rebuild images to pick up patched base layers. That’s good. But if you rebuild and overwrite the same tag, you’ve taken a good security hygiene action and turned it into a deployment ambiguity problem.

Registries, proxies, and “helpful” mirrors get involved

If you have a registry cache, a pull-through proxy, or a mirror, now you have another layer that can return stale results if configured poorly. Your laptop might be talking to Docker Hub. Production might be talking to a corporate mirror with its own refresh logic. Same tag. Different bytes. Enjoy your debugging afternoon.

Facts and history that explain why this keeps happening

Docker Hub popularized “latest” as a default UX choice. Early Docker workflows leaned on convenience, not strict provenance.
Content-addressable storage pre-dates containers. Git’s model (hash identifies content) is older and conceptually similar to image digests; tags are like branch names.
OCI standardized image formats. What you pull today is largely aligned with an open spec, which is why multiple runtimes and registries interoperate.
Multi-arch images changed the meaning of “the image”. A tag can point to an index that picks different platform manifests per node architecture.
Kubernetes defaulted ImagePullPolicy based on tags. If you use :latest, it defaults to Always; otherwise it often defaults to IfNotPresent—subtle, and frequently misunderstood.
Layer caching is an optimization with consequences. It reduces network and speeds deploys, but it also hides tag updates unless you force pulls.
Supply chain security pushed digest pinning into the mainstream. Provenance, signatures, SBOMs—these practices treat the digest as the anchor.
“Immutable infrastructure” was supposed to make deployments boring. Containers made it easier to package; digests make it easier to know exactly what you packaged.

One reliability-minded paraphrased idea worth tattooing on your deployment pipeline:

Werner Vogels (paraphrased idea): treat everything as disposable and design for failure; you’ll spend less time praying and more time restoring service.

Three corporate mini-stories from the trenches

Mini-story #1: The incident caused by a wrong assumption

A mid-sized SaaS company had a tidy setup: Kubernetes in production, GitOps for manifests, a container registry, and a CD controller. They were proud of their discipline. Then they had a login outage that only hit about a third of users. Not everyone. Not every region. Just enough to ruin the on-call’s dinner.

The first clue was that the error signature varied by pod. Some pods threw an exception about a missing CA bundle path. Others were fine. The deployment manifest said myorg/auth-service:1.8. Everyone assumed 1.8 meant one thing. “It’s a version tag. It’s stable.” That was the wrong assumption.

They had rebuilt :1.8 to pick up base image CVEs a week earlier. The registry allowed tag overwrite. Some nodes still had the old layers cached; others pulled the rebuilt image during routine churn. Kubernetes was set to IfNotPresent. The end state was a split-brain deployment: two different images under one tag, running simultaneously.

The fix was quick once the diagnosis landed: pin by digest in the deployment, force a rollout, then make tags immutable for anything resembling a release. The postmortem was blunt: “We treated a tag as an identity.” The next week, they audited every production manifest for floating tags and replaced them with digests, while still keeping a human-friendly tag in CI output.

Mini-story #2: The optimization that backfired

A different org ran a high-throughput API and got tired of slow node scaling. Pulling images during autoscaling caused cold starts. So they introduced a registry mirror inside the VPC and aggressive caching on nodes. It worked: scale-outs became much faster, and network egress dropped.

Then they rolled out a critical bug fix and it didn’t “take” everywhere. Some nodes kept serving the old behavior for hours. The mirror respected cache headers and had a refresh interval; the nodes were set to prefer local cached images to avoid pulls. The team had built speed by increasing staleness.

The backfire wasn’t that caching existed. The backfire was that caching was allowed to decide correctness. Their deployment referenced api-gateway:stable, updated by CI. The tag moved, but the cache didn’t care. The deployment controller thought it deployed. It had, just not the bits they intended.

They kept the mirror, because fast scale-outs are worth having. But they changed the contract: promotion moved a tag to a digest, deployments used the digest, and caches were now safe because the identifier stopped moving. They also added a simple gate: “if prod manifest contains a colon tag without @sha256, fail the PR.” Boring. Effective.

Mini-story #3: The boring but correct practice that saved the day

A finance-adjacent company (the kind that gets audited for fun) had an unglamorous release process. Every build produced an image, signed it, recorded the digest, and stored that digest alongside the change ticket. Environments promoted by digest only. No exceptions. Developers complained it was slow and “enterprisey.”

One Friday, a runtime vulnerability dropped with a wave of scary headlines. Security asked for immediate patching. The team rebuilt base images, rebuilt services, and pushed new images. Then, in the middle of the rush, a different change accidentally slipped into one service’s build context—an unreviewed config tweak. In many shops, that’s how you get a weekend outage.

Here’s what saved them: promotion required an explicit digest selection. The unreviewed build produced a digest that didn’t match the approved change ticket. The pipeline refused to promote it. The team patched the vulnerability using the correct digests and kept the accidental config out of prod.

Nobody got applause. There was no heroic debugging. The system simply declined to be clever. That’s the goal.

Dry-funny joke #2: If your deployment process needs a hero, your process is basically a reality show with worse lighting.

Practical tasks: commands, outputs, and decisions

This section is the “I have a terminal and a problem” part. Each task includes a command, a realistic output snippet, what it means, and what decision you make from it.

Task 1: See what image a running container is actually using (Docker)

cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.ID}}'
NAMES                IMAGE                         ID
auth-service         registry.local/auth:1.8       7c1d2e9a0b53
billing-worker       registry.local/billing:prod   3a9b6d1c4f21

What it means: You see tags, not digests. This is a hint, not proof. A tag may have moved since the container started.

Decision: Inspect the container to find the immutable image ID / digest mapping.

Task 2: Inspect container image ID and repo digests (Docker)

cr0x@server:~$ docker inspect auth-service --format '{{.Image}} {{json .RepoDigests}}'
sha256:0e3d2b4f1e1b6d0f1b6b8c9a3f2a1d0c9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4 ["registry.local/auth@sha256:8b4c6a3d2e1f0a9b8c7d6e5f4a3b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7f6a5"]

What it means: The running container references a specific digest. That digest is what you should record and compare across environments.

Decision: If the manifest deploys by tag, change it to deploy by this digest (or the approved one).

Task 3: Check whether a tag currently points to a different digest (remote truth)

cr0x@server:~$ docker pull registry.local/auth:1.8
1.8: Pulling from auth
Digest: sha256:2f9e1d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c8b7a6f5e4d3c2b1a0f9e
Status: Image is up to date for registry.local/auth:1.8

What it means: The registry’s current digest for :1.8 is shown. If it doesn’t match the digest from Task 2, the tag moved.

Decision: Treat it as a release hygiene failure. Stop deploying by this tag; pin digest and investigate why the tag changed.

Task 4: List local images and see repo digests (catch “same tag, different digests”)

cr0x@server:~$ docker images --digests --format 'table {{.Repository}}\t{{.Tag}}\t{{.Digest}}\t{{.ID}}\t{{.CreatedSince}}'
REPOSITORY              TAG     DIGEST                                                                    IMAGE ID       CREATED SINCE
registry.local/auth     1.8     sha256:2f9e1d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c8b7a6f5e4d3c2b1a0f9e   0e3d2b4f1e1b   9 days ago
registry.local/auth     1.8     <none>                                                                    9a1b2c3d4e5f   15 days ago

What it means: You can have multiple local images that once related to the same tag. The older one may be untagged now.

Decision: Don’t rely on tag presence in cache; use digests in deployments, and clean up old images if disk pressure is a factor.

Task 5: Force a clean pull to eliminate cache ambiguity (Docker)

cr0x@server:~$ docker rmi registry.local/auth:1.8
Untagged: registry.local/auth:1.8
Deleted: sha256:0e3d2b4f1e1b6d0f1b6b8c9a3f2a1d0c9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4

cr0x@server:~$ docker pull registry.local/auth:1.8
1.8: Pulling from auth
Digest: sha256:2f9e1d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c8b7a6f5e4d3c2b1a0f9e
Status: Downloaded newer image for registry.local/auth:1.8

What it means: Now the local tag resolves to the registry’s current content.

Decision: If this changes behavior, you just proved cache-driven drift. Fix the process, not the node.

Task 6: In Kubernetes, see what image the Deployment claims vs what pods actually run

cr0x@server:~$ kubectl -n prod get deploy auth-service -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'
registry.local/auth:1.8

cr0x@server:~$ kubectl -n prod get pods -l app=auth-service -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].imageID}{"\n"}{end}'
auth-service-6d9df7c6d9-7h5qv	docker-pullable://registry.local/auth@sha256:8b4c6a3d2e1f0a9b8c7d6e5f4a3b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7f6a5
auth-service-6d9df7c6d9-km2nw	docker-pullable://registry.local/auth@sha256:2f9e1d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c8b7a6f5e4d3c2b1a0f9e

What it means: Same Deployment tag, different pod image digests. That’s drift.

Decision: Pin the Deployment to a single digest and roll out. Then fix the pipeline to prevent tag overwrite.

Task 7: Check ImagePullPolicy (Kubernetes) to understand cache behavior

cr0x@server:~$ kubectl -n prod get deploy auth-service -o jsonpath='{.spec.template.spec.containers[0].imagePullPolicy}{"\n"}'
IfNotPresent

What it means: Nodes may not pull even if the tag changed, depending on cache state.

Decision: Don’t “fix” this by setting Always everywhere as your primary control. Fix the identifier (digest). Use Always selectively when you truly want tag tracking in dev.

Task 8: Confirm the rollout revision and correlate it to an image change

cr0x@server:~$ kubectl -n prod rollout history deploy/auth-service
deployment.apps/auth-service
REVISION  CHANGE-CAUSE
12        image updated to registry.local/auth:1.8
13        configmap reload

What it means: The history tells you “image updated” but not the digest unless you record it explicitly.

Decision: Start annotating deployments with the digest (or commit SHA + digest) at deploy time.

Task 9: Inspect a registry tag’s digest via manifest inspection (no guessing)

cr0x@server:~$ docker manifest inspect registry.local/auth:1.8 | head -n 12
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.oci.image.index.v1+json",
   "manifests": [
      {
         "mediaType": "application/vnd.oci.image.manifest.v1+json",
         "digest": "sha256:2f9e1d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c8b7a6f5e4d3c2b1a0f9e",
         "platform": {
            "architecture": "amd64",

What it means: The tag points to an index (multi-arch). You can see the digest(s) per platform.

Decision: Pin the index digest if you want a single reference across architectures, or pin platform-specific digests if you need deterministic per-arch control.

Task 10: Find which node is running which digest (targeted remediation)

cr0x@server:~$ kubectl -n prod get pods -l app=auth-service -o wide
NAME                            READY   STATUS    RESTARTS   AGE   IP            NODE
auth-service-6d9df7c6d9-7h5qv   1/1     Running   0          2h    10.20.1.14    ip-10-0-4-21
auth-service-6d9df7c6d9-km2nw   1/1     Running   0          2h    10.20.2.33    ip-10-0-6-18

cr0x@server:~$ kubectl -n prod get pod auth-service-6d9df7c6d9-7h5qv -o jsonpath='{.status.containerStatuses[0].imageID}{"\n"}'
docker-pullable://registry.local/auth@sha256:8b4c6a3d2e1f0a9b8c7d6e5f4a3b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7f6a5

What it means: You can pinpoint drift to nodes, not just pods.

Decision: If you need an immediate containment action, cordon/drain the offending node after you’ve fixed the manifest; otherwise it will just happen again.

Task 11: Verify what containerd thinks is present (runtime-level truth)

cr0x@server:~$ sudo ctr -n k8s.io images ls | grep registry.local/auth
registry.local/auth:1.8                                      application/vnd.oci.image.index.v1+json    sha256:1111aaaabbbb2222cccc3333dddd4444eeee5555ffff6666777788889999aaaa   245.3 MiB  linux/amd64,linux/arm64
registry.local/auth@sha256:2f9e1d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c8b7a6f5e4d3c2b1a0f9e   application/vnd.oci.image.manifest.v1+json sha256:2f9e1d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c8b7a6f5e4d3c2b1a0f9e   84.1 MiB   linux/amd64

What it means: The runtime may store both tag references and digest references. The listing shows what’s cached and for which platforms.

Decision: If pods are unexpectedly not pulling, confirm whether the digest is already present and whether your pull policy and runtime behavior match your expectations.

Task 12: Confirm a rollout actually replaced pods (not just updated a spec)

cr0x@server:~$ kubectl -n prod rollout status deploy/auth-service
deployment "auth-service" successfully rolled out

cr0x@server:~$ kubectl -n prod get rs -l app=auth-service --sort-by=.metadata.creationTimestamp
NAME                       DESIRED   CURRENT   READY   AGE
auth-service-6d9df7c6d9     6         6         6       2h
auth-service-5c7bbf99b8     0         0         0       7d

What it means: You have a single active ReplicaSet. Good. If multiple are active unexpectedly, you might have partial rollouts or stuck terminations.

Decision: If drift persists, the problem is likely the image reference itself (tag moving) or node caching combined with mutable identifiers.

Task 13: Detect if your manifest contains a floating tag (cheap guardrail)

cr0x@server:~$ grep -RIn 'image: .*:[^ @]\+$' k8s/prod | head
k8s/prod/auth/deploy.yaml:27:image: registry.local/auth:1.8
k8s/prod/api/deploy.yaml:19:image: registry.local/api:latest

What it means: Lines show images that are tag-only. You’re one registry overwrite away from ambiguity.

Decision: Replace with digest form or enforce tag immutability; preferably both.

Task 14: Pin an image by digest in Kubernetes (the actual fix)

cr0x@server:~$ kubectl -n prod set image deploy/auth-service auth-service=registry.local/auth@sha256:2f9e1d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c8b7a6f5e4d3c2b1a0f9e
deployment.apps/auth-service image updated

cr0x@server:~$ kubectl -n prod get deploy auth-service -o jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'
registry.local/auth@sha256:2f9e1d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c8b7a6f5e4d3c2b1a0f9e

What it means: Your desired state is now immutable.

Decision: Commit this change back to Git if you use GitOps; don’t let kubectl become the only source of truth.

Task 15: Record the digest as a deployment annotation (make audits and rollbacks sane)

cr0x@server:~$ kubectl -n prod annotate deploy/auth-service deployed-image-digest=sha256:2f9e1d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c8b7a6f5e4d3c2b1a0f9e --overwrite
deployment.apps/auth-service annotated

cr0x@server:~$ kubectl -n prod get deploy/auth-service -o jsonpath='{.metadata.annotations.deployed-image-digest}{"\n"}'
sha256:2f9e1d7c6b5a4f3e2d1c0b9a8f7e6d5c4b3a2f1e0d9c8b7a6f5e4d3c2b1a0f9e

What it means: You can now answer “what exactly did we deploy?” without spelunking in logs.

Decision: Make the pipeline do this automatically on promotion.

Fast diagnosis playbook

If production is acting weird and you suspect image/version drift, don’t start with Dockerfiles or dependency graphs. Start with identity.

First: confirm whether pods are running the same digest

Check pod imageID across replicas.
If multiple digests exist, you have drift. Stop and fix the deploy reference first.

Second: verify what the deployment references (tag or digest)

If it’s a tag: assume it can move. Treat as unsafe until proven otherwise.
If it’s a digest: drift should not happen unless you’re dealing with multi-arch confusion, multiple containers, or multiple deployments.

Third: compare registry state vs node cache behavior

Confirm what digest the tag points to right now using manifest inspection or a pull that prints the digest.
Check ImagePullPolicy and node runtime cache.
If you’re using a mirror/proxy, confirm it’s not serving stale tag mappings.

Fourth: decide the containment move

Best: pin digest and roll out.
If you need emergency uniformity: drain nodes and force new pulls after pinning.
If it’s multi-arch: confirm which platform digest each node is pulling; pin the correct level (index vs manifest).

Fifth: write the permanent fix

Registry tag immutability for release tags.
Promotion pipeline that moves environments by digest.
Policy checks in CI that block floating tags in prod manifests.

Common mistakes: symptom → root cause → fix

Mistake 1: “Some pods behave differently, but they’re on the same Deployment”

Symptom: Different error messages, different behavior, same YAML.

Root cause: Mutable tag + IfNotPresent + mixed node caches (or mirror staleness).

Fix: Deploy by digest. Then roll out to replace all pods. Enforce tag immutability and stop overwriting release tags.

Mistake 2: “We updated the tag, but nothing changed”

Symptom: CI says it pushed :stable, CD says it deployed, but runtime behavior is unchanged.

Root cause: Runtime didn’t pull because image is already present; controller didn’t see a spec change that triggers rollout.

Fix: Pin digest (forces a spec change). If you must use tags, set a policy that forces new pulls and triggers rollouts—then accept the operational cost.

Mistake 3: “We pinned digests and still see differences between nodes”

Symptom: Same digest in manifest, but behavior differs on arm64 vs amd64 nodes.

Root cause: You pinned an index digest vs a platform manifest digest (or vice versa) and the underlying platform image differs in subtle ways (glibc vs musl, native deps, etc.).

Fix: Decide whether you want a single multi-arch index digest or explicit per-arch pinning. Test both architectures. Don’t assume parity.

Mistake 4: “Security rebuilt base images and now prod is inconsistent”

Symptom: After a rebuild, some pods start failing TLS, DNS, or timezone logic.

Root cause: Rebuilt same tag; base image updates changed cert bundles, libc behavior, or package versions.

Fix: Rebuild under a new tag (or unique build tag), pin digest for deployment, and run a promotion step with tests. Keep “patching base images” but stop overwriting tags used by running clusters.

Mistake 5: “Rollback didn’t roll back”

Symptom: You revert the manifest from :prod to :previous, and you still see new behavior.

Root cause: Both tags point to the same digest (tag moved), or the “previous” tag was overwritten during rebuilds.

Fix: Roll back by digest. Maintain an immutable record of last-known-good digests per environment.

Mistake 6: “We set ImagePullPolicy=Always, so we’re safe”

Symptom: Frequent deploy slowness, rate limiting from registries, and occasional failures during registry hiccups.

Root cause: You used pulling as a substitute for immutable identity. Now correctness depends on registry availability.

Fix: Use digests. Keep Always for development workflows where tag tracking is intentional, and ensure registry/mirror resiliency if you depend on it.

Checklists / step-by-step plan

Step-by-step: implement the Image Tag Rule in a real organization

Define the policy: “Prod and staging manifests must reference images by digest.” Write it down. Make it reviewable.
Decide your naming scheme: Keep tags for humans (commit SHA, build number, PR number). Digests for machines.
Make promotion explicit: A promotion moves a digest from one environment to the next. Not “rebuild and redeploy.”
Enforce tag immutability where possible: At least for release tags (e.g., v1.2.3). If your registry can reject overwrites, turn it on.
Add a CI check: Block prod manifests containing image: repo:tag without @sha256.
Record deployed digests: Add annotations or release metadata in Git so incident response doesn’t depend on tribal memory.
Decide multi-arch strategy: Are you pinning index digests or per-arch digests? Document it and test it.
Train the on-call muscle: Teach “check imageID first” as a default move for weirdness.
Keep a last-known-good digest list: One per service per environment. This makes rollback a single-line change.
Audit periodically: Search manifests for floating tags; treat it like expired TLS certs—boring, predictable work.

Operational checklist: before you declare an incident “fixed”

All replicas run the same digest (or the expected per-arch digest set).
The desired state references that digest, not just a tag.
Registry tag for the human-friendly label points to the same digest you deployed (optional, but sanity-preserving).
Rollout completed and old ReplicaSets are scaled to zero (unless you intentionally keep them).
Rollback plan is “change digest back,” not “hope the tag still means what it meant yesterday.”

Release checklist: how to avoid creating drift in the first place

Every build creates a unique tag (commit SHA, build ID) and pushes it once.
Promotion selects a digest from that build output.
No rebuilds happen during deploy; deploy consumes artifacts.
Base image updates trigger new tags/digests, never overwriting release tags used in running clusters.
Incident response includes capturing the digest from running pods into the ticket.

FAQ

1) Why is deploying by `:latest` so bad?

Because it’s not a version. It’s a moving pointer. You can’t reliably answer “what is running?” and you can’t reliably roll back.

2) Isn’t a semantic version tag like `1.2.3` safe?

Only if your registry and process make it immutable. Otherwise it’s just a tag with better manners. Pinning by digest is the technical guarantee.

3) Doesn’t pinning digests make workflows harder for humans?

A bit, until you separate “human labels” from “machine identity.” Keep tags for navigation and dashboards, but deploy the digest. Your future self will thank you quietly.

4) What about Kubernetes ImagePullPolicy—should I set it to Always?

Not as your main safety mechanism. Always increases pull traffic and couples correctness to registry availability. Use digests to get immutability, then choose pull policy for performance and freshness needs.

5) If I pin by digest, can I stop worrying about reproducible builds?

No. Digest pinning ensures you deploy the same artifact repeatedly. Reproducible builds ensure the artifact corresponds to the source and inputs you intended. They solve different failure modes.

6) How do I roll back cleanly?

Keep a last-known-good digest per service. Roll back by changing the deployment image to that digest. Avoid rollbacks that reference tags like :previous unless you enforce immutability.

7) Why do two nodes pull different images for the same tag?

Because caches are local and pull policy varies. Also, multi-arch tags can resolve differently depending on node architecture. Tags are not stable identities; they’re lookup keys.

8) What’s the difference between pinning an index digest and a platform digest?

An index digest identifies a manifest list (multi-arch). A platform digest identifies the specific image manifest for amd64, arm64, etc. Pin the level that matches your fleet strategy.

9) We use a registry mirror. Does that change the advice?

It strengthens it. Mirrors make pull performance better but can add staleness. Digests make mirrors safe because cached content is keyed by immutable identifiers.

10) Is this only a Kubernetes problem?

No. Any system that deploys containers can suffer from tag drift: Docker Compose, Nomad, ECS, plain Docker on VMs. The underlying issue is the same: mutable references.

Conclusion: next steps you can actually do this week

If you do nothing else, do this: pick one production service and change its deployment from repo:tag to repo@sha256:digest. Record the digest in the deploy metadata. Validate that every replica runs the same digest. You just eliminated a whole category of ambiguity.

Then make it systemic:

Add a CI gate that rejects floating tags in production manifests.
Stop overwriting release tags. If your registry supports immutability, enable it for release namespaces.
Separate build and deploy: build produces digests; deploy selects digests. Promotion is a pointer change in Git, not a rebuild.
Teach on-call to check imageID first. It’s the fastest truth you have.

“It works on my machine” doesn’t mean your teammates are careless. It means your system allowed ambiguity. Remove the ambiguity. Make the bytes boring.