Docker Multi-Host Without Kubernetes: Real Options and Hard Limits

November 10, 2025 • February 3, 2026 • Read: 27 min • Views: 13

Was this helpful?

You have more than one server. You have containers. You don’t want Kubernetes—maybe because your team is small,
your workloads are boring (good), or you already have enough moving parts that wake you at 03:00.
But you still want “multi-host”: scheduling, service discovery, failover, and updates that don’t involve
SSH roulette.

The catch: “Docker across hosts” is not one feature. It’s a pile of decisions about networking, identity,
storage, and failure semantics. You can absolutely do it without Kubernetes. You just need to accept
the limits upfront—especially around stateful storage and networking—and pick an approach that matches your
operational maturity.

What “multi-host Docker” really means

When people say “multi-host Docker,” they usually mean one or more of these capabilities:

Scheduling: pick a host for a container and re-place it after failure.
Service discovery: find “the thing” by name, not by IP.
Networking: containers talk across hosts with stable names and reasonable policies.
Updates: roll forward, roll back, don’t melt production.
State handling: persistent data that doesn’t get “recreated” into oblivion.
Health model: what “healthy” means and who decides to restart what.

Kubernetes bundles those choices into a cohesive (if complex) system. Without Kubernetes you’re assembling your own.
That can be an advantage: fewer abstractions, fewer “magic” controllers, easier mental model. Or it can be a trap:
you reinvent the hardest 20% (state, identity, networking) while feeling productive in the easiest 80%.

The useful question isn’t “How do I do multi-host without Kubernetes?” It’s: Which subset do I actually need?
If you’re running stateless APIs behind a load balancer, you can be happy with minimal orchestration.
If you’re running databases across hosts with “containers everywhere,” you’re either doing serious storage engineering,
or you’re cosplaying reliability.

Facts and history that still matter in 2026

Docker’s original multi-host story wasn’t Swarm: early “Docker clustering” leaned on external systems (Mesos, etcd-based experiments) before Swarm matured.
Swarm mode (2016) shipped with integrated Raft: the manager quorum is a real distributed system; treat it like one.
“Docker Machine” had a moment: it automated node provisioning in the pre-IaC era; most people replaced it with Terraform/Ansible and never looked back.
Kubernetes won partly because it standardized expectations: service discovery, rolling deploys, and declarative desired state became table stakes.
Overlay networks existed long before containers: VXLAN is older than most container platforms; it’s still the workhorse for multi-host L2-ish networking.
Container runtimes split out of Docker: containerd and runc became separate components; “Docker” is often the UX layer on top.
Stateful containers were always contentious: the “pets vs cattle” argument didn’t age; it just moved into PVs, CSI drivers, and storage classes.
Service discovery went through phases: from static configs, to ZooKeeper/etcd/Consul, to “the orchestrator is the source of truth.” Without K8s, you pick a phase.
iptables vs nftables is still a thing: container networking still trips over host firewall semantics and kernel versions.

One paraphrased idea worth keeping taped to your monitor, attributed to Werner Vogels: Everything fails eventually; design and operate assuming failure is normal.
If your multi-host plan requires perfect networks and immortal disks, it isn’t a plan. It’s a hope.

Realistic options (and who they’re for)

1) Docker Swarm mode

Swarm is the most direct answer if you want “Docker, but across multiple hosts” with minimal extra machinery.
It gives you scheduling, service discovery, rolling updates, secrets, and an overlay network. The integration is tight.
The ecosystem is quieter than Kubernetes, but quiet isn’t the same as dead. In a lot of companies, quiet is a feature.

Pick Swarm if: you want a cohesive platform, your services are mostly stateless, and you can live with fewer ecosystem integrations.
Avoid Swarm if: you need sophisticated policy controls, multi-tenant isolation, or you expect to hire people who only know Kubernetes.

2) HashiCorp Nomad (with Docker)

Nomad is a scheduler that’s easier to reason about than Kubernetes, while still being real orchestration.
It plays well with Consul and Vault, and it’s happy running Docker containers. It can also schedule non-container workloads,
which sometimes matters in brownfield environments.

Pick Nomad if: you want scheduling and health checks without Kubernetes’ breadth, and you’re comfortable with HashiCorp’s ecosystem.
Avoid Nomad if: you need the “Kubernetes marketplace” of controllers and operators for every niche system.

3) systemd + Docker Compose + a load balancer

This is the “grown-up bash script” approach. You run Compose per host, manage deployment with CI pushing artifacts, and front it with a proper load balancer.
It’s not glamorous. It works. It also puts the burden on you to solve discovery, rollout, and failure response.

Pick it if: you have a handful of nodes, stable workloads, and you value transparency over features.
Avoid it if: you want automated rescheduling after host loss, or you have frequent deploys and scaling events.

4) DIY scheduling (please don’t) + service discovery

People still try: a homegrown “scheduler” that picks a host based on CPU, then runs docker run over SSH, then registers in Consul.
This can work… until the second failure mode shows up. The third failure mode is where it becomes a career-limiting move.

Joke #1: Homegrown orchestration is like writing your own database: it’s educational, expensive, and you’ll do it twice.

Docker Swarm: the “enough orchestration” option

Swarm mode is built into the Docker Engine. You initialize a manager, join workers, and define services.
You get a control plane with Raft consensus. That means: managers keep state, and you need quorum.
Lose quorum and your cluster becomes a frozen museum exhibit.

Swarm’s strengths

Operational simplicity: one binary, one mental model, minimal dependencies.
Service model: replicas, rolling updates, health checks, and routing mesh.
Secrets: built-in distribution to tasks, encrypted in transit and at rest (in the Raft log).
Decent defaults: a lot of teams succeed because Swarm has fewer knobs to mis-set.

Swarm’s limits (the ones that bite)

Stateful services are your problem: the scheduler can move a task; your data can’t teleport.
Networking is “good enough” until it isn’t: overlay issues become multi-layer hunts: kernel, MTU, conntrack, iptables, VXLAN.
Ecosystem: fewer third-party integrations, fewer “operators,” fewer prebuilt patterns.
Day-2 scaling: you will eventually want policy controls and RBAC that Swarm doesn’t emphasize.

Practical tasks (with commands, output, and decisions)

Task 1: Validate swarm status and node roles

cr0x@server:~$ docker info | sed -n '/Swarm:/,/Runtimes:/p'
Swarm: active
 NodeID: 8q2m0h9v6q6f0m8c7xqkqz0vv
 Is Manager: true
 ClusterID: k1l2m3n4o5p6q7r8s9t0u1v2w
 Managers: 3
 Nodes: 9
 Default Address Pool: 10.0.0.0/8
 SubnetSize: 24
 Data Path Port: 4789
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Autolock Managers: false

What it means: Swarm is active, this node is a manager, and you have 3 managers (quorum requires 2).
Decision: If you have 1 manager, fix that before you trust Swarm with anything beyond a lab. For 3 managers, ensure they’re failure-domain separated.

Task 2: Check node health and availability

cr0x@server:~$ docker node ls
ID                            HOSTNAME   STATUS  AVAILABILITY  MANAGER STATUS  ENGINE VERSION
8q2m0h9v6q6f*                 mgr-a      Ready   Active        Leader          25.0.3
3k4l5m6n7o8p9q0r1s2t3u4v5w6x   mgr-b      Ready   Active        Reachable      25.0.3
7y8z9a0b1c2d3e4f5g6h7i8j9k0l   mgr-c      Ready   Active        Reachable      25.0.3
1a2b3c4d5e6f7g8h9i0j1k2l3m4n   wrk-a      Ready   Active                        25.0.3

What it means: Managers are reachable; workers are ready.
Decision: If any manager is “Unreachable,” treat it as a quorum risk and investigate before doing deploys.

Task 3: Inspect service rollout health

cr0x@server:~$ docker service ps api --no-trunc
ID                          NAME         IMAGE                   NODE   DESIRED STATE  CURRENT STATE           ERROR  PORTS
u1v2w3x4y5z6                api.1        registry/app:1.9.2      wrk-a  Running        Running 2 hours ago
a7b8c9d0e1f2                api.2        registry/app:1.9.2      wrk-b  Running        Running 2 hours ago
g3h4i5j6k7l8                api.3        registry/app:1.9.2      wrk-c  Running        Running 2 hours ago

What it means: All replicas are running; no restart loop.
Decision: If you see “Rejected” or “Failed,” stop the rollout and inspect logs/events before you chase ghosts in the load balancer.

Task 4: Confirm published ports and routing mesh behavior

cr0x@server:~$ docker service inspect api --format '{{json .Endpoint.Ports}}'
[{"Protocol":"tcp","TargetPort":8080,"PublishedPort":80,"PublishMode":"ingress"}]

What it means: Port 80 is published via ingress (routing mesh). Any node can accept connections and forward to tasks.
Decision: If you’re debugging intermittent timeouts, consider switching to host publish mode and using an external load balancer for clearer traffic paths.

Task 5: Check overlay networks and peers

cr0x@server:~$ docker network ls
NETWORK ID     NAME              DRIVER    SCOPE
1c2d3e4f5g6h   ingress           overlay   swarm
7h8i9j0k1l2m   backend           overlay   swarm
a1b2c3d4e5f6   bridge            bridge    local
f6e5d4c3b2a1   host              host      local

What it means: You have the default ingress network and a custom overlay.
Decision: If “ingress” is missing or corrupted, service networking will behave like a haunted house. Fix the cluster network before blaming the app.

Task 6: Verify gossip/control-plane ports are reachable

cr0x@server:~$ ss -lntup | egrep ':(2377|7946|4789)\b'
tcp   LISTEN 0      4096          0.0.0.0:2377      0.0.0.0:*    users:(("dockerd",pid=1123,fd=41))
tcp   LISTEN 0      4096          0.0.0.0:7946      0.0.0.0:*    users:(("dockerd",pid=1123,fd=54))
udp   UNCONN 0      0             0.0.0.0:7946      0.0.0.0:*    users:(("dockerd",pid=1123,fd=55))
udp   UNCONN 0      0             0.0.0.0:4789      0.0.0.0:*    users:(("dockerd",pid=1123,fd=56))

What it means: Swarm manager port (2377), gossip (7946 tcp/udp), and VXLAN (4789/udp) are listening.
Decision: If these aren’t present or are blocked by host firewalls, overlay networking and node membership will fail in non-obvious ways.

Nomad: sane scheduling without the full Kubernetes tax

Nomad’s pitch is straightforward: a single scheduler, easy clustering, clear jobspecs, and fewer moving parts.
In practice, the “less moving parts” thing is true until you add Consul for service discovery and Vault for secrets,
at which point you’re still simpler than Kubernetes but not exactly camping with a flint knife.

Where Nomad fits best

Mixed workloads (VMs, raw exec, Docker containers) in one scheduler.
Teams that want explicit jobs and allocations rather than a universe of controllers.
Environments already using Consul/Vault.

Nomad’s limits (practical, not ideological)

Storage integrations: you still need a volume story that survives node failure.
Networking: you can do it cleanly, but you must decide: host networking, bridge, CNI, service mesh, etc.
App ecosystem expectations: many vendors assume Kubernetes objects, not Nomad jobs.

Practical tasks

Task 7: Check Nomad cluster health quickly

cr0x@server:~$ nomad server members
Name     Address          Port  Status  Leader  Raft Version  Build  Datacenter  Region
nomad-1  10.20.0.11       4648  alive   true    3             1.7.5  dc1         global
nomad-2  10.20.0.12       4648  alive   false   3             1.7.5  dc1         global
nomad-3  10.20.0.13       4648  alive   false   3             1.7.5  dc1         global

What it means: 3 servers, one leader, raft is healthy.
Decision: If there’s no leader, stop deploying; fix quorum/network first.

Task 8: Inspect allocations for a job and read failure clues

cr0x@server:~$ nomad job status api
ID            = api
Name          = api
Type          = service
Priority      = 50
Status        = running
Datacenters   = dc1
Task Groups   = web (3 running)

Latest Deployment
ID          = 3c1b2a9d
Status      = successful
Description = Deployment completed successfully

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created   Modified
a1b2c3d4  n-11      web         17       run      running  2h ago    2h ago
e5f6g7h8  n-12      web         17       run      running  2h ago    2h ago
i9j0k1l2  n-13      web         17       run      running  2h ago    2h ago

What it means: Deployment succeeded; allocations are stable.
Decision: If allocations are “failed,” run nomad alloc status and check driver errors (image pulls, port conflicts, CPU/mem constraints).

Compose + systemd + SSH: sharp tools, sharp edges

This approach is common in small-to-medium shops and in regulated environments where “new platform” is a dirty phrase.
It’s also common in enterprises where platform teams exist but you don’t get to use them.

The model: each host runs a known set of containers. You deploy via CI (rsync, artifacts, images),
restart via systemd, and use a load balancer or DNS to spread traffic. Failover is either handled by the LB (stateless)
or by humans (stateful). Yes, it’s less “cloud-native.” It’s also understandable at 02:00.

Practical tasks

Task 9: Confirm the real state of containers on a host

cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
NAMES        IMAGE                    STATUS                  PORTS
api          registry/app:1.9.2       Up 2 hours              0.0.0.0:8080->8080/tcp
worker       registry/worker:1.9.2    Up 2 hours
postgres     postgres:16              Up 14 days              0.0.0.0:5432->5432/tcp

What it means: This host is carrying a database and app containers; the DB has long uptime.
Decision: If “postgres” lives on every host “for convenience,” stop and design proper state placement. Convenience is how outages are born.

Task 10: Use systemd to make Compose behave like a service

cr0x@server:~$ systemctl status app-stack
● app-stack.service - Docker Compose App Stack
     Loaded: loaded (/etc/systemd/system/app-stack.service; enabled; preset: enabled)
     Active: active (running) since Sat 2026-01-03 09:12:44 UTC; 2h 3min ago
   Main PID: 1459 (docker)
      Tasks: 18 (limit: 18952)
     Memory: 612.4M
        CPU: 9min 12.180s
     CGroup: /system.slice/app-stack.service
             ├─1467 /usr/bin/docker compose up
             └─... containers ...

What it means: Your stack is anchored to init; reboots won’t silently “forget” to start it.
Decision: If you don’t have this kind of boring scaffolding, you will eventually debug a “random outage” that is actually “host rebooted.”

Multi-host networking: where confidence goes to die

Multi-host networking is never just “open some ports.” It’s MTU, encapsulation, conntrack tables, asymmetric routing,
firewall rules you forgot existed, and that one kernel upgrade that changed nftables behavior.

If you use Swarm overlays (VXLAN), you’re creating an encapsulated network on top of your network.
That can work beautifully—until your underlay has a smaller MTU than you assumed, or your security team
blocks UDP 4789 because “we don’t use that.” Spoiler: you do now.

Practical tasks

Task 11: Detect MTU mismatch symptoms quickly

cr0x@server:~$ ip link show dev eth0 | sed -n '1,2p'
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff

What it means: MTU is 1450 (common on cloud networks with VXLAN/GRE already in play).
Decision: If your overlay assumes 1500 and your underlay is 1450, you’ll see weird timeouts and partial responses. Align MTUs or configure overlay MTU appropriately.

Task 12: Check conntrack exhaustion (classic “it works until it doesn’t”)

cr0x@server:~$ sudo sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 248901
net.netfilter.nf_conntrack_max = 262144

What it means: You’re close to conntrack max.
Decision: If count approaches max during traffic spikes, you’ll get dropped connections and “random” failures. Increase the max (with memory awareness) and/or fix traffic patterns.

Task 13: Confirm overlay VXLAN traffic is flowing

cr0x@server:~$ sudo tcpdump -ni eth0 udp port 4789 -c 5
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:12:01.123456 IP 10.20.0.11.53422 > 10.20.0.12.4789: UDP, length 98
10:12:01.223455 IP 10.20.0.12.4789 > 10.20.0.11.53422: UDP, length 98
10:12:01.323441 IP 10.20.0.13.48722 > 10.20.0.11.4789: UDP, length 98
5 packets captured

What it means: VXLAN packets are present on the wire.
Decision: If you see nothing while services are trying to communicate, you likely have firewall/security group blocks or wrong routing between nodes.

Joke #2: Overlay networking is a great way to learn packet analysis—mostly because you won’t have a choice.

Storage for multi-host containers: reality, not vibes

Stateless workloads are easy to spread across hosts. Stateful workloads are where platforms earn their keep.
Without Kubernetes, you don’t get CSI abstractions or a standard PV lifecycle. You can still do stateful systems well.
But you must choose a storage model deliberately.

Three sane storage patterns

Pattern A: Local storage, pinned placement (simple, honest)

You run stateful services on specific hosts with local disks (LVM, ZFS, ext4). You pin placement (Swarm constraints, Nomad constraints, or “this host runs the DB”).
Failover is a procedure, not an illusion.

Pros: Fast, simple, fewer dependencies. Cons: Host failure means manual recovery unless you add replication at the application layer.

Pattern B: Network file storage (NFS)

NFS is the cockroach of infrastructure: it survives everything, including strong opinions.
For many workloads—shared uploads, artifacts, read-heavy content—it’s fine. For write-heavy databases, it’s often misery.

Pros: Straightforward, widely supported. Cons: Latency, locking semantics, and noisy-neighbor problems; “it’s slow” becomes a lifestyle.

Pattern C: Distributed block (Ceph RBD) or clustered filesystem

Ceph can give you replicated block devices across hosts. It’s powerful, and it comes with operational weight.
If you don’t have the people to run it, you’ll learn storage the hard way. If you do, it can be excellent.

Pros: Real failover possibilities, consistent block semantics. Cons: Operational complexity, performance tuning, and a long list of “depends.”

Practical tasks

Task 14: Inspect Docker volume usage and driver choice

cr0x@server:~$ docker volume ls
DRIVER    VOLUME NAME
local     pgdata
local     uploads

What it means: These are local volumes. They live on the host.
Decision: If you expect a container to reschedule to another host and keep its data, local volumes won’t do that. Either pin placement or use shared/distributed storage.

Task 15: Find where local Docker volumes actually reside

cr0x@server:~$ docker volume inspect pgdata --format '{{.Mountpoint}}'
/var/lib/docker/volumes/pgdata/_data

What it means: Data is under Docker’s data directory on this host.
Decision: Back it up as host data, not “container data.” If you reinstall the host or move Docker’s root dir, plan migrations.

Task 16: Validate disk health and saturation (host-level truth)

cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (server)  01/03/2026  _x86_64_  (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.31    0.00    4.22    8.77    0.00   74.70

Device            r/s     w/s   rkB/s   wkB/s  await  aqu-sz  %util
nvme0n1         45.2   210.1  1824.3  9312.7  18.40    2.31  96.8

What it means: The disk is near saturation (%util ~97%) with meaningful await times.
Decision: If your “cluster issue” is actually a single hot disk, fix storage first: faster media, better caching, less write amplification, or move noisy workloads off the node.

Task 17: If using NFS, confirm mount options and latency risk

cr0x@server:~$ mount | grep nfs
10.30.0.50:/exports/uploads on /mnt/uploads type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,sec=sys,clientaddr=10.20.0.21)

What it means: NFSv4.2 with large rsize/wsize and hard mounts (good for correctness, not always for tail latency).
Decision: If apps hang during NFS server hiccups, that’s expected with hard. Decide if you prefer correctness (usually) or responsiveness (rarely).

Task 18: Check container-level filesystem pressure

cr0x@server:~$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          42        18        27.4GB    10.2GB (37%)
Containers      51        9         1.8GB     1.1GB (61%)
Local Volumes   12        7         814.3GB   120.5GB (14%)
Build Cache     0         0         0B        0B

What it means: Volumes dominate disk use; pruning images won’t save you.
Decision: If disks are filling, you need volume lifecycle management and backups, not a docker image prune ritual.

Security, secrets, and identity across hosts

Multi-host means more trust boundaries. Single-host Docker can get away with “eh, it’s on the same box.”
Once you have multiple nodes, you’re shipping secrets, opening ports, and letting schedulers act on your behalf.

What to do (and what to avoid)

Do: use Swarm secrets or an external secret manager. Don’t bake secrets into images.
Do: lock down the Docker API socket; it’s root with better marketing.
Do: rotate cluster certificates (Swarm does this; still monitor expiry).
Avoid: exposing Docker Engine TCP without mTLS and strict network policies.

Practical tasks

Task 19: Check Docker daemon exposure risk

cr0x@server:~$ ss -lntp | grep dockerd
LISTEN 0      4096       127.0.0.1:2375      0.0.0.0:*    users:(("dockerd",pid=1123,fd=7))

What it means: Docker API is on localhost:2375 (unencrypted). That’s still risky but not instantly catastrophic.
Decision: If this is bound to 0.0.0.0, fix it immediately. If you need remote control, use SSH tunneling or mTLS on 2376 with firewall allowlists.

Task 20: Validate Swarm secret usage in a service

cr0x@server:~$ docker service inspect api --format '{{json .Spec.TaskTemplate.ContainerSpec.Secrets}}'
[{"File":{"Name":"db_password","UID":"0","GID":"0","Mode":292},"SecretID":"p4s5w0rds3cr3t","SecretName":"db_password"}]

What it means: The service consumes a Swarm secret as a file with mode 0444 (292).
Decision: If secrets are passed as environment variables, assume they’ll leak into logs and crash dumps eventually. Prefer file-based secrets where possible.

Observability and operations: logging, metrics, traces

Multi-host without Kubernetes doesn’t mean “no observability.” It means you can’t rely on Kubernetes-native tooling to paper over gaps.
You need to standardize logs, metrics, and basic host telemetry—because you will debug cross-host issues, and you’ll want facts.

Practical tasks

Task 21: Check container restart storms and correlate with host pressure

cr0x@server:~$ docker inspect api --format 'RestartCount={{.RestartCount}} OOMKilled={{.State.OOMKilled}} ExitCode={{.State.ExitCode}}'
RestartCount=7 OOMKilled=true ExitCode=137

What it means: The container was OOM-killed and restarted; exit code 137 confirms it.
Decision: Increase memory limit, fix memory leak, or reduce concurrency. Don’t “solve” it by adding more replicas if every replica OOMs under load.

Task 22: Read node-level memory pressure (Linux doesn’t lie)

cr0x@server:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            31Gi        27Gi       1.2Gi       512Mi       3.0Gi       1.8Gi
Swap:          4.0Gi       3.8Gi       256Mi

What it means: Low available memory and swap nearly full: you’re in the danger zone.
Decision: If you’re swapping heavily, latency will spike. Either add RAM, tighten container limits, or move workloads. “It’s fine” is not a memory management strategy.

Task 23: Spot Docker daemon errors around networking and iptables

cr0x@server:~$ journalctl -u docker --since "1 hour ago" | tail -n 10
Jan 03 10:44:11 server dockerd[1123]: time="2026-01-03T10:44:11.112233Z" level=warning msg="could not delete iptables rule" error="iptables: No chain/target/match by that name."
Jan 03 10:44:15 server dockerd[1123]: time="2026-01-03T10:44:15.334455Z" level=error msg="failed to allocate gateway (10.0.2.1): Address already in use"
Jan 03 10:44:15 server dockerd[1123]: time="2026-01-03T10:44:15.334499Z" level=error msg="Error initializing network controller"

What it means: Docker’s network controller is unhappy; iptables chain mismatches and gateway conflicts.
Decision: Stop flailing at containers. Fix host networking state (possibly stale rules, conflicting bridges, or a mismatched firewall backend), then restart Docker cleanly.

Fast diagnosis playbook

Multi-host failures are usually one of four things: control plane, network underlay/overlay, storage latency, or resource pressure.
The trick is to find which one in minutes, not hours.

First: confirm the control plane is sane

Swarm: docker node ls shows managers reachable and workers ready.
Nomad: nomad server members shows a leader and alive peers.
If the control plane is degraded, stop deployments and stop “rolling restarts.” You’ll just churn state.

Second: check host resource pressure on the affected nodes

Memory: OOM kills, swap usage, free -h, container restart counts.
CPU: run queue and saturation (use top or pidstat if available).
Disk: iostat -xz and filesystem fullness.
If one node is hot, drain it (Swarm) or stop scheduling there (Nomad) before you debug application code.

Third: validate network fundamentals

Ports listening: 2377/7946/4789 for Swarm overlays.
MTU: confirm underlay MTU and overlay expectations.
Conntrack: ensure you’re not hitting the ceiling.
If overlay traffic is absent (tcpdump shows nothing), you’re looking at firewall, routing, or security group policy.

Fourth: isolate storage as the bottleneck (especially for “random” timeouts)

Disk util and await: iostat -xz.
NFS behavior: mounts, server responsiveness, and whether apps hang on hard mounts.
Volume growth: docker system df and filesystem usage.
If storage is slow, everything above it looks broken. Fix the base layer first.

Common mistakes: symptom → root cause → fix

1) “Containers can’t reach each other across hosts”

Symptom: inter-service calls time out; DNS names resolve but connections hang.

Root cause: VXLAN (UDP 4789) blocked or MTU mismatch causing fragmentation drops.

Fix: allow UDP 4789 end-to-end; align MTU; validate with tcpdump and a small payload test; consider host publish mode + external LB for clarity.

2) “Swarm manager went read-only / stuck”

Symptom: deploys hang, services don’t converge, manager status flaps.

Root cause: lost quorum or unstable manager connectivity; sometimes disk latency on manager Raft logs.

Fix: run 3 or 5 managers; separate failure domains; ensure manager disks are reliable; don’t colocate managers on the noisiest storage nodes.

3) “We scaled replicas, but it got slower”

Symptom: higher latency after adding replicas; more 5xx.

Root cause: shared bottleneck: database, NFS, conntrack, or LB limits; also cache stampedes.

Fix: measure the bottleneck layer first; set connection pools; rate-limit; scale the dependency, not just the stateless tier.

4) “Random connection resets during traffic spikes”

Symptom: intermittent failures that disappear when you look.

Root cause: conntrack table full, ephemeral port exhaustion, or host firewall state churn.

Fix: check nf_conntrack usage; increase limits; reduce connection churn with keepalives; tune LB behavior; ensure timeouts are consistent.

5) “Stateful container rescheduled and lost data”

Symptom: service restarts on another host and comes up “fresh.”

Root cause: local volume on the old host; the scheduler did exactly what you asked, not what you meant.

Fix: pin stateful workloads to nodes; use real replication (Postgres streaming replication, etc.); or implement shared/distributed storage deliberately.

6) “Upgrades break networking after reboot”

Symptom: after kernel/firewall updates, Docker networks won’t initialize.

Root cause: iptables vs nftables mismatch, stale rules, changed defaults.

Fix: standardize OS builds; validate firewall backend; test reboot behavior; keep a rollback plan for host networking components.

Checklists / step-by-step plan

Step-by-step: choosing the right non-Kubernetes approach

Inventory workload types: stateless APIs, background jobs, stateful databases, shared file storage.
Decide your failure contract: auto-reschedule vs manual failover for stateful parts.
Pick orchestration level:
- Need rescheduling and service discovery: Swarm or Nomad.
- Need simple “run the same thing on these hosts”: Compose + systemd.
Design networking: overlay vs host networking; define ports; confirm MTU and firewall policies.
Design storage: local pinned, NFS, or distributed block; write down RPO/RTO and test restores.
Security model: secrets distribution, TLS, node join controls, least privilege.
Observability baseline: centralized logs, metrics, alerting on node pressure, and a deploy audit trail.
Run failure drills: kill a node, break a link, fill a disk, and confirm behavior matches your expectations.

Operational checklist: before you go multi-host

3 manager nodes (if Swarm), placed across failure domains.
Time sync (chrony/ntpd) on all nodes; drift causes weirdness in TLS and consensus systems.
Firewall rules explicitly allow required ports between nodes.
Consistent OS/kernel versions across nodes (or at least tested combinations).
Documented storage topology: where data lives, how it’s backed up, and how it’s restored.
Load balancer behavior documented: health checks, timeouts, connection reuse.
Runbooks for node drain, rollback, and cluster quorum recovery.

Practical tasks: draining and rollback controls

Task 24: Drain a problematic Swarm node safely

cr0x@server:~$ docker node update --availability drain wrk-b
wrk-b

What it means: Swarm will move tasks off wrk-b (except global services).
Decision: Use this when a node has disk errors, kernel issues, or resource thrash. Don’t keep it “Active” and hope it behaves.

Task 25: Pause a bad Swarm rollout

cr0x@server:~$ docker service update --update-parallelism 0 api
api

What it means: Setting parallelism to 0 effectively stops progress of an update.
Decision: Use this when new tasks are failing and you need to stop the bleeding while you inspect logs and constraints.

Task 26: Roll back a Swarm service to the previous spec

cr0x@server:~$ docker service update --rollback api
api

What it means: Swarm reverts the service to its prior configuration/image.
Decision: If your last change correlates with errors, rollback early. Debug later. Pride is not an SLO.

Three mini-stories from corporate life

Mini-story 1: The incident caused by a wrong assumption

A mid-size SaaS company ran Docker Swarm for stateless services and decided to “containerize everything” to standardize deployments.
The database team agreed, with one condition: “We’ll keep the data on a volume.”
They created a local Docker volume and moved on.

A month later, a worker node died hard—power supply, not a graceful shutdown. Swarm rescheduled the database task onto another node.
The container started cleanly. The health check passed. The application started writing to a brand-new empty database.
Within minutes, customers saw missing data. Support escalated. Everyone had the same terrible thought: “Wait, did we just fork reality?”

The wrong assumption was subtle: they thought “volume” meant “portable.” In Docker, a local volume is local.
Swarm did its job and moved the task. Storage did its job and stayed where it was—on the dead server’s disks.

Recovery involved bringing the node back on a bench PSU long enough to extract data, then restoring into the “new” database instance.
They later rebuilt the architecture using application-level replication and pinned stateful tasks to specific nodes with explicit constraints.
The final postmortem sentence was short: “We treated local state like a cluster resource.”

Mini-story 2: The optimization that backfired

An internal platform team wanted faster deployments. Their Swarm cluster pulled images from a registry that sometimes slowed down at peak times.
Someone suggested: “Let’s pre-pull images on every node as a nightly job, so deploys never wait on the network.”
It sounded sensible and felt very DevOps.

The nightly job ran docker pull for a dozen large images on every node. It worked—deploys got quicker.
Then a new symptom arrived: at around 01:00, internal APIs started timing out, and the message queue lagged.
It wasn’t a total outage. It was worse: a slow, flaky system that made engineers argue.

The culprit wasn’t CPU. It was disk and network contention. Pulling layers hammered storage, filled page cache with mostly useless data,
and created bursty egress that collided with backups. Under pressure, the kernel started reclaiming memory aggressively.
Latency spiked. Retries multiplied. The system entered a feedback loop.

The fix was boring: stop pre-pulling on all nodes at once, stagger it, cap bandwidth, and keep “deployment optimization” from competing with production traffic.
They also measured image sizes and reduced them—because the cheapest performance win is not moving bytes you don’t need.

Mini-story 3: The boring but correct practice that saved the day

A financial services team ran Nomad with Docker for internal services. Nothing fancy: three Nomad servers, a pool of clients, Consul for discovery.
The team was unreasonably strict about two things: documented runbooks and routine restore tests for stateful components.
People teased them for it. Quietly, everyone also relied on them.

One Friday, a storage array controller started flapping. Latency didn’t explode immediately; it wobbled.
Applications began timing out “randomly.” The on-call engineer ran the fast diagnosis routine: control plane healthy, CPU fine, memory fine, then iostat.
Disk await was climbing. The logs showed retries across multiple services. It was storage, not code.

They executed the runbook: drain the worst-affected clients, fail over the stateful service to a replica that used different backing storage,
and reduce write amplification by pausing a batch job. Meanwhile, they restored a recent backup into a clean environment to validate integrity.
That last step took effort, but it shut down panic early: they knew they had a safe copy.

When the incident ended, the postmortem was almost disappointingly calm. The biggest “lesson learned” was that the boring practices worked.
Restore drills aren’t glamorous, but they turn disasters into unpleasant afternoons.

FAQ

1) Can I run Docker Compose across multiple servers?

Not as a single coherent “cluster” with scheduling. Compose is per host. You can deploy the same Compose file to multiple hosts,
but you must handle load balancing, discovery, and failover yourself.

2) Is Docker Swarm “dead”?

Swarm isn’t trendy, and that’s different from dead. It’s stable, integrated, and still used. The risk is ecosystem momentum:
fewer third-party patterns, fewer hires with direct experience, and fewer vendor integrations.

3) What’s the biggest limit of multi-host Docker without Kubernetes?

Stateful workload lifecycle and storage portability. Scheduling a process is easy. Scheduling data safely is the hard part.
If you don’t design storage explicitly, you’ll eventually lose data or availability.

4) Should I use overlay networks or host networking?

For simplicity and performance, host networking plus a real load balancer is often easier to operate.
Overlays help with service-to-service connectivity and portability, but they add failure modes (MTU, UDP blocks, conntrack).
Pick based on your team’s debugging appetite.

5) How do I do service discovery without Kubernetes?

Swarm has built-in service DNS on overlay networks. With Nomad, Consul is a common choice.
In the Compose+systemd world, you can use static upstreams in a load balancer, DNS records, or a discovery system like Consul—just keep it consistent.

6) What about secrets management?

Swarm secrets are good for many cases. For larger environments or compliance-heavy needs, use a dedicated secret manager.
Avoid environment-variable secrets when you can; they leak into too many places.

7) Can I run databases in containers across multiple hosts safely?

Yes, but you must pick a replication/failover model that matches the database and your operational maturity.
“Just put it in a container and let the scheduler move it” is how you learn regret.

8) How many manager nodes do I need in Swarm?

Use 3 or 5 managers. One manager is a single point of failure. Two managers can’t tolerate one failure without losing quorum.
Three managers tolerate one failure and still make progress.

9) What’s a good non-Kubernetes baseline for a small team?

For mostly stateless services: Swarm with 3 managers, external load balancer, and a clear storage story for the few stateful components.
For “mostly fixed placement”: Compose + systemd + a load balancer, plus documented runbooks.

Practical next steps

Write down your failure contract: what happens when a node dies, and who/what moves workloads.
Pick one orchestration level and commit for a year: Swarm, Nomad, or Compose+systemd. Mixing them is how complexity sneaks in wearing a fake mustache.
Design storage first for stateful services: pinned local + replication, NFS for the right workloads, or distributed block if you can operate it.
Harden networking: confirm MTU, conntrack capacity, and required ports. Test before production forces you to.
Operationalize: node drain procedures, rollback commands, backup/restore drills, and a fast diagnosis checklist that’s actually used.

Multi-host Docker without Kubernetes is absolutely doable. The winning move is not pretending you’re avoiding complexity.
The winning move is choosing which complexity you’ll pay for—and paying it deliberately, in daylight, with monitoring.