Everything looks fine. CPU is at 30%. Load average isn’t scary. Yet requests crawl, p99 explodes, and your on-call channel fills with the kind of “is it down?” messages that shorten lifespans.
This is the Docker performance trap: you stare at CPU because it’s measurable and comforting, while the real problem is usually somewhere messier—storage, networking, memory pressure, kernel limits, or a quiet cgroup policy that’s been throttling you like a bureaucrat with a stamp.
If it’s not CPU, what is it?
When a container “lags,” it rarely means your code suddenly became stupid. It means the container’s work is waiting: waiting on disk flushes, waiting on the network stack, waiting on DNS, waiting on page faults, waiting on a lock inside the kernel, waiting on a cgroup controller, waiting on the logging driver to stop blocking writes.
CPU graphs are seductive because they’re clean. They’re also incomplete. A system can be slow while CPU is idle if threads are blocked in uninterruptible sleep (usually I/O), stalled in runqueue contention due to throttling, or starved by memory reclaim. In containers, those problems are easier to create because you’ve introduced:
- Union/overlay filesystems (copy-up, metadata churn).
- Extra network layers (veth pairs, bridges, NAT, conntrack).
- Extra control planes (cgroups, namespaces, limits, quotas).
- “Helpful” defaults (logging driver, DNS config, storage driver).
Most performance incidents I’ve seen were “not CPU.” CPU was merely the last witness who didn’t see anything.
Exactly one quote, because it’s still true: “Hope is not a strategy.” — General Gordon R. Sullivan
Here’s the punchline: if you treat performance like a single knob, you’ll keep shipping latency. If you treat it like an elimination game—measure, isolate, change one thing, measure again—you’ll stop guessing and start fixing.
Interesting facts and context (the “why we’re here” edition)
- Containers didn’t start with Docker. Linux had namespaces and cgroups years earlier; Docker popularized the packaging and workflow, not the kernel primitives.
- Cgroups exist because “nice” wasn’t enough. Traditional Unix process priority didn’t provide reliable multi-tenant isolation for memory and I/O; cgroups were built to enforce it.
- Overlay/union filesystems were designed for convenience. They trade some filesystem simplicity for composable layers. That trade shows up under metadata-heavy workloads.
- Early Docker used AUFS widely. AUFS was fast in some cases and painful in others; the ecosystem gradually moved toward overlay2 as kernel support improved.
- Conntrack is a state table, not magic. NAT and connection tracking need memory and CPU. Under spikes, the table fills and the kernel starts dropping or timing out flows.
- DNS in containers is often different from DNS on the host. Docker’s embedded DNS and resolver behavior can amplify timeouts when upstream DNS is slow.
- “fsync storms” are an old problem with new costumes. Databases and journaling apps that call fsync frequently can collapse I/O when the storage backend is not built for consistent latency.
- Logging has always been a throughput killer. In the syslog days, synchronous logging could stall apps. Container logging drivers can recreate that pain if you’re not careful.
- Linux memory reclaim is a performance event. Even without OOM kills, direct reclaim and swapping can turn “fine” latency into a sawtooth.
Fast diagnosis playbook (first/second/third)
First: prove whether you’re waiting on I/O, memory, or network
- Check blocked tasks and I/O wait: if threads pile up in D-state or iowait rises, your “CPU is low” graph is lying by omission.
- Check memory pressure: if you’re reclaiming, swapping, or hitting memory.high, you can get latency without crashes.
- Check network symptoms: retransmits, conntrack drops, and DNS latency create application stalls that look like “random slowness.”
Second: localize it—host-wide vs one container vs one mount
- Host-wide: device saturation, filesystem journal contention, conntrack table full, node memory pressure.
- One container: log flood, tiny CPU quota causing throttling, too-low ulimit, one noisy neighbor volume.
- One mount/path: overlay2 copy-up path, bind mount to slow network storage, volume options misfit for workload.
Third: remove or bypass layers temporarily
- Run the same workload with tmpfs for scratch data to see if disk is the bottleneck.
- Switch logging to none briefly in a staging reproduction to see if logging is blocking.
- Run with host networking for a test (where safe) to isolate bridge/NAT/conntrack.
- Test the workload on a single writable layer (volume) instead of overlay2 for the hot path.
Speed matters during incidents. You don’t need perfect truth; you need the next constraint.
Storage and filesystem bottlenecks (overlay2, fsync, and the tax you forgot)
Disk is the usual suspect because it’s the easiest way to create queueing. Modern CPUs can outrun storage by orders of magnitude. Containers don’t change physics; they just add a few creative ways to trip over it.
Overlay2: fast enough until it isn’t
Overlay2 combines a read-only image layer stack with a writable upper layer. Reads often hit cache and feel great. Writes can trigger copy-up: the first time you modify a file that exists in a lower layer, overlay has to copy it into the writable layer. That’s not just data copy; it’s metadata operations, permission checks, and sometimes a surprise amount of work for what you thought was a “small write.”
Overlay2 pain shows up as:
- Slow package installs or build steps inside containers.
- Latency spikes when an app writes to paths that were baked into the image.
- High metadata IOPS (stat, open, rename, unlink) rather than large sequential throughput.
Journaling, barriers, and why fsync is a performance contract
Applications that call fsync (databases, message queues, anything pretending to be durable) are asking the storage stack to commit work to stable media. On good NVMe, this is fine. On networked storage, consumer SSDs with shaky firmware, or overloaded volumes, it turns into a queue. The container isn’t slow; the storage is being honest.
One more fun detail: even if your app doesn’t fsync, your filesystem journal might. Metadata-heavy workloads can cause frequent journal commits, especially if the underlying device has high latency or inconsistent writeback behavior.
Bind mounts and “I didn’t know that was NFS”
Bind mounts are great. They’re also a foot-gun when the mounted path sits on a slow or inconsistent backend: NFS, SMB, FUSE layers, encryption, or a cloud block device that is quietly rate-limited. A container can be “local” and still be writing to “somewhere else.”
Volumes: better for hot write paths
If your app writes frequently, put the write-heavy directories on a Docker volume or bind mount to a dedicated filesystem. Keep the container’s writable layer as cold as possible. Overlay2 is a fine default, not a high-performance database filesystem.
Joke #1 (short, relevant): Containers are like moving into a tiny apartment: you can live there, but don’t try to run a sawmill in the kitchen.
Memory pressure and the “it’s not OOM but it’s dying” problem
Memory issues don’t always announce themselves with an OOM kill. In fact, the nastiest latency bugs happen before OOM. That’s when the kernel tries very hard to keep you alive by reclaiming pages, compacting memory, and occasionally swapping out something important.
What memory pressure looks like in containers
- Latency spikes during traffic bursts, then recovery when load drops.
- GC pauses get worse (because allocations trigger reclaim and page faults).
- Disk I/O rises for no obvious reason (swap or writeback).
- CPU stays moderate, but the app feels “stuck.”
Cgroup memory limits change kernel behavior
When you set memory limits, you’re not just preventing the container from using too much RAM. You’re changing when and how reclaim happens. If the container is close to its limit, it may hit direct reclaim more often, which can block application threads. If swap is enabled, it may swap inside the cgroup, which often looks like random latency.
Also: file cache matters. Starving a container of memory can reduce effective caching and shift the load to disk. The CPU stays idle while your storage stack does interpretive dance.
Networking: latency, conntrack, and DNS that looks innocent
Networking performance in containers is often “good enough” until the day it isn’t. The failure mode under load is typically not bandwidth. It’s latency and drops: retransmits, queueing, and timeouts.
Bridge/NAT/conntrack overhead
The default Docker bridge setup usually involves NAT and connection tracking. Every connection becomes state in conntrack. If you have high connection churn (short-lived HTTP calls, service meshes, aggressive clients), the conntrack table can fill. When it fills, the kernel starts dropping new connections or timing them out. Your app reports “upstream timeout,” and you stare at CPU because it’s low.
DNS: death by a thousand 5-second timeouts
DNS issues are the most underdiagnosed source of container lag. A container that occasionally can’t resolve names will behave like it’s “slow,” not “broken.” The resolver waits. Your threads wait. The CPU sits there politely.
Misconfigurations include:
- Upstream DNS overloaded or rate-limited.
- Too many search domains causing repeated queries.
- glibc resolver behavior creating serial queries and slow fallback.
- Docker embedded DNS under stress (especially with many containers and frequent restarts).
Cgroups: throttling, quotas, and why “limits” are not free
Cgroups are the unsung reason containers can share a host without immediately starting a knife fight. They’re also a reliable way to create “mystery slowness.”
CPU quota throttling
If you set CPU limits, the kernel enforces them with quotas per period. Under load, a container can burn its quota early and then get throttled until the next period. The result: the process isn’t using CPU because it’s not allowed to. Your CPU graph looks calm. Your latency graph does not.
I/O controls (blkio / io controller)
Depending on cgroup version and configuration, containers can be weighted or limited for I/O. Even without explicit I/O limits, noisy neighbors can saturate a device, making everyone slow. With limits, you can accidentally kneecap your database and then blame the network.
PID limits and file descriptor limits
Not all limits are in cgroups. PIDs and ulimits can quietly cap concurrency. When you run out of file descriptors, apps can stall, fail to accept new connections, or spin on retries. It’s not CPU. It’s the kernel telling you “no” repeatedly.
Logging: the performance cliff disguised as observability
Logging is I/O. Logging is also often synchronous enough to hurt you. Docker’s default json-file logging can become a bottleneck when a container emits a high volume of logs. The daemon writes logs to disk. Disk gets busy. Everything that wants disk now waits. Sometimes the container blocks on stdout/stderr writes if buffers fill.
This is one of those problems that feels like witchcraft: “We added more debug logs and the service got slower.” Yes. You turned your production node into a typewriter.
Joke #2 (short, relevant): Debug logging in production is like adding a second steering wheel to your car—technically more control, practically more crashes.
Practical tasks: commands, outputs, and decisions (12+)
These are not academic. They are the commands you run when someone says “Docker is slow” and you want evidence before you change things.
Task 1: Check if the kernel is screaming about blocked tasks
cr0x@server:~$ dmesg -T | tail -n 20
[Mon Feb 3 10:14:52 2026] INFO: task myservice:23144 blocked for more than 120 seconds.
[Mon Feb 3 10:14:52 2026] Tainted: G W OE 5.15.0-97-generic #107-Ubuntu
[Mon Feb 3 10:14:52 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
What it means: A process is stuck, commonly in uninterruptible sleep (I/O). This is the kernel waving a flare.
Decision: Stop debating CPU. Move immediately to I/O and filesystem checks (iostat, pidstat, storage backend latency).
Task 2: Identify containers with high restart churn or weird status
cr0x@server:~$ docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.RunningFor}}\t{{.Image}}'
NAMES STATUS RUNNING FOR IMAGE
api Up 3 hours (healthy) 3 hours myorg/api:4.2.1
worker Up 3 hours 3 hours myorg/worker:4.2.1
sidecar Restarting (1) 5s ago 2 minutes myorg/sidecar:1.9.0
What it means: Restart loops can create load (DNS queries, image layer churn, log spam) and mask the real bottleneck.
Decision: Stabilize crashing containers first. Performance tuning a flapping process is performance theater.
Task 3: Check per-container CPU throttling
cr0x@server:~$ docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}"
NAME CPU % MEM USAGE / LIMIT NET I/O BLOCK I/O
api 35.12% 1.2GiB / 2GiB 1.1GB / 980MB 18GB / 9GB
worker 12.44% 800MiB / 1GiB 200MB / 210MB 3GB / 1GB
What it means: This is a hint, not proof. High CPU% here doesn’t show throttling; it shows usage.
Decision: If you suspect limits, check cgroup throttling counters (next task). Don’t assume “35%” means headroom.
Task 4: Read cgroup CPU throttling counters (cgroup v2)
cr0x@server:~$ CID=$(docker inspect -f '{{.Id}}' api)
cr0x@server:~$ CGP=$(find /sys/fs/cgroup -name "*$CID*" -type d 2>/dev/null | head -n 1)
cr0x@server:~$ cat "$CGP/cpu.stat"
usage_usec 987654321
user_usec 700000000
system_usec 287654321
nr_periods 123456
nr_throttled 45678
throttled_usec 912345678
What it means: nr_throttled and throttled_usec rising fast indicates quota throttling. Your app is being paused by policy.
Decision: Raise CPU limit, remove quota for latency-sensitive services, or tune worker concurrency. If you need fairness, use reservations/weights carefully, not a tight quota.
Task 5: Spot host-wide I/O saturation quickly
cr0x@server:~$ iostat -xz 1 3
Linux 5.15.0-97-generic (server) 02/04/2026 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.10 0.00 4.20 28.50 0.00 55.20
Device r/s w/s rkB/s wkB/s avgqu-sz await svctm %util
nvme0n1 120.0 900.0 8200.0 64000.0 35.2 28.4 0.9 98.0
What it means: %util near 100% and high await means the device is saturated and requests are queueing. iowait is elevated too.
Decision: Move write-heavy paths to faster storage, reduce fsync frequency (only if you can accept durability tradeoffs), split workloads across devices, or fix the “log flood” problem.
Task 6: Identify which processes are waiting on disk
cr0x@server:~$ pidstat -d 1 5
Linux 5.15.0-97-generic (server) 02/04/2026 _x86_64_ (16 CPU)
12:10:01 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
12:10:02 0 23144 10.00 52000.00 0.00 3200 myservice
12:10:02 0 14521 0.00 9000.00 0.00 600 dockerd
What it means: The service and even dockerd are writing a lot. iodelay indicates time spent waiting on I/O.
Decision: If dockerd is heavy, suspect logging driver or image layer churn. If the service is heavy, inspect its write paths and fsync behavior.
Task 7: Prove overlay2 is in play and where it lives
cr0x@server:~$ docker info --format '{{.Driver}} {{.DockerRootDir}}'
overlay2 /var/lib/docker
What it means: You’re using overlay2 under /var/lib/docker. If that filesystem is slow, everything Docker does is slow.
Decision: Put /var/lib/docker on fast local SSD/NVMe for serious workloads. If it’s on network storage, expect pain.
Task 8: Check filesystem type and mount options for the Docker root
cr0x@server:~$ findmnt -no SOURCE,FSTYPE,OPTIONS /var/lib/docker
/dev/nvme0n1p2 ext4 rw,relatime,errors=remount-ro
What it means: ext4 with typical options. If you see NFS/FUSE or odd sync options, that’s a red flag.
Decision: If mount options include sync or the backend is remote, stop and redesign. Containers don’t fix slow storage.
Task 9: Check Docker log file growth (json-file driver)
cr0x@server:~$ ls -lh /var/lib/docker/containers/*/*-json.log | sort -k5 -h | tail -n 5
-rw-r----- 1 root root 6.2G Feb 4 12:09 /var/lib/docker/containers/7c.../7c...-json.log
-rw-r----- 1 root root 7.8G Feb 4 12:09 /var/lib/docker/containers/aa.../aa...-json.log
What it means: Huge logs mean huge writes, plus rotation pain if misconfigured.
Decision: Enable log rotation, reduce log verbosity, and consider a logging driver that doesn’t pin everything to the node’s hot disk.
Task 10: Measure DNS latency from inside the container
cr0x@server:~$ docker exec -it api bash -lc 'time getent hosts db.internal >/dev/null'
real 0m2.013s
user 0m0.000s
sys 0m0.004s
What it means: Two seconds to resolve a name is a performance incident waiting to happen.
Decision: Inspect resolver config, upstream DNS performance, and search domains. Fix DNS before tuning application threads.
Task 11: Inspect resolver configuration inside the container
cr0x@server:~$ docker exec -it api cat /etc/resolv.conf
nameserver 127.0.0.11
options ndots:5
search corp.example internal.example svc.cluster.local
What it means: Docker embedded DNS (127.0.0.11) and ndots:5 with multiple search domains can multiply lookups.
Decision: Reduce search domains, tune ndots where appropriate, and ensure upstream DNS is healthy. In some environments, bypass embedded DNS with explicit resolvers.
Task 12: Check conntrack table usage
cr0x@server:~$ sysctl net.netfilter.nf_conntrack_count net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_count = 248932
net.netfilter.nf_conntrack_max = 262144
What it means: You’re near the ceiling. New connections will start failing under bursts.
Decision: Increase conntrack max (with memory awareness), reduce connection churn (keepalive, pooling), or reduce NAT dependency (host networking, direct routing).
Task 13: Look for retransmits and general TCP misery
cr0x@server:~$ ss -s
Total: 2138
TCP: 1821 (estab 932, closed 756, orphaned 3, timewait 650)
Transport Total IP IPv6
RAW 0 0 0
UDP 45 40 5
TCP 1065 980 85
INET 1110 1020 90
FRAG 0 0 0
What it means: Lots of TIMEWAIT can indicate high connection churn. Not automatically wrong, but suspicious under load.
Decision: If you see churn, add keepalives, reuse connections, and check client behavior. Then validate conntrack capacity.
Task 14: Verify open file descriptor pressure for a containerized process
cr0x@server:~$ PID=$(pgrep -f myservice | head -n 1)
cr0x@server:~$ ls /proc/$PID/fd | wc -l
9823
What it means: Nearly 10k FDs. If your limits are low, you’re close to failure; if you’re already failing, you’ll see “too many open files.”
Decision: Raise ulimit for the service, audit connection leaks, and verify host-level fs.file-max and per-process limits.
Task 15: Check container’s ulimit setting
cr0x@server:~$ docker exec -it api bash -lc 'ulimit -n'
1024
What it means: 1024 is tiny for modern network services under load.
Decision: Set higher --ulimit nofile=... or configure in your orchestrator. Then verify the application actually uses connection pooling.
Task 16: Detect memory pressure on the host (swap/reclaim hints)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 2 524288 81234 10240 901234 5 8 120 1800 900 1400 12 4 55 29 0
What it means: Non-zero si/so indicates swapping. b blocked processes plus high wa signals waiting.
Decision: Reduce memory pressure: raise limits where safe, fix leaks, add RAM, or re-balance workloads. Swapping a latency-sensitive service is a choice; choose differently.
Three corporate mini-stories from the performance trenches
Mini-story 1: The incident caused by a wrong assumption
They had a high-traffic API and a clean migration plan: move it into containers on a new node pool. Same binaries, same config, “just Docker.” The team did the reasonable thing and watched CPU. It stayed low. They declared victory.
Then Monday arrived. p99 latency tripled, but average latency looked only mildly worse. The autoscaler added more containers, which made the problem worse in a way that felt personal. Everyone argued about thread pools and garbage collection because those are arguments engineers can have without leaving their chairs.
The wrong assumption was simple: they assumed the container’s filesystem behaved like the host’s filesystem. In reality, the app wrote temporary files into a path that lived in the image layer, not a volume. Under load, overlay2 copy-up churned and metadata ops exploded. The disk wasn’t slow in throughput; it was overloaded in small random writes and journal commits.
They proved it by moving only the temp directory to a volume on fast storage. Nothing else changed. Latency fell back to normal, CPU stayed low, and the incident chat went silent. The lesson wasn’t “overlay2 is bad.” The lesson was “write paths are architecture.”
Mini-story 2: The optimization that backfired
A different company had a logging bill they didn’t like. Someone proposed a simple optimization: crank up debug logs only during peak hours, so they could catch edge cases. They shipped a config toggle. It worked in staging. It even worked in production… for a day.
On the second day, peak traffic arrived with peak debug logging. Disk utilization hit the ceiling. Dockerd spent real time writing json logs. The application started blocking on stdout because buffers filled during bursts. Latency climbed. Retries increased. More logs. More disk writes. A feedback loop formed, like a tutorial for how to build your own outage.
The “optimization” was based on a myth: that logging is free if CPU is available. Logging is I/O, and I/O under contention is latency. They rolled back debug logging, enabled log rotation (which they should have had anyway), and moved high-volume logs to an asynchronous pipeline with backpressure.
After the postmortem, the team stopped describing logging as “observability” and started treating it as “a production workload that competes with the service.” That change in phrasing prevented more incidents than any single config tweak.
Mini-story 3: The boring but correct practice that saved the day
This one is less glamorous. A team ran stateful services in containers with strict change control. They had a habit—unfashionable, almost embarrassing—of performance baselining every node pool: disk latency, network RTT, conntrack headroom, and memory reclaim behavior. They stored the results as a simple report and reran it after kernel upgrades.
One afternoon, a new batch of nodes came online. Everything “worked,” but p99 latency was slightly worse. No alarms. Just a quiet regression. Because they had baselines, they noticed immediately that storage latency percentiles were off and that a mount option differed on the Docker root filesystem.
It turned out the provisioning pipeline had applied a different filesystem config on the new nodes. Not catastrophic, but enough to increase write latency under sync-heavy workloads. They drained the nodes, corrected the config, and moved on. No outage. No heroics. Just boring competence.
If you want reliability, you have to be willing to be boring in advance. That’s not a slogan; it’s a budget line item.
Common mistakes: symptom → root cause → fix
1) Symptom: CPU low, latency high, threads “stuck”
Root cause: I/O saturation or blocked tasks (D-state), often from overlay2 write amplification or slow backing storage.
Fix: Put hot write paths on volumes; move Docker root to fast local storage; reduce synchronous writes; confirm with iostat/pidstat.
2) Symptom: Random 1–5s stalls, especially on new connections
Root cause: DNS timeouts amplified by resolver settings (ndots/search domains) or overloaded upstream DNS.
Fix: Measure resolution time in-container; simplify resolv.conf; fix upstream DNS; consider caching resolvers close to workloads.
3) Symptom: p99 spikes during bursts, no OOM kills
Root cause: Memory pressure causing reclaim/compaction; container memory limits too tight; swap activity.
Fix: Increase memory headroom; set appropriate memory limits; reduce per-request allocations; avoid swap for latency-critical services.
4) Symptom: Service scales out but gets slower
Root cause: Shared bottleneck: single disk device saturated, shared network egress, conntrack table pressure, or centralized logging bottleneck.
Fix: Identify the shared resource. Scaling compute doesn’t scale the disk. Split devices, add nodes with independent I/O, reduce log writes.
5) Symptom: Periodic “stutter” every 100ms or 1s under load
Root cause: CPU quota throttling (cgroup periods). The container hits quota, pauses, resumes.
Fix: Loosen CPU limits or use CPU shares/requests; keep quotas for batch jobs, not tail-latency services.
6) Symptom: Connection errors, timeouts, SYN retries during peak
Root cause: Conntrack table full or NAT overhead; high connection churn creating TIMEWAIT storms.
Fix: Increase conntrack limits, reduce churn with keepalive, and consider network modes that reduce NAT/conntrack dependence.
7) Symptom: “Too many open files” or weird partial failures
Root cause: Low ulimit in container or host; FD leak in application.
Fix: Raise nofile; audit FD usage; use connection pooling; alert on FD count.
8) Symptom: Node disk fills unexpectedly, then everything degrades
Root cause: Unbounded container log files or runaway temporary data on Docker root.
Fix: Configure log rotation; cap log volume; store temp data on sized volumes; set alerts on /var/lib/docker usage.
Checklists / step-by-step plan (boring, repeatable, effective)
Incident checklist: “containers are slow right now”
- Confirm scope: one service, one node, or everything?
- Check I/O saturation: run
iostat -xz; if%utilis pinned andawaitis high, treat storage as primary. - Check blocked tasks: scan
dmesgfor hung tasks; confirm with process states. - Check memory pressure:
vmstat, swap activity, cgroup memory events if available. - Check throttling: read
cpu.statfor throttling counters; compare to request latency pattern. - Check DNS: time a resolution inside the container; look at
/etc/resolv.conf. - Check conntrack: compare count vs max; look for drops in kernel logs if present.
- Check logs: size of json logs; disk usage under
/var/lib/docker. - Isolate by bypassing layers (in staging or carefully): tmpfs for temp, volume for hot writes, host networking test.
- Make one change: pick the highest-confidence bottleneck and change a single thing.
- Measure again: did p99 improve? did iostat/conntrack/dns timings improve?
Preventative checklist: design containers that don’t lag
- Put write-heavy paths on volumes: databases, queues, caches with persistence, and temp directories.
- Keep the image layer cold: don’t write to paths baked into the image at runtime.
- Set sane ulimits: especially
nofileand process limits for high-concurrency services. - Avoid tight CPU quotas for latency-sensitive services: prefer weights and reservations; quotas are for batch fairness.
- Configure log rotation: don’t trust defaults; cap size and count.
- Baseline the node: disk latency, network RTT, DNS resolution time, conntrack headroom.
- Alert on the real bottlenecks: disk await, swap activity, conntrack count, log growth, throttling counters.
- Test with production-like I/O patterns: especially fsync behavior and metadata-heavy workloads.
Change plan: improving performance without cargo-cult tuning
- Pick one service with measurable latency pain and stable traffic patterns.
- Capture a baseline: p50/p95/p99 latency, disk await, DNS timings, throttling counters, and error rates.
- Identify the strongest constraint: disk saturation, memory reclaim, DNS, conntrack, or throttling.
- Apply the smallest viable fix: move one directory to a volume, change one limit, rotate logs, or tune DNS.
- Re-run the same load and compare to baseline; keep the change only if it moves the needle.
- Automate the check: convert the manual commands into dashboards/alerts and a runbook.
FAQ
Why does my container slow down when CPU is low?
Because the work is waiting, not computing. The usual culprits are disk latency, memory reclaim, DNS timeouts, network drops/retransmits, or CPU quota throttling.
Is overlay2 always slower than running on the host filesystem?
No. It’s often fine for read-heavy workloads and moderate writes. It gets ugly when you do lots of small writes, metadata churn, or modify files that exist in lower image layers (copy-up cost).
Should I put databases in Docker?
You can, but you must treat storage as a first-class design choice: dedicated volumes, predictable latency storage, and honest fsync testing. If you run a database on a saturated shared disk and blame Docker, the disk will laugh quietly.
Do CPU limits make performance more predictable?
They make CPU usage more predictable. Latency often becomes less predictable because throttling introduces periodic pauses. For tail-latency sensitive services, tight quotas are usually the wrong tool.
How do I know if logging is hurting performance?
Look for large json log files, high dockerd write activity, and disk saturation. Temporarily reducing log volume in a controlled test is an easy way to validate causality.
Why do DNS issues look like application slowness?
Because DNS failures are often slow failures (timeouts) rather than fast errors. Calls block while resolving. Under load, those blocked threads cascade into queueing and timeouts elsewhere.
What’s the fastest way to confirm disk is the bottleneck?
iostat -xz 1 on the host for saturation and pidstat -d to find who’s doing the writing. High await plus high utilization is a smoking gun.
Why does scaling out sometimes make it worse?
Because you’re scaling the wrong resource. More containers can increase contention on a shared disk, shared NAT/conntrack, shared DNS, or shared logging pipeline.
Are these issues “Docker-specific” or “Linux-specific”?
Mostly Linux-specific. Docker makes them easier to trigger because it adds layers and defaults. The kernel still does the real work, and the kernel is where the bottlenecks live.
Should I switch to host networking for performance?
Sometimes it helps by reducing NAT/conntrack overhead, but it changes isolation and port management. Use it as a diagnostic tool first, then decide based on risk and measurable gains.
Conclusion: next steps you can actually do this week
If your containers lag under load and CPU is low, stop staring at CPU like it owes you money. Treat performance like an investigation: find what’s waiting, not what’s busy.
- Instrument the bottlenecks you keep missing: disk await/utilization, swap activity, cgroup throttling counters, DNS resolution time, conntrack usage, and Docker log growth.
- Move hot write paths off the container writable layer: use volumes for temp, caches, and anything resembling a database.
- Fix the “silent killers”: DNS search domains/ndots, conntrack ceilings, ulimits, and log rotation.
- Re-evaluate limits: especially CPU quotas. If you want fairness, don’t accidentally buy latency.
- Baseline nodes and keep the reports: it’s boring, and it prevents outages you’ll never get credit for.
Containers aren’t slow. Unmeasured constraints are slow. Docker just makes it easier to pretend the constraints aren’t there—right up until your p99 graph starts screaming.