It starts as “the database feels kind of slow.” Then your dashboards get weird: CPU is low, disk looks fine, queries are not obviously worse, but the app is timing out anyway. You restart the container and—miracle—everything is fast again. For a day.
That’s usually not a query problem. It’s memory pressure plus container limits, the kind that doesn’t scream in logs. Docker didn’t “throttle” memory the way it throttles CPU, but you can absolutely end up with silent performance collapse: reclaim storms, swap thrash, allocator stalls, I/O amplification, and databases politely adapting by getting slower.
What “silent throttling” really means in Docker
Docker memory limits are not like CPU limits. With CPU, you get clear throttling counters. With memory, you get a hard wall (memory.max / --memory) and then one of two things happens:
- The kernel reclaims aggressively (file cache, anonymous memory, slabs), which looks like “nothing is wrong” until latency spikes and I/O goes vertical.
- The OOM killer ends your process, which is at least honest.
In between those extremes is the miserable middle: the database is alive, but it’s fighting the kernel for memory and losing slowly. That’s your “silent throttling.”
The three-layer trap: database cache, kernel cache, container limit
On a normal host, databases rely on two caching tiers:
- Database-managed cache (MySQL InnoDB buffer pool, PostgreSQL shared buffers).
- Kernel page cache (filesystem cache).
Put them in a container with a strict memory limit and you’ve added a third constraint: the cgroup. Now every byte counts twice: once for database memory and again for “everything else” (connections, work memory, sort/hash, replication buffers, background processes, shared libraries, malloc overhead). The kernel doesn’t care that you “only changed one config”; it accounts the RSS and cache in your cgroup and reclaims based on pressure.
Why this looks like a network or storage bug
Memory pressure manifests as:
- random p99 latency spikes
- increased fsync/flush times
- more read I/O despite stable query mix
- CPU that looks “fine” because threads are blocked in the kernel
- connections piling up, then cascading timeouts
Joke #1: Databases under memory pressure are like people under caffeine withdrawal—still technically functioning, but every small request becomes personal.
Interesting facts and history that actually matter
Some context points that change how you think about “just set a container memory limit.”
- PostgreSQL has relied on the OS page cache forever. Its architecture intentionally leaves a lot of caching to the kernel;
shared_buffersis not meant to be “all memory.” - MySQL’s InnoDB buffer pool became the default workhorse as InnoDB replaced MyISAM as the standard engine (MySQL 5.5 era). That entrenched the habit of “make the buffer pool huge.”
- Linux cgroups v1 and v2 account memory differently enough to confuse dashboards. The same container can look “fine” on v1 and “mysteriously constrained” on v2 if you don’t check the right files.
- Docker added better defaults over time, but Compose files still encode bad folklore. You still see
mem_limitset without any corresponding DB tuning, which is basically a performance roulette wheel. - OOM killer behavior in containers used to surprise people more. Early container adoption taught teams that “the host has plenty of RAM” doesn’t matter if the cgroup limit is low.
- PostgreSQL introduced huge pages support long ago, but it’s rarely used in containers because it’s operationally annoying and sometimes incompatible with constrained environments.
- MySQL/InnoDB has multiple memory consumers outside the buffer pool (adaptive hash index, connection buffers, performance_schema, replication). Sizing only the buffer pool is a classic foot-gun.
- PostgreSQL’s per-query memory is often the real culprit. A too-generous
work_memmultiplied by concurrency is how you “accidentally” allocate your way into a cgroup wall.
MySQL vs PostgreSQL: memory models that collide with containers
MySQL (InnoDB): one big pool, plus death by a thousand buffers
MySQL performance tuning culture is dominated by the InnoDB buffer pool. On bare metal, the rule-of-thumb is often “60–80% of RAM.” In containers, that advice becomes dangerous unless you redefine “RAM” as “container limit minus overhead.”
Memory buckets to account for:
- InnoDB buffer pool: the headline number. If you set it to 75% of the container limit, you have already lost.
- InnoDB log buffer and redo log related memory: usually small, but don’t pretend it’s zero.
- Per-connection buffers: read buffer, sort buffer, join buffer, tmp tables. Multiply by max connections. Then multiply by “peak is never average.”
- Performance Schema: can be surprisingly non-trivial if fully enabled.
- Replication: relay logs, binlog caches, network buffers.
Under memory pressure, InnoDB can keep running but with more disk reads and more background churn. It can look like storage got slower. Storage didn’t. You starved the cache and forced random reads.
PostgreSQL: shared buffers are only half the story
Postgres uses shared memory (shared_buffers) for caching data pages, but it also relies heavily on the kernel page cache. Then it adds a pile of other consumers:
- work_mem per sort/hash node, per query, per connection (and per parallel worker). This is where container budgets go to die.
- maintenance_work_mem for vacuum/index builds. Your slow night job can be your daytime incident.
- autovacuum workers and background writer: they don’t just use CPU; they create I/O and can amplify memory pressure indirectly.
- shared memory overhead, catalogs, connection memory contexts, extensions.
In Postgres, “the database is slow” under container limits often means the kernel is reclaiming page cache and Postgres is doing more real reads. Meanwhile, a handful of concurrent queries may be ballooning memory via work_mem. It’s a two-front war.
Silent throttling patterns: MySQL vs Postgres
MySQL pattern: buffer pool too big → little headroom → page cache squeezed → checkpoint/flush behavior worsens → random read I/O spikes → latency creeps up without an obvious error.
Postgres pattern: moderate shared_buffers but generous work_mem → concurrency spike → anonymous memory growth → cgroup pressure → reclaim and/or swap → query runtimes explode, sometimes with no single “bad query.”
Operational opinion: if you’re containerizing databases, stop thinking in “percent of host RAM.” Think in “hard budget with worst-case concurrency.” You will be less popular at design reviews. You will be more popular at 3 a.m.
One quote, because it’s still true decades later: paraphrased idea
— Jim Gray: “The best way to improve performance is to measure it first.”
Fast diagnosis playbook
This is the order that finds bottlenecks quickly when a DB in Docker “just got slow.” You can do it in 10–15 minutes if you keep your hands steady and your assumptions weak.
1) Confirm the container’s real memory budget (not what you think you set)
- Check cgroup
memory.max/ docker inspect. - Check if swap is allowed (
memory.swap.maxor docker--memory-swap). - Check if the orchestrator overrides Compose values.
2) Decide: are we reclaiming, swapping, or OOMing?
- Look for OOM kills in
dmesg/journalctl. - Check cgroup memory events (
memory.eventson v2). - Check major page faults, swap in/out.
3) Correlate with DB memory behavior
- MySQL: buffer pool size, dirty page %, history list length, temp table usage, max connections.
- Postgres:
shared_buffers,work_mem, active sorts/hashes, temp file creation, autovacuum activity, connection count.
4) Confirm whether storage is the victim or the culprit
- Measure read IOPS and latency at the host level.
- Check whether reads increased after memory pressure started.
- Check fsync-heavy behavior (checkpointing, WAL, redo flush).
5) Make one safe change
Don’t “tune everything.” You’ll never know what fixed it and you’ll probably break something else. Pick one target: either reduce DB memory appetite or increase container budget. Then verify with the same counters.
Hands-on tasks (commands, outputs, decisions)
These are real tasks you can run today. Each includes: command, sample output, what it means, and the decision you make.
Task 1: Identify the container and its configured memory limit
cr0x@server:~$ docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}'
NAMES IMAGE STATUS
db-mysql mysql:8.0 Up 3 days
db-postgres postgres:16 Up 3 days
cr0x@server:~$ docker inspect -f '{{.Name}} mem={{.HostConfig.Memory}} swap={{.HostConfig.MemorySwap}}' db-mysql
/db-mysql mem=2147483648 swap=2147483648
Meaning: memory is 2 GiB. Swap equals memory, so the container can swap up to ~2 GiB (depending on settings). That is not “free.” It’s latency.
Decision: If you see swap enabled for latency-sensitive DBs, either disable swap for the container or size memory so you don’t need it.
Task 2: Verify cgroup v2 memory.max from inside the container
cr0x@server:~$ docker exec -it db-postgres bash -lc 'cat /sys/fs/cgroup/memory.max; cat /sys/fs/cgroup/memory.current'
2147483648
1967855616
Meaning: limit 2 GiB, current usage ~1.83 GiB. That’s close to the ceiling.
Decision: If memory.current sits near memory.max during normal load, you don’t have a “spike” problem. You have a sizing problem.
Task 3: Check cgroup memory pressure events (v2)
cr0x@server:~$ docker exec -it db-postgres bash -lc 'cat /sys/fs/cgroup/memory.events'
low 0
high 214
max 0
oom 0
oom_kill 0
Meaning: high increments indicate sustained memory pressure. No OOM yet, but the kernel is reclaiming.
Decision: If high keeps climbing during incidents, treat it as a first-class signal. Reduce memory consumption or raise the limit.
Task 4: Find OOM kills from the host
cr0x@server:~$ sudo journalctl -k --since "2 hours ago" | grep -i oom | tail -n 5
Dec 31 09:12:44 server kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-8b1...,mems_allowed=0,oom_memcg=/docker/8b1...
Dec 31 09:12:44 server kernel: Killed process 27144 (mysqld) total-vm:3124280kB, anon-rss:1782400kB, file-rss:10240kB, shmem-rss:0kB
Meaning: mysqld got killed by the cgroup OOM killer. This is not “MySQL crashed.” This is “you ran out of budget.”
Decision: Don’t restart-loop. Fix the budget/config mismatch first.
Task 5: Confirm swap activity on the host
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 1 524288 31264 10240 88432 120 210 1820 2410 920 1640 12 8 61 19 0
1 2 526336 29812 10320 87020 140 180 2100 1980 980 1710 10 7 58 25 0
Meaning: swap-in (si) and swap-out (so) are non-zero during load. That’s a latency tax.
Decision: If DB latency matters, avoid swapping. Increase memory limit or reduce DB memory usage; also consider setting --memory-swap equal to --memory to prevent extra swap usage (environment-dependent).
Task 6: See container-level memory usage and page cache via docker stats
cr0x@server:~$ docker stats --no-stream --format 'table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.CPUPerc}}'
NAME MEM USAGE / LIMIT MEM % CPU %
db-mysql 1.95GiB / 2GiB 97.5% 35.2%
db-postgres 1.82GiB / 2GiB 91.0% 18.7%
Meaning: both are running near limits. Expect reclaim pressure and/or OOM.
Decision: Treat anything >90% sustained as “misconfigured,” not “busy.”
Task 7: Inspect per-process RSS inside the container (who is eating memory?)
cr0x@server:~$ docker exec -it db-mysql bash -lc 'ps -eo pid,comm,rss --sort=-rss | head'
PID COMMAND RSS
1 mysqld 1789320
112 bash 17520
98 ps 4440
Meaning: mysqld RSS is ~1.7 GiB. That’s before page cache and other overhead.
Decision: If mysqld RSS is close to the limit, reduce InnoDB buffer pool and per-connection buffers, or raise the limit.
Task 8: MySQL—confirm InnoDB buffer pool size and other big knobs
cr0x@server:~$ docker exec -it db-mysql bash -lc 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "SHOW VARIABLES WHERE Variable_name IN (\"innodb_buffer_pool_size\",\"max_connections\",\"tmp_table_size\",\"max_heap_table_size\",\"performance_schema\");"'
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 1610612736|
| max_connections | 500 |
| tmp_table_size | 67108864 |
| max_heap_table_size | 67108864 |
| performance_schema | ON |
+-------------------------+-----------+
Meaning: buffer pool is 1.5 GiB inside a 2 GiB container. Max connections is 500, so per-connection memory can blow past headroom.
Decision: Cut buffer pool to a defensible number (often 40–60% of limit for small containers) and reduce max_connections or add pooling. Containers hate “just in case” connection ceilings.
Task 9: MySQL—check if you’re doing lots of temp table work (often memory→disk amplification)
cr0x@server:~$ docker exec -it db-mysql bash -lc 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "SHOW GLOBAL STATUS LIKE \"Created_tmp%tables\";"'
+-------------------------+--------+
| Variable_name | Value |
+-------------------------+--------+
| Created_tmp_disk_tables | 184220 |
| Created_tmp_tables | 912340 |
| Created_tmp_files | 2241 |
+-------------------------+--------+
Meaning: a significant fraction of temp tables are spilling to disk. Under memory pressure, this gets worse and looks like “storage regression.”
Decision: Reduce query spill (indexes, query plans), size temp settings carefully, and ensure enough memory headroom so temp operations aren’t forced into disk more often.
Task 10: PostgreSQL—confirm shared_buffers, work_mem, and max_connections
cr0x@server:~$ docker exec -it db-postgres bash -lc 'psql -U postgres -d postgres -c "SHOW shared_buffers; SHOW work_mem; SHOW maintenance_work_mem; SHOW max_connections;"'
shared_buffers
----------------
512MB
(1 row)
work_mem
----------
64MB
(1 row)
maintenance_work_mem
----------------------
1GB
(1 row)
max_connections
-----------------
300
(1 row)
Meaning: work_mem 64MB times concurrency is a trap. maintenance_work_mem 1GB in a 2GB container is a time bomb when vacuum/index jobs run.
Decision: Lower work_mem and maintenance_work_mem, and rely on connection pooling. If you need large work_mem, enforce concurrency limits or use resource queues at the app layer.
Task 11: PostgreSQL—see temp files (a strong indicator of memory shortfalls or bad plans)
cr0x@server:~$ docker exec -it db-postgres bash -lc 'psql -U postgres -d postgres -c "SELECT datname, temp_files, temp_bytes FROM pg_stat_database ORDER BY temp_bytes DESC LIMIT 5;"'
datname | temp_files | temp_bytes
-----------+------------+-------------
appdb | 12402 | 9876543210
postgres | 2 | 819200
(2 rows)
Meaning: large temp_bytes suggests sorts/hashes spilling. In containers, spills plus reclaim storms equal sad users.
Decision: Find the top spilling queries, tune indexes, and set work_mem based on concurrency, not wishful thinking.
Task 12: PostgreSQL—spot autovacuum pressure (it can look like random I/O “mystery”)
cr0x@server:~$ docker exec -it db-postgres bash -lc 'psql -U postgres -d postgres -c "SELECT relname, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"'
relname | n_dead_tup | last_autovacuum
------------+------------+-------------------------------
events | 4821932 | 2025-12-31 08:44:12.12345+00
sessions | 1120044 | 2025-12-31 08:41:02.54321+00
(2 rows)
Meaning: lots of dead tuples means vacuum pressure. Vacuum creates I/O and can exacerbate cache churn, especially with tight memory limits.
Decision: Tune autovacuum thresholds per table and fix write amplification. Don’t “solve” it by giving vacuum unlimited memory in a tiny container.
Task 13: Measure host I/O latency (are we forcing disk reads because cache is gone?)
cr0x@server:~$ iostat -x 1 3
avg-cpu: %user %nice %system %iowait %steal %idle
11.22 0.00 6.11 18.33 0.00 64.34
Device r/s w/s rkB/s wkB/s await r_await w_await svctm %util
nvme0n1 820.0 210.0 50240.0 18432.0 18.2 21.4 5.6 1.2 98.0
Meaning: high utilization and high read await suggests you’re doing real reads and waiting. If this coincides with memory pressure, the disk is often just the messenger.
Decision: Don’t immediately buy faster disks. First confirm cache starvation and memory reclaim.
Task 14: Check major page faults (a reclaim/swap smell)
cr0x@server:~$ pid=$(docker inspect -f '{{.State.Pid}}' db-postgres); sudo cat /proc/$pid/stat | awk '{print "majflt="$12, "minflt="$10}'
majflt=48219 minflt=12984321
Meaning: major faults indicate the process had to fetch pages from disk (or swap). Under pressure, this climbs with latency.
Decision: If major faults increase during incidents, focus on memory headroom and cache behavior before blaming query plans alone.
Task 15: Verify Compose vs runtime settings (the “I set mem_limit” lie)
cr0x@server:~$ docker compose config | sed -n '/db-mysql:/,/^[^ ]/p'
db-mysql:
image: mysql:8.0
mem_limit: 2g
environment:
MYSQL_DATABASE: appdb
cr0x@server:~$ docker inspect -f 'mem={{.HostConfig.Memory}}' db-mysql
2147483648
Meaning: Compose config matches runtime here. In many environments it won’t, because Swarm/Kubernetes/other tooling may override.
Decision: Always trust runtime inspection over config files. Config files are aspirations.
Three corporate mini-stories from the trenches
1) Incident caused by a wrong assumption: “The host has 64 GB, we’re fine”
A mid-sized SaaS team moved a MySQL primary into Docker to standardize deployment. The host had plenty of RAM. They set a 4 GB container limit because “the dataset is small,” and copied their old innodb_buffer_pool_size from a VM: 3 GB. It worked. For weeks.
Then a marketing campaign hit. Connections spiked, sort buffers got used, and suddenly latency went from “fine” to “what happened to our product.” CPU didn’t peg. Disk wasn’t saturated at first. The app servers were timing out.
The first responders looked at query logs, because that’s what you do when you’re tired. They found nothing dramatic. They restarted the container and it got better. That restart also cleared accumulated memory fragmentation and connection-level buffers. It was a placebo with side effects.
Finally someone checked the kernel logs and saw cgroup reclaim pressure and occasional OOM kills of helper processes. MySQL wasn’t always dying; it was just living on the edge and forcing the kernel to constantly steal cache. Reads went up, checkpointing got uglier, and every IO wait turned into a user-visible latency spike.
The fix was painfully boring: reduce buffer pool to leave real headroom, cap connections with a pooler, and raise the container limit to match the actual concurrency profile. The dataset wasn’t the issue. The workload was.
2) Optimization that backfired: “Let’s crank work_mem, it’s faster”
A data-heavy service ran PostgreSQL in containers. A well-meaning engineer saw sorts spilling to disk in EXPLAIN (ANALYZE), and increased work_mem from 4MB to 128MB. Benchmarks improved. Everyone high-fived and went back to Slack.
Two weeks later, an incident. Not during a deploy—worse. During a normal Tuesday. A batch job kicked off, ran several parallel queries, and each query used multiple sort/hash nodes. Multiply that by connection count. Multiply again by parallel workers. Suddenly the container hit memory pressure and started reclaiming aggressively.
The database didn’t crash. It just got slow. Really slow. The batch job slowed down, held locks longer, and blocked user-facing transactions. That drove more retries from the application layer, which increased concurrency further. Classic positive feedback loop, except nobody was positive.
They rolled back work_mem, but performance stayed weird because autovacuum had fallen behind during the chaos. Once vacuum caught up, things normalized. The real fix was to size work_mem based on worst-case concurrency, not single-query benchmarks, and to isolate batch workloads with their own resource budgets.
Joke #2: Raising work_mem in a small container is like bringing a bigger suitcase to an airline gate—you’ll feel prepared until someone measures it.
3) The boring but correct practice that saved the day: budgets, headroom, and alarms
A different team ran both MySQL and PostgreSQL in Docker across environments. They had a rule: every DB container gets a “memory budget sheet,” a short document in the repo listing the limit, expected peak connections, and computed worst-case memory allocations.
They also had alarms on cgroup memory pressure events (v2) and on swap activity, not just “container memory percent.” The alert wasn’t “you’re at 92%.” It was “memory pressure high events rising fast.” That’s the difference between a warning and a fire alarm.
One day, a feature release increased concurrent report queries. Their dashboards flagged rising memory.events high while user latency was still acceptable. They throttled report concurrency at the app layer and scheduled a container limit increase for the next maintenance window.
No outage. No drama. The team got accused of “over-engineering” exactly once, which is how you know you’re doing it right.
Common mistakes: symptom → root cause → fix
1) Symptom: p99 latency spikes, CPU looks fine
Root cause: kernel reclaim pressure inside the cgroup; threads block in I/O or allocator paths.
Fix: check /sys/fs/cgroup/memory.events and memory.current. Reduce DB memory knobs or increase limit. Confirm major faults and I/O await.
2) Symptom: random disk read spikes after “tightening” memory limits
Root cause: page cache squeezed; database cache too small or too big relative to container; reclaim evicts useful cache.
Fix: leave headroom for page cache and non-buffer memory. For MySQL, don’t set buffer pool near the limit. For Postgres, don’t assume shared_buffers replaces page cache.
3) Symptom: database restarts with no clear DB error
Root cause: cgroup OOM kill. The DB didn’t “crash,” it was executed by the kernel.
Fix: check journalctl -k. Fix memory sizing, reduce concurrency, and avoid swap reliance.
4) Symptom: “It’s slow until we restart the container”
Root cause: accumulated memory pressure, connection bloat, cache churn, autovacuum backlog, or fragmentation; restart resets the symptoms.
Fix: measure memory pressure and DB internals. Add pooling, right-size buffers, and schedule vacuum/maintenance properly.
5) Symptom: Postgres temp files explode and disks fill
Root cause: insufficient work_mem for the query shape or bad plans; under memory pressure, spills become more frequent.
Fix: identify top temp-byte queries, tune indexes, and set work_mem based on concurrency. Consider query timeouts for runaway analytics on OLTP.
6) Symptom: MySQL uses “way more memory than innodb_buffer_pool_size”
Root cause: per-connection buffers, performance_schema, and allocator overhead; also OS page cache and filesystem metadata counted in cgroup.
Fix: cap connections, tune per-thread buffers, consider disabling or trimming performance_schema if appropriate, and leave headroom.
7) Symptom: container memory usage seems capped but host swap grows
Root cause: swap is allowed and the kernel is pushing cold pages out; the container “stays alive” but gets slower.
Fix: disable swap for the container or set swap limit equal to memory limit; ensure the workload fits in RAM.
8) Symptom: “We set mem_limit in Compose, but it doesn’t apply in prod”
Root cause: orchestrator overrides; Compose fields differ by mode; runtime config diverges.
Fix: inspect runtime settings and codify them in the actual deployment mechanism (Swarm/Kubernetes configs). Trust docker inspect, not YAML folklore.
Checklists / step-by-step plan
Step-by-step: stopping silent throttling for MySQL in Docker
- Confirm the real limit:
docker inspectand/sys/fs/cgroup/memory.max. - Reserve headroom: target at least 25–40% of limit for non-buffer usage + page cache for small containers (sub-8GB). Yes, it feels conservative. It is conservative.
- Set
innodb_buffer_pool_sizeto a budget, not a vibe: for a 2GB container, 768MB–1.2GB is often sane depending on workload and connection pooling. - Cap connections: reduce
max_connectionsand add pooling at the app tier if possible. - Audit per-thread buffers: keep sort/join/read buffers modest unless you understand worst-case concurrency.
- Watch temp disk tables: rising disk temp tables means you’re paying I/O for memory decisions.
- Alert on cgroup pressure: track
memory.eventshigh/max/oom_kill. - Re-test under peak concurrency: benchmarks with 10 connections are nice; production has 300 because someone forgot to close sockets.
Step-by-step: stopping silent throttling for PostgreSQL in Docker
- Confirm the limit: again, don’t argue with cgroups.
- Set
shared_buffersmoderately: in containers, huge values can crowd out everything else. Many OLTP workloads do fine with 256MB–2GB depending on budget. - Make
work_mema concurrency-aware number: start small (4–16MB) and increase surgically for specific roles/queries if needed. - Don’t let maintenance eat the box: size
maintenance_work_memso that autovacuum and index builds can’t starve the rest. - Use pooling: Postgres connections are expensive, and in containers the overhead becomes more painful.
- Track temp_bytes and autovacuum lag: spills and vacuum debt are early warnings.
- Alert on memory pressure events: treat
memory.events highas a performance SLO threat. - Separate OLTP from analytics: if you can’t, enforce query timeouts and concurrency limits.
Container-level hygiene checklist (both databases)
- Pin memory limits intentionally: avoid “tiny limits for safety” without tuning. That’s not safety; it’s deferred outages.
- Decide on swap policy: for databases, “swap as emergency buffer” usually becomes “swap as permanent lifestyle.”
- Observe from the host and inside: host swap, host I/O, and cgroup events all matter.
- Keep restarts honest: if a restart “fixes” it, you have a leak, a backlog, a memory pressure cycle, or a caching mismatch. Treat it as a clue, not a cure.
FAQ
1) Is Docker “throttling” my database memory?
Not like CPU throttling. Memory limits create pressure and hard failures (reclaim, swap, OOM). The “throttling” is your database running slower because it can’t keep hot pages resident.
2) Why does performance improve after restarting the DB container?
Restarts reset caches, free accumulated per-connection memory, clear fragmentation, and sometimes let the kernel rebuild page cache. It’s like turning the radio off to fix a flat tire: the noise changes, the problem remains.
3) For MySQL, can I set innodb_buffer_pool_size to 80% of container memory?
Usually no. In small containers, you need headroom for per-connection buffers, background threads, performance schema, and some page cache. Start lower and prove you can afford more.
4) For Postgres, should shared_buffers be huge in containers?
Not by default. Postgres benefits from OS page cache, and containers still use the kernel cache within the same memory budget. Over-allocating shared_buffers can starve work_mem, autovacuum, and the page cache.
5) What’s the fastest signal of memory pressure in cgroup v2?
/sys/fs/cgroup/memory.events. If high is climbing during latency issues, you’re reclaiming. If oom_kill increments, you’re losing outright.
6) Should I disable swap for database containers?
If you care about consistent latency, yes—most of the time. Swap can prevent crashes but often turns into slow-motion outages. If you keep swap, monitor swap I/O and set realistic limits.
7) Why does disk look slow only when the DB is “busy”?
Because “busy” might mean “cache-starved.” With less cache, the DB does more real reads and writes more temp data. The disk didn’t get worse; you forced it to do more.
8) Can I fix this purely by increasing the container memory limit?
Sometimes. But if the DB is configured to expand into whatever it gets (too many connections, too-large work_mem, too-large buffers), it will eventually hit the new ceiling too. Pair limit increases with configuration discipline.
9) Which is more prone to container memory surprises: MySQL or PostgreSQL?
Different surprises. MySQL teams often oversize buffer pools and forget per-connection overhead. Postgres teams often underestimate how work_mem multiplies with concurrency. Both can “look fine” until they aren’t.
10) What if I’m using Kubernetes instead of plain Docker?
The principles are identical: cgroups, reclaim, OOM. The mechanics change (requests/limits, eviction behavior). Your job stays the same: align DB memory knobs with the real enforced limit and observe pressure signals.
Next steps
Do these in order. They’re designed to turn a vague “Docker is weird” complaint into a controlled system.
- Measure the real container limit and whether swap is permitted. Write it down where humans can find it.
- Check cgroup memory pressure events during a slow period. If
highis rising, you have your smoking gun. - Pick one database and build a memory budget: buffer/cache + per-connection/per-query + maintenance + overhead. Make it pessimistic.
- Reduce concurrency before you chase micro-optimizations: connection pooling, queueing, rate limits for analytics.
- Retune the DB knobs to the budget, not the other way around.
- Add alerts on pressure signals (not just usage percent). Pressure is what users feel.
- Load test with realistic concurrency. If your test can’t trigger pressure, it’s a unit test wearing a performance hat.
If you do all that, you don’t just stop silent throttling—you prevent it from coming back disguised as a “random” storage or network issue. And you get to sleep through the night, which is the real SLA.