Docker: Limit Log Spam at the Source — App Logging Patterns That Save Disks

Was this helpful?

Disk-full incidents rarely announce themselves politely. They show up as “can’t create file,” “database read-only,” or “node not ready,” and then you discover the culprit is a container that treated stdout like a diary.

Docker logging is simple by design: write to stdout/stderr, and the runtime takes it from there. The runtime also dutifully writes every byte—whether it’s useful, redundant, or a single error printed ten thousand times a minute. If you want to save disks (and your on-call sanity), the biggest wins happen at the source: inside the app, at the exact line where the log gets emitted.

Why “fix it in Docker” isn’t enough

Yes, you should configure Docker’s log rotation. It’s table stakes. But it’s also damage control. If your app logs like a panicked auctioneer, rotation just turns one big problem into many smaller problems that still fill the disk, still burn CPU, still saturate I/O, still drown your signal in noise, and still rack up centralized logging ingestion costs.

Container platforms encourage a particular sin: “just print everything to stdout.” It’s the path of least resistance and the path of maximum regret. A container runtime doesn’t know your intent. It can’t tell the difference between “user checkout failed” and “debug: loop iteration 892341.” It just writes bytes.

Source-limiting means: fewer log events are generated in the first place, and the ones that are generated are more compressible, more searchable, and more actionable. This is where application engineers and SREs meet in the hallway and agree on something: fewer, better logs beat more, worse logs.

Here’s the operational truth: log volume is a performance characteristic. Treat it like latency. Measure it. Budget it. Regressions should fail builds.

One quote that belongs on every logging PR review:

Paraphrased idea — Werner Vogels: you build it, you run it; ownership includes what your software does in production, including its noise.

Facts and context that explain today’s logging mess

These aren’t trivia. They’re the reasons your disks end up full of perfectly preserved nonsense.

  1. Unix treated logs as files first, streams second. Syslog and text files came before “everything to stdout.” Container logging flipped the default transport to streams.
  2. Docker’s original default logging driver (json-file) writes one JSON object per line. It’s human-friendly-ish, machine-ingestable, and dangerously easy to grow without limits.
  3. “12-factor” popularized stdout logging. Great for portability. But it didn’t come with a built-in discipline for volume control; that part is on you.
  4. Log aggregation vendors charge by ingest volume. Your CFO can now be impacted by a single logger.debug in a hot loop.
  5. Early microservice culture normalized “log everything; search later.” That worked when traffic was small and systems were few. At scale, it’s like saving every keystroke “just in case.”
  6. Structured logging was revived because grep stopped scaling. JSON logs are great—until you emit 20 fields for every request and triple your bytes.
  7. Containers made log persistence ambiguous. In VMs, you rotated files. In containers, you often don’t have a writable filesystem that you trust, so people dump to stdout and hope.
  8. High-cardinality labels and fields became a silent tax. Tracing IDs are good; adding unique user input as a field in every line is how you build a data lake by accident.

Fast diagnosis playbook

If you’re on-call and the node is screaming, you don’t have time for philosophy. You need to find the bottleneck quickly and decide whether it’s a logging issue, a runtime issue, or a storage issue.

First: confirm the symptom and blast radius (disk vs I/O vs CPU)

  • Disk pressure: df shows the filesystem near 100%.
  • I/O pressure: elevated await times, high write IOPS, slow application responses.
  • CPU pressure: log serialization and JSON formatting can burn CPU, especially with stack traces and large objects.

Second: identify the top talker (container, process, or host agent)

  • Look for the largest Docker log files and the containers tied to them.
  • Check whether a log shipper (Fluent Bit, Filebeat, etc.) is amplifying the problem with retries/backpressure loops.
  • Confirm if the application is repeating the same message; if yes, rate-limit or dedupe at the app layer.

Third: decide the fastest safe mitigation

  • Emergency mitigation: stop/restart the worst offender, cap logs with rotation if missing, reduce log level via config flag, or temporarily sample logs.
  • Post-incident fix: change logging patterns so the incident can’t recur from a single code path.

Joke #1: If your disk fills up with logs, congratulations—you’ve invented a very expensive, very slow database with no indexes.

Practical tasks: commands, outputs, decisions

These are the kinds of checks you do on a real host at 02:13. Each task includes the command, what the output means, and the decision you make from it.

Task 1: Confirm disk pressure and which filesystem is affected

cr0x@server:~$ df -h
Filesystem                         Size  Used Avail Use% Mounted on
/dev/nvme0n1p2                      220G  214G  2.9G  99% /
tmpfs                               32G     0   32G   0% /dev/shm

Meaning: Root filesystem is essentially full. Containers and their logs often live under /var/lib/docker on /.

Decision: Do not start “cleanup scripts” blindly. Identify what’s consuming space first; avoid deleting runtime state unless you accept downtime.

Task 2: Find the biggest directories under Docker storage

cr0x@server:~$ sudo du -xhd1 /var/lib/docker | sort -h
1.2G    /var/lib/docker/containers
8.4G    /var/lib/docker/image
12G     /var/lib/docker/overlay2
22G     /var/lib/docker

Meaning: containers is big enough to suspect log growth. overlay2 may also be large due to writable layers.

Decision: Drill into /var/lib/docker/containers to find large log files and map them to containers.

Task 3: Locate the largest container log files

cr0x@server:~$ sudo find /var/lib/docker/containers -name "*-json.log" -printf "%s %p\n" | sort -n | tail -5
2147483648 /var/lib/docker/containers/2c1c3e.../2c1c3e...-json.log
3221225472 /var/lib/docker/containers/7a8b9c.../7a8b9c...-json.log
4294967296 /var/lib/docker/containers/aa0bb1.../aa0bb1...-json.log

Meaning: You’ve got multi-gigabyte JSON log files. That’s not “some debug logs,” that’s a log firehose.

Decision: Identify the containers behind these IDs and inspect what they’re emitting.

Task 4: Map a container ID to a name and image

cr0x@server:~$ docker ps --no-trunc --format "table {{.ID}}\t{{.Names}}\t{{.Image}}" | grep aa0bb1
aa0bb1d3f0e9c1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p7q8r9s0t1u2v3w4  payments-api  registry.local/payments-api:3.14.2

Meaning: The payments API container is generating the massive log file.

Decision: Inspect recent logs and look for repetition patterns (same line, same stack trace, same request path).

Task 5: Sample the latest logs without dumping the entire file

cr0x@server:~$ docker logs --tail 50 payments-api
{"level":"error","msg":"db timeout","tenant":"blue","path":"/charge","retry":1}
{"level":"error","msg":"db timeout","tenant":"blue","path":"/charge","retry":1}
{"level":"error","msg":"db timeout","tenant":"blue","path":"/charge","retry":1}

Meaning: Repeated identical errors. Likely a retry loop logging every attempt.

Decision: Mitigate now by reducing log level or rate-limiting that specific message. Then fix code: log once per failure window, not once per retry.

Task 6: Check the container logging driver and options

cr0x@server:~$ docker inspect -f '{{.HostConfig.LogConfig.Type}} {{json .HostConfig.LogConfig.Config}}' payments-api
json-file {"max-file":"1","max-size":"0"}

Meaning: max-size is 0 (effectively unlimited) while max-file is irrelevant.

Decision: Fix runtime configuration (compose/systemd/daemon.json) but don’t stop there; the app is still emitting too much.

Task 7: Verify the daemon-wide logging defaults

cr0x@server:~$ sudo cat /etc/docker/daemon.json
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Meaning: If this is present, new containers should rotate. Existing containers may have been created before these defaults, or overridden per-container.

Decision: Standardize creation paths. Ensure compose stacks or orchestrator definitions don’t override limits.

Task 8: Identify high write I/O caused by logging

cr0x@server:~$ iostat -xz 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.31    0.00    6.44   38.27    0.00   42.98

Device            r/s     rkB/s   rrqm/s  %rrqm   r_await  w/s     wkB/s   w_await aqu-sz  %util
nvme0n1          2.1      86.3     0.0     0.0     3.21   912.4  18432.0  42.10   39.2   98.7

Meaning: Extremely high write utilization and high w_await. Logging to disk can dominate device time.

Decision: Reduce log volume now. If you keep writing at this rate, the disk becomes the bottleneck for everything.

Task 9: Confirm which processes are writing heavily

cr0x@server:~$ sudo pidstat -d 1 3
Linux 6.5.0 (server)  01/03/2026  _x86_64_  (16 CPU)

01:12:11      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
01:12:12        0      2471      0.00  18240.00      0.00  dockerd
01:12:12        0     19382      0.00   3100.00      0.00  fluent-bit

Meaning: Docker daemon is writing huge volumes (container logs). The shipper is also writing/handling a lot.

Decision: Address the source container first; then tune shipper buffering/retry to avoid feedback loops.

Task 10: Check whether the application is logging stack traces repeatedly

cr0x@server:~$ docker logs --tail 200 payments-api | grep -c "Traceback\|Exception\|stack"
147

Meaning: Frequent stack traces are costly in bytes and CPU. Often the same exception repeating.

Decision: Log one stack trace per unique error per time window; emit counters/metrics for the rest.

Task 11: Identify log event rate (lines per second) from the raw log file

cr0x@server:~$ cid=$(docker inspect -f '{{.Id}}' payments-api); sudo sh -c "tail -n 20000 /var/lib/docker/containers/$cid/${cid}-json.log | wc -l"
20000

Meaning: That’s 20k lines in the recent tail segment. If that tail is “a few seconds,” you’re flooding. If it’s “minutes,” still too chatty.

Decision: Set a budget: e.g., steady state < 50 lines/sec per instance; bursts allowed only with sampling and caps.

Task 12: Measure how fast the log file is growing

cr0x@server:~$ cid=$(docker inspect -f '{{.Id}}' payments-api); sudo sh -c "stat -c '%s %y' /var/lib/docker/containers/$cid/${cid}-json.log; sleep 5; stat -c '%s %y' /var/lib/docker/containers/$cid/${cid}-json.log"
4294967296 2026-01-03 01:12:41.000000000 +0000
4311744512 2026-01-03 01:12:46.000000000 +0000

Meaning: ~16 MB in 5 seconds (~3.2 MB/s). That will fill disks fast and choke I/O.

Decision: Immediate mitigation: reduce level, disable noisy component, restart with env flag. Longer term: implement throttling/dedup.

Task 13: Check if the container is restarting due to crash-loop plus verbose startup logs

cr0x@server:~$ docker inspect -f '{{.State.Status}} {{.RestartCount}}' payments-api
running 47

Meaning: Many restarts. Each restart can re-emit large banners/config dumps, multiplying noise.

Decision: Fix the crash root cause and suppress verbose startup dumps; make “one-time startup info” truly one-time.

Task 14: Verify whether log shippers are backpressured and retrying (amplifying noise)

cr0x@server:~$ docker logs --tail 50 fluent-bit
[warn] [output:es:es.0] HTTP status=429 URI=/_bulk
[warn] [engine] failed to flush chunk '1-173587...' retry in 8 seconds

Meaning: Downstream is throttling. Your logs aren’t just filling disk; they’re also causing a retry storm and memory/disk buffering.

Decision: Reduce volume at the app, then tune shipper buffering and consider dropping low-value logs under pressure.

Task 15: Inspect a suspicious message frequency (top repeated lines)

cr0x@server:~$ docker logs --tail 5000 payments-api | jq -r '.msg' | sort | uniq -c | sort -nr | head
4821 db timeout
112 cache miss
45 payment authorized

Meaning: One message dominates. That’s a perfect target for deduplication and rate limiting.

Decision: Replace per-event error logs with: (a) a periodic summary, (b) metrics counter, (c) one sampled exemplar with context.

App logging patterns that actually reduce volume

This is the meat: coding patterns and operational contracts that prevent log spam. You can implement them in any language; the principles don’t care about your framework.

1) Stop logging per retry attempt; log per outcome window

Retries are normal. Logging every retry is not. If a dependency is down, a retry loop can create a perfect log amplifier: failure causes retries, retries cause logs, logs cause I/O pressure, I/O pressure causes more timeouts, timeouts cause more failures.

Do this: log the first failure with context; then rate-limit follow-ups; then emit a summary every N seconds: “db timeout continuing; suppressed 4,821 similar errors.”

A good pattern:

  • One “exemplar” error with stack trace and request metadata (but see the privacy notes below).
  • Counter metric for every failure event.
  • Periodic log summary per dependency, per instance.

2) Pick a steady-state log budget and enforce it

Most teams argue about log formats. Better argument: log rate budgets. For example:

  • Per service instance steady-state: < 1 KB/s average log throughput.
  • Allowed burst: up to 50 KB/s for 60 seconds during incidents.
  • Above burst: sample at 1%, keep error exemplars, drop debug/info.

This gives SREs a crisp SLO-style threshold and gives app teams a target they can test. Add a CI check that runs a synthetic load test and fails if logs exceed budget.

3) Default to structured logs, but don’t over-structure

JSON logs are the standard in container land. They’re also easy to abuse. Every extra field costs bytes. Some fields also cost indexing dollars.

Keep: timestamp, level, message, service name, instance ID, request ID/trace ID, latency, status code, dependency name, error class.

Avoid: full request bodies, unbounded arrays, raw SQL strings, and high-cardinality labels copied into every line.

4) Don’t log inside tight loops unless you throttle

Loops show up everywhere: polling, consuming queues, scanning directories, retrying locks, connection health checks. If you log in a loop, you have created a future incident. Not a possibility; a scheduled appointment.

Rule: any log statement that can run more than once per second in steady-state must be behind a rate limiter, a state-change gate, or both.

5) Log state changes, not state confirmations

“Still connected” every 5 seconds is a waste. “Connection restored after 42 seconds, suppressed 500 failures” is useful. Humans need transitions. Machines need counts.

Implement a simple state machine for dependency health (UP → DEGRADED → DOWN) and only emit logs on transitions and periodic summaries.

6) Use “dedupe keys” for repeated errors

Repeated errors often share a signature: same exception type, same dependency, same endpoint. Compute a dedupe key like:

  • dedupe_key = hash(error_class + dependency + path + error_code)

Then keep a small in-memory map per process: last-seen timestamp, suppressed count, and one exemplar payload. Emit:

  • First occurrence: log normally.
  • Within window: increment suppressed count, maybe emit debug sampled.
  • End of window: log summary with suppressed count and one exemplar ID.

7) Sample informational logs; never sample metrics

Sampling is a scalpel. Use it for high-volume, low-value event logs: per-request access logs, “cache miss,” “job started.” Keep errors mostly unsampled, but you can sample repeated identical errors after the first exemplar.

Metrics are for counting. A counter is cheap and precise. Don’t replace metrics with logs; it’s like replacing a thermometer with interpretive dance.

8) Make “debug mode” a circuit-breaker, not a level

Debug logs in production should be temporary, targeted, and reversible without redeploying. The safest approach:

  • Debug logs exist, but are off by default.
  • Enable debug for a specific request ID, user ID (hashed), or tenant for a limited time.
  • Auto-disable after TTL.

This avoids the classic mistake: “we enabled debug to investigate, forgot, and paid for it for a week.”

9) Stop logging “expected errors” at ERROR

If a client cancels a request, that’s not an error; it’s a Tuesday. If a user enters a wrong password, that’s not a server error; it’s product reality. If you log these at error, you teach on-call to ignore ERROR. That’s how you miss the real outage.

Pattern:

  • Use info or warn for client-driven failures.
  • Use error for server-side failures requiring attention.
  • Use fatal rarely, and only when the process will exit.

10) Strip payloads; log pointers

Logging full request/response bodies is a disk-eater and a privacy trap. Instead:

  • Log a payload size (bytes).
  • Log a content hash (to correlate repeats without storing content).
  • Log an object ID that can be fetched from a secure store if needed.

11) Make stack traces opt-in and bounded

Stack traces can be valuable. They can also be 200 lines of noise repeated 10,000 times. Bound them:

  • Include stack traces for the first occurrence of a dedupe key per window.
  • Truncate stack depth where supported.
  • Prefer exception type + message + top frames for repeats.

12) Use a “once” logger for startup config

Startup logs often print full config, environment, feature flags, and dependency lists. That’s fine once. It’s chaos when the process crash-loops and prints it 50 times.

Pattern: log a compact startup summary and a config hash. Store detailed config elsewhere (or expose it via a protected endpoint), not in logs.

13) Treat logging as a dependency with backpressure

Most logging libraries pretend writes are free. They are not. When the output blocks (slow disk, blocked stdout pipe, log driver pressure), your app can stall.

Do this:

  • Prefer async logging with bounded queues.
  • When queue is full, drop low-priority logs first.
  • Expose metrics: dropped logs, queue depth, logging time.

14) Make logs easier to compress

If you can’t reduce volume enough, at least make it compress well. Repetition compresses. Randomness doesn’t. Good logging:

  • Uses stable message templates: "db timeout" not "db timeout after 123ms on host a1b2" embedded in the message string.
  • Places variable data in fields, not in the message.
  • Avoids printing random UUIDs in every line unless they’re necessary for correlation.

15) Add a “log fuse” for emergencies

Sometimes you need a kill switch: “If logs exceed X lines/sec for Y seconds, automatically raise sampling and suppress repetitive INFO/WARN.” This is not pretty, but it beats a disk-full outage.

Implement it with a local counter and a moving window. When it trips, emit one loud log: “log fuse engaged; sampling now 1%; suppressed N lines.”

Joke #2: Logging is like coffee—small amounts improve performance, but too much turns your system into a jittery mess that won’t stop talking.

Three corporate mini-stories (anonymized, plausible, and technically accurate)

Mini-story 1: The incident caused by a wrong assumption

They assumed Docker rotated logs by default. The team had moved from a VM-based setup where logrotate was everywhere, and they treated the container runtime as a modern replacement with sane defaults.

During a partner integration test, a service started failing authentication. The service had a retry policy that was fairly normal: exponential backoff with jitter. But the developer had added an error log inside the retry loop to “make it visible.” It was visible. It was also relentless.

The first sign of trouble wasn’t an alert about disk usage. It was a database node complaining about slow queries. The host running the chatty container had its root disk near full, and I/O latency had gone sideways. The logging driver was writing JSON lines like a metronome.

On-call did what humans do: restarted the service. That temporarily reduced log volume because it bought a few seconds before the retries ramped up again. They restarted again. Same result. Meanwhile, the log shipper was retrying ingestion because downstream was throttling, which added another layer of write churn.

The fix was embarrassingly simple: add log rotation at the Docker level and change the app to log only the first failure per window, then summarize. The lesson was sharper: “assumed defaults” are not a reliability strategy.

Mini-story 2: The optimization that backfired

A different org wanted “perfect observability.” They added structured logging everywhere, which is good. Then they decided every log line should include the full request context for easier debugging: headers, query params, and a chunk of the body.

It worked beautifully in staging. In production, it became an ingestion-cost bonfire. Worse, it became a performance problem: JSON serialization of large objects for every request ate CPU, and the container runtime dutifully wrote bigger lines. Latency increased, which created more timeouts, which created more errors, which created even bigger stack traces. A classic feedback loop.

The on-call symptom looked like a capacity problem: “We need bigger nodes.” But the real bottleneck was self-inflicted I/O and CPU pressure caused by logging. When they reduced the payload logging and moved to “log pointers” (request ID, payload hash, payload size), the system stabilized without changing instance sizes.

The optimization was “reduce debugging time by logging everything.” The backfire was “increase incident rate and cost by logging everything.” The happy ending is that they kept structured logs—just not the parts that belonged in a secure trace store.

Mini-story 3: The boring but correct practice that saved the day

A finance-adjacent service handled periodic batch jobs. Nothing sexy. The team had a practice that felt almost old-fashioned: every service had a written log budget and a test that measured log throughput under load. If a change increased logs beyond budget, the build failed unless the engineer justified it.

One Friday, a dependency started returning intermittent 500s. The service retried, but the log patterns were already rate-limited and deduplicated. They emitted one exemplar error with a trace ID, then one summary every 30 seconds: “dependency errors continuing; suppressed N.” Metrics counters spiked, alerts fired, but disks stayed calm.

While other teams were fighting disk pressure and drowning in repeated stack traces, this service remained readable. On-call could see what changed (dependency behavior), quantify it (metrics), and correlate it (trace IDs). The incident was still annoying, but it didn’t metastasize into a node-level outage.

Afterward, nobody wrote a grand internal post about “the heroics.” It was boring. That’s the point. Boring reliability practices age well.

Common mistakes: symptoms → root cause → fix

1) Symptom: Docker logs grow without bound

Root cause: json-file driver without max-size/max-file, or containers created before defaults were set.

Fix: Set daemon defaults and enforce per-service config. Recreate containers to pick up limits. Still fix the app so it doesn’t generate junk.

2) Symptom: Disk full after a dependency outage

Root cause: retry loop logs every attempt (often with stack traces).

Fix: log first failure + summary; count retries as metrics; rate-limit logs per dedupe key; add circuit breakers.

3) Symptom: On-call ignores ERROR because it’s always noisy

Root cause: expected client events logged as error (timeouts due to cancels, 4xx responses, validation failures).

Fix: fix severity mapping and alerting rules; reserve ERROR for actionable server-side faults.

4) Symptom: High CPU with no obvious business load increase

Root cause: expensive log formatting (string interpolation, JSON serialization of large objects) on hot paths.

Fix: lazy logging (only format if enabled), avoid serializing full objects, precompile message templates, sample low-value logs.

5) Symptom: Log shipper shows retries, memory growth, or dropped chunks

Root cause: downstream ingestion throttling plus high upstream log volume; shipper buffering amplifies disk usage.

Fix: reduce app log volume; configure shipper backpressure and drop policies; prioritize error exemplars and summaries.

6) Symptom: “We can’t find the relevant lines” during incidents

Root cause: missing context fields (request ID, service version, dependency name) and too much repetitive noise.

Fix: add essential context fields; dedupe repetitive logs; log state transitions; keep messages consistent.

7) Symptom: Sensitive data shows up in logs

Root cause: request/response payload logging, header dumps, or exception messages containing secrets.

Fix: redact at source, stop logging payloads, add allowlists for fields, audit logs automatically, treat logs as production data.

8) Symptom: “Fix” was to increase disk size, but problem returns

Root cause: capacity band-aid; no change to emission patterns.

Fix: implement log budgets, enforce rate limiting, and add regression tests for log volume.

Checklists / step-by-step plan

Step-by-step: stop the bleeding during an active incident

  1. Confirm disk usage: df -h. If root is > 95%, treat as urgent.
  2. Find top log files: find /var/lib/docker/containers -name "*-json.log" sorted by size.
  3. Map file → container: docker ps --no-trunc and docker inspect.
  4. Identify repetition: sample recent logs; check top repeated messages.
  5. Mitigate fast: temporarily reduce log level, enable sampling, or disable noisy component. If needed, restart the container with safer settings.
  6. Restore headroom: once emission stops, remove or truncate only the worst offender’s log file if you accept losing logs. Prefer rotation and controlled restarts over manual deletion.
  7. Confirm I/O recovery: iostat and service latencies should normalize.

Step-by-step: prevent recurrence (what to do after the incident)

  1. Set Docker defaults: max-size and max-file in /etc/docker/daemon.json.
  2. Audit per-service overrides: compose files, systemd units, orchestrator specs.
  3. Instrument log volume: track lines/sec and bytes/sec per service instance.
  4. Implement dedupe + rate limits: per error signature, per dependency.
  5. Replace spam with summaries: periodic rollups, plus exemplars.
  6. Move bulk context to traces: keep logs lean; use request IDs to pivot.
  7. Add a CI guardrail: load test and fail on log budget regressions.
  8. Do a privacy review: redact, allowlist, and verify that secrets can’t leak.

Operational checklist: what “good” looks like

  • ERROR logs are rare, actionable, and not dominated by one repeated line.
  • Info logs are sampled or limited on hot paths (requests, queue consumers).
  • Every service has a log budget and a known steady-state log rate.
  • Every repeating error has a dedupe key, a suppression window, and a summary line.
  • Logs contain the context you need to correlate (trace/request ID, version), not the data you shouldn’t store.
  • When ingestion is throttled, the system degrades gracefully (drops low-value logs first).

FAQ

1) Should I just change the Docker logging driver to fix this?

No. Changing drivers can help with rotation, shipping, or performance characteristics, but it does not fix an application that emits junk. Fix emission first; then pick the driver based on operational needs.

2) Is logging to stdout always the right approach in containers?

It’s the standard approach, not automatically the right one. Stdout is fine if you treat it as a constrained channel with budgets, sampling, and rate limits. If you need durable local logs, use a volume and manage rotation—but that’s a deliberate decision, not an accident.

3) What log level should production run at?

Typically info or warn, with targeted debug toggles. If you need debug constantly to operate, you likely have missing metrics, missing traces, or missing structured context.

4) How do I convince teams to stop logging request bodies?

Tell them the truth: it’s a reliability and security risk. Offer an alternative: log request IDs, payload sizes, hashes, and store detailed payloads in a secure, access-controlled system if truly needed.

5) What’s the simplest rate-limiting approach in an app?

A per-message (or per-dedupe-key) time window: log the first occurrence, then suppress for N seconds while counting suppressions, then emit a summary.

6) Won’t sampling make debugging harder?

Sampling makes debugging possible when the alternative is drowning. Keep exemplars (first occurrence, unique signatures) and retain metrics counters for completeness. You can’t debug what you can’t read.

7) How do I detect log spam before it takes down a node?

Alert on log growth rate (bytes/sec) and on sudden changes in top repeated messages. If you only alert on “disk > 90%,” you’ll find out too late.

8) Why do repeated stack traces hurt so much?

They’re large, slow to format, and often identical. They waste CPU and disk, and they ruin search signal. Keep one exemplar per window; count the rest.

9) Can I safely delete a giant *-json.log file to recover space?

Sometimes, but it’s a sharp tool. Deleting a file that a process still has open may not reclaim space until the handle is closed. Prefer rotation, container restart, or controlled truncation during an incident—then fix the underlying emission.

10) How do I keep logs useful while reducing volume?

Make logs eventful: state transitions, summaries, exemplars. Push high-volume detail into metrics (counts) and traces (rich per-request context). Logs should explain incidents, not recreate them.

Next steps you can do this week

If you only do one thing, do this: remove logging from retry loops and replace it with deduped exemplars plus periodic summaries. That single pattern prevents an entire class of disk-full and I/O-thrash incidents.

Then:

  1. Set Docker log rotation defaults and verify every container actually inherits them.
  2. Define a log budget per service and measure lines/sec and bytes/sec under load.
  3. Implement rate limiting and dedupe keys for repeated errors and hot-path info logs.
  4. Stop logging payloads; log pointers and hashes instead.
  5. Add a “log fuse” so one bad deploy can’t take down a node by talking too much.

You don’t win reliability by writing more logs. You win it by making the logs you keep worth the bytes they occupy.

← Previous
ZFS iostat Basics: Turning Numbers Into a Bottleneck
Next →
Docker: I/O wait from hell — throttle the one container killing your host

Leave a comment