Email Queue Keeps Growing — Find the Culprit Before Mail Stops

Was this helpful?

The queue is climbing. Not spiking—climbing. That slow, steady incline that says “something is broken, but politely.”
Meanwhile users report “emails are delayed,” marketing is about to launch a campaign, and your CEO’s calendar invites are arriving fashionably late.

A growing mail queue is rarely the problem. It’s the most visible symptom of a bottleneck: DNS, TLS handshakes, remote rate limits,
local disk latency, a resolver meltdown, a reputation issue, or a well-meant config change that turned your MTA into a single-lane bridge.
This is how you find the culprit before mail stops—and before you end up learning about your own outage via a forwarded screenshot.

How queues grow (and why they keep growing)

An SMTP server is a conveyor belt. Inbound mail arrives; outbound mail is attempted; failures are deferred; retries happen later.
When deliveries fail faster than they recover—or when you can’t attempt deliveries fast enough—the queue grows.

A queue growth curve tells you the class of problem:

  • Linear growth: you’re delivering some mail, but throughput is below incoming rate. Think remote throttling, too-low concurrency, or storage slowness.
  • Step increase then plateau: a temporary outage or DNS/TLS flake. It backs up, then drains when fixed.
  • Explosive growth: widespread failures (resolver down, outbound blocked, disk full), or a mailstorm (loop, misconfigured app, compromised account).
  • “Deferred” dominates: you are attempting deliveries, but remote systems reject or time out.
  • “Active” dominates: local processing is stuck—often disk, locks, content filtering, or a dead child process.

Most MTAs are conservative by design: they won’t hammer remote servers endlessly, and they’ll keep trying for hours or days.
That’s good for reliability. It’s also how you end up with 600k messages waiting while the underlying problem quietly persists.

Here’s the operational truth: your goal isn’t “clear the queue.” Your goal is “restore stable throughput” so the queue drains naturally.
If you focus on deleting messages first, you’ll delete evidence and keep the bottleneck.

Fast diagnosis playbook

When the queue grows, you don’t have time for a leisurely archaeology dig. You need a tight sequence that separates “remote rejects”
from “local stuck” from “network/DNS/TLS” in minutes.

First: classify the backlog (deferred vs active, oldest age, dominant reason)

  • Is the queue mostly deferred? That’s often network/DNS/TLS/remote policy.
  • Is it mostly active? That’s often local resource contention (disk, CPU, content filter), or a wedged process.
  • What’s the oldest message age? If it’s days old, you have a persistent failure mode or a retry schedule too long for your business.

Second: confirm local health (disk, I/O latency, inode exhaustion, resolver)

  • Check disk space and inode usage where the queue lives.
  • Check iowait and storage latency; mail queues are small-file write heavy.
  • Validate DNS resolution speed and correctness from the MTA host.

Third: confirm outbound path (port 25 egress, TLS handshakes, remote rate limits)

  • Test basic TCP connectivity to known-good destinations on port 25.
  • Check for TLS failures and timeouts; these can serialize delivery attempts.
  • Look for “421 Try again later”, “4.7.0 rate limited”, or “temporarily deferred” patterns.

Fourth: confirm you aren’t the problem (loops, floods, bad retries)

  • Identify top senders and recipients in the queue. One app can poison the well.
  • Look for duplicates, auto-replies, or bounce loops.
  • Check if the queue files are churning (high write rate) without delivery progress.

Fifth: adjust safely (increase concurrency only after you fix the bottleneck)

Turning up concurrency before fixing DNS or disk is like adding lanes to a highway that ends at a drawbridge.
You’ll just get more cars waiting faster.

Interesting facts and historical context

  • SMTP predates most “enterprise email”: RFC 821 (1982) standardized SMTP when the internet was still mostly research networks.
  • “Store-and-forward” is the whole point: early networks were unreliable, so queueing and retries were designed in from day one.
  • Mail queues are file-system heavy by nature: classic MTAs store each message as one or more files; this makes disk latency a first-class factor.
  • DNS became a critical dependency later: MX records and DNS-based routing shifted delivery from static host tables to a resolver-dependent system.
  • Greylisting popularized deferrals: in the early 2000s, many servers used “try again later” tactics to reduce spam, teaching MTAs patience.
  • TLS added CPU and handshake complexity: STARTTLS improved privacy, but introduced new failure modes—timeouts, cipher mismatches, bad intermediates.
  • Anti-spam turned mail delivery into policy: modern “4xx deferrals” often reflect reputation scoring, not broken infrastructure.
  • Queue IDs became forensic artifacts: the ability to trace a single message across retries is one of the most useful operational features in MTAs.

One quote worth keeping on a sticky note near your terminal:
paraphrased idea from Richard Cook: “Success makes systems look simple; failure reveals the messy reality of how work actually happens.”

Practical tasks: commands, outputs, decisions

The tasks below assume a Linux mail server. Most examples use Postfix because it’s common, but the thinking translates to Exim and Sendmail.
Each task includes: command, sample output, what it means, and what decision you should make next.

Task 1: Measure queue size and split (Postfix)

cr0x@server:~$ mailq | tail -n 20
-- 5234 Kbytes in 412 Requests.
-- 18452 Kbytes in 1923 Requests.
-- 23686 Kbytes in 2335 Requests.

Meaning: Postfix prints per-queue directory totals near the end; multiple lines can appear depending on configuration and queue groups.
The key signal is requests count and whether it’s rising over time.

Decision: If requests are rising, don’t touch concurrency yet. Move to reason analysis (Task 3) and local health (Task 5/6).

Task 2: Get a structured queue summary (Postfix)

cr0x@server:~$ postqueue -p | head -n 20
-Queue ID-  --Size-- ----Arrival Time---- -Sender/Recipient-------
A1B2C3D4E5*    1487 Thu Jan  3 10:21:52  alerts@example.com
                                         user@remote.tld
F6G7H8I9J0     9210 Thu Jan  3 10:22:01  billing@example.com
                                         finance@remote.tld

-- 23686 Kbytes in 2335 Requests.

Meaning: An asterisk typically indicates the message is in the active queue (being attempted). No asterisk often means deferred.

Decision: If you see many active messages but little progress, suspect local bottleneck or stuck delivery processes. Go to Task 8 and Task 9.

Task 3: Extract dominant deferral reasons from logs

cr0x@server:~$ sudo grep -E "status=deferred|status=bounced" /var/log/mail.log | tail -n 20
Jan  3 10:23:10 server postfix/smtp[21944]: A1B2C3D4E5: to=<user@remote.tld>, relay=mx.remote.tld[203.0.113.7]:25, delay=120, delays=0.2/0.1/60/59, dsn=4.4.1, status=deferred (connect to mx.remote.tld[203.0.113.7]:25: Connection timed out)
Jan  3 10:23:12 server postfix/smtp[21947]: F6G7H8I9J0: to=<finance@remote.tld>, relay=mx.remote.tld[203.0.113.7]:25, delay=90, delays=0.2/0.1/30/59, dsn=4.7.0, status=deferred (host mx.remote.tld[203.0.113.7] said: 421 4.7.0 Try again later)

Meaning: The reason is in parentheses. Timeouts scream network/egress/DNS; 421/4.7.0 often screams rate limiting or reputation.
The delays= breakdown is gold: it tells you where time was spent (before queue manager, DNS, connect, transmission).

Decision: If the dominant reason is connect timeout, test connectivity (Task 11). If it’s 421/4.7.0, check sending patterns and remote throttling (Task 14).

Task 4: Count deferrals by reason quickly

cr0x@server:~$ sudo grep "status=deferred" /var/log/mail.log | sed -n 's/.*status=deferred (//p' | sed 's/)$//' | sort | uniq -c | sort -nr | head
  842 connect to mx.remote.tld[203.0.113.7]:25: Connection timed out
  311 host mx.other.tld[198.51.100.9] said: 421 4.7.0 Try again later
   97 TLS is required, but our TLS engine is unavailable
   52 lost connection with mx.third.tld[192.0.2.10] while receiving the initial server greeting

Meaning: This is your heatmap. If one reason dominates, you have a bottleneck you can name.

Decision: Fix the top reason first. Queue problems are usually Pareto-distributed: one issue causes most pain.

Task 5: Check disk space where Postfix stores queues

cr0x@server:~$ df -h /var/spool/postfix
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        40G   38G  1.2G  97% /

Meaning: 97% used is not “fine.” Mail queues don’t fail gracefully when the filesystem gets tight: performance tanks, cleanup gets weird, and you risk a full stop.

Decision: Free space immediately or move spools to a larger filesystem. Then investigate why volume filled (logs, stuck mail, core dumps).

Task 6: Check inode exhaustion (the quiet killer)

cr0x@server:~$ df -i /var/spool/postfix
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/sda2      2621440 2619000   2440  100% /

Meaning: You can have free bytes and still be dead. A mail spool creates lots of small files; inode exhaustion stops file creation and breaks deliveries.

Decision: Reduce file count (clear old logs, rotate aggressively, drain queue) and plan a filesystem with more inodes (or different storage layout).

Task 7: Determine if you’re I/O bound (iowait and latency)

cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (server) 	01/03/2026 	_x86_64_	(4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.31    0.00    5.01   48.22    0.00   34.46

Device            r/s     w/s   rKB/s   wKB/s  avgrq-sz avgqu-sz   await  svctm  %util
sda              4.00  220.00   120.0  9800.0     86.2      9.8   44.10   3.20  72.00

Meaning: High iowait and high await mean your disk is not keeping up. Mail queues are sensitive to small-file fsync behavior.

Decision: Stop tuning Postfix first. Fix storage: move spool to faster disk/SSD, reduce competing I/O, or adjust VM storage policy.

Task 8: Check Postfix service health and process saturation

cr0x@server:~$ systemctl status postfix --no-pager
● postfix.service - Postfix Mail Transport Agent
     Loaded: loaded (/lib/systemd/system/postfix.service; enabled)
     Active: active (running) since Thu 2026-01-03 09:01:02 UTC; 1h 22min ago
   Main PID: 1021 (master)
      Tasks: 68 (limit: 18945)
     Memory: 212.4M
        CPU: 18min 12.301s

Meaning: “Active (running)” is necessary but not sufficient. The important part is whether child processes are stuck or respawning.

Decision: If CPU is high and queue isn’t draining, check whether a content filter or policy daemon is blocking. If CPU is low but queue is high, suspect I/O or remote deferrals.

Task 9: Look for stuck smtp processes and long runtimes

cr0x@server:~$ ps -eo pid,etime,cmd | grep -E "postfix/smtp|postfix/local" | head
21944 00:10:12 postfix/smtp -t unix -u -c
21947 00:09:58 postfix/smtp -t unix -u -c
21950 00:00:03 postfix/local -t unix -u

Meaning: Many smtp processes with long elapsed time suggests connect/TLS hangs or remote tar-pitting.

Decision: Check log delay breakdown (Task 3) and test TCP/TLS from this host (Task 11, Task 12).

Task 10: Verify DNS resolution speed and correctness

cr0x@server:~$ dig +tries=1 +time=2 mx remote.tld
;; ANSWER SECTION:
remote.tld.          300     IN      MX      10 mx.remote.tld.
remote.tld.          300     IN      MX      20 mx2.remote.tld.

;; Query time: 1789 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Thu Jan 03 10:25:55 UTC 2026

Meaning: Nearly 2 seconds for a simple MX lookup is bad. Multiply that by thousands of deliveries and you get a queue.

Decision: Fix resolver latency: check systemd-resolved/unbound, upstream DNS, packet loss, or misconfigured search domains.

Task 11: Test outbound port 25 connectivity (don’t guess)

cr0x@server:~$ nc -vz -w 3 mx.remote.tld 25
Connection to mx.remote.tld (203.0.113.7) 25 port [tcp/smtp] succeeded!

Meaning: TCP works. If you still get timeouts in Postfix, suspect intermittent packet loss, stateful firewall issues, or concurrency exhaustion.

Decision: If TCP fails, escalate to network: egress ACLs, provider SMTP blocks, NAT table exhaustion, or upstream filtering.

Task 12: Test STARTTLS handshake to a remote MX

cr0x@server:~$ openssl s_client -starttls smtp -connect mx.remote.tld:25 -servername mx.remote.tld -brief < /dev/null
CONNECTION ESTABLISHED
Protocol version: TLSv1.3
Ciphersuite: TLS_AES_256_GCM_SHA384
Peer certificate: CN = mx.remote.tld
Verification: OK
250 SMTPUTF8

Meaning: TLS handshake is healthy. If Postfix logs show TLS failures, suspect local CA bundle issues, SNI mismatch, or older OpenSSL.

Decision: If handshake hangs, reduce DNS/TLS timeouts temporarily and investigate middleboxes that break STARTTLS.

Task 13: Identify top senders in the queue (who’s flooding you)

cr0x@server:~$ postqueue -p | awk 'NR%2==1{print $7}' | sed 's/[<>]//g' | sort | uniq -c | sort -nr | head
  8123 noreply@app.internal
  1432 alerts@example.com
   611 billing@example.com

Meaning: One sender generating most queued mail usually indicates an application change, a retry storm, or a compromised credential.

Decision: Rate-limit or block the top offender temporarily, and contact the owning team. Stabilize throughput before you “fix email.”

Task 14: Identify top destinations (who is rejecting you)

cr0x@server:~$ postqueue -p | awk 'NR%2==0{print $1}' | sed 's/[<>]//g' | awk -F@ '{print $2}' | sort | uniq -c | sort -nr | head
  9321 remote.tld
  2210 other.tld
   809 bigmail.example

Meaning: If one domain dominates, you can focus: check their MX reachability, your reputation, their deferral messages, and your sending rate to them.

Decision: Apply per-destination concurrency limits or slow down to that domain, rather than punishing every other destination.

Task 15: Inspect a specific queued message (Postfix)

cr0x@server:~$ sudo postcat -q A1B2C3D4E5 | sed -n '1,30p'
*** ENVELOPE RECORDS ***
message_size:            1487             1487
message_arrival_time: Thu Jan  3 10:21:52 2026
sender: alerts@example.com
*** MESSAGE CONTENTS ***
Received: from app.internal (app.internal [10.0.10.5])
	by server with ESMTP id A1B2C3D4E5
	for <user@remote.tld>; Thu, 03 Jan 2026 10:21:52 +0000 (UTC)
Subject: Alert: backend latency

Meaning: Confirms sender, recipient, and internal handoff. Useful to detect mail loops, spoofing, or a specific tenant causing load.

Decision: If you spot a loop (Received headers repeating), stop it at the source and purge affected messages.

Task 16: Confirm queue directory size and file count

cr0x@server:~$ sudo du -sh /var/spool/postfix/deferred /var/spool/postfix/active
3.8G	/var/spool/postfix/deferred
126M	/var/spool/postfix/active

cr0x@server:~$ sudo find /var/spool/postfix/deferred -type f | wc -l
198234

Meaning: Deferred dominating means you’re mostly waiting on remote conditions (or DNS/TLS/egress). File count matters for inodes and directory traversal overhead.

Decision: If file count is massive and inodes are tight, prioritize draining or relocating spool; then reduce inflow (rate limit) to prevent a spiral.

Task 17: Watch queue drain rate in real time

cr0x@server:~$ watch -n 5 "postqueue -p | tail -n 1"
Every 5.0s: postqueue -p | tail -n 1

-- 23686 Kbytes in 2335 Requests.

Meaning: You care about slope, not a single number. If this number doesn’t drop after you apply a fix, you didn’t fix the bottleneck.

Decision: If it drops steadily, stop touching things. Let it drain. Your job is to avoid making it worse.

Task 18: Check if you’re being throttled or tarpitted by remote servers

cr0x@server:~$ sudo grep "said: 4" /var/log/mail.log | tail -n 20
Jan  3 10:24:01 server postfix/smtp[21947]: F6G7H8I9J0: to=<finance@remote.tld>, relay=mx.remote.tld[203.0.113.7]:25, delay=90, dsn=4.7.0, status=deferred (host mx.remote.tld[203.0.113.7] said: 421 4.7.0 Try again later)

Meaning: A 4xx is a temporary deferral; the remote side is explicitly asking you to slow down or come back later.

Decision: Reduce per-destination concurrency and sending rate; if you operate bulk mail, segment that traffic to a dedicated outbound host/IP.

Joke #1: Email is the only system where “try again later” is both a protocol feature and an emotional support strategy.

Common bottlenecks and failure modes

1) Disk and filesystem bottlenecks (the classic)

Mail spools are small-file factories. You get metadata ops, fsyncs, directory lookups, and a lot of “write a little, rename, unlink.”
If your queue is on slow network storage, oversubscribed VM disks, or a nearly-full filesystem, deliveries will slow down even if the CPU is idle.

Storage-specific tells:

  • High iowait, high await, modest throughput.
  • Queue manager processes block on I/O; “active” grows and stays high.
  • Log writes lag; timestamps in logs jump.

What to do:

  • Move spool to local SSD where possible.
  • Keep /var roomy; 20% free is not luxury, it’s operational sanity.
  • Stop colocating mail spools with log-heavy systems, backup staging, or anything that does large sequential writes.

2) DNS resolver slowness (death by a thousand lookups)

Every outbound delivery involves DNS: MX lookup, A/AAAA, sometimes TLSA, sometimes SPF/DMARC checks depending on your setup.
If your resolver is slow, you effectively serialize deliveries.

Typical root causes:

  • systemd-resolved forwarding to flaky upstream DNS.
  • Firewall dropping fragments or UDP responses (EDNS issues).
  • Search domains causing repeated NXDOMAIN churn.
  • IPv6 misbehavior: AAAA lookups succeed, but actual v6 connectivity is broken.

3) Network egress blocks and provider SMTP restrictions

Many cloud networks block outbound port 25 by default. Some corporate firewalls “helpfully” intercept SMTP and do who-knows-what.
Symptoms: connect timeouts, sporadic success, or “No route to host.”

The fix is almost never “tune Postfix.” It’s: open the port, use the correct relay, or route mail via an approved smarthost.

4) TLS handshake failures and policy mismatches

STARTTLS is opportunistic in many configurations, but increasingly mandatory in some environments. TLS introduces:
certificate verification errors, handshake timeouts, cipher suite mismatches, and even bugs in older libraries.

Watch for:

  • TLS is required, but our TLS engine is unavailable
  • SSL_accept error, no shared cipher
  • Long delays during the connect/handshake phase in logs.

5) Remote rate limits and reputation deferrals

The remote side might be healthy. They just don’t like you right now.
Bulk spikes, new sending IPs, misaligned SPF/DKIM/DMARC, or unusual bounce patterns can trigger throttling.

Operational approach:

  • Separate transactional and bulk streams (different host/IP if possible).
  • Implement per-domain pacing and concurrency caps.
  • Stop retry storms from applications that treat SMTP as a message queue (it isn’t).

6) Content filtering and milters (the hidden serialization)

Spam/AV scanning is expensive. If you feed all mail through a single filter instance, you can create a bottleneck that looks like “SMTP is slow.”
Also: when the filter fails, MTAs tend to defer—queue grows—and the filter team says “works on my laptop.”

Treat filters like any other dependency: capacity plan, monitor, and fail safely (with explicit policy).

7) Retry schedule and queue lifetime misconfiguration

If you retry too aggressively, you amplify outages and get rate-limited harder. If you retry too slowly, you accumulate backlog and miss business SLAs.
Queue lifetime also matters: holding junk mail for days wastes disk and attention.

Joke #2: The mail queue is like your gym membership—if you ignore it long enough, it will still be there, quietly judging you.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran Postfix on a pair of VMs behind a load balancer. They had a “simple” assumption:
outbound email isn’t critical infrastructure, because “we can always re-send.” That assumption held until payroll week.

The queue started growing slowly after a network change. Logs showed connect timeouts to a handful of domains, but not all.
The on-call engineer assumed remote domains were flaky and waited for it to resolve. By morning, the deferred queue was massive and the disk was nearly full.
Then Postfix started throwing errors related to file creation and cleanup.

The real problem was embarrassingly local: outbound port 25 was blocked for one of the NAT gateways during a firewall ruleset rollout.
Half the traffic succeeded; half timed out. The “partial success” pattern fooled everyone.
Because retries were conservative, the failures accumulated quietly.

The fix was not heroic: correct the firewall rules, drain the queue, and add a synthetic check that tests outbound TCP/25 from each node every minute.
The lesson was more valuable than the outage: never assume “some mail works” means “mail is fine.”

Mini-story 2: The optimization that backfired

Another org had legitimate throughput issues. Marketing wanted faster campaign delivery; engineering wanted fewer complaints.
Someone bumped Postfix concurrency and parallelism settings across the board. The queue drained faster—for about an hour.

Then the remote deferrals began: lots of 421 and 4.7.x “try again later” responses. Major destinations started tar-pitting connections.
The MTA responded by holding many smtp processes open, each waiting for a greeting or a slow response.
CPU rose, but worse, file descriptor usage climbed and the box started acting haunted.

Meanwhile the content filter (a milter) was now receiving bursts it was never designed to handle.
It began to time out, which triggered more deferrals, which created more retries, which increased bursts.
That’s how you build a feedback loop: take a system already under strain and “optimize” it into self-harm.

Recovery required dialing back concurrency, adding per-domain limits, and scaling the filter tier.
The long-term fix was policy: changes to mail throughput require capacity tests and a rollback plan, not a Friday afternoon “just bump it.”

Mini-story 3: The boring but correct practice that saved the day

A financial services company ran a small fleet of outbound MTAs. Nothing fancy—just disciplined.
They kept mail spools on local SSDs, rotated logs aggressively, pinned DNS resolvers to known-good instances, and monitored queue age distribution.

One day a resolver upstream began intermittently failing. Many systems across the company had issues, but the mail fleet didn’t fall over.
Their dashboards showed DNS query latency rising, but the queue stayed stable.

Why? They had a local caching resolver on each mail host with sane timeouts and a warm cache.
They also had alerting not just on queue size, but on “oldest deferred message age” and “deferral reason cardinality.”
Those alerts triggered early, while the queue was still manageable.

The fix was a controlled change to upstream resolvers and a temporary reduction in outbound concurrency to avoid amplifying DNS timeouts.
No crisis call. No executive panic. Just boring operations doing what boring operations does best: preventing drama.

Common mistakes (symptom → root cause → fix)

  • Symptom: Queue grows, mostly deferred, with “connect timed out”.
    Root cause: Outbound TCP/25 blocked, intermittent routing, or NAT table exhaustion.
    Fix: Validate TCP/25 with nc; check firewall/egress policy; confirm NAT capacity; consider a smarthost relay.
  • Symptom: Queue grows, “Try again later” (421/4.7.0) dominates.
    Root cause: Remote rate limiting or reputation throttling; bursty sending patterns.
    Fix: Pace per-domain, segment traffic, reduce concurrency, clean up authentication (SPF/DKIM/DMARC) and bounce handling.
  • Symptom: Active queue high, CPU low, iowait high.
    Root cause: Disk latency on spool volume; oversubscribed storage; near-full filesystem.
    Fix: Move spool to faster storage; free space/inodes; isolate from other I/O; verify with iostat.
  • Symptom: Sporadic failures, long delays before connect, DNS query time high.
    Root cause: Resolver latency, EDNS/fragmentation issues, misconfigured search domains.
    Fix: Fix DNS path; run a local cache; lower resolver timeouts; verify with dig timings.
  • Symptom: TLS-related deferrals suddenly appear after OS update or certificate change.
    Root cause: CA bundle mismatch, OpenSSL changes, enforced TLS policy without support.
    Fix: Re-test with openssl s_client; verify CA store; adjust TLS policy carefully, not emotionally.
  • Symptom: Queue is huge, but only one internal sender dominates.
    Root cause: App retry storm, bad queueing logic, or compromised account sending spam.
    Fix: Rate-limit the sender at MTA; block temporarily; fix app behavior; rotate credentials.
  • Symptom: Queue grows and logs show “warning: database is locked” or similar for maps.
    Root cause: Contended lookup maps (e.g., local databases), NFS-mounted map files, or slow LDAP.
    Fix: Move maps local, switch map backend, cache lookups, and isolate dependencies.
  • Symptom: Queue drains extremely slowly even after network fix.
    Root cause: Retry schedule too conservative; too few smtp worker processes; per-domain limits overly tight.
    Fix: Temporarily raise concurrency and delivery rate after the bottleneck is gone; watch error rate and remote deferrals.

Checklists / step-by-step plan

Checklist A: First 10 minutes (containment and signal)

  1. Snapshot the situation. Record queue size, oldest message age (if you track it), and top deferral reasons (Task 1–4).
  2. Confirm local disk and inodes. If either is critical, fix that first (Task 5–6). A full disk turns a delivery problem into a data-loss problem.
  3. Check DNS latency and outbound TCP/25. Validate from the MTA host, not from your laptop (Task 10–12).
  4. Identify top sender and top destination domains. Contain a flood early (Task 13–14).
  5. Communicate. Tell stakeholders: “Outbound delivery is delayed; inbound acceptance is still functioning (or not). Next update in 30 minutes.”

Checklist B: Stabilize throughput (stop the bleeding)

  1. Reduce inflow if needed. If a single app is flooding, throttle it at the MTA or firewall. Yes, you will get yelled at; do it anyway.
  2. Fix the top deferral reason. Don’t chase ten tiny issues when one dominates (Task 4).
  3. Prefer per-destination tuning over global tuning. If one domain throttles you, don’t punish all others.
  4. Ensure spool I/O is healthy. Move queues to fast storage if you must; queues are not a good place to be “cost efficient.”
  5. Verify drain rate. Watch the slope (Task 17). If the slope improves, stop fiddling.

Checklist C: Recovery and cleanup (after the queue starts draining)

  1. Confirm bounce behavior. Make sure you’re not generating backscatter or bounce loops.
  2. Validate retry policy. Ensure you won’t hold messages for days that the business no longer cares about.
  3. Post-incident controls. Add alerts on queue age and deferral reasons, not just queue length.
  4. Do a small load test. If you changed concurrency or moved spools, verify you didn’t create a new bottleneck.
  5. Document the “top 3” causes and checks. Future-you is not as smart at 03:00.

FAQ

1) Should I just flush the queue?

Only after you’ve fixed the underlying bottleneck. Flushing a broken system increases load on the broken part and can worsen deferrals or trigger remote throttles.
Flush is a tool for recovery, not diagnosis.

2) When is it okay to delete queued mail?

When it’s clearly abusive (spam flood), clearly stale (business-approved expiration), or when disk/inodes are at risk of taking the server down.
Delete with intention: identify by sender, destination, or age. Don’t randomly purge and hope.

3) Why is the queue mostly deferred and not active?

Deferred means attempts happened and failed temporarily. That’s usually remote policy, DNS/TLS/egress issues, or content filtering dependencies that refused mail.
Active queue dominating points more to local processing or I/O constraints.

4) How do I know if DNS is the bottleneck?

Measure lookup time from the mail host with dig. If MX lookups take hundreds of milliseconds to seconds, that’s enough to cap throughput.
Also correlate with log delay breakdowns where pre-connect time grows.

5) Can a nearly-full disk cause delivery delays before it’s actually full?

Yes. Many filesystems get slower when space is tight, and mail spools amplify metadata operations.
Also: “nearly full” is how you end up “suddenly full” during a backlog.

6) Why do remote servers ask us to “try again later”?

Because they’re protecting themselves: rate limits, reputation controls, greylisting, or overload protection.
Treat 4xx as a pacing signal, not a personal insult.

7) Is increasing concurrency always bad?

No. It’s just commonly misused. Increase concurrency only when local resources and network are healthy, and when remote deferrals are not the dominant failure mode.
Prefer targeted tuning (per-domain) over global “turn it up.”

8) What metrics should I alert on besides queue length?

Alert on oldest message age, deferred-to-active ratio, top deferral reason counts, DNS lookup latency from the MTA, disk/inode headroom on the spool volume,
and outbound TCP/25 success rates.

9) How do I know if a content filter is slowing us down?

Look for timeouts or deferrals referencing the milter/policy service, high connection counts to the filter, or rising latency in the “before queue manager” part of log delays.
If the filter is serializing work, the queue grows even though remote connectivity is fine.

10) Why does the queue keep growing even after we “fixed the network”?

Because delivery throughput is still below arrival rate, or because retry schedules mean you won’t re-attempt many deferred messages immediately.
Watch the slope; adjust concurrency cautiously once the primary failure is resolved.

Next steps you should actually take

If your email queue keeps growing, don’t treat it like a mysterious blob that needs “tuning.” Treat it like a backlog with a cause.
Name the dominant deferral reason. Verify local disk and DNS. Prove outbound connectivity. Identify the top sender and top destination.
Then make one change at a time and watch the slope.

Practical next steps for the next business day:

  • Add alerting on oldest deferred message age and top deferral reasons, not just queue size.
  • Move or provision the mail spool on storage you trust under small-file I/O. If that sounds expensive, price the outage instead.
  • Run a local caching DNS resolver on mail hosts (or use known-good resolvers) and measure query latency continuously.
  • Implement per-destination pacing and separate bulk from transactional mail if you send volume.
  • Write a one-page runbook with the Fast diagnosis playbook and the top five commands you’ll run under stress.
← Previous
MariaDB vs SQLite Performance: Why “Fast Locally” Can Be Slow in Production
Next →
Proxmox SMART warnings: which attributes actually predict failure

Leave a comment