Your product team calls it “email being slow.” Support calls it “customers not getting reset links.”
You call it “the queue is growing teeth.” Somewhere between your MTA and a provider’s edge,
outbound SMTP starts returning polite little 4xx codes and your delivery times explode from seconds to hours.
The hard part isn’t fixing throttling. The hard part is proving where it is, so you don’t “optimize” the wrong system,
burn reputation, or invent a retry storm that turns a small incident into an all-night career review.
What SMTP throttling actually looks like (in production logs)
SMTP throttling is the provider telling you, “Not now.” It usually arrives as transient failures (4xx),
sometimes with enhanced status codes (like 4.7.0), sometimes with prose that reads like a lawyer wrote it.
Your MTA does what MTAs do: it queues and retries. If you don’t control the retry shape, you don’t control the incident.
The classic signals:
- 421 at connect time or after greeting: “Service not available,” often with “try again later.”
- 451/452 during transaction: “Temporary local problem” or “insufficient system storage” (sometimes a lie, sometimes not).
- 4.7.x enhanced statuses: “message deferred,” “rate limited,” “temporary throttling,” “too many connections.”
- Connection deferrals: “lost connection,” “timeout,” “host said: 421 …”
- Per-recipient bursts: only some destination domains slow down, others fly.
Here’s the part people miss: throttling isn’t only “too many messages.” It’s also too many simultaneous connections,
too many recipients per message, too many failures, suspicious patterns, or a reputation issue masquerading as “capacity.”
Providers rarely admit which knob you turned. They just hand you a 4xx and wait for you to behave.
Interesting facts and historical context (email has always been like this)
- SMTP is older than most of your tooling. It’s from the early 1980s, designed for cooperative networks, not adversarial spam economies.
- 4xx deferrals are a feature, not a bug. “Try again later” was meant to smooth outages. It now also smooths policy enforcement.
- Enhanced status codes (RFC 3463 lineage) were introduced because plain 4xx/5xx was too vague for modern mail operations.
- Greylisting popularized the idea of intentional temporary failure. Some receivers used 4xx on first contact to deter spambots that won’t retry.
- Big providers moved the goalposts from IP reputation to behavior. Volume patterns, complaint rates, authentication, and engagement now influence throttling.
- Per-domain policies got stricter as inbound abuse grew. Limiting connections per sending IP became normal once botnets learned parallelism.
- Bulk vs transactional separation became an ops pattern. Not for beauty—because mixing them makes both worse during throttling events.
- DNS and timeouts are part of “throttling” symptoms. When a provider is overloaded, you often see slower TLS handshakes and banner delays before explicit 4xx.
One quote that should be taped to your monitor:
“Hope is not a strategy.” — Gene Kranz
Fast diagnosis playbook (first/second/third checks)
The goal is simple: find the bottleneck in under 15 minutes. Not “understand everything.”
Just identify whether the choke point is your host, your MTA policy, your network, or the destination provider.
First: Is this global or per-domain?
- If all destinations slow down, suspect local resource limits, DNS issues, or a shared upstream (relay/smarthost).
- If only one provider/domain slows down (e.g., outlook.com or a single corporate MX), suspect receiver throttling or reputation/policy.
Second: Is it “can’t connect” or “can’t deliver after connect”?
- Connect/greeting delays (timeouts, 421 at connect): likely connection limits, tarpitting, or remote overload.
- RCPT/DATA deferrals (451/452/4.7.x after MAIL FROM/RCPT TO/DATA): policy/rate limiting, reputation, content triggers, or recipient-level issues.
Third: Is your retry shape making it worse?
- If your queue is growing and retry intervals are short, you may be generating a retry storm that convinces the provider you’re a bad citizen.
- If concurrency is high for a single domain, you’re tripping per-domain connection limits.
- If you’re batching too many recipients per message, one deferred recipient can penalize the whole transaction path.
Quick rule: stabilize before you optimize
Reduce concurrency, add backoff, preserve transactional mail, and stop thrashing the provider.
You can restore performance after you regain a steady-state queue.
Prove it’s the provider: evidence that holds up in a postmortem
“The provider is throttling us” is easy to say and hard to prove. People will ask:
Is it our network? Is it our TLS config? Are we overloaded? Are we on a blocklist? Are we sending garbage?
You need evidence that separates local failure from remote policy.
What proof looks like
- Remote SMTP replies show explicit throttling language or codes (421/451/4.7.x) after successful TCP/TLS establishment.
- Per-destination clustering: one provider defers, others accept at normal rate, from the same host and time window.
- Stable local resources: CPU, RAM, disk, network are fine while queue grows—pointing away from local saturation.
- Connection limit signatures: “too many connections,” frequent resets during greeting, or delayed banners only for certain MX hosts.
- Correlation with sending pattern changes: release went out, marketing blast, password reset spike, or an incident causing retries.
What does NOT count as proof
- “We didn’t change anything.” (You probably did. Or the provider did. Or your traffic did.)
- “The queue is big.” (Queues grow for dozens of reasons.)
- “The provider is always flaky.” (True in spirit, useless in a postmortem.)
Joke #1: SMTP is the only protocol that can politely tell you “come back later” while still ruining your afternoon.
Practical tasks: commands, output meaning, decisions (12+)
The tasks below assume a Linux host running Postfix. If you run Exim, Sendmail, or a managed relay,
the philosophy still applies: collect hard evidence, then shape the traffic so you’re not fighting the receiver.
Task 1: Count deferred messages and identify top destinations
cr0x@server:~$ postqueue -p | awk '
/^[A-F0-9]/ {id=$1}
/@/ && /to=</ {for (i=1;i<=NF;i++) if ($i ~ /to=</) {addr=$i; gsub(/.*to=<|>.*/,"",addr); split(addr,a,"@"); dom=a[2]; print dom}}
/status=deferred/ {deferred=1}
' | sort | uniq -c | sort -nr | head
913 outlook.com
402 protection.outlook.com
177 gmail.com
61 yahoo.com
44 example-corp.com
What it means: Deferrals cluster by destination domain; that’s your first “provider vs local” discriminator.
If one domain dominates, suspect remote throttling or policy.
Decision: Apply per-domain concurrency/rate limits for the dominant domain(s) and protect transactional mail paths.
Task 2: Extract remote SMTP replies for deferred deliveries
cr0x@server:~$ sudo grep -E "status=deferred" /var/log/mail.log | tail -n 5
Jan 04 10:12:21 mx1 postfix/smtp[22119]: 3F2C41A2B: to=<user1@outlook.com>, relay=outlook-com.olc.protection.outlook.com[104.47.56.36]:25, delay=182, delays=0.2/0/12/170, dsn=4.7.0, status=deferred (host outlook-com.olc.protection.outlook.com[104.47.56.36] said: 451 4.7.500 Server busy. Please try again later. (S3150) (in reply to MAIL FROM command))
Jan 04 10:12:22 mx1 postfix/smtp[22121]: 7B9DF1A31: to=<user2@outlook.com>, relay=outlook-com.olc.protection.outlook.com[104.47.56.38]:25, delay=164, delays=0.1/0/10/154, dsn=4.7.0, status=deferred (host outlook-com.olc.protection.outlook.com[104.47.56.38] said: 451 4.7.500 Server busy. Please try again later. (S3150) (in reply to RCPT TO command))
Jan 04 10:12:24 mx1 postfix/smtp[22118]: 9C1C11A40: to=<user3@gmail.com>, relay=gmail-smtp-in.l.google.com[142.250.102.26]:25, delay=1.2, delays=0.1/0/0.2/0.9, dsn=2.0.0, status=sent (250 2.0.0 OK 1704363144 x7si123456qka.321 - gsmtp)
Jan 04 10:12:25 mx1 postfix/smtp[22125]: 1A7D21A55: to=<user4@example-corp.com>, relay=mx.example-corp.com[203.0.113.10]:25, delay=0.8, delays=0.1/0/0.2/0.5, dsn=2.0.0, status=sent (250 2.0.0 queued as 8F2A9C)
Jan 04 10:12:27 mx1 postfix/smtp[22130]: 5E0B71A66: to=<user5@outlook.com>, relay=outlook-com.olc.protection.outlook.com[104.47.56.37]:25, delay=211, delays=0.2/0/9/201, dsn=4.7.0, status=deferred (host outlook-com.olc.protection.outlook.com[104.47.56.37] said: 421 4.7.0 Temporary server error. Please try again later. (in reply to end of DATA command))
What it means: You have successful deliveries to other domains while a specific provider returns “Server busy” and 4.7.x codes.
That’s close to courtroom-grade evidence: the remote peer is explicitly deferring.
Decision: Treat it as provider throttling. Reduce parallelism to that provider and increase backoff; do not “speed up retries.”
Task 3: Check whether you’re saturating SMTP client processes locally
cr0x@server:~$ sudo postconf -n | egrep "default_process_limit|max_use|smtp_destination_concurrency_limit|smtp_destination_rate_delay|smtp_connect_timeout"
default_process_limit = 200
max_use = 100
smtp_destination_concurrency_limit = 50
smtp_destination_rate_delay = 0s
smtp_connect_timeout = 30s
What it means: High concurrency and zero rate delay can look like a connection flood to a provider.
You may be tripping per-IP or per-tenant connection caps.
Decision: Clamp concurrency per destination and introduce a small rate delay. Keep global limits reasonable.
Task 4: Measure active SMTP connections to the provider
cr0x@server:~$ sudo ss -tnp | awk '$4 ~ /:25$/ || $4 ~ /:587$/ {print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head
62 104.47.56.36
59 104.47.56.38
57 104.47.56.37
12 142.250.102.26
4 203.0.113.10
What it means: You’re holding a lot of concurrent connections to the throttling provider.
Even if each message is small, connection concurrency alone can trigger throttling.
Decision: Reduce per-destination concurrency. Also review whether slow remote replies are causing your client to hold sockets open longer.
Task 5: Check for greeting delays (tarpitting) during SMTP handshake
cr0x@server:~$ time bash -c 'echo | nc -w 10 outlook-com.olc.protection.outlook.com 25'
220 DM6PR10CA0001.outlook.office365.com Microsoft ESMTP MAIL Service ready at Thu, 4 Jan 2026 10:13:12 +0000
real 0m6.412s
user 0m0.002s
sys 0m0.002s
What it means: A 6-second banner is not “normal internet latency.” It’s either load, tarpitting, or deliberate pacing.
If this delay is only for one provider, you’re looking at remote-side shaping.
Decision: Lower concurrency even more; long greeting delays multiply your effective connection count and amplify queue growth.
Task 6: Validate DNS resolution speed and correctness for the target MX
cr0x@server:~$ dig +tries=1 +time=2 MX outlook.com
;; ANSWER SECTION:
outlook.com. 1800 IN MX 5 outlook-com.olc.protection.outlook.com.
;; Query time: 21 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Thu Jan 4 10:13:20 UTC 2026
;; MSG SIZE rcvd: 98
What it means: DNS is fast and sane. If query time is hundreds/thousands of ms or timing out, you can misdiagnose throttling.
Decision: If DNS is slow, fix resolvers first. Don’t tune Postfix around a DNS outage.
Task 7: Check local host health (disk and I/O) so you don’t blame the provider for your own mess
cr0x@server:~$ df -h /var/spool/postfix
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 200G 48G 143G 26% /
cr0x@server:~$ iostat -xz 1 3
avg-cpu: %user %nice %system %iowait %steal %idle
3.20 0.00 1.10 0.30 0.00 95.40
Device r/s w/s rkB/s wkB/s aqu-sz await %util
nvme0n1 12.0 21.0 540.0 1120.0 0.02 0.7 2.1
What it means: Plenty of space; no I/O choke. Queue growth is not because your spool is slow or full.
Decision: Keep looking upstream. If disk is full or I/O is pegged, fix that first—email is brutally sensitive to fsync behavior.
Task 8: Check whether the queue manager is overwhelmed
cr0x@server:~$ mailq | head -n 20
-Queue ID- --Size-- ----Arrival Time---- -Sender/Recipient-------
3F2C41A2B 5120 Thu Jan 4 09:58:11 noreply@service.example
user1@outlook.com
(host outlook-com.olc.protection.outlook.com[104.47.56.36] said: 451 4.7.500 Server busy. Please try again later. (S3150) (in reply to MAIL FROM command))
7B9DF1A31 4891 Thu Jan 4 10:01:03 noreply@service.example
user2@outlook.com
(host outlook-com.olc.protection.outlook.com[104.47.56.38] said: 451 4.7.500 Server busy. Please try again later. (S3150) (in reply to RCPT TO command))
What it means: Deferrals are annotated with remote replies. You can show this to anyone.
Decision: If the same deferral repeats for a domain, stop hammering. Tune per-domain limits and retry intervals.
Task 9: Confirm you’re not stuck behind a smarthost or relay bottleneck
cr0x@server:~$ sudo postconf -n | egrep "^relayhost|^smtp_sasl_auth_enable|^smtp_tls_security_level"
relayhost =
smtp_sasl_auth_enable = no
smtp_tls_security_level = dane
What it means: No relayhost configured; you deliver directly to MX hosts. Good: per-domain throttling analysis is meaningful.
If relayhost is set, your bottleneck might be your relay provider, not the destination domain.
Decision: If you use a relayhost, move diagnosis to the relay boundary: their responses, your auth limits, their quotas.
Task 10: Trace a single delivery attempt end-to-end with verbose SMTP client logging
cr0x@server:~$ sudo postconf -e "debug_peer_list = outlook-com.olc.protection.outlook.com"
cr0x@server:~$ sudo systemctl reload postfix
cr0x@server:~$ sudo grep -E "outlook-com\.olc\.protection\.outlook\.com|postfix/smtp" /var/log/mail.log | tail -n 20
Jan 04 10:14:10 mx1 postfix/smtp[22301]: connect to outlook-com.olc.protection.outlook.com[104.47.56.36]:25: Connected
Jan 04 10:14:16 mx1 postfix/smtp[22301]: << 220 DM6PR10CA0001.outlook.office365.com Microsoft ESMTP MAIL Service ready
Jan 04 10:14:16 mx1 postfix/smtp[22301]: >> EHLO mx1.service.example
Jan 04 10:14:17 mx1 postfix/smtp[22301]: << 250-DM6PR10CA0001.outlook.office365.com Hello [198.51.100.10]
Jan 04 10:14:17 mx1 postfix/smtp[22301]: >> MAIL FROM:<noreply@service.example>
Jan 04 10:14:17 mx1 postfix/smtp[22301]: << 451 4.7.500 Server busy. Please try again later. (S3150)
What it means: TCP connect succeeded. Banner arrived. EHLO worked. The provider deferred at MAIL FROM.
That’s not your firewall. That’s not your DNS. That’s the receiver saying “slow down.”
Decision: Implement per-domain pacing and longer backoff; preserve evidence (logs) for provider escalation.
Task 11: Verify authentication posture (SPF/DKIM/DMARC) from your sending domain
cr0x@server:~$ dig +short TXT service.example
"v=spf1 ip4:198.51.100.10 -all"
"google-site-verification=..."
cr0x@server:~$ dig +short TXT selector1._domainkey.service.example
"v=DKIM1; k=rsa; p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8A..."
cr0x@server:~$ dig +short TXT _dmarc.service.example
"v=DMARC1; p=quarantine; rua=mailto:dmarc@service.example"
What it means: Missing or sloppy auth doesn’t always hard-bounce; it can increase throttling and filtering pressure.
Providers may accept slowly while they decide whether you’re trustworthy.
Decision: If auth is missing/broken, fix it before you negotiate “rate limits.” You can’t tune your way out of mistrust.
Task 12: Inspect message composition triggers (size, recipient count, and batching)
cr0x@server:~$ sudo postcat -q 3F2C41A2B | egrep -i "^(message_size|recipient_count|sender|original_recipient|message_arrival_time)"
message_size: 5120
recipient_count: 15
sender: noreply@service.example
message_arrival_time: Thu Jan 4 09:58:11 2026
What it means: A single queued item has 15 recipients. Some providers penalize large recipient fanout,
and one deferred recipient can delay delivery for the whole envelope (depending on how your app batches).
Decision: If you batch recipients heavily, reduce recipients per message for throttled domains. More queue entries, fewer multi-recipient stalls.
Task 13: Check whether TLS negotiation is stalling or failing (often misread as throttling)
cr0x@server:~$ openssl s_client -starttls smtp -connect outlook-com.olc.protection.outlook.com:25 -servername outlook-com.olc.protection.outlook.com -brief </dev/null
CONNECTION ESTABLISHED
Protocol version: TLSv1.3
Ciphersuite: TLS_AES_256_GCM_SHA384
Peer certificate: CN=*.protection.outlook.com
Verification: OK
What it means: TLS is fine. If this fails or hangs, you might be dealing with middleboxes, MTU issues, or provider-side TLS load.
Decision: If TLS is the choke, adjust timeouts and investigate network path. Don’t just reduce send rate and call it solved.
Task 14: Confirm that your own firewall/NAT isn’t dropping outbound connections under load
cr0x@server:~$ sudo conntrack -S | head
entries 32768
searched 154832
found 98211
new 2211
invalid 0
ignore 0
delete 2179
delete_list 2179
What it means: If invalid is high, or conntrack table is near capacity, you’ll see random SMTP timeouts and resets
that look like throttling. The provider gets blamed; your NAT box is quietly on fire.
Decision: If conntrack is saturated, reduce outbound connection churn (lower concurrency, raise reuse) and increase conntrack capacity where appropriate.
Adapt cleanly: rate limits, backoff, concurrency, and queue hygiene
Once you’re confident it’s upstream throttling, you have two goals:
deliver what you can without making the provider angrier, and keep your own systems stable while the backlog drains.
This is backpressure, not a sprint.
Principle 1: Separate classes of mail (transactional gets priority)
If you mix password resets with newsletters in the same queue, you deserve the outage you’re about to have.
Transactional mail has a user waiting on the other side. Bulk mail has a marketing calendar.
Those are not equal in the eyes of your incident commander.
Do it at the application layer (separate streams) or at the MTA (separate transports), but do it.
In Postfix, that typically means dedicated transport_maps routes and different concurrency settings.
Principle 2: Per-domain limits beat global limits
Global throttles are blunt instruments: they protect you, but they punish domains that would otherwise accept mail quickly.
Providers throttle per sending IP, per tenant, per sender, per destination, and per behavioral heuristic.
Your best match is per-domain concurrency and pacing.
Principle 3: Backoff should be exponential-ish, with jitter, and long enough to matter
MTAs already retry, but defaults can be wrong for modern throttling. If you retry too fast, you look like a botnet.
If you retry too slowly, transactional mail misses its usefulness window.
The trick is to apply different retry behavior for different classes and destinations.
Concrete Postfix tuning patterns
You can do this cleanly without turning your MTA into a hand-tuned snowflake.
Start with per-domain concurrency and a small rate delay.
cr0x@server:~$ sudo postconf -e "smtp_destination_concurrency_limit = 10"
cr0x@server:~$ sudo postconf -e "smtp_destination_rate_delay = 1s"
cr0x@server:~$ sudo postconf -e "default_destination_concurrency_limit = 20"
cr0x@server:~$ sudo systemctl reload postfix
What it means: You cap concurrent deliveries per destination and add a minimum spacing between deliveries.
This reduces connection floods and smooths traffic.
Decision: If the provider is explicitly returning “too many connections” or “server busy,” this is your first lever.
For destinations that are particularly sensitive, define a dedicated transport with stricter limits.
cr0x@server:~$ sudo tee -a /etc/postfix/master.cf >/dev/null <<'EOF'
slowoutlook unix - - n - - smtp
-o smtp_destination_concurrency_limit=2
-o smtp_destination_rate_delay=3s
-o smtp_connect_timeout=20s
EOF
cr0x@server:~$ sudo tee /etc/postfix/transport >/dev/null <<'EOF'
outlook.com slowoutlook:
protection.outlook.com slowoutlook:
EOF
cr0x@server:~$ sudo postmap /etc/postfix/transport
cr0x@server:~$ sudo postconf -e "transport_maps = hash:/etc/postfix/transport"
cr0x@server:~$ sudo systemctl reload postfix
What it means: Only mail to those domains uses the throttled transport; other domains keep normal throughput.
Decision: Use dedicated transports when one provider is the problem and you don’t want global slowdown.
Retry control: don’t let “deferred” become “DDoS with better grammar”
Postfix retry timing is controlled by queue manager parameters. You can adjust them, but do it carefully:
tuning retries affects every message, including the ones that would have succeeded quickly.
Prefer per-domain pacing and concurrency first.
cr0x@server:~$ sudo postconf -n | egrep "minimal_backoff_time|maximal_backoff_time|maximal_queue_lifetime"
minimal_backoff_time = 300s
maximal_backoff_time = 4000s
maximal_queue_lifetime = 5d
What it means: Your MTA won’t hammer every minute; it backs off to about an hour-ish max between attempts.
If your minimal backoff is too small, you can self-amplify throttling.
Decision: If you see rapid repeated deferrals, raise minimal_backoff_time modestly, but don’t wreck transactional latency for all domains.
Queue hygiene: protect the system from the queue
A growing queue is not just “messages waiting.” It’s disk usage, inode pressure, and CPU time spent scanning queues.
Your goal is to keep the MTA responsive even while it’s delayed.
- Keep
/var/spool/postfixon fast storage and monitor inode usage. - Avoid huge single-queue bursts by controlling application send rate upstream.
- Prefer fewer simultaneous SMTP clients to one provider rather than constantly opening new connections.
Joke #2: If you stare at mail.log long enough, it starts staring back—and it always wants more disk.
Three corporate mini-stories from the email mines
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company ran its own Postfix for transactional mail. They added a new region and started sending from a fresh IP block.
The rollout went fine for a week, then password resets started arriving 30–90 minutes late for users on one major provider.
The on-call engineer saw a large queue and assumed the Postfix host was underpowered. They doubled CPU and added memory.
The queue kept growing. They upgraded the instance again. Still growing. They tuned file descriptor limits, then restarted Postfix during peak,
which briefly improved nothing and permanently ruined everyone’s confidence. Meanwhile, Gmail deliveries remained near-instant,
which should have been a clue but was ignored because “the queue is big.”
The real issue was right there in the logs: repeated 451 4.7.500 Server busy on MAIL FROM, only for that provider’s MX hosts.
They were tripping connection limits because the app had recently moved from sending one email per event to batching
and firing multiple parallel workers per customer. More events meant more simultaneous SMTP clients.
Once they clamped per-domain concurrency to 2 and added a rate delay, the backlog drained in a few hours.
The postmortem had a brutal line item: “We scaled the wrong thing because we didn’t segment by destination domain.”
The lesson wasn’t “buy bigger servers.” The lesson was: prove where the bottleneck is before you touch hardware.
Mini-story 2: The optimization that backfired
A large enterprise wanted to reduce SMTP overhead. Someone proposed “connection reuse maximization” and “higher concurrency”
to get better throughput to external recipients. On paper, it looked great: fewer TLS handshakes per message, more parallel delivery,
shorter queues. The change was rolled out gradually. Metrics improved. Everyone congratulated themselves.
Then a marketing campaign collided with an unrelated incident that caused a spike in password reset traffic.
The MTA did exactly what it was tuned to do: it opened and held lots of concurrent sessions to a handful of large providers.
Those providers interpreted the behavior as aggressive and started tarpitting during the SMTP greeting
and deferring at RCPT and DATA with 4.7.x codes.
The “optimization” created a nasty feedback loop. Long greeting delays meant connections were held longer.
Held connections meant fewer slots available. Fewer slots meant messages stayed queued longer.
Queued messages got retried. Retried messages opened more connections. You get the idea.
The queue went nonlinear. Not because the server was slow, but because the remote slowed it down and the MTA didn’t adapt.
The fix wasn’t to revert all tuning. It was to add per-domain controls and a policy:
bulk mail and transactional mail must not share the same concurrency class, and no provider gets unbounded parallelism.
They also added a “campaign kill switch” in the app layer. This is the unsexy part:
sometimes the best SMTP tuning is a button that stops your own traffic.
Mini-story 3: The boring but correct practice that saved the day
An org with strict change management ran outbound mail through a dedicated relay tier.
The relay hosts were not powerful. They weren’t fancy. But they were instrumented:
per-domain deferral counters, queue depth dashboards, and log sampling were part of the standard build.
Each domain class had a documented concurrency limit and a rationale.
One afternoon, a major provider started returning transient failures and slow banners. The queue began rising.
The on-call followed the runbook: confirm domain clustering, confirm remote 4.7.x deferrals, confirm local resources stable.
The dashboard showed that only that provider had rising deferrals; other domains were fine.
They flipped the transport for the provider to “slow lane” settings already present in config management.
Concurrency dropped. Rate delay increased. Queue growth stabilized within minutes.
Transactional mail to other providers continued normally, and the backlog drained once the provider recovered.
No heroics. No random parameter changes. No panic restarts. The boring practice was simply this:
predefine throttling behavior by destination, and ship observability with the MTA.
That’s how you get to sleep.
Common mistakes: symptom → root cause → fix
These are the failure modes that keep showing up because humans love simple stories.
SMTP does not reward simple stories.
1) Symptom: “Mail is delayed for everyone” (but only one provider is actually deferred)
Root cause: You’re applying a global throttle or the queue manager is spending most cycles on a single throttled domain.
Fix: Implement per-domain transports or per-domain concurrency limits. Keep fast domains fast.
2) Symptom: Lots of timeouts and “lost connection” errors
Root cause: Either remote tarpitting/greeting delays, or local conntrack/NAT exhaustion under connection churn.
Fix: Measure greeting times, check conntrack stats, and reduce connection concurrency. Favor steadier, fewer sessions.
3) Symptom: You see 451/4.7.x and someone suggests “retry faster”
Root cause: Misunderstanding of deferrals. Faster retries can increase throttling and extend recovery time.
Fix: Backoff with jitter. Lower per-domain concurrency. Stabilize first; drain later.
4) Symptom: Only large messages get delayed or deferred
Root cause: Provider size limits, content scanning load, or your own network/TLS path struggling with larger transfers.
Fix: Confirm remote replies (they often mention size), reduce attachment sizes, and consider alternate channels for heavy content.
5) Symptom: Gmail is fine, Microsoft is slow (or vice versa)
Root cause: Different provider policies, different reputation models, different connection limits.
Fix: Tune per provider. Don’t generalize from one domain’s behavior to “SMTP” as a whole.
6) Symptom: Queue grows after a release, but logs show “server busy” from provider
Root cause: Release changed sending pattern (burstiness, parallelism, batching) more than it changed raw volume.
Fix: Add upstream rate limiting in the app; smooth bursts. Keep MTA controls as the second line of defense.
7) Symptom: Providers defer at MAIL FROM and you suspect a bug in your MTA
Root cause: Policy decision at the receiver based on sender identity/IP reputation; MAIL FROM is a convenient early choke point.
Fix: Improve authentication (SPF/DKIM/DMARC), reduce spikes, and if needed, warm up IPs rather than “go big on day one.”
Checklists / step-by-step plan
Step-by-step: Diagnose and stabilize in one on-call shift
- Segment the problem by destination: identify top deferred domains from the queue.
- Read the remote replies: capture 10–20 representative deferral lines with timestamps and remote IPs.
- Confirm local health: disk space, iowait, memory pressure, conntrack, file descriptors.
- Measure handshake behavior: banner delay and STARTTLS results for the deferred provider.
- Clamp per-domain concurrency: start conservative (2–5) for the throttled provider.
- Add per-domain rate delay: seconds matter; start with 1–3 seconds for a problematic provider.
- Protect transactional mail: split transports or queues; prioritize reset/OTP flows over bulk.
- Reduce upstream burstiness: if the app can pause bulk mail, pause it; if it can slow, slow it.
- Watch the queue slope: stabilization means “queue stops growing,” not “queue is gone.”
- Preserve evidence: log snippets, deferral codes, and time series. You’ll need them for provider escalation and internal accountability.
Checklist: Evidence packet for “it’s the provider”
- Top deferred domains count (from queue parsing).
- At least 5 log lines showing
dsn=4.7.xor421/451with the provider’s SMTP text. - One verbose SMTP trace showing connect+EHLO success followed by deferral (Task 10).
- Handshake timing data (banner delay) for that provider vs at least one other provider.
- Local resource snapshots showing no saturation: disk, iostat, load, conntrack.
Checklist: Clean adaptation (without building a monster)
- Per-domain transport for known throttle-heavy providers.
- Conservative defaults for
smtp_destination_concurrency_limitandsmtp_destination_rate_delay. - Separate bulk/transactional streams.
- Reasonable backoff settings; no rapid retry loops.
- Dashboards: queue depth, deferred rate by domain, delivery latency percentiles.
- Kill switch for bulk mail in application layer.
FAQ
1) Are 4xx errors always throttling?
No. 4xx means “temporary failure,” which can include throttling, greylisting, remote maintenance, or transient routing issues.
Throttling is likely when 4xx clusters by destination and includes 4.7.x language like “rate limited,” “server busy,” or “too many connections.”
2) How do I distinguish provider throttling from my own network issues?
If TCP connect and EHLO succeed and the receiver replies with 451/421/4.7.x, it’s strongly provider-side.
If you can’t connect, see many timeouts across many domains, or conntrack is saturated, you have local network pain.
3) Why does throttling sometimes show up as slow greetings instead of explicit 4xx?
Tarpitting is a tactic: delay the banner or responses to reduce your throughput without burning their own policy budget on explicit rejects.
From your side, it looks like “SMTP is slow.” Measure banner time to confirm.
4) Should I increase Postfix concurrency to drain the queue faster?
Not when the receiver is deferring. Higher concurrency usually increases the deferral rate, holds more sockets open,
and slows recovery. When you’re throttled, you drain by being boring and steady, not by being loud.
5) Is it better to send directly to MX or through a relay service?
Direct-to-MX gives you control and immediate visibility into receiver responses. A relay can provide better IP reputation management,
but it also becomes an extra choke point with its own quotas and policies. Pick one, instrument it, and know where the boundary is.
6) Can broken SPF/DKIM/DMARC cause throttling rather than bounces?
Yes. Some providers degrade acceptance behavior: more deferrals, slower acceptance, or heavier filtering when authentication is missing or inconsistent.
You may not get a clear “auth failed” error; you get “server busy” and a longer day.
7) How do I keep password resets fast during a throttling event?
Separate transactional mail into its own stream and transport, and cap bulk traffic aggressively.
Consider sending transactional mail from a warmed, stable identity (domain/subdomain/IP) distinct from marketing.
8) What’s the cleanest way to adapt without accumulating permanent hacks?
Use small, explicit config blocks: per-domain transports with documented limits, plus dashboards and a runbook.
Avoid random global tweaks. Prefer reversible changes with clear owners and measured impact.
9) When should I escalate to the provider?
Escalate when you have an evidence packet: representative SMTP replies, timestamps, sending IPs, and proof other domains work.
Providers respond better to structured data than to “our emails are slow.”
10) Can I “prove” throttling if the provider only times out and never sends 4xx?
You can build a strong case: show that connect succeeds but banner/commands stall only for that provider,
and that the same host delivers normally to others. It’s not as neat as a 451, but it’s still actionable evidence.
Conclusion: practical next steps
SMTP throttling is normal. Pretending it isn’t is how you end up with a queue so large it becomes a business event.
Your job is to prove where the delay lives, then shape traffic so the provider can accept it and your users don’t wait forever.
- Right now: segment deferrals by destination domain, capture remote replies, and confirm local health.
- Within the hour: clamp per-domain concurrency and add per-domain rate delay for the throttled provider.
- Before the next incident: separate transactional and bulk mail streams, add dashboards for deferred rate by domain, and ship a kill switch.
- For the postmortem: keep the evidence packet. It prevents “we think” from turning into “we guessed.”