You shipped a perfectly normal release. Then the outbound mail graph turns into a sawtooth, the queue climbs, and your logs start chanting: “421 too many connections”. Support calls it “email is slow.” Product calls it “email is broken.” Your SMTP provider calls it “working as designed.”
This error is not a moral failing. It’s flow control—sometimes from the remote server, sometimes self-inflicted by your own MTA. The trick is to tune concurrency so you stop getting kicked, without turning delivery into a polite handwritten letter.
What “421 too many connections” actually means
SMTP 421 is a temporary failure: “Service not available” in RFC-speak. In practice, it’s used as a pressure valve: “I’m not accepting this right now, try later.” The string “too many connections” is the clue that you exceeded some limit—often concurrent sessions from your IP, sometimes per account, sometimes per destination domain, sometimes per inbound IP reputation bucket.
Two common realities hiding behind the same error
- Remote throttling: Your MTA opens too many simultaneous SMTP sessions to the same provider/domain, and the remote side tempfails new sessions with 421. This is common with large mailbox providers, corporate gateways, and shared outbound relays.
- Local throttling: Your own MTA (or an intermediate relay/load balancer) limits connections—e.g., Postfix
smtpd_client_connection_count_limitfor inbound, or policy services that 421 you on purpose.
The operational consequence is the same: retries. Retries create queue growth. Queue growth creates more concurrency pressure unless you shape it. That’s the loop you’re here to break.
One quote worth keeping on a sticky note: “Hope is not a strategy.” — Gene Kranz. If you’re “waiting for it to clear,” you’re doing uncontrolled congestion management.
Joke #1: SMTP is the only protocol where “try again later” is considered a feature, not a bug.
Fast diagnosis playbook
When the pager is hot, you don’t need philosophy. You need three questions answered in order:
1) Is the 421 coming from the remote server, or from us?
Look at the SMTP transcript in logs. If Postfix says “host mx.remote.tld said: 421 …”, that’s remote throttling. If your own smtpd emits 421 to clients, that’s local throttling.
2) What’s saturating: connections, bandwidth, CPU, disk, or DNS?
Connection limits are often a symptom. The real bottleneck might be:
- Slow DNS causing long-lived SMTP sessions.
- TLS handshake overhead and CPU spikes.
- Disk I/O latency causing queue file contention.
- Provider-specific throttling per IP/subnet.
3) Are retries amplifying the problem?
If your retry policy is too aggressive (or concurrency too high), you create a retry storm: more sessions → more 421 → more queue → more sessions. You want a controlled drip, not a fire hose aimed at a closed door.
Fast actions that are usually safe
- Reduce per-destination concurrency (not global) for the affected provider/domain.
- Increase connection reuse (keep sessions open and send more per session).
- Back off retry schedule slightly for that destination so you don’t hammer their limits.
- Confirm DNS/TLS latency and disk I/O aren’t the hidden culprits.
Interesting facts and historical context (because systems have memories)
- SMTP predates the web: SMTP was standardized in the early 1980s, when “high traffic” meant something very different.
- 421 is intentionally vague: It’s a temporary failure class designed for “come back later,” not an exact diagnosis.
- Temporary failures are the backbone of email reliability: The store-and-forward model assumes networks fail and MTAs retry.
- Connection limits became normal with spam: Providers started throttling by IP because spammers made unlimited concurrency a liability.
- Postfix’s design is explicitly modular: Multiple daemons with controlled process limits, to avoid the “one big mailer” failure mode.
- Greylisting popularized controlled tempfails: Many systems intentionally used temporary failures to filter spam; legitimate MTAs retry.
- “Per destination” tuning exists because the internet is uneven: Some domains accept high parallelism; others melt at two connections.
- TLS made connection cost non-trivial: SMTP-over-TLS adds CPU and latency; connection reuse suddenly matters.
A working mental model: concurrency, sessions, and queues
Think of your outbound email system as a shipping dock:
- The queue is your warehouse.
- SMTP sessions are trucks at the loading bays.
- Concurrency is how many bays you open at once.
- Remote connection limits are the receiving warehouse saying “stop sending trucks; we can’t unload that many.”
When you get 421 too many connections, you’re sending more trucks than the other side wants. If you respond by “opening more bays” (increasing global concurrency), you’ll only create more rejected trucks. If you respond by “closing the dock entirely” (global throttling), you’ll delay everything, including mail that could have been delivered quickly to other domains.
So you want targeted shaping:
- Limit concurrency per destination (domain, MX, or relay host).
- Keep sessions efficient (send multiple messages per connection).
- Keep the queue healthy (avoid stampedes; prioritize fresh mail if needed).
Joke #2: Email is like traffic: adding lanes (concurrency) works until everyone decides to use the same exit (one big provider).
Hands-on tasks (commands + what the output means + what you decide)
These tasks assume Postfix on Linux. Adjust paths and service names if you’re on a different distro. Each task gives you: a command, example output, what it means, and the decision you make.
Task 1: Confirm the exact 421 text and whether it’s remote
cr0x@server:~$ sudo grep -R "421" /var/log/mail.log | tail -n 5
Jan 04 09:12:19 mx1 postfix/smtp[19822]: 9A1C12E3F4: to=<user@example.com>, relay=mx.example.com[203.0.113.20]:25, delay=12, delays=0.1/0.2/5.3/6.4, dsn=4.3.2, status=deferred (host mx.example.com[203.0.113.20] said: 421 4.7.0 Too many connections, try again later)
Jan 04 09:12:20 mx1 postfix/smtp[19825]: 0B3E99F1A2: to=<user@example.com>, relay=mx.example.com[203.0.113.20]:25, delay=10, delays=0.1/0.2/4.5/5.2, dsn=4.3.2, status=deferred (host mx.example.com[203.0.113.20] said: 421 4.7.0 Too many connections, try again later)
Meaning: “host … said” indicates the remote server returned 421. Your Postfix is behaving; you’re being throttled.
Decision: Focus on per-destination concurrency and connection reuse, not inbound limits.
Task 2: Identify which destinations are causing most deferrals
cr0x@server:~$ sudo postqueue -p | awk '/^[A-F0-9]/{id=$1} /status=deferred/{print id, $0}' | head
9A1C12E3F4 (host mx.example.com[203.0.113.20] said: 421 4.7.0 Too many connections, try again later)
0B3E99F1A2 (host mx.example.com[203.0.113.20] said: 421 4.7.0 Too many connections, try again later)
Meaning: Your backlog contains deferred mail with 421. It’s not random; it clusters.
Decision: Create a throttle policy for that domain/provider instead of slowing all outbound.
Task 3: Count deferred reasons at scale (top N)
cr0x@server:~$ sudo postqueue -p | sed -n 's/.*status=deferred (//p' | sed 's/)$//' | sort | uniq -c | sort -nr | head
418 host mx.example.com[203.0.113.20] said: 421 4.7.0 Too many connections, try again later
52 connect to mx.other.tld[198.51.100.10]:25: Connection timed out
11 host mx.third.tld[192.0.2.55] said: 450 4.2.0 Mailbox busy
Meaning: 421 dominates. Timeouts exist too, but they’re not the headline.
Decision: Address the 421 first; don’t get distracted by small noise unless it’s a different incident.
Task 4: Check how many SMTP client processes you’re running
cr0x@server:~$ ps -eo pid,comm,args | grep -E "postfix/smtp" | grep -v grep | wc -l
74
Meaning: You have 74 active outbound delivery processes. That can be fine, or it can be a problem if 60 of them are pounding one provider.
Decision: If one destination is throttling, reduce parallelism for that destination rather than cutting the global number blindly.
Task 5: See active connections to port 25/587 and who they’re going to
cr0x@server:~$ sudo ss -ntp '( dport = :25 or dport = :587 )' | head -n 15
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0 0 10.0.0.10:53422 203.0.113.20:25 users:(("smtp",pid=19822,fd=12))
ESTAB 0 0 10.0.0.10:53426 203.0.113.20:25 users:(("smtp",pid=19825,fd=11))
ESTAB 0 0 10.0.0.10:53430 203.0.113.20:25 users:(("smtp",pid=19826,fd=13))
Meaning: Many connections to the same IP suggest you’re tripping per-IP or per-client limits.
Decision: Cap concurrency to that relay/destination and increase per-connection message throughput.
Task 6: Check Postfix per-destination concurrency defaults
cr0x@server:~$ sudo postconf | egrep "default_destination|smtp_destination|initial_destination"
default_destination_concurrency_limit = 20
smtp_destination_concurrency_limit = $default_destination_concurrency_limit
initial_destination_concurrency = 5
Meaning: You can hit 20 concurrent deliveries per destination by default (depending on version/config). That’s aggressive for some providers.
Decision: If the remote limit is lower (often is), you need a lower cap for that destination.
Task 7: Inspect transport map usage (are you already routing through a relay?)
cr0x@server:~$ sudo postconf | egrep "transport_maps|relayhost"
transport_maps =
relayhost =
Meaning: You’re delivering directly to recipient MX hosts. Throttling policies must be per destination domain/MX.
Decision: Consider transport maps for problematic destinations to apply per-transport limits cleanly.
Task 8: Check whether Postfix is reusing SMTP connections
cr0x@server:~$ sudo postconf | egrep "smtp_connection_cache|smtp_destination_recipient_limit|smtp_extra_recipient_limit"
smtp_connection_cache_destinations =
smtp_destination_recipient_limit = 50
smtp_extra_recipient_limit = 1000
Meaning: Connection caching might be disabled (empty). Without caching, you pay a new connection per delivery process, raising concurrency pressure.
Decision: Enable connection caching for the throttled destination or relay to send more messages per session.
Task 9: Check retry and backoff behavior (are you retrying too aggressively?)
cr0x@server:~$ sudo postconf | egrep "minimal_backoff_time|maximal_backoff_time|queue_run_delay"
queue_run_delay = 300s
minimal_backoff_time = 300s
maximal_backoff_time = 4000s
Meaning: Retries start at 5 minutes and grow to ~66 minutes. If you’ve tuned these lower, you can create a retry storm.
Decision: For a destination that says “too many connections,” backoff is your friend. Don’t set minimal backoff to 10 seconds and expect respect.
Task 10: Check for DNS latency (a sneaky concurrency multiplier)
cr0x@server:~$ time dig +tries=1 +timeout=2 mx example.com @127.0.0.53
;; ANSWER SECTION:
example.com. 1800 IN MX 10 mx.example.com.
real 0m0.018s
user 0m0.004s
sys 0m0.004s
Meaning: DNS is fast here. If this takes hundreds of milliseconds or times out, your SMTP processes sit idle waiting, increasing overlap and perceived concurrency.
Decision: If DNS is slow, fix resolver performance/caching before touching concurrency knobs.
Task 11: Check TLS handshake cost and failures
cr0x@server:~$ sudo grep -E "TLS|SSL" /var/log/mail.log | tail -n 5
Jan 04 09:12:18 mx1 postfix/smtp[19822]: Trusted TLS connection established to mx.example.com[203.0.113.20]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
Jan 04 09:12:19 mx1 postfix/smtp[19825]: Trusted TLS connection established to mx.example.com[203.0.113.20]:25: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
Meaning: TLS is working and modern. If you see repeated handshake failures or slow negotiation, sessions last longer and pile up.
Decision: If TLS is flaky, stabilize it (ciphers, SNI, cert chain, MTU issues) before assuming “too many connections” is purely a policy limit.
Task 12: Check disk latency and queue directory health
cr0x@server:~$ iostat -xz 1 3
Linux 6.2.0 (mx1) 01/04/2026 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
3.12 0.00 1.44 0.80 0.00 94.64
Device r/s w/s rkB/s wkB/s %util await
nvme0n1 3.00 55.00 120.0 2400.0 6.40 1.20
Meaning: Low await and util. Disk isn’t choking. If await spikes into tens/hundreds of ms, Postfix queue operations slow down and your delivery pattern becomes bursty—exactly what triggers remote throttles.
Decision: If disk is the bottleneck, fix storage (I/O isolation, faster media, filesystem tuning) before you “tune concurrency.”
Task 13: Check Postfix’s own rate limiting counters (anvil)
cr0x@server:~$ sudo postconf | egrep "smtpd_client_connection_count_limit|smtpd_client_connection_rate_limit|anvil_rate_time_unit"
smtpd_client_connection_count_limit = 50
smtpd_client_connection_rate_limit = 0
anvil_rate_time_unit = 60s
Meaning: These are inbound controls, but they matter if you’re a relay for internal apps. If your apps get 421 from your Postfix, these settings are suspects.
Decision: If internal clients hit your limits, scale horizontally or enforce client pooling; don’t just crank this to infinity.
Task 14: Confirm whether the affected traffic is concentrated in one sender/app
cr0x@server:~$ sudo pflogsumm -d today /var/log/mail.log | head -n 30
Grand Totals
------------
messages
12450 received
12110 delivered
310 deferred (2.4%)
...
Per-Sender Totals
-----------------
8200 noreply@service.local
2400 billing@service.local
Meaning: One sender dominates. If “noreply@service.local” is a bulk sender, its bursts can drown transactional mail.
Decision: Split traffic by transport (transactional vs bulk), give transactional higher priority, and throttle bulk harder.
Task 15: Verify queue size and age distribution
cr0x@server:~$ sudo find /var/spool/postfix/deferred -type f | wc -l
1832
Meaning: Deferred queue count is a fast proxy for trouble. The real risk is age: if mail sits long enough, you violate SLAs and confuse users.
Decision: If deferred count is rising and ages are climbing, you need throttling + prioritization now, not later.
Task 16: Inspect one stuck message’s headers (routing, recipient, transport)
cr0x@server:~$ sudo postcat -q 9A1C12E3F4 | sed -n '1,80p'
*** ENVELOPE RECORDS ***
message_size: 4289
recipient: user@example.com
sender: noreply@service.local
*** MESSAGE CONTENTS ***
Received: by mx1 (Postfix, from userid 1001)
Date: Sat, 04 Jan 2026 09:11:58 +0000
Subject: Your login code
Meaning: This shows sender, recipient, and basic metadata. Useful to confirm whether this is transactional mail that must not be delayed.
Decision: If this is transactional, route it through a higher-priority transport with safer concurrency and better reputation controls.
Tuning Postfix the right way (without delaying mail)
Principle 1: Do not “fix” a per-destination problem with a global throttle
Global changes like lowering default_destination_concurrency_limit can reduce the 421s, sure. They also slow delivery to everyone, including domains that were perfectly happy. That’s how you turn one provider’s policy into your universal outage.
Principle 2: Make fewer, better connections
Remote limits are often based on concurrent connections. If you reuse connections and send multiple recipients/messages per session, you reduce connection churn and still move volume.
Concrete knobs that usually matter
1) Per-destination concurrency limits (the big lever)
In Postfix, you can cap concurrency per destination. If you’re delivering directly, “destination” is typically the domain (and sometimes MX host, depending on how Postfix groups deliveries). The safest pattern in corporate setups is to use transport maps to create explicit “channels” for big providers.
Example approach: route a specific domain (or a set of domains) through a named transport, then apply limits to that transport in master.cf.
cr0x@server:~$ sudo postconf -n | egrep "transport_maps"
transport_maps = hash:/etc/postfix/transport
cr0x@server:~$ sudo tee /etc/postfix/transport > /dev/null
example.com slowmx:
example.net slowmx:
cr0x@server:~$ sudo postmap /etc/postfix/transport
Then in /etc/postfix/master.cf, define the transport with lower concurrency. (This is the pattern; adjust values to your environment.)
cr0x@server:~$ sudo postconf -M | grep -E "^slowmx"
slowmx/unix=slowmx unix - - n - - smtp
Meaning: A dedicated transport lets you set different concurrency and behavior for a subset of mail. You’re no longer treating all destinations as equals.
Decision: If one provider is throttling, isolate it into its own transport and tune it independently.
2) Connection caching (connection reuse)
Connection caching keeps an SMTP session open for reuse. That means fewer TCP handshakes, fewer TLS handshakes, fewer chances to collide with “max connections” policies.
cr0x@server:~$ sudo postconf -e "smtp_connection_cache_destinations = example.com example.net"
Meaning: Postfix will cache connections for these destinations when possible.
Decision: Enable caching for throttled destinations, especially when your traffic is bursty (password resets, notifications, invoices).
3) Balance recipient limits per connection
Providers often accept a single session with multiple deliveries but reject multiple parallel sessions. Make each session do more work, within reason.
cr0x@server:~$ sudo postconf -n | egrep "smtp_destination_recipient_limit"
smtp_destination_recipient_limit = 50
Meaning: Up to 50 recipients per destination per delivery request, depending on queue structure and batching. Too high can cause large transactions that fail mid-session; too low increases session count.
Decision: If you’re hitting connection limits and messages are small, modestly increasing batching can reduce sessions. Don’t crank it to 1000 and then act surprised when one bad recipient slows the whole batch.
4) Make retries less stupid for the throttled destination
If the remote side is saying “too many connections,” retrying in 30 seconds is basically arguing with a firewall. Shape your retries so your system naturally stops stampeding.
cr0x@server:~$ sudo postconf -n | egrep "minimal_backoff_time|maximal_backoff_time"
minimal_backoff_time = 300s
maximal_backoff_time = 4000s
Meaning: This is global. Global changes affect all destinations.
Decision: Prefer per-transport control (separate queue/transport class) if you need special retry behavior for one provider; keep global retry sane and boring.
What not to do (unless you like recurring incidents)
- Don’t increase global concurrency to “push through.” 421 is not a throughput problem; it’s a coordination problem.
- Don’t disable TLS to speed up handshakes. You’ll trade a queue issue for a security incident and reputation damage.
- Don’t “fix” it by adding more MTAs behind the same NAT IP. Remote limits are usually per source IP. You’ll just parallelize rejection.
When the remote side is throttling you
Remote throttling is often policy-driven. The remote side can cap:
- Concurrent sessions per sending IP
- Messages per minute/hour per IP or per authenticated account
- Recipients per message
- Connections per minute (not just concurrent)
- Rate by reputation, complaint rate, bounce rate, or content category
How to tell “policy limit” from “remote overload”
Policy limits tend to be consistent and clean: you hit a ceiling and the error repeats predictably. Remote overload tends to correlate with time-of-day, remote maintenance windows, or sporadic increases in latency/timeouts.
Look for these patterns in your logs:
- Many fast deferrals with 421 immediately after connect: classic policy limit.
- Long delays then 421: remote may accept but then shed load; could be their system under strain or your TLS/DNS slowness causing long sessions.
- Mix of 421 + timeouts: could be network issues or remote rate limiting plus degradation.
Operational play: isolate, shape, and keep transactional mail moving
In real shops, you often have two classes of mail:
- Transactional: password resets, MFA codes, receipts. Users notice seconds.
- Bulk/engagement: newsletters, reminders, campaigns. Users notice… eventually, maybe.
When you’re being throttled, your best move is to split transports and apply different concurrency and retry strategies. Transactional gets the clean lane: stable concurrency, connection reuse, conservative retry, and strong reputation hygiene. Bulk gets throttled harder and scheduled.
Failure modes that look like concurrency (but aren’t)
DNS slowness: the silent session extender
If DNS lookups for MX records or reverse DNS checks are slow, each delivery attempt takes longer, which increases the number of overlapping attempts. You didn’t increase concurrency; time did.
Disk contention: queue operations become bursty
On overloaded disks, Postfix can’t efficiently move queue files. Delivery processes get scheduled in clumps. Clumps create bursts. Bursts trigger throttling. You see 421 and blame the provider, but your storage is the one throwing elbows.
Broken connection reuse assumptions
Some NAT gateways, firewalls, or middleboxes don’t like long-lived SMTP sessions. If they reset idle connections aggressively, caching doesn’t work and you revert to “connect storm” behavior.
IPv6 vs IPv4 asymmetry
Providers can enforce different limits on IPv6 ranges. If your MTA flips between families, you can see inconsistent behavior. Sometimes the “fix” is to make the path consistent, not necessarily faster.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
The company had a tidy architecture: app servers submitted mail to an internal relay, which delivered directly to the internet. It worked for years, mostly because traffic was modest and predictable.
Then the business launched a “security improvement”: every login required a one-time code by email. The team assumed email would behave like their HTTP APIs—scale out, increase concurrency, watch throughput rise. They did what web people do: increased worker count. More SMTP client processes, more parallel delivery attempts.
Within an hour, the big mailbox providers started sending back 421 too many connections. The queue ballooned. Users requested more codes because they weren’t arriving, which generated more mail, which generated more retries. Classic feedback loop, now with customer anxiety layered on top.
The wrong assumption wasn’t “providers are mean.” It was that SMTP is a push protocol where the receiver sets the rules. Concurrency is negotiated socially, not technically.
The fix was boring: separate transports for the major providers, cap per-destination concurrency to a level that stayed under their limits, and enable connection caching. Login codes started arriving faster because the system stopped wasting effort on rejected sessions.
Mini-story 2: The optimization that backfired
A different org wanted faster delivery metrics. Someone noticed that queued mail often sat for a few minutes, so they tuned Postfix to run the queue more frequently and reduced backoff times. The graphs looked great for a week.
Then they onboarded a new customer with a corporate mail gateway that was strict about connection counts. The MTA began hitting 421. With reduced backoff, the MTA retried aggressively. The gateway interpreted that as abusive behavior and tightened throttles further.
Delivery went from “sometimes delayed” to “consistently delayed.” Worse, the aggressive retries increased connection churn, raising TLS handshakes and CPU, which slowed other deliveries. Internal monitoring screamed “SMTP errors” while product screamed “user signups down.”
The optimization failed because it treated temporary failures as something to bulldoze. In SMTP land, polite backoff is not optional; it’s how you avoid self-inflicted denial-of-service.
They rolled back the global retry tweaks, created a separate transport for that customer’s domain with strict concurrency (and slightly longer backoff), and the rest of the system recovered immediately.
Mini-story 3: The boring but correct practice that saved the day
One team had a habit: every time they added a new mail source, they tagged it with a unique envelope sender and logged submission identity. Not fancy. Just consistent. Most engineers considered it “extra process.”
During a 421 incident, that practice became gold. They quickly determined that 90% of the throttled traffic was coming from one newly deployed batch job sending account reconciliation emails. It was supposed to run hourly. It ran every five minutes because of a scheduler misconfiguration.
They didn’t touch Postfix at first. They paused the batch job, let the queue drain, and then reintroduced the traffic with a dedicated transport and strict per-destination concurrency. Transactional mail got its own lane with connection caching and stable limits.
Because they had clean attribution, they avoided the common disaster: “tune the mail server until the symptom disappears,” while the real problem keeps generating pressure. The queue cleared, the throttles stopped, and nobody had to explain to compliance why they disabled TLS in a panic.
Common mistakes: symptoms → root cause → fix
1) Symptom: “421 too many connections” everywhere
Root cause: You changed global concurrency and now multiple providers throttle you, or you’re behind one shared IP that’s reputation-limited.
Fix: Revert global changes. Split transports by major destination groups. If you use a smarthost, ensure you’re not exceeding its per-account limits.
2) Symptom: Queue grows, but CPU and network look fine
Root cause: You’re being rejected quickly (fast 421 deferrals), so you’re not consuming resources—just generating retries.
Fix: Lower per-destination concurrency and enable connection caching; adjust batching so each successful session carries more mail.
3) Symptom: 421 spikes during peak hours only
Root cause: Bursty sending (cron jobs, marketing blasts) colliding with provider time-based limits.
Fix: Rate-shape bulk mail; schedule bursts; implement per-transport limits and prioritize transactional traffic.
4) Symptom: You lowered concurrency and now mail is slow to all domains
Root cause: You applied a global throttle instead of isolating the destination that complained.
Fix: Restore global defaults. Use transport maps/master.cf to apply strict limits only to the affected domain/provider.
5) Symptom: Lots of connections, few messages delivered per connection
Root cause: Connection caching disabled, or middleboxes killing idle connections; also possible that your queue structure prevents batching.
Fix: Enable caching for those destinations, verify firewall idle timeouts, and ensure your MTA can batch recipients appropriately.
6) Symptom: 421 accompanied by timeouts and “lost connection”
Root cause: Network path issues, MTU blackholes, or remote instability; sometimes you’re being tarpitted due to reputation.
Fix: Validate network health (packet loss, retransmits), verify TLS stability, and reduce concurrency while you investigate reputation signals.
7) Symptom: Internal apps get 421 when submitting mail to your relay
Root cause: Your own inbound connection limits (anvil) are protecting you from a noisy client.
Fix: Fix the client (pooling, fewer parallel submits), or scale relays horizontally; raise limits only if you’re sure the relay can handle it.
Checklists / step-by-step plan
Step-by-step: fix 421 without delaying mail
- Classify the 421 source. Confirm remote vs local from logs. If remote, proceed. If local, review
smtpd_*and policy services. - Rank destinations by pain. Count top deferral reasons and identify which domains/providers dominate.
- Protect transactional mail first. If you mix bulk and transactional, split them now. Same server is fine; same queue behavior is not.
- Create a dedicated transport for the throttled destination(s). Transport maps give you a lever you can pull without collateral damage.
- Lower concurrency on that transport. Start conservative. If the provider allows 2–5 concurrent sessions, do 2–3 and measure.
- Enable connection caching for that destination/transport. Fewer sessions, more payload per session.
- Validate DNS and TLS latency. Slow lookups and handshakes inflate session time and perceived concurrency.
- Watch queue age, not just size. You can tolerate a larger queue if it drains steadily and mail stays young.
- Adjust batching carefully. Increase recipients per session modestly if messages are small and failures are rare.
- Verify that retries calm down. The goal is fewer rejected sessions and more successful deliveries, not “fewer log lines.”
- After stabilization, document the limit. Write down the effective concurrency cap that works. Future you will forget.
- Add monitoring around deferrals by destination. Catch the next throttle before it becomes a queue mountain.
Operational checklist: what to monitor continuously
- Deferred queue size and age distribution
- Top deferral reasons (421 vs timeouts vs 5xx)
- Outbound connections by destination IP/domain
- SMTP delivery rate (delivered/min) split by transport
- DNS resolver latency and error rate
- Disk I/O latency on the MTA spool filesystem
- CPU utilization and TLS handshake rate
FAQ
1) Is 421 a permanent failure?
No. It’s a temporary failure. Your MTA should defer and retry. The question is whether your retry behavior is controlled or chaotic.
2) Why do providers throttle connections instead of messages?
Connections are expensive: TCP state, TLS handshakes, and per-session scanning. Limiting concurrent sessions is an effective way to protect infrastructure from bursts and abuse.
3) Should I just lower default_destination_concurrency_limit?
Only if you enjoy slowing down mail to destinations that weren’t complaining. Use per-transport or per-destination tuning so one provider’s cap doesn’t become your global cap.
4) How low should per-destination concurrency be?
Low enough to stop 421s, high enough to meet delivery goals. Start with 2–5 for strict providers, then increase slowly while watching deferrals and queue age.
5) Will enabling connection caching always help?
Often, yes—especially with TLS. But it can be undermined by firewalls that kill idle sessions. If caching doesn’t reduce new connections, check middlebox idle timeouts.
6) Why did reducing concurrency make some emails arrive faster?
Because you stopped wasting attempts. Successful sessions deliver mail; rejected sessions generate retries and churn. Less concurrency can mean more useful throughput.
7) What if the 421 happens when apps connect to our relay (inbound), not outbound?
Then it’s your relay limiting client connections (or a policy service). Investigate Postfix anvil settings, client behavior, and whether one app is opening hundreds of parallel SMTP sessions.
8) Can I fix this by adding more MTAs?
Only if you also add more source IPs and manage reputation responsibly. If everything exits via one NAT IP, more MTAs just mean more rejected connections per second.
9) Does IPv6 change anything?
Yes. Some providers apply different policies to IPv6 ranges. Ensure your outbound path is consistent and that reverse DNS and reputation are correctly set for whichever family you use.
10) How do I prevent bulk mail from starving transactional mail?
Separate transports (or even separate relays), apply strict throttles to bulk, and keep transactional mail in a prioritized path with predictable limits and caching.
Conclusion: practical next steps
“421 too many connections” is congestion control wearing a trench coat. Your job is to stop treating it like a mystery and start shaping traffic like an adult system.
- Confirm whether the 421 is remote or local from the log wording.
- Identify the top destination(s) causing deferrals.
- Split traffic: transactional vs bulk, and isolate big providers into dedicated transports.
- Lower per-destination concurrency for the throttled group and enable connection caching.
- Validate DNS, TLS, and disk I/O so you’re not amplifying concurrency by accident.
- Measure queue age and successful deliveries, not just the absence of errors.
Do that, and you’ll stop playing whack-a-mole with 421s. Mail won’t just “eventually deliver.” It will deliver on time, with fewer connections, and with less drama—which is the highest compliment production systems can earn.