Email SMTP 4xx Temporary Failures: Top Causes and Fixes That Actually Work

Was this helpful?

You shipped the email. Or at least you thought you did. Then your on-call phone starts vibrating like it’s trying to escape: “Orders aren’t sending,” “Password resets missing,” “Invoices delayed.” You open the logs and see the same smug word repeated everywhere: deferred. Alongside it: 4xx SMTP codes.

SMTP 4xx is the email universe saying, “Not now, try again later.” Sometimes that’s reasonable. Often it’s a warning flare that something in your stack—or the receiver’s—has started to buckle. The good news: 4xx failures are usually diagnosable with discipline and a few commands. The bad news: if you guess, you will guess wrong, and your queue will happily grow until it becomes a second job.

What SMTP 4xx temporary failures actually mean

SMTP reply codes are blunt instruments with a deceptively polite veneer. A 4xx response is a transient failure: the sender is expected to retry later. That’s not a promise that later will work—just a request that your MTA stop hammering.

The practical difference between 4xx and 5xx

  • 4xx: “Try again.” Your MTA queues the message, schedules retries, and eventually gives up after some TTL (Postfix default is often days, but verify your config).
  • 5xx: “Stop.” The receiver says it’s permanently rejected (bad recipient, policy rejection, hard bounce). Retrying is usually wasted effort unless the sender fixes something first.

Enhanced status codes matter more than the three digits

Modern MTAs usually include an “enhanced” status code like 4.7.0 or 4.4.1. That second code is where meaning hides. Examples:

  • 421 4.7.0: typically throttling, temporary policy blocks, or “go away.”
  • 450 4.2.0: mailbox unavailable, often temporary or greylisting.
  • 451 4.3.0: local processing error; could be resource exhaustion.
  • 452 4.3.1: insufficient system storage; or recipient says “over quota.”
  • 454 4.7.0: TLS not available due to temporary reason (or policy negotiations).

One operational rule: treat 4xx as a capacity and correctness signal. Even if it’s “the other side,” your system is the one accumulating deferred mail, alert noise, and reputational risk.

Paraphrased idea attributed to Gene Kranz: Fail fast, be calm, and act from the data. It’s not poetry, but it wins incidents.

Joke 1: SMTP 4xx is like a meeting invite that says “tentative”—you’re not rejected, you’re just stuck in calendar purgatory.

Fast diagnosis playbook (first/second/third)

When the queue grows, you don’t have time to meditate on RFCs. You need to find the bottleneck with a sequence that minimizes wandering.

First: identify the failure pattern (receiver-side vs your-side)

  1. Is it one domain or many? If it’s one big provider, think throttling/greylisting/policy. If it’s many, think DNS, network, local resources, broken config, or a shared dependency.
  2. Is it one sending host or all MTAs? If only one server is impacted, suspect that host’s IP reputation, routing, or local resource exhaustion.
  3. Is it time-correlated? Spikes at top of hour? Sounds like batch jobs and rate limits. Spikes during backups? Sounds like I/O and latency.

Second: read the exact SMTP text, not your feelings

Pull sample log lines and group by the remote response string. The free money is in noticing “4.7.0 try again later” versus “4.4.1 connection timed out” versus “451 4.3.0 internal error.” Each points to a different layer.

Third: check queue health and system health in parallel

Queue depth alone isn’t the problem; it’s a symptom. You want to answer:

  • Are we attempting deliveries at a normal rate, or are we stuck (DNS broken, network broken, process wedged)?
  • Are we getting deferred by policy (throttling/greylisting), requiring backoff or warmup?
  • Are we resource bound (disk full, inode exhaustion, I/O latency, memory pressure)?

The fastest triage matrix

  • 4.4.x timeouts → network, DNS, routing, firewall, remote outage.
  • 4.7.x policy/throttling → rate limits, reputation, authentication, content patterns.
  • 4.3.x local processing → storage, permissions, queue corruption, MTA process health.
  • 4.2.x mailbox → recipient issues or greylisting; often domain-specific.

Top causes of SMTP 4xx failures (and real fixes)

1) Receiver throttling and rate limits (421/450/451/4.7.x)

Big providers defend themselves with throttles. They don’t want you to send too fast, from too many connections, or with patterns that resemble spam. The result is often 421 4.7.0, 451 4.7.x, or vague “Try again later.”

Fixes that work:

  • Reduce concurrency to that domain: fewer parallel connections, fewer recipients per connection.
  • Warm up new IPs instead of blasting full volume on day one.
  • Separate traffic classes (transactional vs bulk). Let bulk take the throttling hit, not password resets.
  • Use sane retry/backoff. Aggressive retries can look like abuse.

2) Greylisting (450 4.2.x / 451 4.7.x)

Greylisting is a deliberate “go away and come back” tactic. It relies on the assumption that legitimate MTAs retry, while many spam bots don’t. You’ll see a temporary failure on first attempt; later attempts succeed.

Fixes that work:

  • Don’t fight it with panic. Ensure your retry schedule is correct and not disabled.
  • Keep your sending IP stable. Greylisting keys often include IP.
  • Make sure your MTA isn’t behaving like a bot: correct HELO/EHLO, valid reverse DNS, consistent TLS behavior.

3) DNS problems: lookup timeouts, SERVFAIL, broken resolvers (451 4.4.3, “Temporary lookup failure”)

SMTP depends heavily on DNS: MX lookups, A/AAAA records, PTR, SPF, DKIM selectors, sometimes RBLs and policy checks. If your resolver is slow, misconfigured, or blocked, you get transient failures that look like remote issues but are actually local.

Fixes that work:

  • Use reliable resolvers locally (systemd-resolved misbehavior is a recurring villain).
  • Monitor DNS latency and SERVFAIL rate, not just “is DNS up.”
  • Cache responsibly. Don’t cache broken answers forever.

4) TLS negotiation trouble (454 4.7.0, handshake failures, opportunistic TLS edge cases)

TLS is mostly boring until it isn’t. A temporary TLS failure can come from cipher mismatches, certificate validation hiccups (for enforced TLS), SNI bugs, or middleboxes that “optimize” by breaking things.

Fixes that work:

  • Log SMTP TLS details at the MTA and capture one failing transaction.
  • Prefer modern TLS, but keep compatibility if you must talk to old endpoints.
  • If you enforce TLS for certain domains, make sure your CA store and SNI handling are correct.

5) Storage and queue filesystem trouble (451 4.3.0, 452 4.3.1, “cannot write file”)

Your MTA is a storage system wearing a mail hat. The queue is effectively a database of messages, and databases hate slow disks, full disks, inode exhaustion, and sloppy permissions.

Fixes that work:

  • Check disk space, inodes, and I/O latency.
  • Put the queue on reliable storage. If you share a volume with noisy neighbors, expect pain.
  • Watch out for backup snapshots that stall I/O or fill volumes.

6) Process/resource exhaustion (451 4.3.0, “out of memory,” too many open files)

MTAs aren’t huge, but under load they can hit file descriptor limits, RAM pressure, or process table limits. Symptoms include deferrals due to local errors, slow delivery, and weird intermittent behavior.

Fixes that work:

  • Increase ulimit -n and system limits appropriately.
  • Stop pretending swap is a performance feature.
  • Right-size concurrency. Throwing more threads at a saturated disk is how you get a longer outage.

7) Reputation / policy temp blocks (421/451 4.7.0, “temporarily deferred”)

Receivers may temporarily block you for suspicious volume spikes, spam complaints, invalid recipients, or misaligned authentication. Sometimes the receiver doesn’t want to say “your mail looks bad,” so it says “try later.”

Fixes that work:

  • Align SPF, DKIM, and DMARC for the mail you send.
  • Reduce bounce rate by cleaning lists and fixing recipient validation.
  • Stabilize volume. Spiky traffic is a reputation tax.

8) Network path issues: timeouts, packet loss, MTU weirdness (4.4.x)

If you’re timing out to multiple domains, don’t stare at Postfix configs for hours. Look at the network. Packet loss, asymmetric routing, misconfigured firewalls, or broken IPv6 can turn SMTP into a slow-motion tragedy.

Fixes that work:

  • Test v4 and v6 separately. Dual-stack failures are famously confusing.
  • Check firewall state tables and NAT capacity.
  • Don’t ignore PMTUD/MTU issues if TLS handshakes hang.

9) Downstream dependency failures: content scanning, milter latency, antivirus outages (451 4.3.0)

Many setups route mail through milters, content filters, DLP engines, or “security appliances.” When those degrade, your MTA waits, blocks, then defers. The SMTP error might mention a filter; it might not.

Fixes that work:

  • Measure milter latency and failures explicitly.
  • Use timeouts and fail-open/fail-closed deliberately based on mail class.
  • Capacity-plan filters like they’re production services. Because they are.

Joke 2: If your “email security gateway” is down, congratulations: you’ve built a very expensive packet loss generator.

Practical tasks: commands, outputs, and decisions (12+)

These are field tasks. Each includes a command, an example output, what it means, and the decision you make. Use them to turn “emails are stuck” into a concrete diagnosis.

Task 1: Count deferred vs active queue (Postfix)

cr0x@server:~$ mailq | egrep -c "^[A-F0-9]"
1842

What it means: Rough queue size (message IDs). If this number is rising faster than it falls, you have a throughput problem.

Decision: If queue > normal baseline and rising, proceed to isolate by domain and error text before restarting anything.

Task 2: Identify top deferred reasons from logs (Postfix)

cr0x@server:~$ sudo awk '/status=deferred/ {print $0}' /var/log/mail.log | tail -n 2000 | sed -n 's/.*said: //p' | sort | uniq -c | sort -nr | head
  421 4.7.0 Try again later
  198 451 4.4.2 Timeout while waiting for server greeting
   74 450 4.2.0 Greylisted, please try again
   41 451 4.3.0 Error: queue file write error
   22 454 4.7.0 TLS not available due to temporary reason

What it means: This is your incident’s “top talkers.” It tells you which layer to investigate first.

Decision: Pick the top 1–2 messages and chase those. Don’t spread attention across five different error families.

Task 3: Find which recipient domains dominate the queue

cr0x@server:~$ mailq | awk '/@/ {print $NF}' | sed 's/.*@//' | tr -d '>,' | sort | uniq -c | sort -nr | head
  912 gmail.com
  376 outlook.com
  211 example-corp.com
   94 yahoo.com
   51 proton.me

What it means: If one domain dominates, you likely have throttling, reputation, or a provider-specific policy issue.

Decision: Apply per-domain rate limiting and verify authentication alignment for that domain’s traffic.

Task 4: Inspect a single queued message’s last error (Postfix postcat)

cr0x@server:~$ sudo postcat -vq 3F2A91C02B | sed -n '/^*** ENVELOPE RECORDS/,/^*** MESSAGE CONTENTS/p' | head -n 30
*** ENVELOPE RECORDS ***
message_size:           48213              2812               1               0            48213
message_arrival_time: Wed Jan  3 11:02:19 2026
sender: noreply@yourdomain.tld
*** RECIPIENT RECORDS ***
original_recipient: user@gmail.com
recipient: user@gmail.com
offset: 2812
dsn_orig_rcpt: rfc822;user@gmail.com
dsn_notify: failure
dsn_orcpt: rfc822;user@gmail.com
orig_to: user@gmail.com
recipient_status: 421 4.7.0 Try again later

What it means: Confirms the receiver response attached to that queue record (not just what you saw in logs).

Decision: If the error is policy/throttle, adjust concurrency and retry strategy rather than restarting Postfix.

Task 5: Show Postfix concurrency/rate settings that influence throttling

cr0x@server:~$ sudo postconf | egrep 'default_destination_concurrency_limit|smtp_destination_concurrency_limit|smtp_destination_rate_delay|smtp_destination_recipient_limit|maximal_queue_lifetime'
default_destination_concurrency_limit = 20
smtp_destination_concurrency_limit = 20
smtp_destination_rate_delay = 0s
smtp_destination_recipient_limit = 50
maximal_queue_lifetime = 5d

What it means: High concurrency + no rate delay is a classic way to get throttled by large providers.

Decision: If you see 4.7.x throttling, lower per-destination concurrency (start with 2–5) and add a small rate delay for heavy domains.

Task 6: Confirm DNS resolution for MX and A/AAAA

cr0x@server:~$ dig +time=2 +tries=1 MX gmail.com
;; ANSWER SECTION:
gmail.com.              300     IN      MX      10 alt1.gmail-smtp-in.l.google.com.
gmail.com.              300     IN      MX      20 alt2.gmail-smtp-in.l.google.com.

cr0x@server:~$ dig +time=2 +tries=1 A alt1.gmail-smtp-in.l.google.com
;; ANSWER SECTION:
alt1.gmail-smtp-in.l.google.com. 300 IN A 142.250.102.26

What it means: If this is slow, SERVFAILs, or times out, you have a DNS problem that will manifest as 4.4.x or “temporary lookup failure.”

Decision: If DNS is flaky, fix resolvers first. Mail delivery is downstream of DNS being sane.

Task 7: Check resolver health (systemd-resolved)

cr0x@server:~$ resolvectl status | sed -n '1,40p'
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Current DNS Server: 127.0.0.53
       DNS Servers: 10.0.0.2 10.0.0.3
        DNS Domain: corp.local

What it means: You’re using a local stub resolver; upstream servers are listed. If upstream is wrong or unreachable, expect intermittent failures.

Decision: If upstream servers are internal and overloaded, switch to known-good resolvers or run a local caching resolver with monitoring.

Task 8: Test outbound SMTP connectivity (TCP and banner)

cr0x@server:~$ nc -vz alt1.gmail-smtp-in.l.google.com 25
Connection to alt1.gmail-smtp-in.l.google.com 25 port [tcp/smtp] succeeded!

cr0x@server:~$ timeout 5 bash -lc 'exec 3<>/dev/tcp/alt1.gmail-smtp-in.l.google.com/25; head -n 1 <&3'
220 mx.google.com ESMTP

What it means: If TCP connect fails or banner read times out, suspect firewall, routing, packet loss, or remote outage.

Decision: If this fails broadly, escalate to network. If it fails only to one provider, check blocklists or reputation/policy.

Task 9: Verify IPv6 isn’t silently breaking outbound delivery

cr0x@server:~$ getent ahosts alt1.gmail-smtp-in.l.google.com | head
2607:f8b0:4004:c06::1a STREAM alt1.gmail-smtp-in.l.google.com
2607:f8b0:4004:c06::1a DGRAM
2607:f8b0:4004:c06::1a RAW
142.250.102.26 STREAM
142.250.102.26 DGRAM
142.250.102.26 RAW

What it means: If IPv6 exists, your MTA might try it first. If your IPv6 routing is broken, you’ll see timeouts and deferrals.

Decision: If v6 path is unreliable, fix it or prefer IPv4 for SMTP until it’s trustworthy.

Task 10: Check disk space and inode exhaustion (queue writes)

cr0x@server:~$ df -h /var/spool/postfix
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        40G   39G  620M  99% /var

cr0x@server:~$ df -i /var/spool/postfix
Filesystem      Inodes   IUsed   IFree IUse% Mounted on
/dev/sda2      2621440 2620101   1339  100% /var

What it means: Either space or inodes at 99–100% will cause queue file write errors and 451/452 deferrals.

Decision: Free space/inodes immediately (rotate logs, remove old files), then plan a queue filesystem expansion and inode-appropriate formatting.

Task 11: Check I/O latency (a common hidden villain)

cr0x@server:~$ iostat -xz 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          3.01    0.00    1.24   18.55    0.00   77.20

Device            r/s     w/s   rkB/s   wkB/s  await  svctm  %util
sda              5.12   92.40   164.0  8120.3  98.40   1.12  99.60

What it means: await near 100ms and %util near 100% is a saturated disk. Mail queues hate this.

Decision: Reduce concurrency, move queue to faster storage, or stop the I/O bully (backups, antivirus scans, snapshot storms).

Task 12: Confirm Postfix processes are alive and not stuck

cr0x@server:~$ sudo systemctl status postfix --no-pager
● postfix.service - Postfix Mail Transport Agent
     Loaded: loaded (/lib/systemd/system/postfix.service; enabled)
     Active: active (exited) since Wed 2026-01-03 10:58:01 UTC; 10min ago
    Process: 1123 ExecStart=/bin/true (code=exited, status=0/SUCCESS)

cr0x@server:~$ ps -ef | egrep 'postfix/(master|qmgr|smtp|pickup)' | head
root      1180     1  0 10:58 ?        00:00:00 /usr/lib/postfix/sbin/master -w
postfix   1182  1180  0 10:58 ?        00:00:00 pickup -l -t unix -u
postfix   1183  1180  0 10:58 ?        00:00:00 qmgr -l -t unix -u

What it means: Postfix looks alive. If queue isn’t draining, focus on delivery errors and dependencies rather than restarting blindly.

Decision: Only restart if you have evidence of a wedged process or config change; restarts can amplify queue storms.

Task 13: Measure queue age and “how bad is it”

cr0x@server:~$ mailq | awk 'BEGIN{old=0} /^[A-F0-9]/{id=$1} /^[[:space:]]*[A-Z][a-z]{2}[[:space:]]/{print}' | tail -n 5
     Wed Jan  3 08:41:22  user@outlook.com
     Wed Jan  3 08:41:23  user@gmail.com
     Wed Jan  3 08:41:24  user@yahoo.com
     Wed Jan  3 08:41:25  user@proton.me
     Wed Jan  3 08:41:26  user@example-corp.com

What it means: Old queue entries imply prolonged inability to deliver. It’s not just a transient blip anymore.

Decision: If oldest mail is hours old for transactional traffic, start mitigation: domain throttling changes, separate pools, reroute via alternate MTA, or pause bulk sending.

Task 14: Check for too many open files (classic 451 local errors)

cr0x@server:~$ cat /proc/$(pgrep -n master)/limits | egrep 'open files|Max processes'
Max processes             127528               127528               processes
Max open files            1024                 1048576              files

What it means: If Max open files is low (e.g., 1024), high concurrency can hit FD exhaustion under load.

Decision: Raise limits for Postfix via systemd overrides and tune concurrency to match disk/network reality.

Task 15: Validate TLS from your host to a target (debug handshake failures)

cr0x@server:~$ openssl s_client -starttls smtp -connect alt1.gmail-smtp-in.l.google.com:25 -servername alt1.gmail-smtp-in.l.google.com -brief < /dev/null | head -n 12
CONNECTION ESTABLISHED
Protocol version: TLSv1.3
Ciphersuite: TLS_AES_256_GCM_SHA384
Peer certificate: CN=mx.google.com
Verification: OK
250 SMTPUTF8

What it means: TLS works end-to-end from this host. If Postfix reports 454, it may be a policy mismatch, CA store issue, or milter interference.

Decision: If this fails, fix system CA bundle, SNI/cipher settings, or network middleboxes before touching mail routing.

Task 16: Spot milter/filter delays (symptom: local tempfail)

cr0x@server:~$ sudo journalctl -u postfix --since "30 min ago" | egrep -i 'milter|timeout|warning' | tail -n 20
Jan 03 11:10:12 server postfix/smtpd[2841]: warning: milter inet:127.0.0.1:8891: can't read response packet: Connection timed out
Jan 03 11:10:12 server postfix/smtpd[2841]: warning: milter inet:127.0.0.1:8891: aborting milter conversation
Jan 03 11:10:12 server postfix/smtpd[2841]: warning: milter inet:127.0.0.1:8891: unavailable: temporary failure

What it means: Your filter dependency is timing out, pushing SMTP into temporary failures.

Decision: Either restore the filter capacity, bypass for critical traffic, or set timeouts/failover policy intentionally.

Three corporate mini-stories from production

Mini-story 1: The outage caused by a wrong assumption (“4xx means the other side is down”)

The company ran a pair of Postfix relays in front of an application fleet. Everything was “fine” until a Tuesday morning when password resets started lagging. The on-call saw 451 4.4.2 Timeout while waiting for server greeting and decided the major provider must be having an incident. That assumption bought everyone two hours of waiting.

Meanwhile the mail queue grew. Not just for the major provider—for almost everyone. The clue was buried in the logs: timeouts happened to domains that had nothing in common. The team kept blaming “the internet,” which is how you know you’re out of ideas.

A network engineer finally ran a direct banner test and noticed something odd: TCP connect succeeded quickly, but reading the SMTP banner hung. That’s not “remote is down.” That’s “packets are getting eaten.” The culprit was a stateful firewall upgrade that reduced the connection tracking table size. Under normal load it was OK; under peak it dropped new and some established flows unpredictably.

The fix was mundane: restore connection tracking capacity and reduce unnecessary SMTP concurrency for a few hours while the backlog drained. The postmortem action item was more important than the firewall knob: “When you see 4.4.x timeouts across multiple unrelated domains, treat it as your network until proven otherwise.”

Mini-story 2: The optimization that backfired (queue on “fast” shared storage)

A different org decided to “modernize” their mail relays. They moved the Postfix queue from local SSD to a shared networked filesystem because it was “durable” and “easier to manage.” The change sailed through because it sounded like reliability.

At first, it worked. Then a backup job was introduced that took consistent snapshots of that same shared storage. Every hour, on the hour, the snapshot process caused a spike in latency. SMTP deliveries slowed, then deferred, then the queue grew. The error was a mix of 451 4.3.0 local processing failures and random timeouts, because when the queue manager stalls long enough, everything else starts to look broken.

The team responded with the standard bad ritual: restart Postfix. It temporarily “helped,” mostly by resetting some in-flight work and giving them the illusion of action. But each restart made recovery slower because it increased connection churn and forced more TLS handshakes.

The eventual fix was to stop treating the mail queue like a log directory. They moved it back to local SSD, enabled proper monitoring for disk latency, and treated the shared storage as a place for backups—not for hot-path queue writes. The optimization’s lesson was blunt: email delivery is I/O sensitive, and “durable shared storage” can be a performance trap if latency isn’t tightly controlled.

Mini-story 3: The boring, correct practice that saved the day (traffic class separation)

One enterprise actually did the unsexy thing: they separated transactional email from bulk marketing. Two logical sender identities, two IP pools, different queues, different rate limits. It sounded like bureaucracy, so it was unpopular—until it wasn’t.

A campaign went out with a typo in a segment that caused a surge of invalid recipients. Bounce rates spiked, and a major receiver started returning temporary policy deferrals (421 4.7.0). The bulk queue piled up quickly, like a snow drift against a door.

But the transactional mail kept flowing. Password resets and invoice notifications were routed through the transactional pool with stricter list hygiene and a stable volume profile. That pool had separate per-domain concurrency limits and a different retry policy. It barely saw the throttling.

The incident response was almost boring: pause bulk sending, fix the segment, let the bulk queue drain slowly, and keep critical mail healthy. The postmortem read like a bedtime story: “Boring separation of concerns prevented revenue-impacting mail delays.” It didn’t make anyone famous, which is usually how you know it was good engineering.

Interesting facts and historical context (useful, not trivia)

  • SMTP predates the modern spam era. It was designed in a more trusting network, which is why so much policy enforcement is bolted on today.
  • Temporary failures became a weapon and a shield. Greylisting popularized the idea of using 4xx to make spammers waste time, betting legitimate MTAs will retry.
  • Enhanced status codes (like 4.7.0) were introduced to add precision beyond the three-digit replies, because “451” alone wasn’t expressive enough for modern policy failures.
  • Mail queues are intentionally persistent. SMTP assumes networks fail; queue-and-retry is the whole point. Your job is to keep the queue from becoming a landfill.
  • Historically, some MTAs retried for a week or more. That made sense when links were unreliable; today it can turn a short outage into days of delayed notifications.
  • RBL/RHSBL checks and DNS-based policy exploded with spam. DNS reliability became critical to mail, which is why resolver failures can look like “email is down.”
  • TLS for SMTP started as opportunistic encryption. It improved privacy without requiring coordination, but it also created weird edge cases where “sometimes TLS works” becomes a delivery variable.
  • Large mailbox providers operate like DDoS targets. Their throttles are part of survival; if you send at scale, you’re negotiating with their anti-abuse systems.
  • Even “temporary” blocks can effectively be permanent if your retry behavior triggers more policy enforcement (e.g., high concurrency retries that look like hammering).

Common mistakes: symptoms → root cause → fix

“Everything is deferred” and the queue grows across many domains

Symptom: Lots of 4.4.x timeouts or “temporary lookup failure,” affecting diverse recipients.

Root cause: Local DNS resolver issues, firewall/NAT exhaustion, packet loss, or broken IPv6.

Fix: Validate DNS with dig, test banner reads with nc//dev/tcp, and isolate v4/v6. Fix networking before tuning Postfix.

One domain dominates deferrals with “Try again later”

Symptom: Provider-specific 421 4.7.0 or similar, mostly one domain.

Root cause: Throttling due to concurrency, volume spike, or reputation signals.

Fix: Lower per-destination concurrency, add rate delay, separate traffic classes, and stabilize volume. Don’t retry aggressively.

Intermittent 451 4.3.0 local errors during peak

Symptom: 451 4.3.0 “internal error” / queue write errors, often time-correlated with other jobs.

Root cause: Disk full/inodes exhausted, or I/O latency from backups/snapshots/AV scans; sometimes FD exhaustion.

Fix: Check df -h, df -i, iostat, and file descriptor limits. Move queue to low-latency storage and stop noisy neighbors.

454 4.7.0 TLS not available, only for some recipients

Symptom: Opportunistic TLS works for most, but a subset fails temporarily.

Root cause: Cipher mismatch, broken middlebox, SNI quirks, CA store issues, or enforced TLS policy misconfig.

Fix: Reproduce with openssl s_client -starttls smtp from the sending host; adjust TLS settings or bypass middleboxes.

Greylisting is treated like an outage

Symptom: First delivery attempt gets 450, later it succeeds, but people panic anyway.

Root cause: Greylisting is working as designed, and your alerting is too naive.

Fix: Tune alerting to rate/age thresholds, not a single 450. Ensure retry intervals aren’t overly long for critical mail.

Queue drains slowly even after “the issue is fixed”

Symptom: Receiver is accepting mail again, but you’re still backed up for hours.

Root cause: Your own MTA throughput limits, disk latency, or per-domain concurrency caps are too low for backlog recovery.

Fix: Temporarily increase throughput carefully (more worker processes, slightly higher concurrency), watch disk and error rates, then revert.

Checklists / step-by-step plan

Step-by-step: from first alert to stable delivery

  1. Snapshot the state. Queue size, top error strings, top recipient domains, oldest message age.
  2. Classify the dominant error family. 4.4.x timeouts vs 4.7.x policy vs 4.3.x local errors.
  3. Prove or disprove local DNS/network issues. DNS lookups, TCP connect, banner read, v4/v6 sanity.
  4. Check storage health. Space, inodes, I/O latency, queue filesystem.
  5. Check dependencies. Milters/content filters, outbound proxies, NAT/firewalls.
  6. Apply the smallest safe mitigation. Rate-limit per domain, pause bulk traffic, or reroute critical mail.
  7. Monitor drain rate and error rate. Don’t declare victory until the queue is shrinking and new mail delivers normally.
  8. After stability: post-incident hardening. Better alert thresholds, better separation of traffic, better capacity margins.

Operational checklist: what to tune (and what not to)

  • Do tune per-domain concurrency and rate delay for big providers.
  • Do separate transactional and bulk, ideally at the IP/identity and queue level.
  • Do monitor DNS latency, disk latency, and queue age.
  • Do not “fix” a queue backlog by increasing concurrency blindly; you’ll often get more throttling and worse reputation.
  • Do not restart MTAs as a first-line response. It’s a fine last resort. It’s a terrible diagnostic tool.
  • Do not ignore IPv6. Either run it correctly or explicitly prefer IPv4 for SMTP until you can.

Alerting checklist (avoid false “email down” incidents)

  • Alert on oldest queue age for critical classes, not just queue size.
  • Alert on deferred rate by domain (sudden 4.7.0 spike).
  • Alert on DNS SERVFAIL/timeout rate from mail hosts.
  • Alert on disk I/O wait and filesystem fullness for queue volumes.
  • Alert on milter latency/timeouts if you have milters.

FAQ

1) Are 4xx SMTP errors “safe to ignore” because they’ll retry?

No. Retries are a mechanism, not a solution. A 4xx spike is either a policy signal (throttle) or a reliability signal (DNS/network/storage). Treat it like rising latency in any other production system.

2) What’s the fastest way to tell throttling vs network failure?

Look at the enhanced codes and the distribution. 4.7.x clustering on one provider suggests throttling/policy. 4.4.x timeouts across many domains suggests network/DNS/local issues.

3) Why do I see “Try again later” with no useful details?

Receivers intentionally keep policy opaque to avoid helping spammers. Your best signal is behavior: which domains, what volumes, what connection patterns, and whether your IP identity recently changed.

4) How long should I keep retrying deferred messages?

Long enough to handle real outages, short enough that you don’t deliver stale notifications. Many orgs set shorter lifetimes for transactional notifications than for low-value bulk mail. Align TTL with business value.

5) Can SPF/DKIM/DMARC issues cause 4xx instead of 5xx?

Yes. Some receivers defer while they evaluate reputation, wait for DNS, or apply temp policy. Also, your own systems (e.g., milters) might tempfail during DNS lookups for these checks.

6) Should I disable greylisting on the receiving side (if I control it)?

If you can, prefer more modern anti-abuse controls. Greylisting can delay legitimate mail and create operational noise. If you keep it, whitelist critical senders and ensure retry windows are reasonable.

7) Why does restarting Postfix sometimes “fix” it?

Because it resets state, clears stuck processes, or temporarily reduces concurrency while the system comes back up. It can also make things worse by increasing connection churn and TLS handshakes. Use it with evidence, not superstition.

8) What’s the right way to handle a massive backlog once the root cause is fixed?

Drain deliberately. Prioritize critical traffic, throttle domains that penalize bursts, and increase throughput only as far as your disk/network can handle. Otherwise you’ll trigger new throttles and extend the recovery.

9) Are 452 errors always about disk full?

No. 452 can mean the receiver is over quota, or the sender can’t allocate resources. Check whether the text mentions “insufficient system storage” locally, and validate your own disk/inodes.

10) How do I prevent 4xx incidents from waking me up at 3 a.m.?

Separate traffic classes, monitor queue age and deferred reasons, keep the queue storage fast and roomy, and treat DNS as a first-class dependency with its own SLO.

Conclusion: next steps that prevent repeats

SMTP 4xx failures are rarely mysterious. They’re just distributed systems wearing a tie. Your job is to stop guessing and start classifying: policy vs network vs local resources vs dependencies.

Next steps that pay off immediately:

  1. Add a “top deferred reasons” panel (error text + enhanced codes) and alert on sudden shifts.
  2. Alert on oldest queue age for transactional mail; stop treating all mail the same.
  3. Enforce traffic separation: transactional vs bulk, ideally with separate queues and rate policies.
  4. Make DNS boring: fast resolvers, monitoring, and tested failover.
  5. Protect the queue storage: low latency, sufficient space/inodes, and no surprise snapshot storms.
  6. Document the fast diagnosis playbook in your runbooks, then rehearse it once while nothing is on fire.

If you do those, “temporary failures” go back to being what they were meant to be: occasional turbulence, not a recurring lifestyle.

← Previous
WordPress Search Is Slow: Speed It Up Without Expensive Services
Next →
Fix WordPress MySQL “Server Has Gone Away” and “Too Many Connections”

Leave a comment