Mail outages are special. Nobody notices email until it fails, and then suddenly it’s the only system that matters. The queue spikes, the load average climbs like it’s late for a meeting, and every internal team discovers they “can’t do business” without that password reset that didn’t arrive.
This is the playbook you want open on a second screen: fast diagnosis, evidence-driven decisions, and recovery steps that won’t turn a backlog into a crater. We’ll treat Postfix like a production system, because it is—one that happens to speak SMTP and store its pain in queues.
Fast diagnosis playbook (first 15 minutes)
If mail “suddenly stops,” you have one job: identify the bottleneck that’s limiting throughput. The second job is not to accidentally amplify the damage by flushing blindly or restarting everything in a panic. Postfix is good at delivering mail slowly and safely. You can still make it deliver mail quickly and dangerously.
Minute 0–2: Confirm the failure mode
- Is mail not entering the system? (inbound SMTP acceptance issue, postscreen, TLS failures, firewall)
- Is mail entering but not leaving? (remote delivery issues, deferred queue growth)
- Is mail leaving but late? (throughput bottleneck: disk, DNS, content filter, rate limits)
Minute 2–5: Identify which queue is growing
- Incoming/cleanup stage jam: many messages in
maildropor heavy cleanup activity - Active queue jam: active stays huge; qmgr can’t move fast enough
- Deferred queue jam: remote delivery failing or slow; backoff timers stacking
Minute 5–10: Find the resource that’s saturated
- Disk/IOPS: queue directory churn; fsync storms; RAID controller crying quietly
- DNS: slow MX lookups, RBL lookups, DNS timeouts
- Network: outbound port 25 blocked/filtered; packet loss; MTU weirdness
- CPU: TLS handshakes, content scanning, regex maps, milters
- Concurrency limits: per-domain concurrency, default_destination_concurrency_limit, anvil limits
Minute 10–15: Choose one safe action
- If disk is full: stop the bleeding (temporary rejects), free space, don’t flush.
- If DNS is slow: fix resolver or bypass RBL lookups temporarily; don’t increase concurrency.
- If a downstream filter is slow: bypass or scale it; don’t restart Postfix repeatedly.
- If a remote provider is deferring: throttle by domain; don’t brute-force with a flush loop.
One quote to keep you honest. As a paraphrased idea from W. Edwards Deming: “Without data, you’re just another person with an opinion.” In queue meltdowns, opinions are how you end up with a second incident.
The queue meltdown mental model
Postfix is a pipeline with buffers. When a stage slows down, buffers upstream fill. Your job is to locate the slow stage and either speed it up or reduce input so the system can drain. The danger is that the queue itself becomes the workload: each message is a tiny file, every retry is more disk IO, every bounce is more mail.
What “meltdown” actually means
A true queue meltdown isn’t just “lots of mail.” It’s “the system spends more time managing the queue than delivering mail.” You see it as:
- rapid growth in deferred and/or active queues
- load average climbing without matching useful throughput
- disk utilization and IO wait rising
- log lines repeating the same temporary failures
- mail processes piling up: smtp, qmgr, cleanup, trivial-rewrite, tlsmgr
Three categories of stoppage
- Acceptance stoppage: inbound connections fail, submissions fail, or messages can’t be queued.
- Scheduling stoppage: messages are queued but qmgr can’t schedule/deliver fast enough.
- Delivery stoppage: remote delivery is failing, slow, or throttled, so deferred dominates.
The key is to resist the temptation to “flush everything.” That’s like solving a traffic jam by opening every on-ramp and telling people to drive faster. The jam just moves to the next intersection, usually disk.
Joke #1: SMTP stands for “Sometimes Mail Takes a Pause.” Unfortunately, the pause is always during business hours.
Interesting facts and historical context
Mail is older than your cloud provider’s branding, and many of its sharp edges are historical artifacts. A few concrete facts that help during incidents:
- Postfix was designed as a secure alternative to Sendmail, with multiple small processes to reduce blast radius when something goes wrong.
- SMTP was built for a friendlier Internet; temporary failures and retries are fundamental, not an exception—your queue is an intended feature.
- Backoff behavior is deliberate: Postfix won’t hammer a remote domain continuously; it spaces retries to be polite and avoid self-DDoS.
- The “queue as files” design means storage performance matters far more than many teams expect; thousands of tiny file ops can beat a slow disk into submission.
- DNS is part of your critical path for remote delivery. When resolvers are slow, mail “stops” even though Postfix is healthy.
- Greylisting made a comeback in the spam era; it intentionally causes temporary failures, which can explode queue size if you don’t plan capacity.
- RBL lookups are a hidden dependency: one slow or broken blacklist provider can stall SMTP sessions and starve throughput.
- TLS everywhere increased CPU cost per connection; on busy relays, handshake overhead is a real scaling factor, not a rounding error.
- Mailbox provider throttling is normal: large providers defer aggressively when you spike volume or trip reputation controls; you can’t “out-concurrency” reputation.
Practical tasks: commands, what the output means, what you decide
These are the tasks I actually run when the pager goes off. Each one has three parts: command, how to read it, and the decision it drives. Use them in order, not as a buffet.
Task 1: Confirm Postfix is up and what it thinks it’s doing
cr0x@server:~$ systemctl status postfix --no-pager
● postfix.service - Postfix Mail Transport Agent
Loaded: loaded (/lib/systemd/system/postfix.service; enabled)
Active: active (running) since Tue 2026-02-04 09:12:08 UTC; 2h 14min ago
Main PID: 1243 (master)
Tasks: 6 (limit: 18689)
Memory: 21.3M
CGroup: /system.slice/postfix.service
├─1243 /usr/lib/postfix/sbin/master -w
├─1301 qmgr -l -t unix -u
├─1302 pickup -l -t unix -u
└─1303 tlsmgr -l -t unix -u
Meaning: If Postfix isn’t active, you’re in “service down” territory. If it is active, the problem is usually downstream (DNS, network, remote, filters) or upstream (disk full, permissions).
Decision: If inactive/failed, check journal logs immediately before restarting. If active, don’t restart reflexively; gather evidence first.
Task 2: See queue size and shape at a glance
cr0x@server:~$ mailq | head -n 20
-Queue ID- --Size-- ----Arrival Time---- -Sender/Recipient-------
A1B2C3D4E5 2456 Tue Feb 4 11:23:05 alerts@example.net
user1@corp.tld
F6G7H8I9J0 1879 Tue Feb 4 11:23:06 noreply@corp.tld
external@bigmail.tld
-- 4821 Kbytes in 217 Requests.
Meaning: The last line is the headline. Requests are queue entries; not necessarily unique emails if you have multiple recipients per message.
Decision: If requests are growing fast, move to “what queue is growing” and “why is delivery failing.”
Task 3: Split active vs deferred vs maildrop counts
cr0x@server:~$ postqueue -p | tail -n 1
-- 4821 Kbytes in 217 Requests.
cr0x@server:~$ find /var/spool/postfix/active -type f | wc -l
53
cr0x@server:~$ find /var/spool/postfix/deferred -type f | wc -l
10241
cr0x@server:~$ find /var/spool/postfix/maildrop -type f | wc -l
0
Meaning: A giant deferred with small active often means remote delivery is failing/throttled. A big maildrop can mean pickup/cleanup issues or content filter slowdown.
Decision: Big deferred: focus on remote responses and DNS/network. Big maildrop: focus on local pipeline (cleanup, milters, disk, permissions).
Task 4: Read the last 15 minutes of mail logs like a grown-up
cr0x@server:~$ journalctl -u postfix --since "15 min ago" --no-pager | tail -n 80
Feb 04 13:18:02 server postfix/smtp[22190]: connect to bigmail.tld[203.0.113.10]:25: Connection timed out
Feb 04 13:18:02 server postfix/smtp[22190]: A1B2C3D4E5: to=<external@bigmail.tld>, relay=none, delay=620, delays=0.1/0.1/620/0, dsn=4.4.1, status=deferred (connect to bigmail.tld[203.0.113.10]:25: Connection timed out)
Feb 04 13:18:05 server postfix/qmgr[1301]: warning: private/anvil: connection refused
Feb 04 13:18:05 server postfix/master[1243]: warning: process /usr/lib/postfix/sbin/anvil pid 22144 exit status 1
Meaning: “Connection timed out” points to network egress, firewall, provider blocking, or routing. Anvil failing is a local symptom (resource exhaustion, limits, or filesystem trouble).
Decision: For timeouts, validate outbound connectivity and port 25 policy. For anvil failures, check disk space/inodes and process limits before restarting.
Task 5: Verify outbound port 25 connectivity to the world
cr0x@server:~$ nc -vz -w 5 203.0.113.10 25
nc: connect to 203.0.113.10 port 25 (tcp) timed out: Operation now in progress
Meaning: If this times out broadly, your network path is blocked or broken. If it works to some hosts but not others, you may be dealing with provider-specific blocking or routing issues.
Decision: If outbound 25 is blocked, stop tuning Postfix and start talking to your network team or cloud provider. No amount of queue flush will fix policy.
Task 6: Inspect DNS health and lookup latency
cr0x@server:~$ dig +tries=1 +time=2 MX bigmail.tld
;; communications error to 10.0.0.53#53: timed out
;; no servers could be reached
Meaning: If DNS can’t answer quickly, Postfix can’t route mail reliably. You will see deferrals, slow sessions, and growing queues.
Decision: Fix resolvers first (reachability, performance, caching). Consider temporarily reducing DNS-dependent checks (RBL) during incident—safely, with clear rollback.
Task 7: Identify the top deferral reasons
cr0x@server:~$ grep -h "status=deferred" /var/log/mail.log | tail -n 2000 | sed -E 's/.*deferred \((.*)\)$/\1/' | sort | uniq -c | sort -nr | head
312 connect to bigmail.tld[203.0.113.10]:25: Connection timed out
141 host mx.other.tld[198.51.100.25] said: 451 4.7.1 Try again later
88 lost connection with mx.slow.tld[192.0.2.77] while receiving the initial server greeting
Meaning: This turns a “mail is down” complaint into ranked causes. Timeouts differ from 451 deferrals; the fixes differ too.
Decision: Timeouts: network. 451 try later: remote throttling/reputation, so throttle. Lost greeting: remote slow/overloaded or your network is flaky; reduce concurrency and timeouts.
Task 8: See whether you’re CPU-bound or IO-bound
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 1 0 184320 22112 612300 0 0 120 980 900 1400 12 8 55 25 0
3 2 0 182900 22120 612400 0 0 110 2200 1100 1800 10 7 40 43 0
Meaning: High wa means IO wait: disk is the bottleneck. High us/sy means CPU-bound (TLS, milters, scanning).
Decision: IO wait high: reduce queue churn, slow intake, consider moving queues to faster storage. CPU high: reduce TLS overhead, scale scanning, or reduce concurrency until stable.
Task 9: Check spool filesystem space and inode exhaustion
cr0x@server:~$ df -h /var/spool/postfix
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 20G 19G 700M 97% /
cr0x@server:~$ df -i /var/spool/postfix
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda2 1310720 1299900 10820 100% /
Meaning: Email queues are inode-hungry. You can have free space and still be dead, or have inodes and be dead. Both feel the same from the user’s perspective: “mail stopped.”
Decision: If space/inodes are tight: stop accepting non-essential mail, clear non-mail junk, and plan a controlled drain. Don’t run a flush loop on a 97% full filesystem.
Task 10: Check which Postfix processes are piling up
cr0x@server:~$ ps -eo pid,comm,pcpu,pmem,args --sort=-pcpu | head -n 15
22501 smtp 22.4 0.2 postfix/smtp
22480 smtp 18.1 0.2 postfix/smtp
22310 cleanup 12.5 0.1 postfix/cleanup
1301 qmgr 6.2 0.2 qmgr -l -t unix -u
1243 master 0.1 0.1 /usr/lib/postfix/sbin/master -w
Meaning: Lots of smtp can mean remote slowdowns or you’ve allowed too much concurrency. High cleanup can indicate header_checks, milters, or disk contention.
Decision: If smtp dominates and deferrals are remote: throttle and reduce concurrency. If cleanup dominates: inspect content filters, maps, and disk health.
Task 11: Confirm what limits Postfix is actually using
cr0x@server:~$ postconf -n | egrep -i 'queue|concurrency|recipient|timeout|anvil|milter|smtpd_client|stress|dns'
default_destination_concurrency_limit = 20
smtp_destination_concurrency_limit = 20
smtp_connect_timeout = 30s
smtp_helo_timeout = 30s
smtp_data_xfer_timeout = 180s
smtpd_milters = inet:127.0.0.1:8891
milter_command_timeout = 30s
minimal_backoff_time = 300s
maximal_backoff_time = 4000s
Meaning: This tells you whether you’re tuned for sane throughput or chaos. Milter timeouts and DNS timeouts are common silent killers.
Decision: If concurrency is high and remotes are slow, reduce it. If timeouts are too strict, you might be causing self-inflicted deferrals; adjust carefully and test.
Task 12: Locate the top recipient domains clogging the queue
cr0x@server:~$ postqueue -p | awk '/^[A-F0-9]/ {id=$1} /@/ {print $NF}' | sed -n 's/.*@//p' | tr -d '><,' | sort | uniq -c | sort -nr | head
842 bigmail.tld
317 other.tld
205 slow.tld
Meaning: Often one destination domain dominates. That’s your lever: throttle that domain instead of punishing everyone equally.
Decision: Apply per-domain concurrency limits or transport maps to isolate the offender. Do not globally crank concurrency to “drain faster.”
Task 13: Safely inspect a single queue file’s story
cr0x@server:~$ postcat -q A1B2C3D4E5 | sed -n '1,80p'
*** ENVELOPE RECORDS active A1B2C3D4E5 ***
message_size: 2456
message_arrival_time: Tue Feb 4 11:23:05 2026
sender: alerts@example.net
named_attribute: log_ident=postfix/smtp
recipient: external@bigmail.tld
*** MESSAGE CONTENTS A1B2C3D4E5 ***
Received: from app01 (app01 [10.10.10.21])
by server (Postfix) with ESMTP id A1B2C3D4E5
for <external@bigmail.tld>; Tue, 4 Feb 2026 11:23:05 +0000 (UTC)
Subject: Alert: job failed
Meaning: You can confirm sender, recipient, and timing without guessing. Useful when someone claims “it never left our app.”
Decision: If arrival times are old and retries keep failing, stop treating it as transient; it may be policy (blocked IP, reputation, outbound filtering).
Task 14: Measure queue manager behavior and warnings
cr0x@server:~$ postqueue -j | head -n 5
{"queue_name":"deferred","queue_id":"A1B2C3D4E5","arrival_time":1707055385,"message_size":2456,"sender":"alerts@example.net","recipients":["external@bigmail.tld"]}
{"queue_name":"deferred","queue_id":"F6G7H8I9J0","arrival_time":1707055386,"message_size":1879,"sender":"noreply@corp.tld","recipients":["external@bigmail.tld"]}
Meaning: JSON output is script-friendly. You can sample and build quick counts without parsing human formatting.
Decision: Use this for targeted remediation (per-domain throttles, identifying large messages) instead of random flushes.
Task 15: Throttle safely by destination (surgical, not global)
cr0x@server:~$ postconf -e "smtp_destination_concurrency_limit=10"
cr0x@server:~$ postconf -e "default_destination_concurrency_limit=10"
cr0x@server:~$ systemctl reload postfix
Meaning: Lower concurrency reduces pressure on disk, DNS, and remote servers. Reload is safer than restart during heavy queues.
Decision: If you’re timing out or seeing 451 deferrals, throttling often improves throughput by reducing retries and connection churn.
Task 16: Pause intake when the spool is dying
cr0x@server:~$ postconf -e "smtpd_soft_error_limit=5"
cr0x@server:~$ postconf -e "smtpd_hard_error_limit=10"
cr0x@server:~$ systemctl reload postfix
Meaning: This is not a full stop, but it makes abusive/broken clients back off. For a real intake pause, you can temporarily firewall or adjust master.cf, but start with gentle brakes.
Decision: If disk/inodes are near exhaustion, you must reduce input. Draining cannot win a race against an infinite firehose.
Task 17: Trigger a controlled requeue (only when you know why)
cr0x@server:~$ postqueue -f
cr0x@server:~$ postqueue -p | tail -n 1
-- 4809 Kbytes in 214 Requests.
Meaning: Flush tells Postfix to retry sooner. That can help after fixing DNS or network. It can also detonate your disk if the root cause is still present.
Decision: Only flush after you fix the bottleneck and you’ve reduced concurrency if needed. Flush is a “now try again,” not a “make it go away.”
Task 18: Check the filesystem latency that murders throughput
cr0x@server:~$ iostat -xz 1 3
avg-cpu: %user %nice %system %iowait %steal %idle
10.12 0.00 7.44 41.20 0.00 41.24
Device r/s w/s rkB/s wkB/s await svctm %util
sda 8.10 210.30 220.1 5400.7 58.4 3.9 92.1
Meaning: High await and high %util suggest the device is saturated. Postfix queue ops are latency-sensitive, not bandwidth-hungry.
Decision: If storage is saturated, throttle, move queue to faster storage, or temporarily reduce logging/filtering work. Do not increase concurrency.
Where Postfix melts down: common bottlenecks
1) Storage: your queue is a tiny-files benchmark you didn’t ask for
Postfix uses the filesystem as a durable queue. That means metadata operations, directory lookups, fsync behavior, and inode availability matter. When queues grow, Postfix performs more file operations: create, rename, unlink, stat, open, close. Multiply by thousands and you’re not “sending email.” You’re load-testing ext4/XFS and whatever sits underneath.
Failure pattern: IO wait climbs, qmgr and cleanup appear busy, throughput drops, queue grows faster, then the disk fills with deferred mail and logs. The system becomes a mail-shaped disk thrash machine.
2) DNS: the silent dependency that can stop all delivery
Every remote delivery needs DNS: MX lookup, A/AAAA lookup, sometimes PTR, plus any policy checks you bolt on. DNS failures often show up as timeouts and deferred mail, which feels like “Postfix is broken” when it’s actually your resolver, your network path to it, or an overloaded caching layer.
Common culprit: RBL or reputation lookups. They’re DNS queries too. A single slow upstream can add seconds to each SMTP session, which at scale becomes a full stop.
3) Network policy: the port 25 reality check
Many environments block outbound port 25 by default. Some do it quietly. If your relay host is in a cloud network with strict egress policy, you can end up with “connect timed out” to many destinations. That’s not Postfix. That’s reality.
Another network failure mode: asymmetric routing, stateful firewalls that age out idle SMTP sessions, or NAT gateways hitting connection limits. In a queue meltdown, you create lots of connections. NAT tables don’t love that.
4) Content filters and milters: the throughput tax
If you run Amavis, ClamAV, SpamAssassin, OpenDKIM, OpenDMARC, or custom milters, congratulations: you have a distributed system. Any part of it can slow, block, or fail, and Postfix will dutifully queue your pain.
Typical symptom: maildrop increases, cleanup processes climb, logs show milter timeouts or “queue file write error.” A slow antivirus update can feel like a full mail outage.
5) Remote throttling and reputation: “451, try again later” is a policy decision
When a major provider returns 4xx responses, they’re telling you to slow down. They might be rate-limiting, detecting spikes, or penalizing reputation. The worst response is to crank concurrency and flush. That increases connection attempts, increases deferrals, and burns your reputation harder.
Better response: throttle per domain, space retries, fix SPF/DKIM/DMARC alignment issues, and stop sending garbage. If you’re sending legitimate mail, stabilize your sending pattern.
Joke #2: “We’ll just restart Postfix” is the email equivalent of turning your laptop off and on—except your laptop doesn’t have a deferred queue of regret.
Recovery patterns that don’t make it worse
Stabilize first: reduce incoming load before you “optimize”
Queue meltdowns often happen during spikes: marketing blasts, password reset storms, alert floods, or a compromised account sending junk. If you keep accepting at full speed, you’re betting your storage can out-run the spike. That’s a bad bet on most days.
Stabilization options, from least invasive to most:
- Throttle concurrency to reduce connection churn and disk activity.
- Temporarily disable expensive checks (like a slow RBL) if it’s clearly the bottleneck and you can accept the risk.
- Rate-limit abusive clients using postscreen/anvil tuning or firewall limits.
- Temporarily reject non-critical mail (for specific senders/domains) while preserving core flows.
Drain intelligently: target the offenders
When the deferred queue is dominated by one or two domains (common with big providers), you should isolate them. If you treat all destinations equally, your healthy destinations get punished by the sick one.
Practical approach:
- Find top domains in queue.
- Apply per-destination throttling (via transport maps or concurrency settings).
- Let everything else flow normally.
Be careful with flush and requeue operations
Flush is appropriate after you fixed a transient infrastructure issue: DNS outage, network policy change, resolver restart, firewall rollback. It is not appropriate if the remote is still deferring or your disk is on fire.
Also beware “requeue everything” habits. Requeuing touches every message and can be more IO than the original backlog. If your bottleneck is disk, requeueing is basically cardio for the thing that’s already exhausted.
Know when to split the role: relay vs mailbox vs submission
One of the most reliable ways to prevent meltdowns is architectural: separate inbound submission from outbound relay, or isolate bulk mail from transactional mail. In a meltdown, the bulk queue doesn’t get to starve password resets.
If you can’t split physically, you can still split logically: separate IPs, separate transports, separate queue directories (advanced), or separate service instances. The point is to contain failure modes.
Storage-specific recovery moves (what SREs hate learning at 3 a.m.)
If IO is the bottleneck:
- Stop making it worse: reduce concurrency and input rate. Every retry cycle is extra IO.
- Consider moving the queue to faster storage (NVMe, better RAID cache policy, dedicated volume). This is not a “during the incident” move unless you’ve rehearsed it.
- Look for external IO hogs on the same filesystem: log storms, backups, antivirus scans of the spool (yes, people do this).
Common mistakes: symptom → root cause → fix
1) Symptom: deferred queue grows, logs show “connect timed out”
Root cause: outbound port 25 blocked, routing issue, firewall/NAT limits, or remote IP unreachable.
Fix: validate connectivity with nc to multiple MX targets; check egress policies; reduce concurrency; stop flush loops; engage network/provider.
2) Symptom: maildrop grows, cleanup processes spike
Root cause: slow milter/content filter, slow disk, or heavy header/body checks.
Fix: inspect milter timeouts in logs; temporarily bypass non-critical milters; scale filter service; confirm disk IO wait; reduce intake.
3) Symptom: queue is huge, but active is small and keeps cycling
Root cause: remote throttling or policy deferrals causing slow retry cadence; or per-destination concurrency too low for your traffic shape.
Fix: identify top domains and deferral messages; apply per-domain throttling and reasonable retry strategy; improve reputation and sending patterns.
4) Symptom: Postfix logs “queue file write error” or “No space left on device”
Root cause: disk full or inode exhaustion on the spool filesystem.
Fix: free space/inodes immediately; stop accepting large/bulk mail temporarily; move logs; clean unrelated junk; then drain slowly.
5) Symptom: CPU high, lots of smtp processes, TLS-related logs
Root cause: TLS handshake overhead, too many concurrent outbound connections, or CPU-starved VM.
Fix: reduce concurrency; ensure you’re not doing expensive per-message operations; consider session reuse tuning; allocate CPU or move the workload.
6) Symptom: slow everything, DNS timeouts, intermittent deliverability
Root cause: broken/overloaded resolver, misconfigured resolv.conf, unreachable DNS servers, or RBL causing slow queries.
Fix: fix resolver reachability and caching; temporarily remove the slowest DNS-based checks; confirm with dig latency and error rates.
7) Symptom: users report “sent” but nothing arrives; no queue growth
Root cause: submission layer issue (auth, TLS, firewall), application pointing to wrong relay, or mail routed elsewhere.
Fix: check submission logs (smtpd), authentication failures, and application configs; confirm messages appear in queue using postcat or logs.
Checklists / step-by-step plan
Checklist A: “Mail stopped” incident response (30–60 minutes)
- Confirm scope: inbound only, outbound only, or both. Check queue and logs.
- Identify queue type growing: maildrop vs active vs deferred. Use directory counts.
- Rank the failure reasons: parse last N deferrals from logs.
- Check disk space and inodes: spool filesystem first.
- Check DNS: resolver health and MX lookup speed.
- Check outbound port 25: connect test to a few destination MX IPs.
- Check saturation: vmstat/iostat for IO wait vs CPU.
- Pick one lever: throttle, bypass a filter, fix DNS, fix egress, stop intake.
- Apply change via reload: prefer
systemctl reload postfixover restarts. - Reassess every 5 minutes: queue size trend, deferral reasons trend, disk IO trend.
- Only then consider flush: after bottleneck is fixed and concurrency is sane.
- Communicate clearly: what’s broken, what’s mitigated, what’s next, and expected drain time.
Checklist B: Controlled drain after a DNS/network fix
- Reduce concurrency temporarily to avoid stampede.
- Flush once.
- Watch IO wait and queue trend for 10–15 minutes.
- If stable, step concurrency up in small increments.
- Stop increasing when IO wait rises sharply or deferrals return.
Checklist C: When storage is the bottleneck
- Stop accepting optional mail sources (bulk/alerts) if you can.
- Reduce concurrency globally.
- Find and kill non-mail IO on the spool volume (backups, scans).
- Free space/inodes; rotate logs; move large files off the filesystem.
- Drain slowly; avoid requeue operations that rewrite the entire spool.
- Plan a queue storage redesign after the incident.
Three corporate mini-stories from the trenches
Mini-story 1: The outage caused by a wrong assumption
The company had a “simple” architecture: apps sent mail to a Postfix relay, the relay sent mail to the Internet. The relay lived on a VM with decent CPU and what looked like plenty of disk. Everyone assumed email was low bandwidth, so it couldn’t possibly stress storage.
One Monday, a password reset storm hit after an SSO change. The relay accepted mail just fine. Then the deferred queue started growing. Users screamed. The on-call engineer restarted Postfix twice—because that’s what people do when they want to feel involved.
The real issue was inode exhaustion. The filesystem had free space but no inodes left on the root volume where /var/spool/postfix lived. The queue was made of tiny files; each message was a tiny inode tax. The system wasn’t “down,” it was suffocating on metadata.
Once they checked df -i, the fix was mundane: free inodes by cleaning non-mail artifacts and moving the spool to a properly sized filesystem with adequate inode density. The restart wasn’t harmful, but it wasted the only thing you can’t replenish during an incident: time.
Mini-story 2: The optimization that backfired
A different organization ran a high-volume relay and wanted to “drain queues faster.” Someone increased destination concurrency limits and shortened retry backoff. It looked great in a quiet test window. Messages flew out.
Then an upstream DNS resolver had intermittent latency. With high concurrency, Postfix opened more SMTP sessions, which triggered more DNS lookups, which overwhelmed the resolver further. Delivery slowed, retries stacked, and the queue ballooned. The relay’s disk hit high IO wait, then the logs turned into a wall of repeated deferrals.
They had optimized for ideal conditions and accidentally built a feedback loop under failure. High concurrency magnified DNS flakiness; shorter backoff increased retry churn; the spool became a busywork factory.
The fix was to reverse the “optimization” and make it resilient: sane concurrency, default backoff behavior, and a hardened resolver path with caching and monitoring. The queue drained slower on perfect days and much faster on imperfect days—which is the only kind of day you actually need to design for.
Mini-story 3: The boring practice that saved the day
A finance-adjacent company had strict change control and a habit that engineers love to mock: they documented “known good” Postfix settings and kept a lightweight runbook for mail incidents. Nobody bragged about it. It was just there.
One afternoon, outbound mail started deferring with “Try again later” to a major mailbox provider. The queue grew, but not explosively. The on-call engineer pulled up the runbook and followed a simple script: identify top domains, apply per-domain throttling, avoid global flush loops, and keep transactional mail flowing.
They also had pre-existing dashboards for spool filesystem utilization and resolver latency. Within minutes, they confirmed DNS was fine and the provider was throttling. They throttled the destination, communicated expected drain time, and stopped the incident from becoming a crisis.
The provider recovered later that evening. The queue drained overnight. No heroics, no requeue storms, no “we restarted the mail server and it worked” fairy tales. The best incident is the one that stays boring.
FAQ
1) Should I restart Postfix during a queue meltdown?
Rarely. Reload is usually safer. Restarting can drop in-flight sessions and cause more retries, which means more queue churn. Restart only if a process is wedged and you’ve identified why.
2) Is it safe to run postqueue -f when the queue is huge?
Safe for data integrity, risky for system stability. Flush increases retry pressure immediately. Use it after fixing a transient cause (DNS restored, firewall fixed), and consider lowering concurrency first.
3) Why is deferred huge but active small?
Because Postfix schedules a limited number of active deliveries and defers the rest, especially when remote delivery is failing or rate-limited. The small active queue can be a sign of politeness, not weakness.
4) What’s the fastest way to find the root cause?
Rank deferral reasons from logs, then validate the relevant dependency: outbound 25 for timeouts, DNS for lookup issues, storage for IO wait and inode/space. Don’t guess.
5) How do I know if the bottleneck is DNS?
If dig times out or is slow, and logs show DNS-related failures or “no route to host” after long delays. Also, if SMTP sessions stall before they even connect to remote MXs, DNS is often the culprit.
6) Can the queue itself cause disk exhaustion even after the original spike ends?
Yes. The queue creates ongoing IO from retries, log writes, and bounces. If your backoff is aggressive or your concurrency is too high, the queue becomes a self-sustaining load.
7) What if only one provider is deferring me?
That’s common. Isolate it: throttle per destination, reduce concurrency for that domain, and keep other mail flowing. Then work on reputation, authentication alignment, and steady sending patterns.
8) Why do I care about inodes? I have free disk space.
Because the queue is many small files. If you run out of inodes, you can’t create queue files even with free bytes. It’s a classic “system looks fine until it isn’t” failure.
9) How do I tell if a milter is the problem?
Look for milter timeout messages, growing maildrop, and elevated cleanup activity. Temporarily bypassing the milter (if allowed) is a fast confirmation test.
10) Should I delete the queue to recover faster?
Only if you are explicitly choosing data loss and you have approval. Deleting queues can violate policy, lose customer mail, and create legal trouble. There are safer levers: throttling, stopping intake, fixing dependencies.
Next steps (the boring part that prevents drama)
If you’re mid-incident: stop flushing blindly, identify the growing queue type, rank the deferral reasons, and verify the big three dependencies—storage, DNS, network egress. Then apply one controlled change and measure the trend. Repeat. Incidents end when the queue is shrinking, not when someone says “it should be fine now.”
After the incident, do the unglamorous work:
- Put spool space and inode monitoring on dashboards with alerting.
- Monitor DNS latency and error rate from the mail host, not from some happy-path probe.
- Document and enforce sane concurrency defaults, plus per-domain throttling procedures.
- Separate transactional vs bulk paths if email matters to your business (it does).
- Practice a controlled drain in a maintenance window so you’re not improvising at 3 a.m.
Postfix is stable. Your environment might not be. The queue is where that truth becomes visible.