Postfix Queue Stuck: The Safe Cleanup Workflow (No Data Loss)

November 11, 2025 • February 3, 2026 • Read: 21 min • Views: 8

Was this helpful?

Your monitoring says “mail backlog growing.” Users say “password resets never arrive.” The CEO forwards you an email with a subject line that is mostly punctuation. The Postfix queue is stuck, and now you have to fix it without turning a bad morning into a compliance incident.

This is the production-safe way to handle a jammed Postfix queue: diagnose the bottleneck fast, stop the bleeding without losing mail, clean up only what’s actually broken, and replay the rest in a controlled way.

The rules: don’t “clean up,” recover

When people say “clean the Postfix queue,” they usually mean “make the pain go away.” That’s how mail gets deleted, bounces get generated in bulk, and your incident channel fills with screenshots from legal.

A stuck queue is rarely a queue problem. The queue is just the visible pile of symptoms. The root cause is almost always one of these:

DNS is slow or broken (lookups time out; deliveries stall).
Network egress is blocked (firewall rules, routing, NAT exhaustion).
Remote sites throttle or tarpitting (421/450/451 responses, greylisting, rate limits).
Local policy services are down (milter, postscreen, rspamd, opendkim, policy daemon).
Disk or inode pressure (queue files can’t be created/renamed).
Content filter is slow (amavis, antivirus, DLP), causing backpressure.
Misconfiguration (relayhost, TLS policies, sender restrictions, SASL auth, bad transports).
Actual queue file corruption (rarer than people think, but real after storage issues).

The safe stance is:

Freeze the scene enough to stop making it worse.
Measure what’s stuck (which queue, which destination, which error).
Fix the cause (DNS/network/service/storage).
Replay gradually (throttle, requeue, targeted flush).
Only then delete anything — and only with a written reason.

One quote I keep taped to the inside of my brain: Hope is not a strategy. — James Cameron. Reliability work is what you do instead of hope.

Short joke #1: Postfix queues are like laundry piles: ignoring them doesn’t make them smaller, it just makes you creatively blind.

Fast diagnosis playbook (first/second/third)

First: confirm what “stuck” means in your reality

Is the active queue huge (mail is being attempted constantly) or is everything in deferred (Postfix gave up for now)?
Are messages from one sender/domain stuck, or is it global?
Are new messages entering the queue faster than they leave?

Second: isolate the bottleneck class

DNS latency: long resolver timeouts in logs, spikes in system load, many “Host not found” that later succeed.
Remote throttling: lots of 4xx responses, “try again later,” “rate limited,” “greylisted.”
Local services: milter timeouts, policy daemon not responding, content filter backlog.
Storage: “No space left on device,” “File too large,” “too many files,” or queue directories slow (high iowait).
Network: connect() timeouts, “Network is unreachable,” TLS handshake failures due to MITM/proxy issues.

Third: pick the safest lever

If root cause is unresolved, do not flush everything. You’ll amplify load and make logs useless.
If one destination is poison (one domain timing out), isolate it with transport maps or concurrency limits.
If your server is melting (load, disk), slow down: reduce concurrency, pause nonessential traffic, protect the host.

Interesting facts and history (because it explains the weird bits)

Postfix was designed as a security-minded Sendmail replacement in the late 1990s, emphasizing least privilege and multiple small daemons instead of one monolith.
The queue is file-based by design. That’s boring and great: messages survive daemon restarts and even some partial failures.
“Deferred” isn’t an error state; it’s a scheduling decision. Postfix intentionally backs off to avoid hammering a broken destination.
Postfix splits responsibilities (pickup, cleanup, qmgr, smtp, local). When one of these stalls, the queue becomes your early warning system.
Queue IDs aren’t random decoration. They’re stable handles you can use to trace a message through logs and queue operations.
Backoff behavior is part of being a good internet citizen. The SMTP ecosystem punishes aggressive retriers with throttling and blocklisting.
Historically, mail systems taught ops teams about durability before “event sourcing” was fashionable: accept, persist, retry, and only fail with evidence.
Maildir vs mbox taught the industry painful lessons about corruption and locking; Postfix’s queue format inherits that “don’t bet on one big file” mindset.

Production tasks: commands, output meaning, and the next decision

These are not “run this and pray” commands. Each task includes what to look for and what you decide next. Run them as root or with appropriate privileges.

Task 1: Measure queue size (and which queue)

cr0x@server:~$ postqueue -p | tail -n 20
-- 18451 Kbytes in 392 Requests.
A1B2C3D4E5*     1234 Fri Jan  3 10:41:22 sender@example.com
                                         (connect to mx.remote.tld[203.0.113.10]:25: Connection timed out)
                                         recipient@remote.tld
...

What it means: The summary tells you total size and number of requests. The per-message line shows a dominant error: connection timeouts to a specific destination.

Decision: If you see one or a few destinations repeating, do targeted mitigation (transport throttling or isolate) instead of global flushing.

Task 2: Split active vs deferred counts

cr0x@server:~$ find /var/spool/postfix/active -type f | wc -l
128
cr0x@server:~$ find /var/spool/postfix/deferred -type f | wc -l
9412

What it means: Mostly deferred. Postfix tried and backed off. That usually points to remote issues, DNS, or policy timeouts—not a local send loop.

Decision: Focus on why deliveries are being deferred. Don’t restart Postfix repeatedly; it won’t make remote MTAs respond faster.

Task 3: Check whether Postfix is healthy as a service

cr0x@server:~$ systemctl status postfix --no-pager
● postfix.service - Postfix Mail Transport Agent
     Loaded: loaded (/lib/systemd/system/postfix.service; enabled)
     Active: active (running) since Fri 2026-01-03 10:12:01 UTC; 42min ago
   Main PID: 1123 (master)
      Tasks: 6 (limit: 4672)
     Memory: 34.2M
        CPU: 1min 12s

What it means: The master process is up. This doesn’t mean mail is flowing, but it rules out “it’s dead.”

Decision: If it’s not running, start it and then immediately inspect logs for why it stopped. If it’s running, move on to bottleneck checks.

Task 4: Find the dominant deferral reason in logs

cr0x@server:~$ journalctl -u postfix -S "1 hour ago" | egrep -i "deferred|timed out|refused|4[0-9][0-9]" | tail -n 20
Jan 03 10:43:10 server postfix/smtp[2451]: A1B2C3D4E5: to=, relay=mx.remote.tld[203.0.113.10]:25, delay=120, delays=0.1/0/120/0, dsn=4.4.1, status=deferred (connect to mx.remote.tld[203.0.113.10]:25: Connection timed out)
Jan 03 10:43:22 server postfix/smtp[2454]: F6E7D8C9B0: to=, relay=none, delay=4, delays=0.1/0.1/3.8/0, dsn=4.4.3, status=deferred (Host or domain name not found. Name service error for name=mx.other.tld type=MX: Host not found)

What it means: You’ve got connect timeouts (network/remote) and DNS failures (resolver/path). That’s enough to pick the next two probes: DNS and network.

Decision: If errors are mostly 4.4.3 (DNS), fix resolver first. If mostly connect timeouts, check firewall/egress and remote reachability.

Task 5: Check DNS resolution latency and correctness

cr0x@server:~$ dig +tries=1 +time=2 mx remote.tld
;; ANSWER SECTION:
remote.tld.              300     IN      MX      10 mx.remote.tld.
;; Query time: 12 msec

What it means: DNS is fast for this domain at least. If you see 2000 msec and SERVFAIL/timeouts, that’s your queue “stuck.”

Decision: If DNS is slow, fix resolvers, caching, or network path. Don’t touch the queue yet; you’ll just generate more failing lookups.

Task 6: Verify outbound TCP/25 and TLS reachability

cr0x@server:~$ nc -vz -w 3 203.0.113.10 25
nc: connect to 203.0.113.10 port 25 (tcp) timed out: Operation now in progress

What it means: The OS can’t connect. That’s not Postfix. That’s routing, firewall, remote blackhole, or ISP blocking outbound 25.

Decision: Escalate to network/firewall. If outbound 25 is policy-blocked, you need a relayhost (submission/587) or approved smart host.

Task 7: Inspect Postfix concurrency and rate controls

cr0x@server:~$ postconf | egrep "default_destination_concurrency_limit|smtp_destination_concurrency_limit|qmgr_message_active_limit|default_process_limit"
default_destination_concurrency_limit = 20
default_process_limit = 100
qmgr_message_active_limit = 20000
smtp_destination_concurrency_limit = 20

What it means: These numbers dictate how hard you hit the world. If you flush a giant deferred queue with high concurrency, you can DDoS yourself or get blocked.

Decision: During recovery, temporarily lower concurrency (especially per-destination) if remote throttling is happening.

Task 8: Look for milter/policy timeouts (the invisible handbrake)

cr0x@server:~$ journalctl -u postfix -S "1 hour ago" | egrep -i "milter|policy|timeout" | tail -n 20
Jan 03 10:31:07 server postfix/cleanup[1881]: warning: milter inet:127.0.0.1:8891: connect to Milter service: Connection refused
Jan 03 10:31:08 server postfix/smtpd[1876]: warning: problem talking to server 127.0.0.1:8891: Connection refused

What it means: Postfix is blocking acceptance or processing because a configured milter isn’t answering. Mail may pile in incoming, active, or hold depending on config.

Decision: Fix the milter service (start it) or temporarily disable it if policy allows. Don’t delete queued mail because your filter died.

Task 9: Check disk space and inodes (queue is files)

cr0x@server:~$ df -h /var/spool/postfix
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        20G   19G  600M  97% /
cr0x@server:~$ df -i /var/spool/postfix
Filesystem       Inodes  IUsed   IFree IUse% Mounted on
/dev/sda2       1310720 1299000  11720   100% /

What it means: You’re out of inodes. That’s catastrophic for a file-based queue: Postfix can’t create queue files even if you have some bytes left.

Decision: Free inodes safely (logs, caches), expand filesystem, or move spool to a larger volume. Do not run “cleanup scripts” that rm random queue files.

Task 10: Identify top offending destinations in deferred

cr0x@server:~$ postqueue -p | awk '/\(.+\)/{print}' | sed -E 's/.*\((.*)\).*/\1/' | cut -d: -f1 | sort | uniq -c | sort -nr | head
  812 connect to mx.remote.tld[203.0.113.10]
  503 Host or domain name not found. Name service error for name=mx.other.tld type=MX
  221 451 4.7.1 Try again later

What it means: A few failure modes dominate. This is good news: fix a small number of causes and the queue drains naturally.

Decision: If one destination dominates, isolate it to prevent it monopolizing delivery attempts and log volume.

Task 11: Inspect a specific message safely

cr0x@server:~$ postcat -q A1B2C3D4E5 | sed -n '1,40p'
*** ENVELOPE RECORDS active ***
message_size:           1234             1234
message_arrival_time: Fri Jan  3 10:41:22 2026
sender: sender@example.com
recipient: recipient@remote.tld
*** MESSAGE CONTENTS active ***
Received: from app01.internal (app01.internal [10.0.0.12])
        by server with ESMTP id A1B2C3D4E5
        for ; Fri, 03 Jan 2026 10:41:22 +0000 (UTC)
Subject: Password reset

What it means: You can verify sender/recipient/headers without guessing. This is how you avoid deleting “junk” that is actually business-critical.

Decision: If the mail is legitimate, you keep it and fix delivery. If it’s spam from a compromised host, you stop the source first.

Task 12: Put a message on hold instead of deleting

cr0x@server:~$ postsuper -h A1B2C3D4E5
postsuper: A1B2C3D4E5: placed on hold

What it means: The message is removed from normal delivery attempts but not destroyed. This is your “quarantine” lever.

Decision: Use hold for suspected loops, poisoned messages triggering filter crashes, or legal-sensitive mail while you investigate.

Task 13: Release held mail after fixing the root cause

cr0x@server:~$ postsuper -H A1B2C3D4E5
postsuper: A1B2C3D4E5: released from hold

What it means: The message is eligible again. If the underlying error is fixed, it will deliver on the next queue run.

Decision: Release in batches if you held many items. Watch system load and remote responses.

Task 14: Trigger a controlled queue run (don’t hammer)

cr0x@server:~$ postqueue -i A1B2C3D4E5

What it means: This asks Postfix to schedule the specific queue ID for immediate delivery attempt.

Decision: Use this for surgical verification (“did the fix work?”) rather than flushing the entire backlog at once.

Task 15: Requeue mail safely (when configuration changed)

cr0x@server:~$ postsuper -r ALL
postsuper: Requeued: 392 messages

What it means: Postfix rewrites queue files into a fresh state for scheduling. This is useful after changing transport maps, relayhost, or fixing transient corruption.

Decision: Requeue is safer than delete-and-resend. Still, do it only after the cause is fixed and you’ve throttled concurrency if needed.

Task 16: Verify there’s no queue manager stall

cr0x@server:~$ ps -eo pid,comm,args | egrep "postfix/qmgr|postfix/master" | grep -v egrep
1123 master  /usr/lib/postfix/sbin/master -w
1388 qmgr    qmgr -l -t unix -u

What it means: qmgr exists. If qmgr is missing or constantly respawning, the queue won’t drain regardless of remote reachability.

Decision: If missing, check master.cf errors and logs; fix configuration before touching queue contents.

Safe cleanup workflow: step-by-step (no data loss)

Step 0: Decide what “no data loss” means for your org

Mail retention is policy. But operationally, “no data loss” means you don’t delete queued mail as a first-line response. You preserve evidence, you preserve deliverability, and you preserve the ability to explain what happened later.

Step 1: Stop making it worse (without stopping the world)

If the queue is exploding because an internal app is dumping mail (loop, bug, or compromised host), you need to slow intake. Options, from least disruptive to most:

Block the offending sender at the edge (smtpd restrictions) temporarily.
Rate-limit or tarpitting for a specific client network.
If you are being flooded, temporarily reject with a 4xx for non-essential sources so legitimate senders can retry.

Avoid “postfix stop” as the first reflex. Stopping can be fine, but it also stops legitimate deliveries and may trigger app retries that create more load elsewhere.

Step 2: Snapshot the situation (cheap evidence)

Before you “fix,” capture a few quick facts so you can verify improvement and write a decent postmortem:

Queue size now and 15 minutes later.
Top 3 deferral reasons.
Top 3 destinations by volume.
Disk/inode status.
CPU/iowait and network errors.

This is not bureaucracy. It’s how you avoid chasing ghosts.

Step 3: Fix the bottleneck, not the queue

Examples of real fixes that actually drain queues:

Repair DNS: correct resolv.conf, fix broken upstream, add local caching resolver.
Open outbound TCP/25 or configure relayhost on 587 with auth.
Restart/repair milters and content filters; increase their capacity if they’re the choke point.
Fix disk/inode exhaustion; move spool to a dedicated filesystem if needed.
Address remote throttling by reducing concurrency and respecting 4xx backoff.

Step 4: Replay mail in a controlled way

Once the cause is fixed, you want a steady drain, not a stampede.

Start by delivering a few known queue IDs (postqueue -i) to confirm success.
Lower per-destination concurrency temporarily if you were being throttled.
Use postsuper -r ALL if you changed transports/relayhost and need to reschedule.
Monitor queue depth and deferral rate as you ramp back.

Step 5: Only remove mail with a scoped, auditable reason

Deleting queued mail is sometimes correct. Examples:

Confirmed spam flood from a compromised account that must not be delivered.
A mail loop you can prove will never succeed (bad address expansion creating infinite recipients).
Malware payload you must not relay.

When you do delete, do it narrowly: by queue ID, by sender, or by time window. And keep a record.

Short joke #2: If someone suggests “just delete the queue,” ask which limb they’d like to remove to fix a sprained ankle.

Where queues get stuck: bottlenecks by subsystem

DNS: the silent queue killer

When DNS is slow, Postfix doesn’t fail fast. It waits, because sometimes DNS is flaky and retrying helps. That means the smtp delivery processes spend their time blocked on lookups instead of delivering. Symptoms:

Many deferrals with “Name service error” or “Host not found” that later succeed.
High load with low CPU usage (processes in IO wait / blocked states).
Dig queries from the host feel slow or inconsistent.

Fix DNS first. Then replay queue gently. A flushed queue with broken DNS just creates a thundering herd of timeouts.

Network egress: the firewall that ate your mail

Outbound TCP/25 being blocked is common in corporate and cloud environments. Sometimes it’s intentional. Sometimes a change request went sideways. Postfix can queue forever, but users won’t care that it was “by design.”

If you can’t do direct-to-MX delivery, use a relayhost. The safe recovery is to configure the relayhost, test with a single queue ID, then requeue and drain.

Remote throttling and greylisting: not broken, just unfriendly

Remote MTAs send 4xx responses for all sorts of reasons: rate limits, reputation, temporary resource constraints, greylisting. Postfix treats these as transient and defers.

Do not “solve” greylisting by brute-force retries. Respect backoff, reduce concurrency, and make sure your IP reputation and reverse DNS are sane. Your queue might be a reputation symptom, not a Postfix problem.

Milters and content filters: the expensive middleboxes

Milters are great until they aren’t. A down milter can block mail acceptance or slow cleanup processing. A slow AV scan can turn into a system-wide throughput problem.

Operationally: keep your mail path simple. If you must run filters, monitor their latency and failure modes. Treat them like dependencies, because they are.

Storage: queue health is storage health

Postfix is a storage workload disguised as messaging. It creates lots of small files, renames them, fsyncs them, and expects that to be cheap. If your disk is full, inode-starved, or suffering latency spikes, Postfix will show it as queue buildup.

Queue corruption is uncommon, but it happens after abrupt power loss, filesystem bugs, or broken virtualization storage. When it does, the safe response is to preserve the spool, repair filesystem issues, and requeue — not to rm -rf your way into silence.

Three corporate mini-stories (mistakes you can borrow)

1) The incident caused by a wrong assumption

A mid-sized enterprise had a mail relay that “never touched user mail,” only application notifications. That line got repeated so often it became folklore. One morning, the queue hit tens of thousands of messages, and the on-call engineer did what the folklore suggested: delete the deferred queue to “let new mail through.”

What they assumed: that queued mail was disposable. What they missed: password resets, billing notices, and a chunk of legal notifications came from those applications. Some recipients were external regulators. Those systems don’t “just retry” forever—some had one-shot workflows tied to user actions.

Technically, the real issue was outbound TCP/25 blocked after a firewall change. Postfix was behaving correctly: queue and retry. Deleting the queue removed the only copy of those notifications. There was no upstream retry because the apps had already handed off successfully.

The recovery was ugly and manual: re-triggering workflows where possible, apologizing where not, and auditing which message types were lost. The root-cause fix took ten minutes (restore the firewall rule). The impact lasted weeks because “delete the queue” is not reversible.

The lesson that stuck: treat Postfix like a durable buffer. Once an app has handed off, the queue is the system of record until delivery succeeds or you deliberately bounce.

2) The optimization that backfired

Another org ran a busy outbound relay and wanted faster throughput. Someone increased concurrency limits—globally and per-destination—because the queue looked “too slow.” It worked for about a day. Then remote providers started returning more 421/451 responses and greylisting became constant.

What happened under the hood: their relay began to look like an aggressive sender. More parallel connections, more simultaneous deliveries, and more repeated attempts during transient failures. Some large mailbox providers treat that pattern as suspicious or abusive, regardless of the content.

The queue got worse, not better. Each failed attempt created more log volume, more DNS lookups, and more socket churn. The system spent more time failing faster than it would have spent delivering slower.

They fixed it by reversing the “optimization,” implementing sensible per-destination throttles, and adding visibility: top deferral reasons, top destinations, and trend-based alerting. The key was accepting that SMTP throughput is a negotiation, not a unilateral decision.

The lesson: don’t tune Postfix like it’s a web server. Email is a long game, and remote MTAs get a vote.

3) The boring but correct practice that saved the day

A financial services company treated the mail relay as infrastructure, not a pet project. The configuration was managed, changes were reviewed, and—most importantly—the spool lived on a dedicated filesystem with monitoring for inodes and latency.

During an unrelated incident, their log volume spiked, and a separate application started generating massive temporary files on the root filesystem. On many servers this would quietly fill the disk until Postfix started failing. Here, the mail spool had its own headroom and its own inode pool.

The queue grew, but it didn’t collapse. Postfix kept accepting mail, persisting it safely, and retrying deliveries. The on-call saw inode alerts for the root filesystem but not for the spool. That told them something critical: mail durability was intact; the bottleneck was delivery capacity, not acceptance/persistence.

They throttled outbound concurrency temporarily, fixed the noisy application, and let the queue drain over the next hour. No mail loss. No panic deletions. No “mysterious missing notifications.”

The lesson: a dedicated spool filesystem and inode monitoring is the most boring insurance policy you’ll ever be glad you bought.

Common mistakes: symptom → root cause → fix

1) “mailq shows thousands of deferred messages”

Root cause: Remote throttling, DNS issues, or outbound network blockage.

Fix: Identify top deferral reason and destination; repair DNS/network; reduce concurrency; requeue and drain. Don’t restart Postfix as therapy.

2) “The queue never drains even though remote is reachable”

Root cause: qmgr not running, master.cf misconfiguration, or a stuck cleanup/content filter path.

Fix: Verify qmgr/master processes; inspect logs for cleanup/milter timeouts; restore dependencies; then replay a single message to validate.

3) “After we fixed the firewall, nothing happened”

Root cause: Messages are deferred with backoff timers; Postfix won’t instantly retry everything.

Fix: Trigger controlled retries: postqueue -i for a test ID; then postsuper -r ALL if needed. Avoid stampede flushes.

4) “Postfix rejects new mail with temporary failures”

Root cause: Content filter/milter down, or disk/inodes exhausted, or too many active processes.

Fix: Restore the dependency, free inodes/space, or adjust process limits. Use holds rather than deletes for suspect items.

5) “Queue files missing / weird queue IDs / postsuper errors”

Root cause: Filesystem corruption, manual deletion, or broken storage layer.

Fix: Stop the urge to run random cleanup. Validate filesystem health, preserve spool, and use requeue operations once storage is stable.

6) “Only one domain is stuck; everything else is fine”

Root cause: That domain’s MX is down, rate-limiting you, or you have a routing problem to that network.

Fix: Isolate deliveries to that domain (lower concurrency); don’t let it dominate the queue manager. Communicate with that recipient domain if necessary.

7) “Bounces are exploding; users get backscatter”

Root cause: You’re generating non-delivery reports for forged senders or mis-handling temporary failures as permanent.

Fix: Ensure you reject bad mail at SMTP time when possible; avoid bouncing spam accepted from unauthenticated sources; review policies around 4xx vs 5xx.

Checklists / step-by-step plan

Checklist A: First 10 minutes (triage without regrets)

Get queue size and top error: postqueue -p, log grep for “deferred.”
Split active vs deferred file counts.
Check disk and inodes for /var/spool/postfix.
Confirm Postfix processes exist (master/qmgr).
Check DNS speed with dig for a failing domain.
Check outbound reachability with nc to a failing MX.
Decide whether the intake source is flooding (internal apps). If yes, rate-limit or block temporarily.

Checklist B: Safe recovery (fix + replay)

Fix the root cause (DNS/network/filter/storage).
Test delivery with a single message: postqueue -i QUEUEID.
Lower concurrency temporarily if you expect throttling.
Requeue if scheduling/transport changed: postsuper -r ALL.
Watch queue depth and log error rate for 15–30 minutes.
Release held messages in batches if you quarantined any.
Only then consider deleting truly unwanted mail, narrowly scoped.

Checklist C: When you must delete (controlled, auditable)

Prove the mail is unwanted or harmful (inspect with postcat -q).
Stop the source (compromised account/app) first.
Prefer hold over delete if there’s uncertainty.
Delete by queue ID or sender with an incident ticket reference.
Record what you deleted and why. Future you will be asked.

FAQ

1) Is it safe to run `postsuper -d ALL`?

It’s “safe” in the same way formatting a disk is safe: it does what it says. It is not a recovery workflow. Use holds, targeted deletes, and fix the root cause first.

2) Should I restart Postfix when the queue is stuck?

Restarting can clear a wedged process, but it rarely fixes DNS, firewall rules, or remote throttling. If you restart, do it once with a reason, then investigate logs immediately.

3) Why is everything in deferred even after the network is fixed?

Deferred messages obey backoff timers. Postfix won’t instantly retry everything. Use postqueue -i for a spot-check, and consider postsuper -r ALL to reschedule once you’re confident.

4) What’s the difference between “flush” and “requeue”?

Flush (a queue run) attempts delivery of eligible mail. Requeue rewrites/reschedules mail in the queue. Requeue is useful after config changes; flush is what happens normally over time.

5) How do I avoid a single bad domain dominating delivery attempts?

Lower per-destination concurrency for that destination and consider transport maps to route it differently. The goal is to keep the rest of your mail flowing while that domain recovers.

6) How do I know if the queue problem is local storage?

Check df -h and df -i for the spool, and look for Postfix log messages about file creation/rename failures. High iowait and inode exhaustion are classic tells.

7) Can I move the Postfix spool to a different filesystem?

Yes, and it’s often a good idea for durability and performance. Do it carefully: stop Postfix, copy with permissions intact, update configuration, and verify queue integrity before starting.

8) What’s the safest way to inspect queued mail for sensitive content?

Use postcat -q QUEUEID and restrict access to spool directories. Treat spool access like production database access: logged, justified, and minimal.

9) Why are there tons of 4xx errors but hardly any 5xx?

4xx means “temporary” and triggers retries. Many providers intentionally use 4xx to control sender behavior. Your job is to back off, not to argue with SMTP physics.

Next steps (what to do after you’ve stopped the fire)

Once the queue is draining and users stop slacking you screenshots, don’t just walk away. Do these practical next steps while the pain is fresh:

Add queue depth and deferral reason monitoring (trend alerts, not just thresholds).
Monitor spool inodes and latency, not only disk percentage.
Document the “hold, don’t delete” rule and require a reason for destructive actions.
Set sensible concurrency defaults and per-destination limits for your typical mail mix.
Make dependencies explicit: milters, DNS resolvers, relayhosts. If they fail, you should know before the queue becomes your dashboard.
Run a recovery drill: pick a test queue backlog scenario in staging and practice the safe workflow so production isn’t your training environment.

The queue isn’t your enemy. It’s your evidence locker and your shock absorber. Treat it like one.