Email: Inbound Spam Flood — Survive the Attack Without Blocking Real Users

Was this helpful?

The first sign is usually not a screaming alert. It’s a quiet complaint: “I’m not getting password resets.”
Then another: “My customer says they emailed twice.” Then your CEO forwards you a screenshot of an unread “mail delivery delayed” notice like it’s a personal insult.

By the time you open the mail logs, you’re standing in a hallway of fire doors: inbound connections spiking, queues swelling, disks churning,
spam scanners pegged, and your legitimate mail trapped behind a conga line of garbage. The goal is not “block everything.” The goal is: keep real mail moving while you absorb the hit.

What you’re fighting (and what you’re not)

An inbound spam flood is not “someone sent us spam.” It’s a throughput and fairness problem that happens to arrive over SMTP.
The attacker (or misconfigured bulk sender, or compromised botnet) doesn’t need to be clever. They just need to be numerous and persistent enough to make your mail pipeline trip over itself.

The failure modes that actually take you down

  • Queue blow-up: your MTA accepts messages faster than it can process them (filtering, DNS lookups, delivery), so the queue grows until disk or inode pressure hits.
  • CPU collapse: content filtering (AV, spam scoring, decompression) becomes your hottest path. One bad attachment type can turn into a fork-bomb of work.
  • DNS latency amplification: every SPF/DKIM/DMARC check and RBL lookup becomes a multiplier on inbound connections. Slow resolver = slow everything.
  • Connection table exhaustion: too many simultaneous SMTP sessions; your kernel and MTA start dropping legitimate connections.
  • Backscatter and retries: if you accept and later reject, you generate bounces or force retries; you become part of the problem and your queue stays hot for hours.
  • False positives: the “fix” blocks customers, partner systems, and password resets. That’s how incidents become executive escalations.

The trick is to make your system cheap to say no, and predictable to say yes. Do rejection as early as possible (during SMTP dialogue),
and reserve expensive content analysis for mail that is likely legitimate.

One quote worth taping to your monitor: Hope is not a strategy. — Gene Kranz.
Email floods are where “we’ll just watch it” goes to die.

Joke #1: Email is the only protocol where “Please try again later” is a legitimate business workflow.

Interesting facts and a little history

Spam floods feel modern, but the ecosystem has been evolving for decades. Here are concrete points that influence today’s defenses:

  1. SMTP predates spam by design: the original model assumed cooperative hosts, which is why so many controls are bolt-ons rather than built-ins.
  2. Open relays were a major early spam vector: widespread relay abuse in the 1990s pushed MTAs to lock down relaying by default.
  3. DNSBLs/RBLs became popular because they were cheap: one DNS query could replace expensive content scanning for obviously bad sources.
  4. Greylisting emerged as an economic weapon: delay the first attempt and many spam bots don’t retry correctly, while real MTAs do.
  5. SPF was designed to stop envelope-from spoofing: it helps, but it doesn’t prove the human sender; it proves domain authorization for that path.
  6. DKIM made message authentication scalable: signing content in headers let receivers verify integrity without shared secrets.
  7. DMARC added policy and reporting: it’s an enforcement and visibility layer, not a magic “no spam” button.
  8. Botnets shifted spam volume patterns: instead of a few giant sources, you get many low-volume sources that evade simple per-IP thresholds.
  9. Content scanning got heavier over time: modern spam uses nested archives, weird encodings, and “document” traps that are expensive to unpack.

Fast diagnosis playbook

When the flood hits, you don’t have time to debate architecture. You need to find the bottleneck in minutes.
This sequence is biased toward Linux MTAs (Postfix, Exim) with common filtering stacks, but the thinking applies to Exchange and hosted gateways too.

First: Are we rejecting early, or accepting and drowning later?

  • Check inbound SMTP session rates and concurrent connections.
  • Check whether queue growth is incoming, active, deferred, or hold.
  • Check the ratio of 2xx accepts vs 4xx/5xx rejects.

Second: What resource is saturated right now?

  • CPU: spam/AV workers pegged, load rising, context switching high.
  • Disk: queue directory on slow storage; iowait spikes; inode exhaustion.
  • DNS: resolver latency; tons of outbound DNS; timeouts causing SMTP stalls.
  • Network: SYN backlog; conntrack exhaustion; upstream rate limits.

Third: What’s the cheapest lever that buys time without blocking real users?

  • Raise friction for abusive senders (connection limits, tarpit, postscreen, greylisting).
  • Lower cost of handling (bypass expensive scans for obviously bad; cache DNS; prefer RBL checks).
  • Protect critical flows (allowlists for known partners; separate MX or port/IP for transactional mail).

The flood response mindset: stabilize, then improve. Do not “refactor your mail system” mid-incident unless you enjoy change tickets that read like crime scenes.

Twelve+ practical tasks: commands, outputs, and the decision you make

The goal here is not to show off commands. It’s to make you fast: run a thing, interpret the result, choose the next action.
Examples assume a Linux mail gateway with Postfix and common tooling; adapt paths for your distro.

Task 1: Measure queue size and shape (Postfix)

cr0x@server:~$ mailq | tail -n 5
-- 12437 Kbytes in 1832 Requests.

What it means: You have 1,832 queued messages; size is moderate but count is the real operational pain (processing overhead per message).

Decision: If count is rising quickly, prioritize early rejection and throttling before you try to “process your way out.”

Task 2: Identify top sending IPs hitting your SMTP service

cr0x@server:~$ sudo ss -tn sport = :25 | awk 'NR>1 {print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head
  312 203.0.113.55
  287 198.51.100.22
  190 192.0.2.77

What it means: A few IPs are hogging sessions. Could be bots, could be a broken bulk sender.

Decision: Apply per-client connection limits or temporary blocks for the worst offenders only after checking for legitimate partners.

Task 3: Spot whether you’re spending time on DNS lookups

cr0x@server:~$ sudo journalctl -u systemd-resolved --since "10 min ago" | tail -n 6
Jan 04 10:11:20 mx1 systemd-resolved[642]: Transaction 49231 for 55.113.0.203.zen.spamhaus.org. IN A timed out
Jan 04 10:11:20 mx1 systemd-resolved[642]: Transaction 49232 for 22.100.51.198.zen.spamhaus.org. IN A timed out

What it means: RBL lookups timing out. Your MTA will stall, sessions stay open, concurrency rises, queue grows.

Decision: Fix resolver performance now: local caching resolver, shorter timeouts, fewer parallel expensive checks during the peak.

Task 4: Check CPU and iowait (is it compute or disk?)

cr0x@server:~$ iostat -xz 1 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          28.14    0.00    9.83   34.76    0.00   27.27

Device            r/s     w/s   rkB/s   wkB/s  aqu-sz  %util
nvme0n1         12.0   420.0    96.0  8120.0    8.40   98.0

What it means: High iowait and disk utilization. Queue I/O is your limiting factor.

Decision: Reduce queue churn (reject earlier, slow accepts), and consider moving queue to faster storage if you can do it safely.

Task 5: Verify inode pressure (queue directories can die by inodes)

cr0x@server:~$ df -ih /var/spool/postfix
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/sda2        2.0M  1.9M  120K   95% /

What it means: You’re close to inode exhaustion. Even with free space, you can’t create new files. MTAs hate this.

Decision: Emergency cleanup: purge obvious spam from queue (carefully), increase inode capacity longer term, and reduce acceptance rate now.

Task 6: Find which queue state is exploding (Postfix queue IDs)

cr0x@server:~$ sudo qshape deferred | head
T  5 10 20 40 80 160 320 640 1280 2560 5120 10240 20480 40960
5  0  2  3  6 15  44  98 210  380  390  230    54     0     0

What it means: Deferred is growing: remote delivery failures, greylisting elsewhere, or your outbound path is clogged.

Decision: Separate inbound survival from outbound delivery: consider rate-limiting outbound retries, and make sure inbound rejection is happening pre-queue.

Task 7: Check Postfix process pressure

cr0x@server:~$ sudo postconf -n | egrep 'smtpd_client_connection_count_limit|default_process_limit|smtpd_client_message_rate_limit'
default_process_limit = 200
smtpd_client_connection_count_limit = 20
smtpd_client_message_rate_limit = 100

What it means: You allow 20 concurrent connections per client and 100 msgs/min per client; global process limit is 200.

Decision: During a flood, lower per-client concurrency (bots benefit from parallelism). Increase global limit only if CPU/disk allow it; otherwise you amplify the crash.

Task 8: Quickly quantify rejects vs accepts from logs

cr0x@server:~$ sudo awk '{print $6}' /var/log/mail.log | grep -E 'reject:|status=sent|status=deferred' | head
reject:
status=sent
status=deferred

What it means: This is a crude sampling. You need proportions.

Decision: If rejects are low while queue rises, you’re too permissive at SMTP time. Add cheap rejections (RBL, postscreen, strict HELO) and reduce expensive scans.

Task 9: Extract top recipients being targeted (protect them)

cr0x@server:~$ sudo grep -h "to=<" /var/log/mail.log | sed -n 's/.*to=<\([^>]*\)>.*/\1/p' | cut -d, -f1 | sort | uniq -c | sort -nr | head
  842 info@
  731 sales@
  690 support@

What it means: Generic mailboxes are getting hammered.

Decision: Apply recipient-based controls: stricter filtering for catch-all/generics, separate MX for public aliases, or require captcha/portal for inbound to certain addresses (business decision).

Task 10: Check TLS handshake cost and failures (spam loves expensive handshakes)

cr0x@server:~$ sudo grep -h "TLS" /var/log/mail.log | tail -n 5
Jan 04 10:12:01 mx1 postfix/smtpd[23102]: TLS handshake failed: SSL_accept error from 203.0.113.55: -1
Jan 04 10:12:02 mx1 postfix/smtpd[23108]: Anonymous TLS connection established from 198.51.100.22: TLSv1.3 with cipher TLS_AES_256_GCM_SHA384

What it means: Handshake failures can consume CPU; successful TLS is good, but heavy TLS-only policies can be abused by spammers who can afford CPU.

Decision: Keep TLS enabled, but add postscreen/connection controls so you don’t spend crypto cycles on obvious junk.

Task 11: Validate that your resolver cache is working

cr0x@server:~$ resolvectl statistics | head -n 12
Transactions:           98422
Cache Hits:             61210
Cache Misses:           37212
DNSSEC Verdicts Secure: 0

What it means: A healthy cache hit rate buys you survival. If hits are low, you’re redoing expensive lookups repeatedly.

Decision: Ensure local caching, tune TTL respect, and avoid per-message DNS checks that you can postpone or cache in policy services.

Task 12: Check ClamAV or content scanner backlog (example)

cr0x@server:~$ systemctl status clamav-daemon --no-pager | sed -n '1,12p'
● clamav-daemon.service - Clam AntiVirus userspace daemon
     Loaded: loaded (/lib/systemd/system/clamav-daemon.service; enabled)
     Active: active (running) since Thu 2026-01-04 09:41:22 UTC; 30min ago
     Docs: man:clamd(8)
   Main PID: 1432 (clamd)
      Tasks: 42 (limit: 18956)
     Memory: 1.1G
        CPU: 18min 10.221s

What it means: Scanner is alive, consuming CPU, with many tasks. Doesn’t prove throughput.

Decision: If scanner becomes the bottleneck, temporarily shift to “deny by reputation + basic checks” and reserve deep scanning for allowed sources or authenticated senders.

Task 13: See if the kernel is dropping connections

cr0x@server:~$ netstat -s | egrep -i 'listen|overflow|drops' | head
    2489 times the listen queue of a socket overflowed
    2489 SYNs to LISTEN sockets dropped

What it means: You’re losing sessions before Postfix even sees them. Legit senders will retry, but you’re adding chaos.

Decision: Increase backlog and tune kernel limits, but only after you apply MTA-level throttles; otherwise you just accept more pain.

Task 14: Sample message headers from queued mail (don’t guess)

cr0x@server:~$ sudo postcat -q 3F2A1C2B9E | sed -n '1,25p'
*** ENVELOPE RECORDS active ***
message_size: 28412              452             1               0               28412
sender: spammer@example.net
*** MESSAGE CONTENTS ***
Received: from unknown (HELO user) (203.0.113.55)
Subject: Invoice attached

What it means: You can see patterns: bad HELO, suspicious subjects, repeated sources.

Decision: Turn patterns into cheap SMTP-time rules (HELO restrictions, postscreen, RBL), not heavyweight after-accept filtering.

Joke #2: The fastest spam filter is a 554 response code. It’s also the only one that never needs a signature update.

Control the blast radius: choke points that matter

In a flood, you don’t “fight spam.” You control resource consumption. Your MTA is basically a factory line:
accept connection → negotiate → receive message → run checks → enqueue → deliver → retry if needed.
Any stage that’s slow becomes a pileup behind it.

1) Connection handling: reduce expensive conversations

The most underrated defense is making inbound SMTP connections boring and cheap. Most spam floods rely on you doing work:
TLS handshakes, banner delays, policy lookups, content filters. Your job is to do less per bad sender.

  • Per-client concurrency limits: bots like parallel sessions. Legit MTAs typically behave.
  • Connection rate limiting: limit new connections per time window per IP/subnet.
  • Postscreen (Postfix): keep the SMTP daemon away from obvious garbage until the client proves basic competence.
  • Tarpitting: add a small delay for suspicious clients. Not too big—minutes are just self-harm.

2) SMTP-time rejection: never accept what you already know you’ll reject

If you accept a message and then later decide it’s spam, you have already paid the cost: disk I/O, queue entries, scanning time,
and often a retry storm. Reject at connect/helo/mail-from/rcpt-to time whenever possible.

  • Block invalid HELO/EHLO patterns: “HELO user” from a public IP is rarely a serious mail server.
  • Require sane envelope sender where appropriate.
  • Use RBLs carefully: one solid list beats five flaky ones that time out.
  • Recipient validation: reject unknown recipients at SMTP time to avoid backscatter and queue growth.

3) Fairness for real users: protect critical senders and recipients

“Don’t block real users” isn’t sentimental; it’s operational. Your best customers will trigger the same controls that catch spam if you set them bluntly.
The fix is segmentation and explicit exceptions.

  • Partner allowlists at the gateway level (IP ranges, authenticated sources, or known domains with stable infrastructure).
  • Separate inbound MX: a public MX that absorbs junk and a protected channel for transactional/partner traffic.
  • Different policies per recipient: executives and password-reset addresses deserve low latency and higher trust.

Filtering strategy during a flood: prioritize flow over perfection

Content scanning is where good intentions go to get pegged at 100% CPU. In normal times, you can afford to unzip attachments,
run AV, do fuzzy hashing, and compute elaborate spam scores. During a flood, every extra millisecond becomes a multiplier.

Reputation first, content second

Start with cheap signals that terminate early:

  • IP reputation and basic protocol sanity: postscreen, RBL, invalid command pipelining, too many errors.
  • Envelope policy: reject nonexistent recipients; limit recipients per message; block obvious forged local domains.
  • Authentication checks: SPF/DMARC results can be used for scoring or rejection for high-risk domains, but beware DNS dependence.

When to use greylisting (and when it’s a trap)

Greylisting can be lifesaving in a sudden flood: it turns your server into a “try again later” machine, and a lot of spamware gives up.
But it also delays legitimate first-time senders, which is unpleasant if you’re onboarding new customers or receiving time-sensitive tickets.

A practical middle ground: greylist only on suspicious signals (no reverse DNS, invalid HELO, new IP with bad behavior),
and exempt known good senders or authenticated partners.

DMARC/SPF/DKIM during the storm

These checks help, but don’t let them become your bottleneck:

  • SPF: great for eliminating obvious spoofs, but DNS-heavy. Cache results, and set timeouts aggressively.
  • DKIM: verification cost is usually reasonable, but broken signatures are common; treat failures as signals, not always rejections.
  • DMARC: powerful for domains that publish strict policies. For others, it’s mostly scoring and reporting.

Defer vs reject: choose your pain

A 4xx temporary failure tells legitimate MTAs to retry later. Spammers also retry, but many do it poorly. A 5xx reject is cleaner and cheaper,
but it increases the risk of blocking a real message due to a false-positive signal.

During a flood, use targeted deferrals (suspicious sources, unknown senders to hot recipients) and confident rejects
(known bad reputation, invalid recipients, protocol violations). Avoid blanket deferral that just keeps your connection table full.

Queues, disks, and the part storage engineers quietly fix

Spam floods are often diagnosed as “mail problem,” then resolved as “storage problem.” Queue directories are small-file hell:
thousands of tiny writes, fsync patterns, directory lookups, and metadata churn. Your fancy NVMe helps; your overworked network filesystem does not.

Queue architecture basics (why it hurts)

Most MTAs store each message as multiple files (envelope, content, deferred state). Under heavy load:

  • Directory operations dominate (create, rename, unlink).
  • Journaling overhead becomes visible.
  • Inodes get consumed rapidly.
  • Backup agents, antivirus on-access scanning, or indexing tools can turn queue directories into a slow-motion tragedy.

Practical storage moves that help during floods

  • Keep spool on local fast storage: not NFS, not “that shared SAN volume everyone uses.” Local NVMe or SSD is the boring, correct choice.
  • Separate spool from bulk logs: avoid contention with log rotation or shipping.
  • Watch inode usage: size-based alerts miss this failure entirely.
  • Minimize synchronous writes where safe: careful with mount options; reliability matters, but you can often tune without lying to the filesystem.
  • Keep queue cleanup tools ready: you don’t want to write a one-off script under pressure.

Queue management is incident response, not janitorial work

A growing queue is not automatically “bad.” It’s a buffer. The incident happens when the buffer grows faster than your recovery rate and starts consuming everything else.
Your goal is to shape the queue so that important mail is delivered first and garbage is rejected early.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company ran a pair of mail gateways and a separate ticketing system. They had decent spam filtering and believed they were safe because “we have a cloud provider in front.”
The wrong assumption was that the provider would absorb any flood and forward only “real email.”

The flood arrived as a slow-boil: thousands of small messages per minute, mostly to support@ and info@.
The provider dutifully delivered them, because they were technically valid SMTP and not obviously malicious.
The on-prem gateways accepted them and then handed them to content scanning. The scanners didn’t crash; they just got slower. Queue grew. Disk iowait grew.

Real customer emails started timing out. Password reset emails (also inbound replies and bounces) got delayed.
The support team’s workaround was to ask customers to “use chat.” The chat vendor rate-limited them. That’s how email incidents grow legs.

The fix was not “buy more filtering.” It was to move rejection forward: recipient validation at SMTP time, strict limits per client, and postscreen-style gating.
Then they created a protected path for high-value inbound mail: partners and transactional replies had a separate MX with tighter allowlists.

The postmortem takeaway: a cloud edge is not a policy engine unless you configured it to be one. If your gateway accepts it, you own it.

Mini-story 2: The optimization that backfired

A retail company tried to speed up mail processing by increasing worker counts: more Postfix processes, more Amavis children, bigger concurrency everywhere.
It worked beautifully in a quiet staging environment, which is a sentence that should always make you suspicious.

During a spam wave, the “optimization” became a multiplier. With more concurrency, they accepted more messages per second and immediately dumped them onto disk and scanners.
The scanners saturated CPU. The disk saturated IOPS. Latency increased, which kept SMTP sessions open longer, which increased concurrency further.
The system didn’t explode; it slowly suffocated.

They also tuned DNS timeouts upward “to be robust.” In practice, this meant they waited longer for slow RBL responses,
holding sessions open and burning processes while the resolver struggled.

The recovery involved doing the opposite of their instincts: reduce concurrency to a level the disk could sustain, shorten DNS timeouts,
and reject more at connection time. Throughput improved because the system stopped thrashing.

The lesson: in pipeline systems, “more parallelism” is not always more throughput. If a downstream stage is saturated, concurrency just increases the size of your problem.

Mini-story 3: The boring but correct practice that saved the day

A financial services org had a reputation for being slow to change and annoyingly strict about runbooks. They also rotated on-call like it was a sacred ritual.
This did not make them popular at happy hour, but it made their mail system resilient.

They had three mundane things in place: (1) queue directory on dedicated local SSD with inode monitoring, (2) a local caching resolver with metrics,
and (3) a pre-approved “incident policy set” for Postfix that could be enabled in minutes: stricter RBLs, lower per-client concurrency, higher postscreen aggressiveness,
and temporary deferrals for suspicious sources.

When a flood hit, the on-call didn’t improvise. They flipped the incident policy, watched queue growth stabilize, and then selectively relaxed rules for a partner
whose mail was getting caught by a too-aggressive HELO check. That partner got an allowlist entry with an expiry ticket.

Business impact was limited to delays for low-priority inbound mailboxes. The executives never noticed.
Later, the team reviewed logs, found the main botnet ASNs, and updated longer-term policy.

The lesson: “boring” is a feature. Pre-approved knobs and monitored fundamentals beat heroics.

Checklists / step-by-step plan

Phase 0: Before the flood (you will not do this during the flood)

  1. Instrument the pipeline: queue size, queue age distribution, SMTP session rate, reject/accept counts, DNS latency, disk iowait, CPU per filter.
  2. Separate roles: inbound MX should not be the same machine that runs unrelated heavy jobs. Email wants predictable resources.
  3. Build an incident policy set: a known-good, pre-tested “storm mode” configuration you can enable quickly.
  4. Maintain allowlists with owners: every allowlist entry should have a reason and an expiry plan.
  5. Know your critical mail flows: password reset, billing, partner integrations, legal notifications. Identify recipients and sending IPs/domains.

Phase 1: First 15 minutes (stabilize)

  1. Confirm it’s inbound: check SMTP connection rate and inbound queue growth.
  2. Identify the bottleneck: CPU vs disk vs DNS vs kernel drops.
  3. Enable storm mode: lower per-client concurrency, enable postscreen/tarpit/greylisting selectively, tighten RBL usage.
  4. Protect critical senders: add temporary allowlists for known transactional providers and key partners if they get caught.
  5. Stop accepting what you can’t process: prefer early rejects; if necessary, temporary 4xx for suspicious sources to preserve service.

Phase 2: Next hour (restore flow)

  1. Reduce expensive scanning: disable deep archive recursion, cap attachment sizes for unauthenticated/untrusted sources, and ensure you fail closed only when you mean it.
  2. Fix DNS: local caching resolver, tune timeouts, and reduce the number of DNS-based checks executed per message.
  3. Manage the queue: prioritize legitimate mail, and consider quarantining obvious flood traffic.
  4. Communicate: tell support what is delayed, what is safe, and what alternative channels exist without promising miracles.

Phase 3: After stabilization (clean up and harden)

  1. Remove temporary blocks with a ticket trail and expiry checks.
  2. Quantify false positives: sample quarantined/rejected mail patterns, adjust rules.
  3. Capacity plan: measure peak acceptance rates and scanning throughput; decide where to spend money (compute, storage, or upstream filtering).
  4. Document what worked, what didn’t, and which knobs were too risky.

Common mistakes (symptoms → root cause → fix)

1) Symptom: Queue grows, CPU is fine, iowait is high

Root cause: queue on slow/contended storage; too many small-file operations; inode pressure; possibly antivirus scanning the queue directory.

Fix: move spool to fast local disk; exclude spool from on-access scanning; reduce acceptance rate with SMTP-time rejects; monitor inodes.

2) Symptom: Lots of SMTP sessions stuck, few rejects, timeouts in logs

Root cause: DNS lookups timing out (RBL/SPF/DMARC), causing sessions to wait; resolver overloaded or upstream blocked.

Fix: local caching resolver; reduce the number of DNS checks; shorten timeouts; ensure outbound DNS is not rate-limited; consider temporary disabling of nonessential lookups.

3) Symptom: Load average spikes, scanner processes multiply, deliveries slow

Root cause: content scanning is the bottleneck (AV/spam scoring, archive recursion, PDF/Office extraction).

Fix: cap recursion depth; skip expensive MIME types for untrusted senders; use reputation gating before scanning; scale scanning horizontally if that’s part of your design.

4) Symptom: Legit partner mail blocked after tightening rules

Root cause: blunt RBL usage, strict HELO checks, or aggressive greylisting without exemptions.

Fix: add allowlists with ownership; use multi-signal scoring instead of single-rule rejection; greylist only suspicious sources; test changes with sampled partner traffic.

5) Symptom: Kernel reports SYN drops / listen queue overflow

Root cause: connection flood; insufficient backlog; MTA not accepting fast enough; too many concurrent SMTP sessions.

Fix: MTA-level connection/rate limits first; then tune kernel backlog; consider upstream filtering or TCP-level protection if available.

6) Symptom: Outbound mail delayed while inbound flood happens

Root cause: shared resources (same queue/disk/CPU) and retry storms; outbound delivery processes starved.

Fix: separate inbound/outbound roles or at least separate queues; throttle outbound retries; ensure inbound rejection prevents needless queue growth.

7) Symptom: You see lots of bounces you didn’t intend to send

Root cause: accepting spam and later rejecting or generating DSNs to forged senders (backscatter).

Fix: reject at SMTP time; validate recipients during RCPT TO; avoid generating DSNs for unauthenticated inbound where possible.

FAQ

1) Should I just block entire countries or ASNs during a flood?

Only if your business reality supports it. Geo-blocking can buy time, but it’s a blunt instrument with predictable collateral damage.
If you do it, make it temporary, logged, and reviewed. Prefer per-ASN blocks based on observed abusive traffic rather than vibes.

2) Is greylisting still useful in 2026?

Yes, selectively. Blanket greylisting annoys legitimate first-time senders and breaks some poorly built SaaS mailers.
Greylist suspicious sources, exempt known good senders, and monitor the delay impact on support and sales workflows.

3) Why not crank spam scoring to “very aggressive” and call it done?

Because the expensive part is often the scoring itself. Also, high aggression increases false positives exactly when your users are most sensitive.
During floods, shift left: reject cheap and early; scan deeply only for mail that passes basic trust gates.

4) How do I avoid blocking password reset and sign-up emails?

Separate transactional flows. Ideally, inbound gateways should have a protected path for mail from your own providers and key partners,
with strict allowlists and authentication where possible. Also, don’t route your own critical emails through the same “public MX” policy as random inbound traffic.

5) What’s the most common “we didn’t think of that” bottleneck?

DNS. Every anti-spam feature that looks “lightweight” often translates into DNS queries under load.
If your resolver is slow or your upstream rate-limits you, your mail system becomes a distributed waiting room.

6) Can I delete the queue to recover quickly?

You can, but you probably shouldn’t. First, stabilize acceptance so the queue stops growing.
Then surgically remove obviously bad traffic (by sender IP ranges, recipients like info@ under attack, or known spam signatures) with auditability.
Deleting everything is how you turn an email incident into a business incident with legal seasoning.

7) Should I disable antivirus scanning during a spam flood?

Sometimes, partially. If the flood is high-volume low-sophistication spam, AV may be wasting CPU on junk that you should reject earlier anyway.
But turning AV off entirely can be risky if you’re also receiving targeted malware. A safer approach is gating: run AV only after reputation/protocol checks pass.

8) How do I know if I’m rejecting too much legitimate mail?

Track false positives deliberately: sample rejected mail logs, quarantine suspicious decisions instead of hard rejecting when uncertain,
and watch for business signals (support tickets, partner complaints, missing password resets). If you don’t measure, you’ll guess—and guessing is expensive.

9) Do I need multiple MX records and multiple gateways?

For most orgs that care about uptime, yes. Redundancy isn’t only about hardware failure; it’s about absorbing abuse.
Two gateways let you do maintenance, isolate issues, and apply different policies for different mail flows.

10) Why do my legitimate senders keep getting hit by RBLs?

Because “legitimate” doesn’t mean “well-managed.” Shared hosting, cheap VPS providers, and misconfigured SaaS mailers can land on blocklists due to neighbor behavior.
The fix is to build a process: verify the sender, allowlist with constraints, and push the partner to improve their sending infrastructure.

Conclusion: next steps you can do today

Surviving an inbound spam flood without blocking real users is a systems problem. You win by being cheap to attackers and predictable to everyone else.
Don’t chase every spam sample. Control your bottlenecks, reject early, and keep a protected lane for critical mail.

Do these in the next week

  • Write and test a “storm mode” config: connection limits, postscreen/greylisting policy, and a reduced-cost filtering pipeline.
  • Put the queue on fast local storage and add inode alerts. Disk space alerts alone are a comforting lie.
  • Stand up a local caching resolver with metrics and sane timeouts; measure DNS latency under load.
  • Define critical inbound flows (password resets, billing, partner mail) and ensure they have explicit exceptions and monitoring.
  • Practice queue triage: sample headers, identify top sources/recipients, and rehearse a safe cleanup procedure.

The best time to discover your mail gateway’s true throughput is not during a flood.
The second-best time is before the next one, which is scheduled for an inconvenient moment you haven’t picked yet.

← Previous
ZFS zfs diff: Finding Exactly What Changed Between Snapshots
Next →
Debian 13: APT upgrade gone wrong — roll back one package without a domino collapse

Leave a comment