Postfix Rate Limiting: Prevent Abuse Without Blocking Real Users

January 27, 2026 • February 3, 2026 • Read: 20 min • Views: 12

Was this helpful?

Your mail server doesn’t fail politely. It fails at 09:12 on a Tuesday when marketing uploads a list, a compromised mailbox starts spraying spam, and one partner insists “we didn’t change anything.” Meanwhile the queue climbs, delivery gets deferred, and your CEO learns the phrase “blacklist” in real time.

Rate limiting is how you keep Postfix upright under bad behavior—malicious, accidental, or just enthusiastic. Done well, it reduces blast radius without turning your mail system into a random number generator that denies real users. Done poorly, it’s indistinguishable from an outage.

What rate limiting actually means in Postfix (and what it doesn’t)

Postfix rate limiting is not one feature. It’s a set of pressure valves at different layers:

Connection-level controls: how many TCP sessions a client can open, how fast they can open them, and how long you’ll tolerate them being idle.
Protocol-level controls: how many recipients per message, how many commands per session, how quickly a client can pipeline.
Authentication-level controls: limiting SASL-authenticated submission so a compromised account can’t send 50,000 messages before you blink.
Queue and delivery controls: preventing a single domain or destination from consuming all delivery slots and getting you throttled by major providers.
Policy controls: external “brains” that apply per-user/per-sender/per-client rules based on state, reputation, or business context.

Rate limiting is not a spam filter. It will not decide whether a message is spam. It decides how much damage any one source can do per unit time. It buys you time and preserves service for everyone else.

Also, rate limiting is not “blocking.” The best rate limiting is defer-first—slow down suspicious behavior while still allowing normal traffic through. Rejections are for clear policy violations, not for “my server is busy.” When you reject because you’re overloaded, you’re exporting your incident to someone else, and they will remember.

Dry-funny truth #1: SMTP will happily let a laptop in a coffee shop attempt to become your “bulk mail platform.” It has the confidence of a toddler with a permanent marker.

Three principles that keep you from self-inflicted outages

Shape traffic, don’t guess. Use measurable limits (per-IP, per-user, per-destination) rather than vibes.
Prefer “temporary failure” over “permanent failure.” 4xx responses preserve deliverability patterns and reduce support tickets.
Make exceptions explicit. If you need to allow a relay partner or a bulk sender, put them in a controlled lane rather than raising global limits.

A single quote worth keeping on a sticky note

Hope is not a strategy. — General Gordon R. Sullivan

Rate limiting is where you replace hope with controls.

Facts and context: why Postfix rate limits look the way they do

Some interesting context helps explain why Postfix gives you a grab bag of knobs instead of a single “limit spam” button:

SMTP predates the commercial internet. It was designed for cooperative hosts, not hostile clients, so rate limiting became a bolt-on survival skill.
Postfix was created as a Sendmail alternative with security and performance as first-class goals, including a multi-process architecture that can keep working even when parts are under strain.
“anvil” exists because counting connections is expensive. Postfix added the anvil(8) service to track per-client rates efficiently without each process reinventing it.
Greylisting popularized the idea of “slow down the bad.” Rate limiting is a cousin: make abusive automation pay time costs.
Large receivers started enforcing sender reputation at scale (rate, complaints, bounces), turning outbound rate limiting into a deliverability feature, not just a security feature.
Botnets changed the inbound game. A single IP used to be “the source.” Now abuse comes from swarms, which makes per-IP limits necessary but not sufficient.
Credential stuffing migrated to mail submission because password reuse is evergreen. Rate limiting auth attempts is now basic hygiene.
Cloud NAT made “per-IP fairness” tricky. Hundreds of legitimate users may appear as one IP, so you need limits that consider authenticated identity, not just client address.

Pick a threat model before you pick a knob

If you don’t decide what you’re protecting against, you’ll implement “limits” that punish the wrong people.

Inbound threats (SMTP server role)

Connection floods (SYN/accept queues, process exhaustion, TLS CPU): solve with postscreen, anvil limits, and OS-level tuning.
Directory harvest attacks (lots of RCPT TO probes): solve with recipient limits, tarpitting, and policy restrictions.
Slowloris-style SMTP (keep sessions open, drip commands): solve with timeouts and minimal per-client concurrency.

Outbound threats (submission/relay role)

Compromised mailbox: cap per-user/per-sasl sender rates and alert on anomalies.
Bulk sender “accidentally” using you: enforce per-client and per-sender policy; dedicate a separate instance or transport.
Destination throttling (big providers): cap concurrency per destination and smooth bursts to protect reputation.

Decide what “good mail” looks like

Real users tend to have patterns: a few recipients, moderate rates, consistent destinations. Abuse tends to be spiky, broad, and impatient. Your configuration should encode that difference.

Fast diagnosis playbook: find the bottleneck in minutes

When someone says “mail is slow” or “some users can’t send,” don’t immediately edit main.cf. First, determine which limiter is active: network, smtpd, policy, queue, or downstream receivers.

First: is it inbound acceptance or outbound delivery?

If users report “cannot send” from clients: check submission service, SASL, policy daemon, and outbound rate caps.
If remote senders report “cannot reach you”: check inbound listeners, postscreen, connection limits, and DNS.

Second: look at the queue and defer reasons

A growing deferred queue is a downstream throttle or your own delivery concurrency issue. A growing active queue suggests local contention or too-low concurrency.

Third: confirm what Postfix is doing to clients

Logs tell you whether you are rejecting (5xx), deferring (4xx), or just slow. Rate limiting usually shows up as “too many connections,” “policyd: … action=defer,” or postscreen drops.

Fourth: identify the top talkers

Find which IPs, SASL users, or sender addresses dominate traffic. Apply limits surgically.

The rate limiting toolbox: anvil, postscreen, policyd, and friends

1) anvil(8): per-client connection and rate tracking

Anvil keeps counters like “connections per client per time unit” and “simultaneous connections per client.” It backs Postfix parameters such as:

smtpd_client_connection_count_limit
smtpd_client_connection_rate_limit
smtpd_client_message_rate_limit
smtpd_client_recipient_rate_limit

These are blunt but effective. They’re also easy to misapply to NATed populations (universities, enterprises, mobile carriers) where many legitimate users share one IP.

2) postscreen: stop garbage before it becomes SMTP load

postscreen handles the initial connection stage and can drop obvious junk cheaply. It’s your “front desk bouncer.” Use it when inbound SMTP sees large volumes of spam or bot traffic.

postscreen is especially useful when TLS handshakes and SMTP negotiations are consuming CPU. Better to avoid the negotiation than to rate-limit after you already paid for it.

3) Policy services: smarter limits by identity and context

Postfix policy delegation (check_policy_service) lets an external service decide accept/defer/reject based on:

SASL username
Sender address
Client IP and HELO
Recipient, domain, and message metadata

This is where you implement “per-user messages per hour,” “per-customer plan limits,” and “special lane for trusted systems” without making your entire SMTP daemon a spreadsheet.

4) Delivery shaping: don’t get throttled by the big receivers

Postfix can limit concurrency and rate per destination, via transports and per-destination settings (for example, tuning concurrency and delays). This is less about stopping abuse and more about staying welcome.

5) Submission service separation: protect inbound from outbound

If you run both inbound MX and outbound submission on the same host, isolate the services. Different ports, different restrictions, different limits. Same binary, different behavior. The correct setup feels boring. Keep it boring.

Sane baselines: starting values that don’t ruin your day

There is no universal number. But there are universal mistakes: setting limits so low that legitimate retries become a support storm, or so high that a compromise becomes a PR event.

Inbound (MX) baseline suggestions

Use postscreen if you get meaningful bot volume.
Set conservative per-client concurrency (smtpd_client_connection_count_limit) to prevent single-IP hogging.
Use timeouts to kill slow sessions: smtpd_timeout, smtpd_helo_timeout, smtpd_recipient_limit (recipient count is not a timeout, but it caps a common abuse pattern).

Outbound (submission/relay) baseline suggestions

Rate limit by SASL user, not IP, if you have NATed clients.
Prefer defer over reject for rate excess. Users retry; malware also retries, but buys you detection time.
Put hard caps on recipients per message for submission to prevent “one message to 10,000 recipients” disasters.

Dry-funny truth #2: Every mail server eventually becomes a bulk mail server. The only question is whether it happens with your permission.

Practical tasks (commands, outputs, decisions)

These are the moves you make in real incidents and real tuning work. Each task includes: the command, what typical output means, and the decision you take next.

Task 1: Confirm Postfix is up and which services are listening

cr0x@server:~$ systemctl status postfix --no-pager
● postfix.service - Postfix Mail Transport Agent
     Loaded: loaded (/lib/systemd/system/postfix.service; enabled)
     Active: active (running) since Fri 2026-01-02 08:14:03 UTC; 1 day ago

Meaning: Postfix is running. If it’s not, rate limits aren’t your problem; it’s service availability, config syntax, or dependencies.

Decision: If inactive/failed, inspect journalctl -u postfix and run postfix check before touching tuning.

Task 2: Inspect key rate-limit parameters currently in effect

cr0x@server:~$ postconf -n | egrep 'anvil|client_.*rate|client_.*count|recipient_limit|smtpd_timeout|postscreen'
smtpd_client_connection_count_limit = 20
smtpd_client_connection_rate_limit = 60
smtpd_client_message_rate_limit = 120
smtpd_client_recipient_rate_limit = 600
smtpd_recipient_limit = 200
smtpd_timeout = 60s

Meaning: These are non-default settings in your live config. Many incidents begin with “someone tuned this months ago.”

Decision: Record these values. If you’re in an outage, treat changes as rollbacks with a plan, not experiments.

Task 3: Verify anvil is enabled (because those limits depend on it)

cr0x@server:~$ postconf anvil
anvil = unix - - n - 1 anvil

Meaning: The anvil service is available to track rates. If it’s missing, some rate-limit settings won’t behave as expected.

Decision: If anvil is disabled or failing, fix that first; otherwise you’ll chase phantom “limits.”

Task 4: Check the mail queue size and composition

cr0x@server:~$ mailq | tail -n 20
-- 12 Kbytes in 23 Requests.

Meaning: Small queue. Your problem is likely acceptance (clients getting blocked) rather than delivery backlog.

Decision: If the queue is huge, pivot to defer reasons and destination throttling (Tasks 5–7).

Task 5: Summarize deferred reasons from logs (what is actually failing)

cr0x@server:~$ grep -E "status=deferred|status=bounced" /var/log/mail.log | tail -n 5
Jan 03 09:12:44 mx postfix/smtp[22184]: 8A3C73C2F: to=<user@gmail.com>, relay=gmail-smtp-in.l.google.com[74.125.140.27]:25, delay=12, delays=0.2/0.1/9.6/2.1, dsn=4.7.0, status=deferred (host gmail-smtp-in.l.google.com[74.125.140.27] said: 421-4.7.0 Try again later, closing connection. (in reply to end of DATA command))

Meaning: Downstream is throttling you (421 4.7.0). This is not fixed by raising your inbound client limits. It’s fixed by smoothing outbound and reducing suspicious volume.

Decision: Implement per-destination concurrency/rate shaping and verify you aren’t spewing spam or bounces.

Task 6: Identify top sending SASL users (who is generating volume)

cr0x@server:~$ grep "sasl_username=" /var/log/mail.log | awk -F'sasl_username=' '{print $2}' | awk '{print $1}' | sort | uniq -c | sort -nr | head
  842 sales.bot@example.com
  113 j.smith@example.com
   77 alerts@example.com

Meaning: One identity dominates. That could be legitimate automation, or a compromise.

Decision: If unexpected: disable credentials, force reset, and add per-user limits via policy. If expected: move it to a dedicated relay policy/instance.

Task 7: Identify top client IPs on inbound SMTP (who is connecting)

cr0x@server:~$ grep "connect from" /var/log/mail.log | awk '{print $NF}' | sed 's/[][]//g' | sort | uniq -c | sort -nr | head
  502 203.0.113.44
  311 198.51.100.77
  148 192.0.2.10

Meaning: A few IPs are hammering you. If these are unknown networks, you’re under bot/abuse traffic or misconfigured upstream relays.

Decision: Use postscreen/anvil limits or firewall-level rate limiting; whitelist only if you can prove legitimacy.

Task 8: Check whether clients are being rate-limited by Postfix

cr0x@server:~$ grep -E "too many (connections|commands|messages)|rate limit" /var/log/mail.log | tail -n 8
Jan 03 09:10:02 mx postfix/smtpd[21901]: warning: too many connections from 203.0.113.44
Jan 03 09:10:03 mx postfix/smtpd[21902]: warning: 203.0.113.44: SMTP command rate limit exceeded

Meaning: Your server is actively throttling. That might be correct (bots) or incorrect (NATed office, load balancer, monitoring system).

Decision: If it’s a legitimate upstream, create a controlled exception. If not, tighten and add postscreen.

Task 9: Validate postscreen status (if enabled)

cr0x@server:~$ systemctl status postfix@postscreen --no-pager
● postfix@postscreen.service - Postfix postscreen
     Loaded: loaded (/lib/systemd/system/postfix@.service; enabled)
     Active: active (running) since Fri 2026-01-02 08:14:03 UTC; 1 day ago

Meaning: postscreen is running. If you have inbound flood symptoms and postscreen is down, you’re paying the full SMTP cost per bot.

Decision: If not running, fix enablement in master.cf and verify port 25 is routed to postscreen as intended.

Task 10: Inspect current concurrency limits for delivery

cr0x@server:~$ postconf | egrep 'default_process_limit|smtp_destination_concurrency_limit|smtp_destination_rate_delay|smtp_destination_recipient_limit'
default_process_limit = 100
smtp_destination_concurrency_limit = 20
smtp_destination_rate_delay = 0s
smtp_destination_recipient_limit = 50

Meaning: You may be hitting big-provider throttles because you deliver too aggressively (rate_delay=0s) or too concurrently.

Decision: If you see 421 deferrals from a domain, add rate delay and lower concurrency for that destination via transport maps.

Task 11: Spot TLS CPU pressure (a common hidden limiter)

cr0x@server:~$ top -b -n 1 | head -n 12
top - 09:13:01 up 12 days,  2:41,  1 user,  load average: 7.82, 6.90, 5.11
Tasks: 212 total,   3 running, 209 sleeping,   0 stopped,   0 zombie
%Cpu(s): 78.2 us,  3.1 sy,  0.0 ni, 18.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st

Meaning: High CPU with low IO wait suggests compute-bound work (TLS handshakes, content filters, policy calls). Rate limiting at the SMTP layer may be your cheapest relief.

Decision: Enable postscreen and reduce expensive work for unknown clients; consider TLS session reuse and sane cipher choices later.

Task 12: Confirm you’re not blocking yourself with too-low timeouts

cr0x@server:~$ postconf | egrep 'smtpd_.*timeout|smtp_.*timeout'
smtpd_timeout = 60s
smtpd_helo_timeout = 30s
smtpd_client_connection_timeout = 10s
smtp_connect_timeout = 30s
smtp_data_done_timeout = 600s

Meaning: Aggressive server-side timeouts can murder slow but legitimate senders (especially over high-latency links).

Decision: If you see many “lost connection after CONNECT/HELO” from legit partners, increase timeouts modestly and rely on concurrency limits instead.

Task 13: Verify policy service calls (and latency) in logs

cr0x@server:~$ grep -E "check_policy_service|policy.*timeout|policyd" /var/log/mail.log | tail -n 6
Jan 03 09:11:22 mx postfix/smtpd[22010]: warning: problem talking to server 127.0.0.1:10031: Connection timed out
Jan 03 09:11:22 mx postfix/smtpd[22010]: NOQUEUE: reject: RCPT from mail.example.net[192.0.2.10]: 451 4.3.0 <127.0.0.1:10031>: Temporary lookup failure; from=<a@b> to=<c@d> proto=ESMTP helo=<mail.example.net>

Meaning: Your policy daemon is the bottleneck; Postfix is deferring because it can’t get a decision.

Decision: Fix policy daemon performance/availability or temporarily bypass it for critical flows. Do not “solve” this by raising smtpd process limits; you’ll just create a bigger pileup.

Task 14: Confirm master.cf service separation (submission vs smtp)

cr0x@server:~$ postconf -Mf | egrep '^(smtp|submission|smtps)'
smtp/inet=smtp inet n - y - - smtpd
submission/inet=submission inet n - y - - smtpd
smtps/inet=smtps inet n - y - - smtpd

Meaning: Services exist. But separation is about options per service, not just having ports open.

Decision: Apply stricter auth requirements and per-user limits on submission, and stronger postscreen/anti-abuse on inbound smtp.

Three corporate mini-stories from the trenches

Incident #1: The wrong assumption (NAT is “one user”)

A mid-sized company ran Postfix for employee outbound mail and inbound MX on the same VM. They had just moved offices and, like many moves, the network team did a “temporary” NAT setup that became permanent. Hundreds of users now shared one public IP for submission.

The mail admin noticed a small increase in outbound spam attempts (real, but not huge). They reacted quickly: set smtpd_client_message_rate_limit and smtpd_client_connection_rate_limit on the submission service. The numbers looked generous—if you picture one person on one IP.

At 09:00 the next day, the helpdesk lit up. Outlook clients showed intermittent send failures. Mobile users saw delays. The CEO’s assistant discovered that “retry later” is not a satisfying message when booking flights.

The logs were clear: “too many connections from [office public IP].” The system wasn’t under attack. It was being used normally, but the limiter was watching the wrong dimension. They assumed IP == user; the network proved otherwise.

The fix wasn’t “raise the limit until complaints stop.” They moved submission controls to per-SASL identity via a policy service and added a small per-IP concurrency cap as a safety net. Abuse was contained, and real users stopped tripping over each other.

Incident #2: The optimization that backfired (more concurrency, more throttling)

A SaaS provider sent transactional email—password resets, receipts, alerts—through Postfix. Deliverability mattered. Someone noticed that the outbound queue occasionally built up during peak hours, and made a classic performance move: increase default_process_limit and destination concurrency so delivery “goes faster.”

It did go faster—for about an hour. Then the deferred queue started filling with 421 responses from a large mailbox provider. The provider wasn’t angry about volume per day; it was reacting to burstiness and connection churn. More concurrency looked like aggressive behavior.

Support tickets followed: “I’m not receiving my reset email.” Engineering, under pressure, increased concurrency again. That made the provider throttle harder. This is the mail equivalent of pressing the elevator button repeatedly.

The eventual fix was to do the boring thing: lower per-destination concurrency, introduce a small smtp_destination_rate_delay for the affected domain, and keep transactional mail in a prioritized transport. The queue stabilized, and the provider stopped slamming the door.

The lesson: delivery speed isn’t just your throughput. It’s also how receivers perceive your behavior. Rate limiting outbound is sometimes the fastest way to deliver.

Incident #3: The boring, correct practice that saved the day (separate lanes)

A financial services company had learned the hard way that “one Postfix to rule them all” becomes a single failure domain. They ran separate Postfix instances: one for inbound MX, one for authenticated submission, and one for internal application relays. Same hosts, different ports, different policies, separate queues.

One Friday evening, a legacy app went rogue after a deployment. It started generating high-volume notification mail due to a loop. Not malicious, just wrong. It hit the internal relay instance first.

Rate limits on that lane were strict: per-sender and per-app credentials were capped, and burst allowances were small. The relay instance deferred excess and kept its queue under control. The app’s mail slowed down, alerts fired, and on-call found the bug.

Meanwhile, employee mail submission kept working, and inbound MX kept accepting mail. No customer-visible incident. No emergency “turn it all off” decision. The system degraded in the right place.

This is what good engineering looks like: not heroic, not clever, just designed so failures are contained. You can call it “overkill” until it saves your weekend.

Common mistakes: symptoms → root cause → fix

1) Symptom: “Too many connections” from one IP, and it’s your office

Root cause: You applied per-IP anvil limits to a NATed population (submission traffic).

Fix: Rate limit by SASL user via policy service; keep a mild per-IP cap to stop true floods.

2) Symptom: Random partners complain of dropped connections

Root cause: Over-aggressive timeouts or postscreen tests failing legitimate MTAs (older systems, high latency).

Fix: Increase timeouts modestly; whitelist known-good partners in postscreen access tables; keep strictness for unknowns.

3) Symptom: Queue grows, mostly deferred with 421/4.7.x from big providers

Root cause: You are being throttled; you deliver too bursty, or your reputation is under suspicion.

Fix: Lower per-destination concurrency, add rate delay, and verify outbound sources (compromised accounts, bounce storms).

4) Symptom: CPU spikes during spam waves, even though you reject later

Root cause: Expensive work (TLS, policy, content filters) happens before you drop junk.

Fix: Use postscreen to filter earlier; reduce expensive checks pre-auth; cache or harden policy services.

5) Symptom: Legitimate bulk sends fail (newsletters, invoices)

Root cause: You used global recipient/message rate limits without dedicated lanes.

Fix: Create a separate submission identity/transport with explicit limits and monitoring; do not raise global limits.

6) Symptom: Users see intermittent 451/4.3.0 temporary failures

Root cause: Policy daemon timeouts or overloaded database backend.

Fix: Make policy service highly available and fast; implement sane timeouts and fail-open/closed intentionally per service.

7) Symptom: Delivery is slow even with low queue size

Root cause: You’re limiting concurrency too much globally (default_process_limit too low), or a single destination is consuming slots.

Fix: Balance process limits and per-destination limits; isolate problematic destinations via transports.

8) Symptom: After adding rate limiting, you get more spam complaints

Root cause: You deferred too gently for compromised outbound accounts, allowing sustained spam over time.

Fix: For authenticated submission, enforce hard per-user caps plus alerting and auto-lock workflows.

Checklists / step-by-step plan

Step-by-step: implement safe outbound limits for submission

Separate submission service policy in master.cf (submission on port 587) with mandatory auth and TLS.
Set recipient caps for submission (protect against “one email to a thousand”).
Implement per-SASL rate limits using a policy daemon (recommended) or at minimum use anvil limits cautiously.
Define exception lanes for known automation accounts (alerts, invoicing) with explicit higher limits.
Instrument logs: extract top SASL users, top sender addresses, and rejection/defer counts.
Alert on anomalies: sudden spikes per user, sudden growth in deferred queue, repeated auth failures.
Run a controlled test: one normal user, one automation account, and one synthetic abusive sender in staging if you have it.

Step-by-step: harden inbound MX without blocking real senders

Enable postscreen if inbound volume justifies it.
Apply moderate per-client concurrency limits to prevent one host from hogging smtpd workers.
Use protocol sanity checks: limit recipients, require proper HELO if appropriate, and reject obvious garbage early.
Prefer temporary failures for suspicious bursts, especially for unknown clients.
Maintain a small whitelist for known partners who have legitimate quirks, but review it quarterly.
Review timeouts so you don’t punish high-latency legit MTAs.

Step-by-step: shape outbound delivery to avoid receiver throttles

Measure deferrals by domain (gmail, outlook, yahoo, corporate partners).
Create per-domain transports for problematic receivers.
Lower concurrency and add rate delay where you see 421/4.7.x deferrals.
Keep transactional traffic prioritized via separate queue/transport if you can.
Stop the bleeding first: if compromised users exist, disabling them beats “tuning around” spam.

FAQ

1) Should I rate limit inbound or outbound first?

If you run authenticated submission, start with outbound per-user limits. Compromised accounts are high-impact and reputation-destroying. Inbound floods are real too, but they’re usually easier to mitigate with postscreen and connection limits.

2) What’s the difference between connection limits and message rate limits?

Connection limits restrict how many sessions a client can hold or open per time unit. Message rate limits restrict how many messages the client can submit. Bots can be efficient: few connections, many messages. You usually need both, but apply them where they match your threat model.

3) Should rate limiting reject (5xx) or defer (4xx)?

For most rate limiting: defer (4xx). It slows abuse while preserving deliverability for legitimate senders. Use reject for policy violations (open relay attempts, invalid recipients if you reject at RCPT, etc.).

4) Why do my users get blocked when one IP goes over the limit?

NAT. Hotels, offices, mobile networks, and some VPNs concentrate many users behind one IP. Don’t treat IP as identity for submission. Rate limit by SASL username or sender.

5) Can I solve throttling from Gmail/Microsoft by increasing Postfix concurrency?

No. That usually makes it worse. When a receiver is throttling you, you need to deliver more politely: lower concurrency, add delays, and reduce suspicious volume.

6) Is postscreen safe to enable on a production MX?

Yes, if you test and monitor. The risk is false positives with unusual senders. Start with conservative settings and maintain a small whitelist for important partners. Don’t whitelist the internet because one vendor’s MTA is weird.

7) How do I rate limit per mailbox without a policy daemon?

Pure Postfix is strongest at per-client/IP limits via anvil. Per-user needs policy logic. Without it, you can approximate using separate submission services (different ports) and authenticated client restrictions, but it’s clumsy. If you need per-user limits, use a policy service.

8) What metrics matter most for tuning rate limits?

Queue size (active vs deferred), defer reasons (especially 421/4.7.x), top SASL users by volume, top client IPs, and rate-limit log hits. Also track complaint/bounce trends—reputation damage shows up there first.

9) How do I avoid breaking legitimate bulk sends?

Give bulk senders a separate lane with explicit limits and monitoring. Do not raise global limits “just for them.” Global limits are for the internet; lanes are for business exceptions.

Conclusion: what to do next

If you want rate limiting that stops abuse without blocking real users, do it in this order:

Measure: find top talkers (IP, SASL user, sender) and read the defer/reject reasons.
Separate lanes: inbound MX, authenticated submission, and application relays should not share one policy bucket.
Implement identity-aware limits: per-user caps for submission; per-IP caps for inbound; per-destination shaping for delivery.
Prefer defer for rate excess, and keep rejects for true policy violations.
Write down your exceptions and review them. Every permanent exception begins life as a “temporary” one.

Most mail incidents aren’t caused by a sophisticated attacker. They’re caused by normal systems behaving at abnormal scale, plus a few bad assumptions. Rate limiting is how you make the abnormal survivable.