Email: Authenticated relay vs direct send — pick the approach that doesn’t get you blocked

Was this helpful?

You shipped the feature. The emails “send.” Users still don’t get them. Or worse: they get them three hours late, in Promotions, or in Quarantine, and your support queue turns into a crime scene.

This is where teams discover the difference between “SMTP works” and “mail is delivered.” Choosing between direct send and an authenticated relay isn’t a stylistic preference. It’s an operational decision that determines whether your messages arrive, whether your IP gets torched, and whether your on-call will spend Friday night reading SMTP transcripts like tea leaves.

The decision: direct send vs authenticated relay

Direct send (your MTA delivers straight to recipient MX)

In a direct send model, your server runs an MTA (Postfix/Exim/etc.) and connects to recipient mail exchangers (MX) on the Internet. You control the full SMTP session, the queue, retries, TLS policy, and throughput. You also inherit the full burden of reputation and deliverability engineering.

Use direct send when:

  • You can commit to running mail as a production service (monitoring, abuse handling, reputation management).
  • You need high control over routing, retries, custom policies, and per-domain tuning.
  • You have stable sending volume and can warm IPs properly.
  • You can assign a clean, dedicated IP (or a carefully managed pool) with correct rDNS.

Avoid direct send when:

  • You’re on cloud instances with frequently changing IPs.
  • You don’t control rDNS/PTR records (common in consumer VPS plans and some internal networks).
  • You send low volume but need high deliverability (reputation never “settles”).
  • You don’t have a plan for spam complaints, bounces, and feedback loops.

Authenticated relay (your server submits mail to a provider)

In an authenticated relay model, your systems send mail to a smarthost over SMTP submission (usually port 587, sometimes 465) with authentication (SASL) and TLS. The relay provider then handles delivery to recipient MXs. That provider may be an ESP (transactional email service), your corporate suite (Microsoft 365 / Google Workspace), or an internal relay tier.

Use authenticated relay when:

  • You want deliverability without becoming an email deliverability company.
  • You need predictable outbound connectivity from locked-down networks (587 works where 25 is blocked).
  • You want reputation handled by a party that has staff dedicated to it.
  • You need consolidated audit, rate limiting, abuse detection, and bounce processing.

Avoid authenticated relay when:

  • You can’t tolerate dependency on a third-party sending infrastructure.
  • You have strict data residency constraints and the relay can’t meet them.
  • You need certain SMTP behaviors the relay won’t allow (rare, but it happens).

My operational bias

If you’re running a typical SaaS product, internal app notifications, password resets, invoices, and alerts: use an authenticated relay. It’s boring. It reduces blast radius. It keeps you off blocklists when some intern’s script decides to “test” by emailing the entire customer table twice.

Direct send is for teams who explicitly choose to own email as infrastructure, including all the miserable edge cases. It can be done well. But if you treat it like “just another daemon,” it will treat your company like “just another spammer.”

Facts and history that still matter in 2026

  • SMTP predates spam as an industrial problem. Early SMTP assumed cooperating peers; modern filters assume you might be lying until proven otherwise.
  • Port 25 egress blocking became common as ISPs fought botnets. Many networks still block outbound 25; submission ports (587/465) are the pragmatic path.
  • SPF showed up because forging envelope senders was trivial. It’s a DNS-era band-aid that still matters because receivers use it to model intent and accountability.
  • DKIM was built to protect headers from tampering and prove domain responsibility. It’s not encryption; it’s a signature that survives forwarding (sometimes).
  • DMARC exists because SPF and DKIM alone weren’t enough. Alignment and policy let domains say “reject this if it isn’t really us.”
  • Bulk senders drove sophisticated reputation systems. The “IP is good/bad” idea evolved into domain reputation, content reputation, engagement signals, and complaint rates.
  • rDNS/PTR remains a blunt but effective filter. A missing PTR won’t always block you, but it often downgrades trust and increases spam placement.
  • TLS became an expectation, then a signal. Opportunistic STARTTLS is table stakes; some receivers now score you for modern ciphers and stable TLS behavior.
  • Greylisting trained MTAs to retry correctly. Sloppy retry behavior is still a sign of low-quality infrastructure and can hurt deliverability.

Delivery models and where each breaks

Model A: App → local MTA → direct to recipient MX

This is the classic. Your app speaks SMTP to localhost, Postfix queues, Postfix does DNS lookups for MX, then delivers outward. Breakpoints include DNS, outbound connectivity, reputation, rDNS, TLS negotiation, and rate limits at the receiver.

Failure mode you’ll recognize: it works for small domains, fails for the big ones. That’s because major providers have the strictest reputation systems and the best telemetry.

Model B: App → local MTA → authenticated smarthost relay

Your app still speaks SMTP locally, but Postfix relays everything to a provider with AUTH and TLS. Breakpoints move from “can we reach every MX on earth?” to “can we authenticate, comply with provider limits, and align identity?”

The sneaky failure: mail is accepted by the relay but never arrives. This is where you need to understand that “accepted” is not “delivered.” It’s “queued somewhere else.” You need message IDs and a way to correlate events.

Model C: App → provider API (HTTP) → provider delivers

Not the headline topic, but it matters because teams often drift into it. API sending removes your local MTA but increases vendor coupling. It can be excellent. It can also break in weird ways when your app deploys a new template that triggers content filters.

Hybrid reality

Most companies end up hybrid:

  • Transactional mail via authenticated relay (password resets, receipts).
  • Internal alerts via direct send to corporate MX (or internal relay).
  • Marketing mail via a dedicated ESP pipeline.

That split is healthy. It also means you must be explicit about which path each message uses. Otherwise you’ll accidentally route password resets through the marketing IP pool and learn new meanings of the word “incident.”

Identity: SPF, DKIM, DMARC, alignment, and why “pass” can still fail

SPF: authorization for the envelope sender

SPF checks whether the sending IP is authorized to send on behalf of the domain in the RFC5321.MailFrom (envelope-from). It’s evaluated via DNS TXT records.

What SPF does well: stops casual spoofing and gives receivers a clean signal when your infrastructure is stable.

What SPF does badly: forwarding. If a message is forwarded through another server, SPF might fail because the forwarder’s IP isn’t in your SPF record.

DKIM: signature for message headers/body

DKIM signs selected headers and the body with a private key; receivers verify via a public key in DNS. It survives forwarding better than SPF because the signature remains verifiable—unless the message is modified.

Common operational footgun: a downstream system modifies the message (footer injection, line wrapping, MIME conversion), breaking the DKIM signature. The receiver sees “DKIM fail,” and your DMARC alignment story gets ugly.

DMARC: alignment and policy

DMARC asks: does the domain in the visible From header align with either SPF or DKIM (or both), and what should we do if it doesn’t? Policy is expressed as none/quarantine/reject.

Alignment is where teams get surprised. You can have SPF pass and DKIM pass, and still fail DMARC if they’re not aligned with the From domain. That’s not the receiver being mean. That’s your configuration being incoherent.

Authenticated relay changes who “owns” the sending IP

With direct send, your server IP is evaluated against your domain’s SPF. With an authenticated relay, the relay provider’s IPs matter, and your SPF must authorize them. DKIM can be signed by you (preferred) or by the provider (acceptable if aligned properly).

One quote worth keeping on a sticky note

“Hope is not a strategy.” — General Gordon R. Sullivan

Mail deliverability is the textbook place where this applies. If you “hope” your cloud IP will be trusted, you’re choosing downtime via spam folder.

Infrastructure reality: IP reputation, rDNS, TLS, queues, rate limits

IP reputation is not a moral judgment, it’s a statistical model

Receivers score your IP (and domain) based on complaint rates, bounce rates, spam trap hits, sending patterns, and content signals. Direct send means you carry that score. Authenticated relay means you borrow someone else’s scoring model—and their constraints.

Direct send also means you inherit the sins of your IP’s past. In cloud environments, IP reuse is real. You can get an IP with a history, and your first email will be treated like a returning guest who broke the furniture last time.

Reverse DNS (PTR) is basic hygiene

Many receivers expect an IP to have a PTR record that matches a forward DNS record (A/AAAA) and looks reasonable. It doesn’t need to be poetic. It needs to be consistent and under control.

TLS: you don’t need perfection, you need consistency

Opportunistic TLS is the baseline. Some receivers penalize “weird TLS,” like broken chains, mismatched hostnames, or weak cipher suites. When using an authenticated relay, you usually get modern TLS by default and fewer reasons to debug OpenSSL at 3 a.m.

Queues and retries are part of delivery

An MTA that retries correctly and backs off intelligently looks like legitimate infrastructure. An MTA that retries aggressively looks like a botnet. One of these gets delivered. The other gets blocked.

Rate limiting is not personal

Big providers rate-limit new or low-reputation senders. Your job is not to “fight” rate limits; it’s to design around them:

  • Queue and retry gracefully.
  • Send critical mail first (password resets, login alerts).
  • Warm up IPs and domains instead of dumping volume all at once.

Short joke #1: Email deliverability is like dating—if you show up unannounced from a sketchy address, don’t be shocked when you’re ignored.

Three corporate mini-stories (realistic, anonymized, painful)

Mini-story 1: The incident caused by a wrong assumption

The team ran a small fleet of application servers and decided to “keep it simple”: Postfix on each node, direct send to the Internet, same config everywhere. They used a cloud provider’s default outbound IPs. Nobody owned rDNS. Nobody owned reputation. But staging emails arrived, and so did the first week of production mail, so it felt fine.

Then an unrelated scaling event replaced a chunk of nodes. Those replacements came with different IPs. Overnight, password reset emails started bouncing for major mailbox providers with vague 4xx deferrals and occasional 5xx rejections. Support saw a spike: users couldn’t log in, couldn’t reset passwords, and blamed “the app.” The app was innocent. The mail path wasn’t.

The wrong assumption was subtle: “an IP is an IP.” In email, an IP is a reputation object with history. New IPs meant new reputation, and the reputation was either cold or bad. The fleet’s sending behavior changed from stable to chaotic, and the receiving systems reacted like they always do: protect users first.

The fix was operational, not heroic. They moved to an authenticated relay with stable sending IPs, added SPF authorization for the relay, enabled DKIM signing aligned to the From domain, and stopped pretending cloud autoscaling is compatible with DIY deliverability by default. Direct send could have worked, but it would have required dedicated egress IPs, rDNS control, warming, monitoring, and someone on the hook. They didn’t have that, so they shouldn’t have chosen that model.

Mini-story 2: The optimization that backfired

A different company used an authenticated relay, but they wanted to shave latency off their notification pipeline. Someone noticed each app server established new TLS sessions to the relay frequently. So they “optimized” by cranking up connection reuse, increasing concurrency, and batching more emails per connection.

It worked—until it didn’t. The relay provider started throttling. Not because the volume was higher overall, but because the sending pattern changed: bursts, spikes, and connection behaviors that looked like abuse automation. Some messages began to queue on the provider side, then deliver late. The app still returned “sent” because the relay accepted submission. Users experienced delays and duplicates when the app retried at the wrong layer.

The backfire was classic: optimizing for micro-latency without understanding the system’s macro-contract. SMTP submission is not a low-latency RPC. It’s a store-and-forward pipeline with policy and reputation controls. Over-tuning connection behavior changed the sender fingerprint and tripped guardrails.

The recovery involved dialing concurrency back to provider guidance, adding correct retry semantics (don’t retry if you already got a 250 + queue ID), and implementing idempotency at the application layer so the same event couldn’t fan out into multiple emails during transient throttling.

Mini-story 3: The boring but correct practice that saved the day

A finance-oriented org sent invoices and account notices. They used an authenticated relay for outbound mail, but they also ran a small internal Postfix instance as a submission gateway for applications. The gateway did one unsexy thing extremely well: it stamped each message with a correlation header and logged the mapping between application request ID, SMTP queue ID, and relay provider message ID.

One Tuesday, a major mailbox provider started deferring mail from their relay provider’s shared pool due to a localized reputation issue affecting many customers. It wasn’t the org’s fault, and they couldn’t fix it. But they could explain it. Within minutes, on-call pulled a list of affected message IDs, extracted SMTP response codes, and produced a customer-facing status update that was specific: “Messages submitted between time A and time B are delayed due to 4xx deferrals; no data loss; retries in progress.”

Support stopped guessing. Product stopped blaming “email being flaky.” Leadership didn’t demand random config changes. The provider resolved the reputation issue, and delivery resumed. The only reason it was calm is because the org treated email like a production pipeline: traceability, logs, and clear retry boundaries.

Short joke #2: Nothing builds confidence like a well-labeled queue—without it, you’re just yelling into the SMTP void.

Practical tasks with commands: check, prove, decide

These are tasks I actually run during incidents. Each includes: a command, a realistic output snippet, what it means, and the decision you make next.

Task 1: Confirm which path you’re using (direct vs relay)

cr0x@server:~$ postconf -n | egrep '^(relayhost|smtp_sasl_auth_enable|myhostname|myorigin|mydestination)'
relayhost = [smtp.relay.example]:587
smtp_sasl_auth_enable = yes
myhostname = app-01.prod.example.net
myorigin = /etc/mailname
mydestination = localhost

Meaning: relayhost is set to a submission endpoint on 587 and SASL auth is enabled. You are relaying, not delivering direct.

Decision: Troubleshoot provider-side throttling/auth/TLS and identity alignment, not outbound port 25 to arbitrary MX.

Task 2: If direct send, confirm outbound port 25 is not blocked

cr0x@server:~$ nc -vz gmail-smtp-in.l.google.com 25
Connection to gmail-smtp-in.l.google.com 25 port [tcp/smtp] succeeded!

Meaning: Network allows outbound TCP/25 to at least one major MX.

Decision: If delivery still fails, pivot to reputation/TLS/rDNS/content rather than basic connectivity.

Task 3: Test submission to the relay with STARTTLS and AUTH capability

cr0x@server:~$ openssl s_client -starttls smtp -connect smtp.relay.example:587 -servername smtp.relay.example -crlf
...
250-smtp.relay.example
250-PIPELINING
250-SIZE 52428800
250-STARTTLS
250-AUTH PLAIN LOGIN
250 HELP

Meaning: The relay supports STARTTLS and AUTH methods. Good baseline.

Decision: If Postfix can’t authenticate, it’s likely credentials/SASL config, not the relay being down.

Task 4: Check DNS for SPF on the From domain

cr0x@server:~$ dig +short TXT example.com
"v=spf1 include:_spf.relay.example -all"
"google-site-verification=..."

Meaning: SPF authorizes the relay provider via include and uses -all (hard fail for unauthorized IPs).

Decision: If you’re direct-sending from your own IPs, this SPF is wrong; add your IPs or switch to relay intentionally.

Task 5: Validate DKIM public key record exists

cr0x@server:~$ dig +short TXT s1._domainkey.example.com
"v=DKIM1; k=rsa; p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAtu6..."

Meaning: DKIM selector s1 is published.

Decision: If receivers show DKIM=fail, check whether the signing system is using the same selector and domain, and whether any middleware is modifying the message.

Task 6: Inspect DMARC policy and alignment mode

cr0x@server:~$ dig +short TXT _dmarc.example.com
"v=DMARC1; p=quarantine; rua=mailto:dmarc@example.com; adkim=s; aspf=r; pct=100"

Meaning: DMARC is enforcing quarantine. DKIM alignment is strict (adkim=s), SPF alignment relaxed (aspf=r).

Decision: Strict DKIM alignment means the DKIM d= domain must match the From domain exactly. If your relay signs with a different d=, you will fail DMARC.

Task 7: Verify reverse DNS (PTR) of your sending IP (direct send path)

cr0x@server:~$ dig +short -x 203.0.113.10
mailout-01.example.net.

Meaning: PTR exists and points to a hostname.

Decision: Confirm forward DNS of that hostname resolves back to the same IP; if not, fix the mismatch or expect deliverability penalties.

Task 8: Verify forward DNS matches PTR (basic sanity)

cr0x@server:~$ dig +short mailout-01.example.net A
203.0.113.10

Meaning: Forward-confirmed reverse DNS (FCrDNS) passes.

Decision: If it doesn’t match, fix DNS before you debate anything else. This is table stakes for direct send.

Task 9: Check if your IP is listed on common DNS-based blocklists (signal, not gospel)

cr0x@server:~$ for z in zen.spamhaus.org bl.spamcop.net b.barracudacentral.org; do ip=203.0.113.10; rev=$(echo $ip | awk -F. '{print $4"."$3"."$2"."$1}'); echo "Query $z:"; dig +short ${rev}.${z} A; done
Query zen.spamhaus.org:
Query bl.spamcop.net:
Query b.barracudacentral.org:

Meaning: No A record returned, so not listed on these three at query time.

Decision: If listed, stop sending direct from that IP until you understand why; switch to authenticated relay or a clean IP while remediating.

Task 10: Inspect Postfix queue depth (are you stuck locally?)

cr0x@server:~$ mailq | head -n 20
-Queue ID-  --Size-- ----Arrival Time---- -Sender/Recipient-------
A1B2C3D4E5     1823 Mon Jan  4 09:12:10  alerts@example.com
                                         user@gmail.com
                                         (connect to gmail-smtp-in.l.google.com[142.250.102.27]:25: Connection timed out)

Meaning: Mail is queued locally due to connection timeouts to recipient MX.

Decision: Check outbound firewall, routing, and whether the recipient is deferring via tarpitting. If this is widespread, suspect outbound 25 block or network ACL changes.

Task 11: Trace a specific message through logs (find the SMTP response)

cr0x@server:~$ grep -R "A1B2C3D4E5" /var/log/mail.log | tail -n 5
Jan  4 09:12:10 app-01 postfix/qmgr[1187]: A1B2C3D4E5: from=<alerts@example.com>, size=1823, nrcpt=1 (queue active)
Jan  4 09:12:12 app-01 postfix/smtp[2214]: A1B2C3D4E5: to=<user@gmail.com>, relay=gmail-smtp-in.l.google.com[142.250.102.27]:25, delay=2.1, delays=0.1/0/2/0, dsn=4.4.1, status=deferred (connect to gmail-smtp-in.l.google.com[142.250.102.27]:25: Connection timed out)

Meaning: DSN 4.4.1 indicates a temporary network failure; Postfix will retry.

Decision: If the DSN is 5.x.x (permanent), stop retrying and fix identity/content/reputation. If it’s 4.x.x, fix connectivity or wait out rate limits depending on message text.

Task 12: Verify Postfix is using the relay and not leaking direct sends

cr0x@server:~$ grep -E "relay=" /var/log/mail.log | tail -n 3
Jan  4 10:02:41 app-01 postfix/smtp[4021]: 9F8E7D6C5B: to=<user@outlook.com>, relay=smtp.relay.example[198.51.100.20]:587, delay=1.2, dsn=2.0.0, status=sent (250 2.0.0 queued as 7c9f1a2b-1234-5678-9012-acdeffedcba0)
Jan  4 10:02:42 app-01 postfix/smtp[4021]: 1A2B3C4D5E: to=<user@gmail.com>, relay=smtp.relay.example[198.51.100.20]:587, delay=0.9, dsn=2.0.0, status=sent (250 2.0.0 queued as 0d1e2f3a-aaaa-bbbb-cccc-112233445566)

Meaning: Messages are leaving via relayhost on 587 and being queued by the provider (you have provider queue IDs).

Decision: If some lines show relay=recipient-mx:25, you have split paths. Fix transport maps or relay restrictions before you debug deliverability—you can’t debug two systems at once.

Task 13: Check TLS policy and whether you’re failing TLS negotiation

cr0x@server:~$ postconf -n | egrep '^(smtp_tls_security_level|smtp_tls_CAfile|smtp_tls_loglevel|smtp_tls_policy_maps)'
smtp_tls_security_level = may
smtp_tls_loglevel = 1

Meaning: Opportunistic TLS is enabled (may). Log level is low but present.

Decision: If you see TLS failures in logs, temporarily increase smtp_tls_loglevel to debug. If you require TLS (encrypt), set security_level=encrypt only if you know recipients support it or you have policy maps.

Task 14: Prove what the message actually looked like on the wire (headers)

cr0x@server:~$ sed -n '1,60p' /var/spool/postfix/deferred/A/A1B2C3D4E5
From: "Example Alerts" <alerts@example.com>
To: user@gmail.com
Subject: Your login code
Date: Mon, 04 Jan 2026 09:12:09 +0000
Message-ID: <20260104091209.12345@app-01.prod.example.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8

Meaning: This shows the stored message headers (not the final received headers). Useful for verifying From domain, Message-ID domain, and content type.

Decision: If DMARC alignment is strict and you see From: uses one domain but DKIM signs another (or Message-ID domain confuses receivers), standardize header domains.

Task 15: Confirm your app is not bypassing the MTA and using random SMTP libraries directly

cr0x@server:~$ ss -tpn | egrep ':25 |:587 ' | head
ESTAB 0 0 10.0.1.10:49822 198.51.100.20:587 users:(("master",pid=1123,fd=24))
ESTAB 0 0 10.0.1.10:50112 142.250.102.27:25 users:(("python3",pid=9981,fd=8))

Meaning: Postfix is using the relay on 587, but a python process is connecting directly to a recipient MX on 25.

Decision: Hunt that process. Mandate a single egress path. “Some mail direct, some via relay” is how you end up with half your domain reputation pinned to chaos.

Task 16: Check system time skew (DKIM and TLS can behave badly when clocks drift)

cr0x@server:~$ timedatectl
               Local time: Mon 2026-01-04 10:15:22 UTC
           Universal time: Mon 2026-01-04 10:15:22 UTC
                 RTC time: Mon 2026-01-04 10:15:21
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

Meaning: Clock is synchronized; good.

Decision: If not synchronized, fix NTP. Some receivers are forgiving; others treat time-skewed signatures and TLS sessions as suspicious.

Fast diagnosis playbook

This is the “stop thrashing” sequence. Run it in order. You’re trying to locate the bottleneck: app, local MTA, relay provider, recipient, or DNS/auth.

First: Determine where “sent” ends

  1. Is the app handing mail to something? Look for app logs with message IDs or SMTP response codes. If the app never handed off, you have an application bug, not deliverability.
  2. Is Postfix queue growing? mailq. If local queue grows, your system can’t hand off (network, auth, DNS, relay outage).
  3. Do logs show 250 queued at a relay? If yes, your problem moved downstream. Stop blaming Postfix.

Second: Check identity and alignment, not just “passes”

  1. SPF record exists and includes the sending system (your IP or relay).
  2. DKIM is present and verifiable; selector exists in DNS.
  3. DMARC policy isn’t rejecting you; alignment matches From domain.

Third: Check reputation and basic trust signals

  1. If direct: PTR exists and matches A record (FCrDNS).
  2. If direct: check blocklist signals and whether the IP is new/rotating.
  3. If relay: check provider status, rate limits, and whether you exceeded quotas.

Fourth: Verify recipient-side feedback (bounces, deferrals, spam placement)

  1. Parse SMTP response codes from logs or provider events.
  2. Look for patterns by domain (gmail vs outlook vs yahoo).
  3. Identify if it’s rejection (5xx), deferral (4xx), or spam placement (accepted but not in inbox).

Common mistakes: symptom → root cause → fix

1) “Works in staging, fails in production”

Symptom: Test emails arrive; production emails to large providers bounce or disappear.

Root cause: Production sends from different IPs (autoscaling, NAT, multiple egress paths) with cold/bad reputation and missing PTR.

Fix: Use authenticated relay or pin dedicated outbound IPs with controlled rDNS; enforce a single mail egress path; warm up gradually.

2) “SMTP says 250 OK but users don’t see mail”

Symptom: Logs show 250 queued at relay/provider; recipients claim nothing arrived.

Root cause: Accepted into provider queue; later dropped/throttled; or delivered to spam/quarantine due to content/DMARC misalignment.

Fix: Correlate provider queue IDs, inspect provider event logs, and validate SPF/DKIM/DMARC alignment. Implement bounce processing and complaint monitoring.

3) “Sudden spike in bounces after a deploy”

Symptom: Rejections start right after deploying a template change.

Root cause: Content filters triggered (URL patterns, broken MIME, missing List-Unsubscribe for bulk, weird encoding). Or DKIM breaks because middleware modifies body.

Fix: Roll back template, compare raw MIME, ensure DKIM canonicalization matches transformations, and avoid risky content patterns.

4) “Some recipients get mail, others don’t”

Symptom: Gmail okay, Outlook blocked (or vice versa).

Root cause: Provider-specific reputation scoring; TLS/cipher mismatch; inconsistent SPF includes; IPv6 path issues.

Fix: Segment by recipient domain, check SMTP transcripts and response codes, disable broken IPv6 sending if misconfigured, and tune policies per domain if direct-sending.

5) “DMARC reports show failures but we swear we configured SPF and DKIM”

Symptom: Aggregate reports show DMARC fail at a high rate.

Root cause: Alignment mismatch: From domain differs from DKIM d= domain or SPF MailFrom domain. Or third parties send as you without authorization.

Fix: Align From with DKIM signing domain; use a dedicated subdomain for third-party senders; update SPF includes; enforce DMARC policy as you gain confidence.

6) “We added more servers to send faster and deliverability got worse”

Symptom: Higher throughput, worse inbox placement and more throttling.

Root cause: Volume bursts and concurrency changes altered sender fingerprint; per-IP reputation diluted; complaint rate increased due to speed and lack of throttling.

Fix: Centralize sending through an authenticated relay or a controlled direct-send tier; apply rate limits; warm IPs; prioritize messages.

7) “Bounces are high and support says addresses are valid”

Symptom: 550 user unknown spikes.

Root cause: List hygiene problem, old data, typos, or signup abuse generating invalid addresses.

Fix: Implement bounce handling and suppression; validate addresses at signup carefully (without being a jerk to legitimate users); throttle suspicious signup sources.

Checklists / step-by-step plan

Checklist A: If you choose authenticated relay (recommended for most teams)

  1. Pick the sending identity: Decide the From domain and whether you’ll use a subdomain (e.g., mail.example.com) for operational separation.
  2. Publish SPF: Authorize the relay provider (include or explicit IP ranges). Keep it under 10 DNS lookups.
  3. Enable DKIM with your domain: Prefer signing as your From domain or aligned subdomain. Rotate keys periodically.
  4. Publish DMARC: Start with p=none to observe, then move to quarantine/reject when you’ve eliminated unauthorized senders.
  5. Configure Postfix submission to relay: 587 with STARTTLS and SASL auth. Use stable credentials and secret management.
  6. Set clear retry boundaries: If relay returns 250 queued, do not retry at the application layer.
  7. Implement bounce handling: Parse DSNs and provider events; suppress persistent bounces automatically.
  8. Monitor deliverability signals: Deferred rates, spam complaints, domain-specific rejections.

Checklist B: If you choose direct send (you are now operating a mail system)

  1. Stable egress IPs: No rotating IPs. No autoscaling surprises. Dedicated IPs if possible.
  2. rDNS/PTR control: PTR exists, forward matches, hostname is sensible.
  3. SPF authorizes your IPs: Keep record tight; don’t “~all” your way into ambiguity forever.
  4. DKIM signing on all outbound: Ensure intermediate systems don’t break signatures.
  5. DMARC policy: Align From with SPF/DKIM and move toward enforcement.
  6. Queue tuning and rate limits: Per-domain concurrency limits, sensible backoff, avoid spikes.
  7. Abuse handling: Complaints, unsubscribes, suppression lists, and an actual human process.
  8. Monitoring: Queue depth, deferrals by domain, bounce codes, blocklist checks, TLS errors.
  9. Warming plan: Gradually increase volume; start with engaged recipients and critical transactional traffic.

Step-by-step: migrating from direct send to authenticated relay without chaos

  1. Inventory senders: Find every system that sends email (apps, cron jobs, printers, legacy servers). Yes, printers.
  2. Centralize submission: Require all senders to submit to a local MTA or a relay gateway.
  3. Configure relayhost: Update Postfix relayhost and SASL settings; test with one low-risk mail stream.
  4. Update SPF/DKIM: Authorize relay and ensure DKIM aligns with visible From.
  5. Enable DMARC reporting: Observe who is still sending direct or spoofing you.
  6. Cut over per message class: Password resets first, then notifications, then bulk. Do not flip everything at once unless you enjoy adrenaline.
  7. Remove direct egress: Block outbound 25 from application networks to prevent bypass.
  8. Track message IDs end-to-end: Correlate app request ID → local queue ID → provider queue ID.

FAQ

1) If I use an authenticated relay, do I still need SPF/DKIM/DMARC?

Yes. The relay solves transport and reputation infrastructure, not identity. Receivers still evaluate your domain’s authentication and alignment. Without it, you’ll leak into spam or get rejected under stricter DMARC ecosystems.

2) Is direct send always worse for deliverability?

No. Direct send can be excellent if you have stable IPs, correct DNS, disciplined sending, and real monitoring. It’s just that many teams choose it accidentally and operate it casually, which is how it becomes worse.

3) Why do my messages get accepted (250) but never show up?

Because SMTP acceptance means “queued,” not “in inbox.” After acceptance, content filters, reputation, and policy checks still decide placement. You need downstream event visibility (provider events or recipient-side bounces) to know what happened.

4) Should I send transactional and marketing from the same domain/IP?

Prefer separation. Use subdomains or separate IP pools. Marketing traffic has different complaint dynamics and can drag down transactional mail reputation, which is the one stream you can’t afford to lose.

5) Does IPv6 help or hurt?

It can help if configured correctly (stable IPv6, proper rDNS, SPF includes, and consistent sending). It can hurt if you accidentally start sending from an unroutable or un-reputable IPv6 address that nobody warmed up. If you’re unsure, disable IPv6 outbound for SMTP until you can validate it.

6) Can I skip rDNS if I use a relay?

Usually yes, because receivers see the relay’s IP and rDNS, not yours. But your local environment still needs sane DNS for its own hostname and TLS operations; don’t run with broken DNS and expect stability.

7) What DMARC policy should I start with?

Start with p=none to collect reports and discover who is sending mail as your domain. Move to quarantine and then reject once you’ve aligned legitimate senders and eliminated unauthorized ones.

8) What’s the simplest “don’t get blocked” setup for a small production app?

Authenticated relay on 587 with TLS and AUTH, SPF authorizing the relay, DKIM signing aligned with your From domain, and a DMARC record. Add bounce suppression. Keep your From domain stable and avoid changing it every week.

9) How do I know whether I’m being rate-limited vs blocked?

Look at SMTP response codes. Deferrals are usually 4xx with wording about throttling or temporary rate limits. Blocks are often 5xx with policy or reputation hints. Either way, log and trend by recipient domain.

10) Is running Postfix locally still useful if I use a relay?

Yes. A local MTA gives you buffering, retries, and a clean submission interface for apps. It also gives you logs you control. Just keep it simple: relay everything; don’t let apps “go direct” on the side.

Conclusion: what to do next, in production terms

If you want the approach that doesn’t get you blocked, pick authenticated relay unless you have a specific, resourced reason to run direct send. That’s not cowardice; it’s cost control. Deliverability work is real work.

Next steps that pay off immediately:

  1. Pick one egress path for each class of email (transactional, bulk, internal) and enforce it.
  2. Make identity coherent: SPF authorizes the sender, DKIM signs as the From domain (or aligned subdomain), DMARC policy matches your risk tolerance.
  3. Add traceability: correlate message IDs across app → MTA → relay provider so “where is my email?” becomes a query, not a séance.
  4. Monitor the right metrics: deferrals, rejections by domain, queue depth, and bounce codes. Not “emails sent.”
  5. Stop optimizing the wrong layer: don’t fight SMTP with application retries. Respect the queue.

You can get fancy later. First, get delivered.

← Previous
Debian 13: Thermal throttling ruins throughput — prove it and fix cooling/power limits
Next →
Ubuntu 24.04 “Connection reset by peer”: Prove whether it’s client, proxy, or server (case #14)

Leave a comment