You have one job: explain why an email didn’t arrive. Not “it probably got spam-filtered.” Not “Microsoft/Gmail is flaky.” You need to point at the hop where the message stopped behaving like a message and started behaving like a rumor.
The fastest way there is the Received header chain. It’s the closest thing email has to a flight recorder—messy, inconsistent, and occasionally tampered with. But if you read it like an SRE reads a distributed trace, you can usually nail the failure domain in minutes.
The mental model: what a Received header really is
Email is hop-by-hop. Every SMTP server that accepts a message and passes it onward is supposed to stamp a new Received header at the top. So you get a stack: newest hop at the top, oldest at the bottom.
Think of each hop as a small contract: “I, server X, received this message from Y at time T using protocol P and I’m handing it to the next step.” The contract is written by X, not by Y. That’s the key. The hop that writes the line is the one you can interrogate with logs and metrics.
If you’re used to distributed systems tracing, this is the email version of a span list without IDs. No global trace ID. No canonical schema. Lots of creative punctuation. Yet, if you read it with the right skepticism, it’s enough.
Two rules that will save you from wandering into meetings empty-handed:
- Received lines are evidence only within trust boundaries. Anything below your first trusted ingress can be forged.
- Most “missing email” cases are actually “delayed email” cases. Your job is to decide where the delay was introduced and whether it’s expected (retry) or pathological (queue meltdown, policy block, misrouting).
Interesting facts and historical context (short, concrete)
- SMTP predates most modern authentication. The core protocol was standardized in the early 1980s; authentication bolt-ons arrived decades later.
- Received headers are older than spam. They were designed for tracing and debugging mail routing, not adversarial environments.
- Header format is “structured-ish,” not strict. Servers are encouraged to add details, so you’ll see vendor-specific clauses and creative whitespace.
- Time zones are a recurring source of nonsense. Email hops can show times “going backward” when a server clock is wrong or using odd offsets.
- Message-ID is not guaranteed unique. Most MTAs generate sane IDs, but broken clients and some appliances reuse or mangle them.
- Greylisting became popular as a low-effort anti-spam tactic. It intentionally delays first delivery attempts; Received headers often show retry gaps.
- SPF checks the envelope sender, not the visible From. People still confuse this, including people with “Director” in their title.
- DKIM signs headers and body, but not the entire “truth.” It doesn’t validate the Received chain; it validates content integrity from a signer.
- Large providers run multiple layers of mail handling. You may see internal hops that don’t map 1:1 to what you think is “the mail server.”
Anatomy of a Received line: fields that matter, fields that lie
A typical Received header can look like this (don’t get attached; your reality will be worse):
cr0x@server:~$ sed -n '1,8p' message.headers
Received: from mail-out.example.net (mail-out.example.net [203.0.113.10])
by mx.google.com with ESMTPS id x12-20020a17090a000000b003e000000000si1234567pja.12.2026.01.04.10.11.12
for <user@example.com>
(version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
Sun, 04 Jan 2026 10:11:12 -0800 (PST)
Received: from app01.corp.example (app01.corp.example [10.20.30.40])
by mail-out.example.net (Postfix) with ESMTPA id 4A1B2C3D4E
for <user@example.com>; Sun, 04 Jan 2026 18:11:10 +0000 (UTC)
What matters:
- “by”: the server that added this line. This is your log anchor. If you can’t access logs on “by,” your troubleshooting options shrink.
- “from”: who connected to “by.” Can be a hostname + IP. The hostname is often reverse DNS or what the peer claimed. The IP is harder to fake at the TCP layer (but NAT and proxies complicate it).
- Protocol clause (
with ESMTP,with ESMTPS,with LMTP): tells you whether TLS was used and whether auth happened (ESMTPAcommonly indicates authenticated submission). - Timestamp: critical for delay analysis. Treat it like any distributed-system timestamp: trust it only if clocks are sane.
- Queue IDs: Postfix, Exchange, and others often stamp an internal ID. These are gold because they map directly to logs.
What often lies or misleads:
- Client-provided names. Some MTAs include what the client said in HELO/EHLO. That’s not identity; it’s a string.
- Untrusted lower hops. Anything added by a server outside your trust perimeter can be fabricated by a spammer or a buggy relay.
- Overconfidence in formatting. Some servers wrap lines oddly; some omit fields; some jam everything into one line like they’re paid by the character.
Joke #1: Email headers are like incident timelines: everyone adds a line, and somehow it still doesn’t answer “who touched prod last?”
Ordering rules: bottom-up, with sharp edges
The oldest hop is at the bottom. The newest hop is at the top. That means you read Received headers from bottom to top to follow the message’s journey forward in time.
Sharp edges:
- Clock skew can make time appear to travel backward. If one server is five minutes off, you’ll see negative transit times. Don’t panic; verify NTP.
- Local delivery may not add a Received line. Sometimes the final handoff is to a mailbox store or a filtering stage without a new header.
- Gateways can collapse hops. A mail security gateway might accept inbound mail and then re-inject it internally, adding lines that look like “from gateway by internal-mx,” but you won’t see the upstream complexity.
- Mailing lists and forwarders can create “new” messages with new Message-IDs and changed headers, while still appearing like the same user-visible content.
Trust boundaries: which Received lines you can believe
The only Received headers you should treat as reliable are the ones added by systems you trust: your MX, your inbound gateway, your internal relays, your submission service.
Everything before the first trusted hop is “user-submitted evidence.” Useful, but not admissible. This is not paranoia; it’s Tuesday.
Practical approach:
- Identify the first hop you control. That’s typically your MX, inbound gateway, or hosted provider’s ingress.
- Mark everything below it as untrusted chain.
- Correlate the trusted hop with logs using queue ID, Message-ID, and timestamp.
If you operate a system that receives mail from the Internet directly, enforce sane practices that make your own Received lines valuable: include connecting IP, include TLS info when present, include queue ID, and keep clocks accurate.
Delays: how to spot queueing, retries, greylisting, and backpressure
Delivery “breaks” are often just delivery “waiting.” Your job is to classify the waiting:
1) Queue delay at the sender’s side
Symptom: big time gap between the bottommost “client → sender MTA” Received and the next hop “sender MTA → recipient MX.”
Interpretation: the sender accepted the message but didn’t deliver promptly. Causes include outbound queue congestion, DNS issues, rate limits, policy blocks, or the sender being throttled by recipients.
2) Greylisting / temporary deferrals
Symptom: repeated attempts reflected in logs (not usually in headers), with minutes-long gap until acceptance. In headers you might see the final attempt only, with a timestamp long after the original send time the user claims.
Interpretation: recipient refused early tries with a 4xx. Sender retried. This is normal-ish in some ecosystems, but it’s a business decision disguised as SMTP.
3) Inbound queueing at your gateway or MX
Symptom: message is accepted by your perimeter but arrives late in the mailbox. Headers show fast transit to your gateway, then internal “by” hops with timestamps that drift forward slowly.
Interpretation: internal filtering, content scanning, mailbox store backpressure, or internal routing issues. This is where your dashboards matter.
4) Policy holds and quarantine
Symptom: headers show acceptance, but user never sees the message. Logs show it diverted to quarantine, moderation, or held for manual review.
Interpretation: not a “delivery break.” It’s a policy decision. Treat it as such: find the policy rule, justify it, adjust if needed.
5) Silent drop (rare, but real)
True silent drops are less common than people think because SMTP is transactional. But they happen inside ecosystems: filters can discard, mailboxes can reject, and some misconfigured systems accept then lose. You look for where the last trustworthy receipt occurs and then what the system did after that.
Security signals around Received: SPF, DKIM, DMARC, ARC, and why they matter to delivery breaks
Received headers tell you path. Authentication results tell you whether the message’s identity story matches reality—at least enough for providers to accept it.
Authentication-Results is your friend
Many systems add an Authentication-Results header. It can include SPF, DKIM, DMARC evaluations. It’s written by a receiver, so within trust boundaries it’s valuable.
What you do with it:
- If SPF fails, check the envelope sender domain’s SPF record and sending IP alignment.
- If DKIM fails, check signing domain, selector, canonicalization issues, and whether an intermediary modified the body/headers.
- If DMARC fails, check alignment between From domain and SPF/DKIM domains; also check policy (reject/quarantine/none).
ARC can explain “forwarding broke my DMARC”
Forwarders and mailing lists commonly break DKIM (by modifying the message) and SPF (because they re-send from their own IP). ARC (Authenticated Received Chain) is a way for intermediaries to preserve upstream auth results. When diagnosing delivery failures involving forwarding, ARC presence and validation can be the difference between “mysterious” and “obvious.”
Use auth signals to decide whether delivery broke because of transport or policy. If transport looks fine but DMARC fails with a strict policy, you’re not debugging networking. You’re debugging identity and alignment.
Fast diagnosis playbook
You want speed? Stop reading the whole header novel. Do this:
First: Identify the last trusted “by” hop
- Locate the topmost Received line written by the recipient system you trust (your MX / gateway / provider ingress).
- Extract: timestamp, queue ID (if present), connecting IP, and “for <recipient>”.
- Decision: if you don’t have a trusted hop, you don’t have evidence. Ask the sender for the full headers from their sent item, or request logs from their admin.
Second: Classify the failure domain
- Not accepted by your perimeter: no trusted Received line → sender side or Internet transport (DNS, routing, reputation blocks).
- Accepted by perimeter but not delivered: trusted Received exists → your internal pipeline (filtering, mailbox, routing, quarantine).
- Delivered but “missing”: headers show final delivery but user can’t find it → client-side rules, mailbox search, folders, or downstream archiving.
Third: Measure delay hop-by-hop
- Read Received headers bottom-to-top within the trusted segment.
- Compute time deltas between adjacent hops.
- Decision: the hop with the biggest delta is where you look for queueing, throttling, or holds.
Fourth: Correlate with logs using IDs
Queue IDs, Message-ID, and recipient address are your join keys. Pick two; use three if you can.
Fifth: Decide what you’re telling humans
Don’t say “email is delayed.” Say “accepted by our gateway at 10:11:12 PST, held in content scanning for 34 minutes due to attachment sandbox backlog, then delivered.” People hate uncertainty more than bad news.
Practical tasks (with commands, output meaning, and decisions)
Below are real tasks you can run on common Linux-based MTAs and toolboxes. Each task includes: command, sample output, what it means, and the decision you make.
Task 1: Extract and view just the Received chain (quick sanity)
cr0x@server:~$ awk 'BEGIN{RS="";FS="\n"} {for(i=1;i<=NF;i++) if($i ~ /^Received:/) print $i}' message.eml
Received: from app01.corp.example (app01.corp.example [10.20.30.40]) by mail-out.example.net (Postfix) with ESMTPA id 4A1B2C3D4E for <user@example.com>; Sun, 04 Jan 2026 18:11:10 +0000 (UTC)
Received: from mail-out.example.net (mail-out.example.net [203.0.113.10]) by mx.google.com with ESMTPS id x12-... for <user@example.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 04 Jan 2026 10:11:12 -0800 (PST)
What it means: You have a short chain; likely only two hops are recorded in the sample. In real mail, you’ll see more.
Decision: If the chain is unexpectedly short, suspect the message was generated inside a provider, or headers were stripped by a gateway or client. Ask for “full original headers” from the mailbox UI, not a forwarded copy.
Task 2: Identify the first trusted hop (pattern match your MX/gateway)
cr0x@server:~$ grep -n '^Received:' -n message.eml | head
12:Received: from mail-out.example.net (mail-out.example.net [203.0.113.10]) by mx-inbound.corp.example with ESMTPS id 9F3A2B1C for <user@corp.example>; Sun, 04 Jan 2026 18:11:12 +0000 (UTC)
18:Received: from unknown (HELO sender.example) (198.51.100.23) by mail-out.example.net with SMTP; Sun, 04 Jan 2026 18:11:10 +0000 (UTC)
What it means: The “by mx-inbound.corp.example” line is your trusted ingress. Anything below it might be real, might be fiction.
Decision: Start correlation from that queue ID (9F3A2B1C) and timestamp. Ignore the upstream “unknown (HELO …)” for blame decisions unless it matches other evidence.
Task 3: Parse timestamps and compute hop delays (cheap “distributed trace”)
cr0x@server:~$ grep '^Received:' -n message.eml
12:Received: ...; Sun, 04 Jan 2026 18:11:12 +0000 (UTC)
18:Received: ...; Sun, 04 Jan 2026 18:11:10 +0000 (UTC)
What it means: Two seconds from sender relay to your MX. That’s normal.
Decision: If you see minutes or hours between adjacent trusted hops, that’s your bottleneck. Go to the MTA/gateway responsible for the later hop and check queues and deferrals.
Task 4: On Postfix, search logs by queue ID
cr0x@mailgw01:~$ sudo grep '9F3A2B1C' /var/log/mail.log | tail -n 5
Jan 4 18:11:12 mailgw01 postfix/smtpd[22110]: 9F3A2B1C: client=mail-out.example.net[203.0.113.10], sasl_method=PLAIN, sasl_username=
Jan 4 18:11:13 mailgw01 postfix/cleanup[22122]: 9F3A2B1C: message-id=<CAO9z9kGx12345@mail-out.example.net>
Jan 4 18:11:14 mailgw01 postfix/qmgr[1211]: 9F3A2B1C: from=<noreply@example.net>, size=48212, nrcpt=1 (queue active)
Jan 4 18:47:02 mailgw01 postfix/smtp[23110]: 9F3A2B1C: to=<user@corp.example>, relay=mbx01.corp.example[10.0.10.25]:25, delay=2149, delays=0.2/0.1/2130/18, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 7C8D9E0F)
Jan 4 18:47:02 mailgw01 postfix/qmgr[1211]: 9F3A2B1C: removed
What it means: The message sat mostly in queue before delivering to mbx01. The delays=0.2/0.1/2130/18 breakdown indicates time in different phases; the huge third number is the smoking gun.
Decision: Stop blaming the sender. Investigate why your gateway held the message ~35 minutes: content filter integration, rate limiting, downstream mailbox slow responses, or policy hold.
Task 5: Check the Postfix queue for backlog
cr0x@mailgw01:~$ mailq | head -n 20
-Queue ID- --Size-- ----Arrival Time---- -Sender/Recipient-------
9F3A2B1C 48212 Sun Jan 4 18:11:12 noreply@example.net
user@corp.example
A1B2C3D4 1200 Sun Jan 4 18:12:01 alerts@vendor.example
oncall@corp.example
What it means: There’s an active queue. The presence of multiple messages around the same time hints at systemic delay, not a one-off.
Decision: If queue grows and age increases, treat as incident: check downstream availability, content filtering latency, DNS resolution, and rate limits.
Task 6: Postfix queue age distribution (is it “old mail”?)
cr0x@mailgw01:~$ sudo postqueue -p | awk 'BEGIN{count=0} /^[A-F0-9]/ {id=$1} /[0-9][0-9]:[0-9][0-9]:[0-9][0-9]/ {count++} END{print "queue_entries="count}'
queue_entries=42
What it means: Quick-and-dirty count of queued entries. Not perfect, but it’s a tripwire.
Decision: If entries spike compared to baseline, check why mail isn’t draining: downstream relay= timeouts, TLS handshake failures, or a filter service being slow.
Task 7: Verify DNS for recipient MX and sender reverse DNS (reputation basics)
cr0x@server:~$ dig +short MX corp.example
10 mx-inbound.corp.example.
cr0x@server:~$ dig +short A mx-inbound.corp.example
192.0.2.55
cr0x@server:~$ dig +short -x 203.0.113.10
mail-out.example.net.
What it means: MX is resolvable; sender has PTR. Lack of PTR or mismatched names can trigger stricter receivers and lead to temp fails or spam placement.
Decision: If MX resolution is flaky or PTR is missing, fix DNS. If you can’t fix the sender’s PTR, expect delivery variance and use authenticated submission pathways.
Task 8: Check SPF alignment for the envelope sender domain
cr0x@server:~$ dig +short TXT example.net | sed -n '1,3p'
"v=spf1 ip4:203.0.113.0/24 include:_spf.provider.example -all"
What it means: The domain claims which IPs may send. If your Received shows a different sending IP, SPF will fail at recipients doing enforcement.
Decision: If SPF doesn’t include the real outbound IPs, update SPF. If the sender uses a third-party, make sure they’re included and not exceeding DNS lookup limits.
Task 9: Inspect Authentication-Results from the recipient side
cr0x@server:~$ grep -i '^Authentication-Results:' -n message.eml | head -n 2
25:Authentication-Results: mx-inbound.corp.example; spf=pass smtp.mailfrom=noreply@example.net; dkim=pass header.d=example.net; dmarc=pass header.from=example.net
What it means: Transport may still fail, but identity checks passed. That shifts focus away from DMARC policy rejection and toward routing, queueing, or content filtering.
Decision: If DMARC fails and policy is strict, don’t waste hours on network traces. Fix alignment or adjust policy for legitimate senders.
Task 10: On Exchange (or Exchange-like logs), grep message tracking (Linux collector example)
cr0x@loghost:~$ grep -R "CAO9z9kGx12345@mail-out.example.net" /var/log/exchange-message-tracking.log | tail -n 3
2026-01-04T18:11:12Z,RECEIVE,SMTP,MBX01,user@corp.example,CAO9z9kGx12345@mail-out.example.net,192.0.2.55
2026-01-04T18:45:50Z,AGENTINFO,TRANSPORT,MBX01,user@corp.example,CAO9z9kGx12345@mail-out.example.net,ContentFilter=SandboxWait
2026-01-04T18:47:02Z,DELIVER,STOREDRIVER,MBX01,user@corp.example,CAO9z9kGx12345@mail-out.example.net,Inbox
What it means: The message was received promptly, then sat waiting for sandboxing. This is an internal processing delay.
Decision: Scale or tune the sandbox/AV stage, or exempt low-risk flows if business tolerates it. Don’t ask the sender to “resend” as a strategy.
Task 11: Confirm TLS negotiation failures when a hop shows ESMTPS but delivery stalls
cr0x@mailgw01:~$ sudo grep -E "TLS|handshake|SSL" /var/log/mail.log | tail -n 5
Jan 4 18:20:11 mailgw01 postfix/smtp[24001]: warning: TLS library problem: error:0A000126:SSL routines::unexpected eof while reading
Jan 4 18:20:11 mailgw01 postfix/smtp[24001]: warning: TLS policy lookup failed
Jan 4 18:20:11 mailgw01 postfix/smtp[24001]: warning: TLS handshake failed for mbx01.corp.example[10.0.10.25]:25: unexpected EOF
Jan 4 18:20:12 mailgw01 postfix/smtp[24001]: 9F3A2B1C: to=<user@corp.example>, relay=mbx01.corp.example[10.0.10.25]:25, delay=540, delays=0.2/0.1/520/20, dsn=4.4.2, status=deferred (lost connection with mbx01.corp.example[10.0.10.25] while sending end of data -- message may be sent more than once)
Jan 4 18:20:12 mailgw01 postfix/qmgr[1211]: 9F3A2B1C: deferred: lost connection with mbx01.corp.example[10.0.10.25] while sending end of data
What it means: Your gateway can’t complete TLS or connection with downstream. That will cause deferrals and long queue times, which users will interpret as “email vanished.”
Decision: Fix TLS policy mismatch, certificate issues, or load balancer resets. If needed, temporarily relax TLS enforcement internally while you remediate (carefully, with compensating controls).
Task 12: Validate local system time and NTP (stop time-travel debugging)
cr0x@mailgw01:~$ timedatectl
Local time: Sun 2026-01-04 18:52:10 UTC
Universal time: Sun 2026-01-04 18:52:10 UTC
RTC time: Sun 2026-01-04 18:52:10
Time zone: UTC (UTC, +0000)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
What it means: Clock is sane. If it wasn’t, Received timestamps become unreliable, and you’ll chase the wrong hop.
Decision: If clock is not synchronized, fix NTP first. Then re-evaluate the header timeline.
Task 13: Confirm whether a message is quarantined (example: local mail gateway quarantine DB/log)
cr0x@mailgw01:~$ sudo grep -R "CAO9z9kGx12345@mail-out.example.net" /var/log/mailgw/quarantine.log | tail -n 2
2026-01-04T18:12:02Z action=quarantine reason="AttachmentSandboxPending" queue_id=9F3A2B1C msgid=CAO9z9kGx12345@mail-out.example.net
2026-01-04T18:46:55Z action=release reason="SandboxVerdictClean" queue_id=9F3A2B1C msgid=CAO9z9kGx12345@mail-out.example.net
What it means: Message didn’t “fail.” It was held. You now have a clear explanation and a tuning knob.
Decision: If quarantine is too slow for business needs, scale the sandbox or adjust policy for known-good senders and file types.
Task 14: Track a message using Message-ID across multiple hosts (grep + ssh)
cr0x@ops:~$ for h in mailgw01 mailgw02 mbx01; do echo "== $h =="; ssh $h "sudo grep -R \"CAO9z9kGx12345@mail-out.example.net\" /var/log/mail.log | tail -n 2"; done
== mailgw01 ==
Jan 4 18:11:13 mailgw01 postfix/cleanup[22122]: 9F3A2B1C: message-id=<CAO9z9kGx12345@mail-out.example.net>
Jan 4 18:47:02 mailgw01 postfix/smtp[23110]: 9F3A2B1C: to=<user@corp.example>, relay=mbx01.corp.example[10.0.10.25]:25, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 7C8D9E0F)
== mailgw02 ==
== mbx01 ==
Jan 4 18:47:02 mbx01 postfix/smtpd[18001]: 7C8D9E0F: client=mailgw01.corp.example[10.0.10.5]
Jan 4 18:47:03 mbx01 postfix/local[18012]: 7C8D9E0F: to=<user>, orig_to=<user@corp.example>, relay=local, status=sent (delivered to mailbox)
What it means: End-to-end trace across your estate. Absence on mailgw02 suggests routing didn’t involve it.
Decision: If the Message-ID appears on ingress but not on mailbox, focus on the relay stage. If it appears on mailbox and user can’t find it, pivot to client rules, foldering, and search indexing.
Task 15: Inspect content filter latency (example: amavis/milter stats log)
cr0x@mailgw01:~$ sudo grep -E "milter|amavis|scan time" /var/log/mail.log | tail -n 5
Jan 4 18:12:05 mailgw01 amavis[3101]: (03101-02) Passed CLEAN, [203.0.113.10] [203.0.113.10] <noreply@example.net> -> <user@corp.example>, Queue-ID: 9F3A2B1C, Message-ID: <CAO9z9kGx12345@mail-out.example.net>, Hits: -, size: 48212, queued_as: 9F3A2B1C, 92 ms
Jan 4 18:45:49 mailgw01 milter-sandbox[4021]: queue_id=9F3A2B1C verdict=pending wait=2010s
Jan 4 18:46:55 mailgw01 milter-sandbox[4021]: queue_id=9F3A2B1C verdict=clean wait=2066s
What it means: AV scan is fast; sandbox is the bottleneck and it’s explicit.
Decision: Tune sandbox concurrency, adjust what gets sandboxed, or implement async delivery patterns if your gateway supports it (deliver then remediate, if policy allows).
Task 16: Verify that the user’s mailbox rules didn’t “eat” the mail (server-side rule audit example)
cr0x@mbx01:~$ sudo grep -R "user@corp.example" /var/log/mailbox-rules.log | tail -n 5
2026-01-04T18:47:05Z user=user@corp.example rule="Move vendor mail" match="From contains example.net" action="move" folder="Vendors"
2026-01-04T18:47:05Z user=user@corp.example msgid=CAO9z9kGx12345@mail-out.example.net result="moved" folder="Vendors"
What it means: Delivery succeeded; visibility failed. The message is in a different folder.
Decision: Stop treating it as an MTA incident. Help the user fix rules or adjust corporate defaults.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
The ticket said: “Email from our payment processor is not arriving. Must be their outage.” It came in from finance, which means it had urgency, ambiguity, and a calendar invite attached.
The junior admin did what juniors do when the room is loud: they stared at the topmost Received line in the message the processor forwarded as proof. It showed the processor’s hostname, and then nothing that looked like our environment. They concluded: “We never got it. Processor problem.” They were half-right and entirely wrong in the only way that matters.
We asked for the headers from a message that did arrive earlier that day and compared the chains. The missing mail wasn’t missing; it was hitting our perimeter gateway and being quarantined silently due to a new attachment type. Our gateway had a policy hold that did not generate user-visible notifications for external senders, because someone once complained about “too many quarantine emails.” That someone was in a different org now, naturally.
The wrong assumption was that “if it’s not in the Inbox, we didn’t receive it.” The Received chain from our trusted gateway proved we did. The logs showed the queue ID, the quarantine action, and the release never happened because the sandbox service was down. That last part was the real outage—and it was ours.
We restored the sandbox, released the quarantined messages, and then made a decision that stuck: quarantine without notification is operational debt. Sometimes it’s necessary, but then you owe yourself dashboards and alerting, because users will bring pitchforks and you will deserve them.
Mini-story 2: The optimization that backfired
A large-ish company wanted to “optimize email security latency.” The plan sounded reasonable: move from synchronous attachment sandboxing (hold mail until verdict) to a faster pipeline by increasing parallelism and tightening TLS policies. Faster and more secure. What could go wrong.
The change landed on a Friday, because of course it did. Parallelism increased, which increased CPU load on the gateway. TLS policies tightened, which increased handshake failures with an internal mailbox relay that had an old cipher suite configuration. The gateway began deferring delivery, but it still accepted inbound mail, piling it into the queue like plates at a buffet.
The Received headers were telling: inbound acceptance timestamps were normal, but internal handoff to the mailbox servers lagged by 20–60 minutes. Users blamed external senders, senders blamed our reputation, and leadership blamed “email” as a concept. Meanwhile the queue grew, and the gateway’s disk filled, because queues are storage whether you admit it or not.
We rolled back TLS strictness for internal relays, reduced sandbox parallelism to match CPU reality, and added a hard queue size monitoring threshold tied to paging. The painful lesson: performance tuning in mail systems is like tuning a database. You can move the bottleneck, but you can also manufacture a brand-new failure mode with better marketing.
Mini-story 3: The boring but correct practice that saved the day
A different place, different vibe: they had a “mail trace” runbook that nobody bragged about. It was a page of steps: extract Received chain, locate first trusted hop, correlate queue ID, check quarantine, confirm delivery foldering. It was not exciting. It was correct.
One afternoon, a VIP complained that a contract amendment “never arrived” and threatened to escalate. The on-call followed the runbook. Within five minutes they had the gateway queue ID, the exact timestamp of acceptance, and the internal relay ID showing final delivery to the mailbox store.
Then they checked the mailbox rule logs: the message was moved to a “Legal” folder and marked as read by a mobile client. This was not an MTA issue. It was a user workflow issue. The on-call provided a clear explanation: “Delivered at 14:03 UTC, moved by rule X to folder Y.” The escalation evaporated.
The practice that saved the day wasn’t some fancy tool. It was consistent logging, clocks that didn’t lie, and a habit of correlating IDs instead of guessing. Reliability is often just the absence of drama achieved through good hygiene.
Common mistakes: symptoms → root cause → fix
-
Symptom: “Received headers show it was delivered, but user says it never arrived.”
Root cause: User-visible placement issue (rules, focused inbox, spam folder, server-side move, archive, mobile client).
Fix: Correlate final trusted hop + mailbox delivery logs; check rule actions; confirm folder path and client behavior; disable problematic rules. -
Symptom: Timestamps go backward between hops.
Root cause: Clock skew or timezone misconfiguration on one server; occasionally, header folding misread by tools.
Fix: Verify NTP (timedatectl); correct timezone; re-check with raw headers. -
Symptom: No trusted Received line from your perimeter.
Root cause: You never accepted the message; sender never reached you; DNS MX wrong; upstream policy block; sender sent to wrong domain.
Fix: Verify MX records; confirm recipient address; ask sender for SMTP logs/bounce; check your perimeter logs for connection attempts. -
Symptom: Big delay between acceptance at your MX and delivery to mailbox.
Root cause: Inbound pipeline backlog: content filter, sandbox, AV, DLP, mailbox store slowness, internal relay TLS failures.
Fix: Identify which internal hop introduced delay; check queue depth; scale bottleneck; fix TLS/cipher mismatch; add alerting on queue age. -
Symptom: Sender claims “we got a 250 OK” but you can’t find the message.
Root cause: 250 was from an intermediate relay (not you), or you accepted then quarantined/rewrote, or logs are missing due to rotation.
Fix: Match the server that issued 250 with the “by” hop in Received; request their logs; ensure you retain mail logs long enough for business timelines. -
Symptom: Certain external domains consistently fail or delay, others fine.
Root cause: Reputation/throttling issues, TLS policy mismatch, DMARC enforcement, or content filter false positives specific to that sender’s patterns.
Fix: Inspect Authentication-Results; check TLS logs; coordinate allowlisting carefully (prefer DKIM-based); adjust content rules with evidence. -
Symptom: Forwarded mail from a personal account doesn’t arrive, but direct mail does.
Root cause: SPF fails and DKIM breaks in forwarding; DMARC policy reject/quarantine triggers at recipient.
Fix: Encourage direct sending; implement ARC-aware handling; for internal forwarders, use SRS and preserve DKIM where possible. -
Symptom: Multiple duplicate copies arrive hours later.
Root cause: Sender retries after uncertain delivery (connection lost after DATA), or a gateway re-injected on timeout.
Fix: Check logs for “message may be sent more than once”; fix downstream connection stability; ensure idempotent filtering where feasible.
Checklists / step-by-step plan
Step-by-step: trace a single missing message
- Get the right artifact: the full original message headers from the recipient mailbox UI (not a forwarded message). If possible, get the raw source.
- Extract Received lines and read bottom-to-top.
- Identify the first trusted “by” hop (your MX/gateway/provider ingress). Everything below is untrusted context.
- Record the join keys: queue ID(s), Message-ID, recipient address, timestamps.
- Check perimeter logs for acceptance, rejects, or deferrals around the timestamp.
- Check queue status (depth + age). Determine if this is systemic.
- Check content pipeline (AV/sandbox/DLP) for holds and latency.
- Check internal relay handoff for TLS failures, timeouts, or backpressure.
- Check mailbox delivery logs and final placement (folders/rules/quarantine).
- Write the human explanation: “Accepted at X, delayed at Y because Z, delivered at T.” Include times and systems.
Step-by-step: decide if it’s sender-side or receiver-side
- If you have no trusted Received line from your infrastructure: treat as sender-side or Internet routing. Ask for bounce codes or their MTA logs.
- If you have a trusted acceptance but no mailbox delivery: treat as receiver-side pipeline. Incident-handling mode.
- If you have mailbox delivery but user can’t find it: treat as visibility/rules problem. Don’t page the MTA team for a UI issue.
Operational checklist: make Received headers useful in your environment
- Keep NTP healthy everywhere that touches mail.
- Ensure your MTAs log queue IDs and Message-IDs and retain logs long enough for business investigations.
- Expose queue depth and oldest-message age as metrics with paging thresholds.
- Instrument content filters with latency and verdict metrics.
- Document trust boundaries: which hops are yours, which are vendors, which are the public Internet.
Joke #2: The quickest way to find an email bottleneck is to schedule a meeting about it—mail will arrive during the meeting, just to spite you.
FAQ
1) Why do Received headers appear “backwards”?
Because each server prepends its Received line to the top. The message’s journey forward in time is read from the bottom up.
2) Can Received headers be forged?
Anything outside your trust boundary can be forged because a sender can add arbitrary headers. The Received line added by your first trusted MX/gateway is the anchor you rely on.
3) What’s the difference between “from” and “by” in a Received header?
by is the server that wrote the line. from is the peer that connected to it. When troubleshooting, “by” is where you look for logs.
4) Which identifier is best for log correlation: Message-ID or queue ID?
Queue ID is best within a single MTA because it’s local and guaranteed to map to logs. Message-ID is best across systems, but can be missing or duplicated. Use both when possible.
5) I see ESMTPA in a Received line. What does that tell me?
Often it indicates authenticated submission (a client logged in). That usually means the mail entered the sender’s system through a submission service, not an open relay. It also narrows the search to that submission host’s logs.
6) The sender claims they got “250 OK.” Doesn’t that prove delivery?
It proves acceptance by some SMTP server, not necessarily final delivery to the recipient mailbox. You need to match which server issued 250 to a Received “by” hop and then trace forward.
7) How do I tell if the delay is greylisting?
Headers alone often won’t show retries; logs will. Look for 4xx deferrals in recipient logs or the sender’s logs, followed by a later successful delivery. The user’s “sent time” may be much earlier than your acceptance time.
8) Why do emails sometimes show many internal hops at big providers?
Large providers pipeline mail through multiple services: edge MTAs, spam filtering, policy engines, mailbox delivery. Each layer may add a Received line, and each is a potential delay point.
9) What if there are no Received headers at all?
That’s unusual for Internet mail. You might be looking at a message fragment, a copy/paste, or a forwarded message where the client replaced headers. Get the raw source again, from the mailbox’s “view original” feature.
10) How do SPF/DKIM/DMARC relate to “where delivery breaks”?
If DMARC fails with a strict policy, the break is policy enforcement at the receiving side, not transport. Received headers will show the path, but auth results explain why a hop refused, quarantined, or spam-foldered it.
Conclusion: next steps you can do this week
Email delivery troubleshooting gets dramatically less mystical when you treat Received headers as a hop trace and logs as your ground truth. Anchor on the first trusted hop, measure delays between hops, and correlate using queue ID and Message-ID. Then speak in timestamps and systems, not vibes.
Practical next steps:
- Write (or steal) a one-page runbook matching the Fast diagnosis playbook and the step-by-step plan above.
- Add two metrics to your monitoring: queue depth and oldest message age on every mail-handling hop you own.
- Audit your quarantine/hold behavior: if you hold mail, you need visibility and notifications, or you’re farming future incidents.
- Verify time sync across MTAs, gateways, and mailbox servers. Time drift turns header analysis into performance art.
“Hope is not a strategy.”
— paraphrased idea commonly used in operations and reliability circles