Mail outages don’t always look like outages. Sometimes the web app is green, the pager is quiet, and the only symptom is sales asking why customers “aren’t getting the email.” In Postfix land, the classic quiet killer is a line that feels harmless until your queue hits five digits: “Host or domain name not found. Name service error”.
This is the DNS failure mode that doesn’t crash a daemon. It just makes your MTA politely defer mail forever, like a coworker “circling back” until the heat death of the universe.
What “host not found” actually means in Postfix
Postfix isn’t being poetic. “Host not found” usually means it asked the resolver a question and didn’t get a usable answer. That answer might be “NXDOMAIN” (it doesn’t exist), or “SERVFAIL” (the DNS server gave up), or “timeout” (no response), or “no data” (domain exists, but not the record type requested).
The most common places you’ll see this:
- Outbound delivery to remote recipients: Postfix tries to resolve MX for
example.com, then A/AAAA for the MX hostnames. - Inbound SMTP restrictions like
reject_unknown_client_hostnameorreject_unknown_sender_domain, where Postfix does lookups during the SMTP conversation. - Local routing when you use DNS-based transport maps or relayhost rules.
Where the error string comes from
Postfix uses system resolver libraries for most lookups. So the “name service error” phrasing isn’t really a Postfix personality quirk; it often bubbles up from libc resolver behavior, influenced by:
/etc/resolv.confcontents and search domains- local caching resolvers (
systemd-resolved,unbound,dnsmasq) - network path to upstream resolvers
- DNSSEC validation failures (if enabled in your path)
- IPv6 behavior (AAAA queries can be surprisingly “loud”)
Most of the time, the immediate operational problem is simple: mail is stuck in the queue as deferred. The more dangerous problem is that nothing alerts you unless you’re watching queue depth, deferral reasons, or log patterns.
Paraphrased idea (attributed): John Allspaw has long pushed the operations mindset that reliability comes from learning how systems actually fail, not from assuming they won’t.
Take that seriously here. DNS failures are normal. Your mail system should treat them as routine turbulence, not as a rare comet strike.
Fast diagnosis playbook (first/second/third)
This is the “stop guessing” sequence. Do it in order. It’s designed to find the bottleneck quickly: is it Postfix config, local resolver, upstream DNS, or the remote domain?
First: confirm the error and isolate the target domain
- Find one failing message in the queue and capture the recipient domain and the exact error text.
- Check whether it’s one domain or many. One domain points to remote DNS; many domains points to your resolver path.
Second: test DNS from the Postfix host (not your laptop)
digMX and A/AAAA for the recipient domain and the MX targets.- Repeat against your configured resolver and then a known-good public resolver (temporarily, for diagnosis). Differences matter.
- Check response codes: NXDOMAIN vs SERVFAIL vs timeout are different species of failure.
Third: verify the local resolver chain and timeouts
- Inspect
/etc/resolv.confand local resolver services. - Check for DNSSEC failures, split-horizon mismatches, and broken IPv6 reachability.
- Confirm Postfix isn’t jailed in a network namespace/container with different DNS.
If you do those three, you’ll stop arguing about whether “DNS is down” and start talking about which resolver path is lying to you.
Facts and history that explain today’s weirdness
DNS and email grew up together, which is why their failure modes are deeply entangled. A few short facts that help make sense of the mess:
- MX records were introduced to decouple mail routing from host addressing. Before MX, mail delivery leaned heavily on A records and implicit conventions.
- DNS TTLs were designed for caching efficiency, not operational clarity. When someone says “we fixed DNS,” you still have to ask “where, and what’s the TTL?”
- Negative caching is a thing. NXDOMAIN responses can be cached too, so a transient typo can linger like a bad smell.
- Resolvers often retry across multiple servers. One flaky resolver plus one good resolver can still create intermittent failures that look like “random Postfix issues.”
- SMTP was built for delay. Temporary failures are expected; MTAs queue and retry. That’s great—until it masks your incident.
- IPv6 adoption introduced “happy eyeballs” behavior in many stacks, but not always in the way you’d hope. AAAA lookup latency can still hurt you even if you don’t have working IPv6 routing.
- DNSSEC can turn “works on my resolver” into “SERVFAIL everywhere else.” Validation failures present as SERVFAIL to clients, which looks like “the domain is broken” even when records exist.
- Split-horizon DNS is common in enterprises. The same domain can resolve differently inside vs outside. Mail servers straddling boundaries are frequent casualties.
- Large providers publish complex DNS setups. Multiple MX records, geo-DNS, and short TTLs can amplify any weakness in your resolver chain.
DNS isn’t “just a phone book.” It’s a distributed database with caching, timeouts, partial failure, and politics. Email depends on it more than most applications, which is why you can lose mail deliverability while everything else looks fine.
Joke #1: DNS is the only database where “it depends” is a feature, not a bug.
How Postfix uses DNS (and where it can go wrong)
Outbound: MX lookup, then A/AAAA
For user@recipient.tld, Postfix does roughly:
- Lookup MX for
recipient.tld. - If no MX, fallback to A/AAAA for
recipient.tld(depending on policy and RFC behavior). - For each MX hostname, lookup A/AAAA.
- Attempt SMTP delivery to the resolved IPs, respecting priority and retry logic.
“Host not found” can happen at multiple points: MX lookup fails, MX exists but points to a hostname with no A/AAAA, or your resolver returns junk.
Inbound: policy checks that trigger DNS during SMTP
Modern Postfix configs often include restrictions that call DNS:
reject_unknown_sender_domaincauses an A/AAAA (and sometimes MX) check.reject_unknown_client_hostnameperforms PTR then forward-confirmed reverse DNS (FCrDNS) checks.- RBL lookups are DNS queries too; they can stall if your resolvers are slow.
These checks are useful, but they’re also a dependency injection mechanism. You’ve embedded DNS availability into your SMTP response path. If your resolver is slow, your SMTP service becomes slow. If your resolver is broken, your SMTP service becomes “mysteriously rude.”
Local resolver path: your real dependency
Postfix typically relies on whatever the host uses for name resolution. That means your “DNS design” includes:
- Stub resolver (
127.0.0.53withsystemd-resolved, or localunbound, etc.) - Upstream resolvers (corp DNS, VPC DNS, ISP, public)
- Firewall rules and routing (especially UDP/53, TCP/53, and fragmented UDP)
- EDNS0 and DNSSEC behavior
If you want fewer mail outages, treat DNS like a tier-0 dependency. Monitor it. Redundantly provision it. Keep it boring.
Practical tasks: commands, outputs, and decisions (12+)
These are the tasks I actually run when “host not found” appears. Each task includes: command, what you might see, what it means, and the decision you make.
Task 1: Identify the exact deferral reason in the queue
cr0x@server:~$ postqueue -p | sed -n '1,40p'
-Queue ID- --Size-- ----Arrival Time---- -Sender/Recipient-------
9C2E12A3B5 3123 Fri Jan 3 10:21:54 sender@example.net
(connect to mx1.recipient.tld[203.0.113.10]:25: Connection timed out)
user@recipient.tld
B1A7F0D91C 2210 Fri Jan 3 10:22:09 sender@example.net
(host mx2.recipient.tld[198.51.100.77] said: 450 4.1.8 <user@recipient.tld>: Recipient address rejected: Domain not found)
user@recipient.tld
C4F9A3D2E1 1444 Fri Jan 3 10:22:32 sender@example.net
(host or domain name not found. Name service error for name=recipient.tld type=MX: Host not found, try again)
user@recipient.tld
What it means: You’re seeing different failure classes. Only the last one is the DNS lookup failure. Don’t conflate them.
Decision: Pick one representative message with the DNS error, and focus on that domain and lookup type.
Task 2: Pull a single queue file for full context
cr0x@server:~$ postcat -q C4F9A3D2E1 | sed -n '1,80p'
*** ENVELOPE RECORDS ***
message_size: 1444 1444
message_arrival_time: Fri Jan 3 10:22:32 2026
sender: sender@example.net
named_attribute: rewrite_context=remote
recipient: user@recipient.tld
*** DELIVERED TO THIS RECIPIENT ***
remote: recipient.tld
offset: 1234
status: deferred
reason: host or domain name not found. Name service error for name=recipient.tld type=MX: Host not found, try again
What it means: The resolver call failed on an MX query for the domain itself, not the MX target hostnames.
Decision: Test MX resolution for recipient.tld from this host, using the same resolver path.
Task 3: Read mail logs for patterns and scope
cr0x@server:~$ sudo grep -E "Name service error|host not found|type=MX" -n /var/log/mail.log | tail -n 15
38412:Jan 3 10:22:32 server postfix/smtp[21903]: C4F9A3D2E1: to=<user@recipient.tld>, relay=none, delay=12, delays=0.1/0.01/12/0, dsn=4.4.3, status=deferred (host or domain name not found. Name service error for name=recipient.tld type=MX: Host not found, try again)
38458:Jan 3 10:24:10 server postfix/smtp[21922]: 7A11B7D4C2: to=<ops@another.tld>, relay=none, delay=10, dsn=4.4.3, status=deferred (Host or domain name not found. Name service error for name=another.tld type=MX: Host not found, try again)
What it means: Multiple domains are failing. That’s a strong hint your resolver path is sick, not just one remote domain’s DNS.
Decision: Switch from “debug recipient.tld” to “debug our DNS client path.”
Task 4: Confirm what resolvers the host will use
cr0x@server:~$ cat /etc/resolv.conf
nameserver 127.0.0.53
options edns0 trust-ad
search corp.internal
What it means: You’re using the systemd-resolved stub. Postfix will ask 127.0.0.53, which forwards to whatever systemd-resolved believes is correct.
Decision: Inspect systemd-resolved state and upstream servers next. Also watch out for search domains causing surprise queries.
Task 5: Inspect systemd-resolved’s upstream DNS and per-link config
cr0x@server:~$ resolvectl status | sed -n '1,120p'
Global
Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Link 2 (ens3)
Current Scopes: DNS
Protocols: +DefaultRoute
Current DNS Server: 10.20.0.10
DNS Servers: 10.20.0.10 10.20.0.11
DNS Domain: corp.internal
What it means: Upstream resolvers are 10.20.0.10 and 10.20.0.11. If those are unreachable or misconfigured, Postfix will fail lookups.
Decision: Test queries directly against each upstream server to catch split-brain or one-server failure.
Task 6: Query MX directly against the configured resolvers
cr0x@server:~$ dig +time=2 +tries=1 @10.20.0.10 MX recipient.tld
; <<>> DiG 9.18.24 <<>> +time=2 +tries=1 @10.20.0.10 MX recipient.tld
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 45123
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; Query time: 1998 msec
;; SERVER: 10.20.0.10#53(10.20.0.10) (UDP)
;; WHEN: Fri Jan 03 10:31:22 UTC 2026
;; MSG SIZE rcvd: 56
What it means: Not NXDOMAIN. SERVFAIL implies the resolver couldn’t complete the lookup (DNSSEC failure, upstream problem, lame delegation, recursion broken).
Decision: Query the second resolver; if one works and one SERVFAILs, you have intermittent failures depending on which server gets used.
Task 7: Compare with the second resolver
cr0x@server:~$ dig +time=2 +tries=1 @10.20.0.11 MX recipient.tld
; <<>> DiG 9.18.24 <<>> +time=2 +tries=1 @10.20.0.11 MX recipient.tld
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6112
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; ANSWER SECTION:
recipient.tld. 300 IN MX 10 mx1.recipient.tld.
recipient.tld. 300 IN MX 20 mx2.recipient.tld.
;; Query time: 28 msec
;; SERVER: 10.20.0.11#53(10.20.0.11) (UDP)
What it means: One resolver is broken. Your host will still use it sometimes, depending on resolver selection and timeouts.
Decision: Remove or fix 10.20.0.10. Don’t “wait for it to heal.” DNS servers that sometimes SERVFAIL are how you get haunted queues.
Task 8: Validate the MX targets resolve to addresses
cr0x@server:~$ dig +short A mx1.recipient.tld.
203.0.113.10
What it means: There is an IPv4 address for the MX host. Do the same for AAAA even if you “don’t use IPv6.”
Decision: If A exists but AAAA lookup stalls due to broken IPv6 path, you may need to fix IPv6 or adjust resolver behavior; see Task 12.
Task 9: Check for NXDOMAIN vs NODATA precisely
cr0x@server:~$ dig @10.20.0.11 MX nonexistent-example.tld
; <<>> DiG 9.18.24 <<>> @10.20.0.11 MX nonexistent-example.tld
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 13255
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; AUTHORITY SECTION:
. 300 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2026010300 1800 900 604800 86400
What it means: NXDOMAIN is definitive: the domain doesn’t exist in DNS. That’s a permanent failure in principle, but Postfix may still treat it as temporary depending on context.
Decision: For outbound mail: bounce if it’s truly invalid; for inbound checks: consider whether you want to reject mail based on sender-domain existence (it can cause false positives during DNS incidents).
Task 10: Confirm whether queries time out (network/firewall/path MTU)
cr0x@server:~$ dig +time=2 +tries=1 @10.20.0.10 MX recipient.tld
;; connection timed out; no servers could be reached
What it means: Not SERVFAIL—this is a connectivity issue. Could be firewall, routing, DNS service down, or UDP fragments being dropped.
Decision: Test TCP/53, and test basic reachability to the resolver.
Task 11: Test DNS over TCP (catches “UDP works until it doesn’t”)
cr0x@server:~$ dig +tcp +time=2 +tries=1 @10.20.0.10 MX recipient.tld
;; connection timed out; no servers could be reached
What it means: If both UDP and TCP time out, the resolver is unreachable or blocked. If UDP fails but TCP works, you may have a UDP fragmentation/EDNS0 problem.
Decision: Fix firewall rules and MTU/fragmentation. For quick mitigation, some environments reduce EDNS0 UDP size on resolvers, but that’s a bandage you should track and remove later.
Task 12: Detect broken IPv6 that slows resolution
cr0x@server:~$ dig +time=2 +tries=1 AAAA mx1.recipient.tld. +stats
; <<>> DiG 9.18.24 <<>> +time=2 +tries=1 AAAA mx1.recipient.tld. +stats
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9001
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; Query time: 1995 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; MSG SIZE rcvd: 49
;; Query time: 1995 msec
What it means: “NOERROR with 0 answers” (NODATA) is fine. The bad part is the two-second latency. That’s typically a resolver delay (upstream timeout, v6 path issue, or DNSSEC retries).
Decision: If AAAA queries consistently cost seconds, you’re adding latency to every delivery attempt. Fix the resolver path; don’t “just disable IPv6” unless you fully understand the blast radius.
Task 13: Confirm Postfix isn’t using a different DNS view (containers/namespaces)
cr0x@server:~$ sudo postconf -n | grep -E "smtp_host_lookup|smtp_dns|inet_protocols|resolve"
inet_protocols = all
smtp_host_lookup = dns
What it means: Postfix is doing DNS lookups normally. If this is a containerized deployment, it might still use the container’s resolv.conf.
Decision: If Postfix runs in a container, run the same dig commands from inside the container. If it runs on the host, keep debugging on the host.
Task 14: Check for resolver library timeouts and rotate behavior
cr0x@server:~$ grep -E "nameserver|options" /etc/resolv.conf
nameserver 127.0.0.53
options edns0 trust-ad
What it means: You’re not explicitly controlling timeout and attempts here. Defaults vary. Some environments also set rotate, which can spread pain evenly across resolvers.
Decision: If one upstream resolver is sick, remove it or fix it. Tweaking resolver retry parameters is not a substitute for healthy DNS.
Task 15: Validate that your own hostname and HELO name resolve
cr0x@server:~$ postconf -n | grep -E "myhostname|mydomain|myorigin"
myhostname = mail01.corp.internal
myorigin = /etc/mailname
cr0x@server:~$ dig +short A mail01.corp.internal
10.30.4.21
What it means: If your own hostname doesn’t resolve in the DNS view you’re using, you’ll trigger weird remote rejections and sometimes local policy failures.
Decision: Make sure the identity your Postfix presents (myhostname, smtpd_banner, TLS names) is resolvable and consistent across views that matter.
Task 16: If deferrals are piling up, confirm queue growth and retry behavior
cr0x@server:~$ postqueue -p | tail -n 1
-- 12753 Kbytes in 2843 Requests.
What it means: You have a backlog. Postfix will retry, but you need to manage the operational impact: disk usage, bounce storms, and delayed business workflows.
Decision: Fix DNS first. Then consider controlled queue flush (postqueue -f) during a quiet window, and watch remote rate limits.
Three corporate mini-stories (anonymized, plausible, technically accurate)
Mini-story 1: The incident caused by a wrong assumption
They migrated a pair of mail relays into a new VPC. Everything else was already there: app servers, databases, caches, the usual cloud zoo. The team assumed DNS would “just work,” because DNS had worked for everything else.
Within an hour, the mail queue started to swell. No alarms, because the SMTP service was up. CPU was fine. Disks were fine. From the outside, it looked like a slow Tuesday. From inside, outbound mail was piling up with host not found errors across multiple recipient domains.
The assumption that bit them: “If curl can resolve names, Postfix can too.” Except Postfix lived in a hardened container image with its own /etc/resolv.conf pinned to an old internal resolver IP that didn’t exist in the new VPC. The host’s DNS worked; the container’s did not.
Fixing it was embarrassingly simple: rebuild the container to use the platform DNS, and add a test that runs a real MX lookup during deployment. What hurt was the time spent debugging “Postfix” while DNS was the actual fault line.
They also added a queue-depth alert. Not because it’s fancy, but because it tells you the truth even when everything else is lying.
Mini-story 2: The optimization that backfired
A security team rolled out aggressive SMTP restrictions to reduce spam and spoofing. Good goals. They enabled reject_unknown_sender_domain and some additional client hostname checks. The mail logs looked cleaner. Fewer obvious garbage senders. Everybody felt accomplished.
Then a routine maintenance window hit the corporate DNS resolvers. Not a full outage—just a partial failure where one resolver intermittently SERVFAILed and the other was fine. Web traffic didn’t notice because browsers and app stacks retried, cached, and masked the issue. SMTP, however, does real-time policy decisions per connection. The mail relays started rejecting legitimate mail sporadically during the DNS wobble.
The “optimization” was operationally expensive: they moved failure from “deferred outbound delivery” (recoverable) to “rejected inbound mail” (potentially lost, depending on sender behavior). Some senders retried; some didn’t. That’s not a moral failing by the sender. It’s just how the ecosystem behaves.
They kept the checks, but changed the philosophy: DNS-dependent rejections were applied only where they had high confidence and low loss risk, and they built resolver health monitoring into the mail tier. Most importantly, they decided that when DNS is degraded, the MTA should degrade gracefully—more deferrals, fewer hard rejects.
Mini-story 3: The boring but correct practice that saved the day
A different company had a boring rule: every production mail relay ran a local caching resolver (unbound) on localhost, with two upstream resolvers and strict timeouts. They also had a tiny synthetic check: every minute, resolve MX for three well-known external domains and one internal domain, from the mail host itself. The check wasn’t smart. It was just persistent.
One morning, upstream DNS started dropping UDP fragments due to a firewall change. Some queries still worked—small ones. Others timed out—bigger responses, DNSSEC-heavy answers, certain MX sets. The synthetic checks caught it immediately because MX responses tend to be “chunkier” than basic A records.
The mail queue started to grow, but they already had alerts for queue length and for DNS resolution latency. The incident response was boring too: roll back the firewall policy, flush the resolver cache, watch the queue drain, and don’t touch unrelated knobs.
They didn’t win awards. They delivered mail. In operations, that’s the award.
Common mistakes: symptom → root cause → fix
1) “Host not found” for many domains at once
Symptom: Multiple unrelated recipient domains failing with MX lookup errors.
Root cause: Your resolver path is broken (one upstream resolver down, systemd-resolved confused, firewall blocking DNS, or DNSSEC validation failures).
Fix: Identify which resolver returns SERVFAIL/timeout, remove it from rotation, restore network reachability, and verify with dig @resolver MX domain tests.
2) “Host not found” only for one recipient domain
Symptom: One domain consistently fails; others deliver.
Root cause: Remote domain misconfigured DNS (missing MX and no A fallback, broken delegation, stale DS record causing DNSSEC failure).
Fix: Verify from multiple resolvers. If it fails everywhere, it’s theirs; queue and retry. If it fails only from your corporate resolver, fix your resolver or bypass for outbound mail via a stable recursive resolver.
3) Intermittent failures: half the time it works
Symptom: Same domain alternates between working and “host not found.”
Root cause: Multiple resolvers with inconsistent behavior; resolver selection/timeout behavior produces roulette. Sometimes it’s also split-horizon DNS depending on network path.
Fix: Query each resolver directly. Remove the bad one. If split-horizon is intended, ensure the mail relay is in the correct DNS view consistently.
4) Slow mail delivery, not total failure
Symptom: Mail eventually delivers but takes minutes; SMTP sessions appear sluggish.
Root cause: DNS lookups are timing out before succeeding (common with broken IPv6 routes, overloaded resolvers, or RBL timeouts).
Fix: Measure query time with dig +stats. Fix resolver performance, ensure IPv6 is either working end-to-end or handled correctly, and avoid synchronous DNS checks in SMTP path when your resolver SLO isn’t solid.
5) After “DNS fix,” Postfix still fails for hours
Symptom: You corrected records but Postfix keeps seeing NXDOMAIN or wrong answers.
Root cause: Caches. Negative caching. Local caching resolver holding old data. Or you fixed one authoritative server but not others.
Fix: Check TTLs and SOA. Flush local caches where appropriate. Verify authoritative consistency. Don’t keep restarting Postfix; it’s not a ritual sacrifice that placates the resolver gods.
6) “Host not found” after enabling strict anti-spam rules
Symptom: Inbound mail gets rejected with domain/hostname checks during DNS blips.
Root cause: You moved DNS dependency into the live SMTP decision path.
Fix: Re-evaluate restrictions. Prefer temporary deferrals over permanent rejects when DNS is unreliable. Monitor resolver health and consider local caching with predictable behavior.
Joke #2: Restarting Postfix won’t fix DNS, but it does burn calories—mostly yours.
Checklists / step-by-step plan
Step-by-step: when you’re actively losing mail delivery
- Confirm scope: One domain or many? Use
grepon mail logs and sample queue entries. - Run direct DNS queries:
dig MXfor failing domains against each configured resolver. - Classify the failure: NXDOMAIN vs SERVFAIL vs timeout. Don’t treat them the same.
- Fix the resolver path:
- If one resolver is bad, remove it from config or take it out of service.
- If network is blocking, fix firewall/routing.
- If DNSSEC is the issue, fix validation chain; don’t disable DNSSEC blindly on the resolver without a clear risk decision.
- Stabilize Postfix behavior: Avoid config churn during the incident. Get DNS healthy first.
- Drain queue carefully: Once DNS is stable, consider
postqueue -fand watch for remote throttling. - Post-incident hardening: Add queue-depth alerts, DNS latency checks, and resolver redundancy.
Hardening checklist: keep “host not found” from being a surprise
- Run a local caching resolver on mail relays (or use a proven stub-to-recursive design) and monitor it.
- Use at least two upstream resolvers that are independently healthy.
- Test DNS from the same network namespace/container that runs Postfix.
- Alert on:
- mail queue size and growth rate
- rate of “Name service error” log lines
- DNS query latency and SERVFAIL rates from mail hosts
- Be cautious with SMTP restrictions that do synchronous DNS in the critical path.
- Document how split-horizon DNS is supposed to work for mail relays. If it’s “nobody knows,” it will fail at 3 a.m.
- Validate outbound deliverability in CI/CD: at least one MX lookup and one test SMTP connection to a known domain.
- Keep IPv6 either fully working or explicitly designed around. Half-configured IPv6 is a latency tax.
Operational decision points (the ones people argue about)
- Do we bypass corporate DNS? For outbound mail relays, sometimes yes—if corporate DNS is a frequent source of incidents. Make it a managed decision, not a midnight hack.
- Do we reject inbound mail when DNS is flaky? Usually no. Prefer temporary failures so senders can retry.
- Do we disable IPv6? Only if you own the consequences. Better: make IPv6 correct, or ensure resolver and routing behavior doesn’t stall on it.
FAQ
1) What’s the difference between NXDOMAIN and SERVFAIL in this context?
NXDOMAIN means the domain name does not exist in DNS. SERVFAIL means the resolver couldn’t complete the query (upstream failure, DNSSEC validation issue, recursion disabled, lame delegation). NXDOMAIN is a “no such domain” answer; SERVFAIL is “I tried and something broke.”
2) Why does Postfix say “try again” if the domain truly doesn’t exist?
Postfix often treats DNS failures as transient because DNS is a distributed system and failures can be temporary. Also, “domain doesn’t exist” can be caused by resolver glitches or split-horizon views. If you want deterministic bounce behavior, you need policy decisions informed by reliable DNS signals.
3) Can one bad resolver really cause intermittent mail failures?
Yes. If your system has two resolvers and the resolver library rotates or fails over imperfectly, you’ll see sporadic SERVFAIL/timeouts. Mail delivery then looks “random,” because it depends on which resolver answered that specific query at that moment.
4) Why do AAAA lookups matter if we only deliver mail over IPv4?
Even if delivery ultimately uses IPv4, the resolver may still query AAAA and wait for timeouts or retries. That adds latency and can trigger “host not found” if the resolver path is unhealthy. You don’t have to love IPv6. You do have to account for it.
5) How do I tell if this is our DNS or the recipient’s DNS?
Test from your mail host against multiple resolvers. If your configured resolver fails but a known-good resolver returns correct MX records, it’s likely your resolver path. If multiple independent resolvers fail the same way, it’s likely the recipient domain’s DNS or its delegation/DNSSEC state.
6) Does restarting Postfix help?
Rarely. Postfix isn’t usually caching DNS aggressively in a way that a restart fixes. Restarts can add churn and delay while the queue continues to grow. Fix DNS, then flush the queue when stable.
7) Why do we see “host not found” but only during peak hours?
Your resolvers may be overloaded, dropping queries, or upstream rate-limited. Peak hour problems are often capacity or packet loss problems disguised as “DNS mysteries.” Measure query latency and timeout rates during peak, not at 2 p.m. when everything is calm.
8) What Postfix settings most commonly amplify DNS failures?
SMTP restrictions that do DNS lookups during connection handling (sender domain checks, client hostname checks) and heavy RBL usage can turn resolver slowness into SMTP slowness or rejection. That’s not inherently wrong—it’s a trade-off you should make intentionally.
9) How do I keep mail flowing during a resolver incident?
Short-term: remove the failing resolver from your configuration, or point the mail relays at a stable recursive resolver designed for production. Medium-term: run a local caching resolver on the mail host to smooth upstream failures and reduce latency.
10) Are “search” domains in resolv.conf a real problem for Postfix?
They can be. A missing dot or a partial name can trigger extra queries as the resolver appends search domains. That wastes time and can create surprising traffic patterns. For mail infrastructure, be conservative with search domains and prefer fully-qualified names in configs.
Conclusion: next steps you can do this week
If you only do three things, do these:
- Add monitoring that reflects reality: alert on queue depth and on DNS lookup failures/latency from the mail hosts themselves.
- Make your resolver path boring: two healthy upstream resolvers, predictable behavior, and ideally a local caching resolver on mail relays.
- Stop turning DNS hiccups into mail loss: be cautious with DNS-dependent SMTP rejections unless you have strong resolver reliability and clear business intent.
Then schedule one short exercise: pick a staging relay, intentionally break one resolver IP, and watch what happens. If your system fails “silently,” you’ve learned something priceless—and you learned it without production being on fire.