Email: TTL during migration — the simple trick that prevents downtime

Was this helpful?

Email migrations don’t usually fail because IMAP sync was slow or because someone misread a vendor wizard.
They fail because mail keeps going to the old place after you “switched,” or worse: half the world goes old,
half goes new, and you learn what split delivery feels like at 2 a.m.

The fix is boring, reliable, and absolutely not optional: control DNS caching with TTL before you cut over.
You can’t eliminate propagation, but you can make it short, measurable, and reversible.

TTL in one sentence (and why it matters for email)

TTL (Time To Live) is how long a DNS answer is allowed to be cached before the resolver must ask again.
During an email migration, TTL determines how long senders keep using the old MX records after you change them.

That sounds simple because it is. It’s also one of the few migration levers you can pull weeks in advance,
without touching the mail servers at all.

Here’s the operational truth: a cutover isn’t when you change DNS; it’s when enough resolvers
stop believing the old answer. TTL is the schedule you hand to the internet.

What actually breaks during email cutovers

1) Split delivery (the quiet disaster)

Some senders resolve your MX to the old provider. Others resolve to the new provider. Both accept mail.
Users report “missing” messages, and both sides are right. Mail is delivered. Just not to the same place.

If you’re doing coexistence intentionally, split delivery can be planned. Most teams aren’t.
Most teams accidentally create it and call it “intermittent.”

2) Backscatter and retries that look like loss

SMTP is a store-and-forward system. Senders retry on transient failures.
When you change MX and something is misconfigured (TLS, IP reputation, greylisting, firewall),
you can get a backlog of retries that arrive late. Users call it “lost,” then it arrives at 3 p.m.

3) Authentication misalignment (SPF/DKIM/DMARC)

Changing where you send mail from is often coupled with changing where you receive mail.
If the new provider signs DKIM differently, or you send from new IPs without SPF updates,
you can pass delivery but fail trust. That means junk folder, not outage. In practice, it’s an outage.

4) Overconfident assumptions about “DNS propagation”

Teams treat DNS like a magical eventual-consistency fog. It’s not fog. It’s caching with rules.
The rules are measurable. If you don’t measure them, you’re not migrating; you’re gambling.

Joke #1: DNS propagation is like “waiting for the internet to update”—which is adorable until you realize the internet doesn’t take tickets.

Interesting facts and a little history (so you stop guessing)

  • DNS is older than most email SaaS providers. The Domain Name System was designed in the early 1980s to replace HOSTS.TXT distribution at scale. Email rode on top of it, and still does.
  • MX records exist because SMTP needed indirection. Early mail routing needed a way to say “mail for this domain goes there,” not “this domain’s A record is the mail server.” MX formalized that.
  • TTL isn’t new; what changed is caching behavior. Modern recursive resolvers, ISP resolvers, and enterprise caches aggressively store answers, and some add “helpful” behaviors like prefetching.
  • Negative caching is a thing. NXDOMAIN answers can be cached based on the zone’s SOA values, so “we’ll add that record later” can still bite you for a while.
  • SMTP was built for retries, not instant delivery. Queue-and-retry is why email is resilient… and why migration mistakes can take hours to surface.
  • Some senders pin results longer than you expect. While resolvers should honor TTL, some environments effectively hold onto answers due to internal forwarders, policy caching, or stale recursor behavior.
  • Mail transfer agents use MX preference ordering. Lowest preference value wins. Misordering can silently route mail somewhere “valid” but wrong.
  • Long TTLs were historically used to reduce DNS load. In the days when bandwidth and CPU were expensive, admins set multi-hour TTLs to keep resolvers from hammering authoritative servers.
  • Email migrations got harder when authentication got stricter. SPF (2000s), DKIM (late 2000s), and DMARC (2010s) improved trust but increased the number of moving parts during cutover.

The “simple trick”: TTL staging that prevents downtime

You do not “just lower TTL when you’re ready.” That’s the classic mistake. Lowering TTL only helps after the
new TTL has had time to replace the old cached answers. If your MX TTL is 3600 seconds and you lower it to 300
seconds at noon, a resolver that cached at 11:59 still has permission to believe the 3600-second answer until 12:59.

The trick is staging TTL changes so that, on cutover day, the internet is already trained to re-check frequently.
Then your MX flip becomes a short window instead of a half-day mystery.

The core pattern

  1. Days before cutover: lower TTL on MX (and any related records you will change) to something like 300 seconds.
  2. Wait at least the old TTL + safety margin (I prefer 2× the old TTL if you can) so caches naturally refresh.
  3. Cutover: change MX to the new provider while TTL is low.
  4. After stability: raise TTL back to a sane value (often 1800–3600 seconds) to reduce query volume and noise.

Which records to stage (most teams miss at least one)

MX is the headline, but email migrations touch a small constellation of DNS. Stage TTL for anything you will
change during the window:

  • MX (obvious)
  • A/AAAA for the mail hostnames referenced by MX (sometimes the provider gives you hostnames you don’t control; sometimes you do)
  • TXT for SPF changes (if you’re changing outbound)
  • TXT for DKIM selector records (if you’re rotating or adding)
  • TXT for DMARC (if you’re tightening policy post-migration)
  • CNAME for autodiscover/autoconfig endpoints (client pain is still pain)
  • SRV records in some enterprise setups (less common, but don’t assume)

How low is “low TTL”?

300 seconds (5 minutes) is the workhorse. 60 seconds is tempting but often pointless: not all resolvers behave
better at 60 than 300, and you increase query volume and potential for rate limiting or logging noise.
If your DNS provider is flaky, a too-low TTL can amplify the flakiness into a visible outage.

What TTL does not solve

TTL reduces the time it takes for the world to notice your new records. It does not:

  • Force a sender to retry immediately.
  • Fix firewall issues or port 25 reachability.
  • Repair SPF/DKIM alignment.
  • Magically move mail already delivered to the old mailbox.

A single quote worth keeping on a sticky note

Everything fails all the time. — Werner Vogels

Treat the MX flip as a failure you’re planning to survive. TTL staging is your easiest resilience win.

Timelines that work: T-7 days to T+2 days

Below is a migration schedule that assumes you’re switching inbound mail routing (MX cutover) and possibly outbound.
Adapt the days to your organization, but keep the ordering. Email punishes improvisation.

T-7 to T-3 days: make DNS boring

  • Inventory current TTLs for MX and related records.
  • Lower TTLs to 300 seconds for records you will change.
  • Confirm that authoritative DNS is actually serving the new TTL (don’t trust the UI).
  • Run a “pre-cutover” verification from multiple resolvers (your ISP, public resolvers, corporate recursors).

T-2 to T-1 days: verify the new receiving side can accept mail

  • Confirm inbound SMTP works to the new provider (test domains, pilot users, or alternate MX).
  • Confirm anti-spam, TLS settings, connector rules, and inbound allowlists/denylists.
  • Prepare rollback: old MX records ready to restore, and confirm old side can still accept mail.

T-0 day: cutover, observe, and don’t panic

  • Change MX.
  • Watch inbound mail flow and queue behavior on both sides.
  • Confirm authentication results (SPF/DKIM/DMARC) on actual messages.
  • Track straggler traffic still arriving at the old system; decide whether to forward, relay, or fetch.

T+1 to T+2 days: stabilize and raise TTL

  • Raise TTL back to 1800–3600 seconds once you’re confident.
  • Keep monitoring for retries and delayed mail.
  • Decommission old inbound gradually, not immediately, unless risk demands it.

Fast diagnosis playbook

When the inevitable “email is down” message arrives, you need a fast way to locate the bottleneck.
Don’t start by arguing about DNS propagation. Start by narrowing the blast radius.

First: is it DNS, SMTP reachability, or acceptance policy?

  1. Check what MX the world sees from at least two independent resolvers.
  2. Check whether the destination accepts TCP/25 and responds with an SMTP banner.
  3. Check whether the server accepts a test message (even just MAIL FROM/RCPT TO).

Second: is mail being delivered somewhere else?

  1. Search old system logs/queues for recent inbound.
  2. Check new system message trace for attempted deliveries.
  3. Look for bounces and deferred messages in sender logs (if internal) or user-provided bounce headers.

Third: is authentication causing “soft outage”?

  1. Verify SPF includes the correct sending systems.
  2. Verify DKIM signing and selector correctness.
  3. Verify DMARC policy isn’t rejecting newly legitimate mail.

Fourth: does anything cache longer than TTL?

  1. Check enterprise forwarders/recursors for stale data.
  2. Check if upstream security appliances do DNS filtering/caching.
  3. Decide whether to temporarily override with local resolver flushes for critical apps (rare, but sometimes needed).

Hands-on tasks: commands, outputs, and decisions (12+)

The difference between a calm migration and a headline is usually whether you can answer basic questions quickly:
“What MX is active?”, “Who’s caching what?”, “Is port 25 reachable?”, “Are we rejecting mail because of policy?”
Below are practical tasks you can run from a Linux box with common tools.

Task 1: Query current MX and TTL from your default resolver

cr0x@server:~$ dig +noall +answer example.com MX
example.com.        3600    IN      MX      10 mx1.oldmail.example.net.
example.com.        3600    IN      MX      20 mx2.oldmail.example.net.

What it means: TTL is 3600 seconds. Many resolvers may keep this answer for up to an hour after caching.

Decision: If cutover is within the next day, you’re late. Lower TTL now and accept the longer transition, or reschedule.

Task 2: Query authoritative nameservers directly (bypass caches)

cr0x@server:~$ dig +noall +answer example.com MX @ns1.dnsprovider.net
example.com.        300     IN      MX      10 mx1.oldmail.example.net.
example.com.        300     IN      MX      20 mx2.oldmail.example.net.

What it means: The authority is serving TTL 300 now. Good. Caches will converge as their old TTLs expire.

Decision: Note the time you changed it. The earliest safe cutover is after old TTL expiration + margin.

Task 3: Compare different public resolvers (real propagation, not vibes)

cr0x@server:~$ for r in 1.1.1.1 8.8.8.8 9.9.9.9; do echo "== $r =="; dig +noall +answer example.com MX @$r; done
== 1.1.1.1 ==
example.com.        3600    IN      MX      10 mx1.oldmail.example.net.
example.com.        3600    IN      MX      20 mx2.oldmail.example.net.
== 8.8.8.8 ==
example.com.        300     IN      MX      10 mx1.oldmail.example.net.
example.com.        300     IN      MX      20 mx2.oldmail.example.net.
== 9.9.9.9 ==
example.com.        1800    IN      MX      10 mx1.oldmail.example.net.
example.com.        1800    IN      MX      20 mx2.oldmail.example.net.

What it means: Different resolvers cached at different times. This is normal and exactly why staging TTL matters.

Decision: Don’t cut over until the big resolvers are showing the low TTL consistently, unless you can tolerate split delivery.

Task 4: Confirm what hostnames the MX records point to (and whether you control their DNS)

cr0x@server:~$ dig +noall +answer mx1.oldmail.example.net A
mx1.oldmail.example.net. 300 IN A 203.0.113.10

What it means: MX target resolves to an IP with TTL 300. If this record is in a vendor zone you don’t control, you can’t stage its TTL.

Decision: If you control it, stage it. If you don’t, accept that the provider owns that risk.

Task 5: Check for “dangling” MX targets (a classic cutover foot-gun)

cr0x@server:~$ dig +noall +answer mx-new.example.com A

What it means: No answer. If you point MX at a name that doesn’t resolve, senders will queue/retry, and users will claim email is dead.

Decision: Do not cut over. Fix DNS first. Also check for negative caching timing via SOA.

Task 6: Check SOA values to understand negative caching behavior

cr0x@server:~$ dig +noall +answer example.com SOA
example.com. 3600 IN SOA ns1.dnsprovider.net. hostmaster.example.com. 2026010401 7200 900 1209600 300

What it means: The final field (300) is the negative caching TTL in many implementations.

Decision: If you’re about to add missing records, expect NXDOMAIN to persist up to this value after caches have seen it.

Task 7: Validate SMTP reachability to the new MX host (TCP/25)

cr0x@server:~$ nc -vz mx1.newmail.example.net 25
Connection to mx1.newmail.example.net 25 port [tcp/smtp] succeeded!

What it means: Network path is open. This does not prove acceptance, but it clears the first hurdle.

Decision: If it fails, fix firewall/security groups/routing before touching MX.

Task 8: Check the SMTP banner and basic handshake

cr0x@server:~$ openssl s_client -starttls smtp -crlf -connect mx1.newmail.example.net:25 -servername mx1.newmail.example.net
CONNECTED(00000003)
---
220 mx1.newmail.example.net ESMTP ready
...
250-STARTTLS
250 SIZE 52428800

What it means: The server speaks SMTP and offers STARTTLS. You have evidence it’s alive and somewhat modern.

Decision: If STARTTLS is required by policy, ensure it’s offered and certificates validate in your environment.

Task 9: Do a minimal SMTP transaction to validate acceptance policy

cr0x@server:~$ printf "EHLO test.example\r\nMAIL FROM:<probe@example.org>\r\nRCPT TO:<postmaster@example.com>\r\nDATA\r\nSubject: smtp probe\r\n\r\nhello\r\n.\r\nQUIT\r\n" | nc -w 5 mx1.newmail.example.net 25
220 mx1.newmail.example.net ESMTP ready
250-mx1.newmail.example.net
250 PIPELINING
250 2.1.0 Ok
250 2.1.5 Ok
354 End data with <CR><LF>.<CR><LF>
250 2.0.0 Queued
221 2.0.0 Bye

What it means: The new system accepts mail for your domain and queues it for delivery. That’s the core of “inbound works.”

Decision: If RCPT TO is rejected, fix domain verification, accepted domains, routing rules, or anti-spam settings.

Task 10: Check SPF record content before and after outbound changes

cr0x@server:~$ dig +short TXT example.com
"v=spf1 include:_spf.oldprovider.example -all"

What it means: SPF currently authorizes the old provider only.

Decision: If outbound will switch, add the new sender include or IPs before users start sending from the new platform.

Task 11: Confirm DKIM selector records exist (common “it worked in the wizard” trap)

cr0x@server:~$ dig +short TXT selector1._domainkey.example.com
"v=DKIM1; k=rsa; p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8A..."

What it means: The selector record is present in DNS. Without it, recipients can’t verify your signatures.

Decision: If missing, do not enable “DKIM required” policies or strict DMARC yet. Fix DNS first.

Task 12: Confirm DMARC policy (to avoid self-inflicted rejection)

cr0x@server:~$ dig +short TXT _dmarc.example.com
"v=DMARC1; p=reject; rua=mailto:dmarc-rua@example.com; adkim=s; aspf=s"

What it means: Strict alignment and reject policy. During migration, this can turn minor auth drift into hard failure.

Decision: If you’re changing outbound or DKIM, consider temporarily relaxing to p=quarantine or alignment modes until stable.

Task 13: Verify what MX is currently being used by an MTA via DNS lookup trace

cr0x@server:~$ dig +trace example.com MX
; <<>> DiG 9.18.24 <<>> +trace example.com MX
.                       518400  IN      NS      a.root-servers.net.
...
example.com.            300     IN      MX      10 mx1.newmail.example.net.
example.com.            300     IN      MX      20 mx2.newmail.example.net.

What it means: Trace shows authoritative chain and final MX answers. If this is correct but some clients still see old MX, it’s caching downstream.

Decision: Use this to prove what authority serves; then focus on resolvers, not DNS UI screenshots.

Task 14: Inspect local resolver cache behavior (systemd-resolved example)

cr0x@server:~$ resolvectl query example.com --type=MX
example.com: 10 mx1.oldmail.example.net
             20 mx2.oldmail.example.net

-- Information acquired via protocol DNS in 1.9ms.
-- Data is authenticated: no

What it means: Your host is still seeing old MX. That may be your local cache, not the world.

Decision: Flush local caches (carefully) or query an external resolver directly to avoid self-misdiagnosis.

Task 15: Confirm old system is still receiving mail (to detect split delivery)

cr0x@server:~$ tail -n 5 /var/log/maillog
Jan 04 01:10:21 oldmx postfix/smtpd[18273]: connect from mail-oi1-f170.google.com[209.85.167.170]
Jan 04 01:10:22 oldmx postfix/smtpd[18273]: 3F2A01234: client=mail-oi1-f170.google.com[209.85.167.170]
Jan 04 01:10:22 oldmx postfix/cleanup[18280]: 3F2A01234: message-id=<CA+demo@example.org>
Jan 04 01:10:23 oldmx postfix/qmgr[912]: 3F2A01234: from=<sender@example.org>, size=1842, nrcpt=1 (queue active)
Jan 04 01:10:23 oldmx postfix/local[18290]: 3F2A01234: to=<user@example.com>, relay=local, delay=1.1, dsn=2.0.0, status=sent

What it means: The old MX is still getting mail from at least some sources. That’s split delivery in action.

Decision: Decide whether to forward/relay from old to new, or keep the old mailbox accessible until stragglers taper off.

Task 16: Measure “who is still using old MX” by sampling Received headers

cr0x@server:~$ grep -m 1 -i '^Received:' /var/mail/user | head -n 1
Received: from mx1.oldmail.example.net (mx1.oldmail.example.net [203.0.113.10]) by oldmx.example.com with ESMTP id 3F2A01234

What it means: Message passed through the old infrastructure. Headers are forensic gold during migrations.

Decision: If key partners still route to the old MX days later, investigate whether they use internal resolvers with sticky caching or static routes.

Three corporate mini-stories (how this goes wrong in real life)

Mini-story #1: The incident caused by a wrong assumption

A mid-sized company moved from a self-hosted Postfix/Dovecot stack to a hosted email platform. The plan was
straightforward: sync mailboxes, switch MX on Friday night, call it done.

The project lead assumed “DNS propagation” would be done in minutes because their DNS console updated instantly.
They didn’t check the existing TTL. It was set to 14,400 seconds (4 hours), a relic from an era when their
authoritative servers were fragile and they were trying to reduce query load.

They flipped MX at 9 p.m. By 9:15 p.m., the first on-call page hit: some users were receiving mail in the new
inbox, others weren’t. By 10 p.m., executives were forwarding screenshots. By midnight, the helpdesk had a queue
of “missing email” tickets that were real, because mail was split across two systems with no automatic bridging.

The team tried to “force propagation” (which is not a thing) by making more DNS changes. That made it worse.
Some resolvers refreshed and got a different answer; others stayed pinned. Their monitoring, which used a single
public resolver, showed “MX is correct.” Reality didn’t care.

The eventual fix was not heroic. They re-enabled delivery on the old system, set up forwarding from old to new,
and waited out the TTL. The postmortem root cause was a sentence that should be printed on mugs: they confused
“authoritative change is live” with “clients have re-queried.”

Mini-story #2: The optimization that backfired

A large enterprise had a central DNS team that prided itself on performance. They ran aggressive caching
resolvers and even fronted them with internal forwarders at remote sites. Latency was great. Reliability was
mostly great. And then the email team scheduled a tenant migration.

The email team did the right thing: they lowered MX TTL to 300 seconds a week in advance. They verified the
authoritative nameservers. They ran dig from a couple of laptops. Everything looked ready.

Cutover happened and… remote sites kept delivering inbound mail to the old provider for hours. Not minutes.
The DNS team’s “optimization” was a prefetch-and-serve behavior: resolvers fetched ahead of TTL expiration and
remote forwarders had their own caching layers. TTL was honored in the letter-of-the-law sense, but the effective
behavior during change windows was sticky.

The email team initially blamed the new provider. The provider blamed the sender. Meanwhile, the old provider
kept accepting mail because nobody wanted to decommission during a transition. Split delivery became an all-day
issue that only affected certain locations, which is the worst kind of issue: the one that looks like user error.

The fix was to route around the “smart” caching during cutover: temporarily point remote sites at a resolver
that did not prefetch and did not chain caches, then revert after the window. Also, the DNS team documented their
caching layers and added a change-management flag for “migration windows,” where prefetch behavior is disabled.

Mini-story #3: The boring but correct practice that saved the day

A financial firm migrated email as part of a broader identity and endpoint overhaul. The email piece was not
glamorous. That was exactly why it worked. They ran a checklist, stuck to it, and refused to cut over without
evidence.

Two weeks before cutover, they lowered TTL on MX, SPF, DKIM selectors, and autodiscover records. They logged
the change time and calculated the “safe flip” time based on the old TTL. They ran queries against their
authoritative servers and three external resolvers twice daily and recorded results in a shared runbook.

On cutover day, their monitoring didn’t just check “MX equals expected.” It checked “MX TTL is low,” “SMTP banner
matches the new system,” and “test messages are accepted and traced end-to-end.” They also kept the old inbound
system alive but configured it to relay mail to the new system. That meant stragglers were no longer a user-facing
problem. They were just a metric.

The migration still had minor issues: a few partners had hard-coded old MX hosts in their devices (yes, that’s real),
and one internal app used a local resolver that didn’t refresh often. But because DNS was staged, they could
focus on those edge cases without global chaos.

Their secret wasn’t a tool. It was discipline. The boring practice was “measure before you change, and keep a rollback
that works.” The result was a cutover so uneventful that leadership forgot it happened, which is the highest compliment
you can get in operations.

Common mistakes: symptom → root cause → fix

1) Symptom: Some senders deliver to old provider hours after MX cutover

Root cause: TTL wasn’t lowered early enough, or multiple caching layers (enterprise forwarders, site resolvers) keep stale answers.

Fix: Lower TTL days ahead next time. For now, keep old inbound accepting and relay/forward to the new system. Identify stubborn resolvers and bypass temporarily.

2) Symptom: Mail queues on sender side with “connection timed out”

Root cause: Port 25 blocked to the new MX, or the new provider requires specific source IPs/connector configuration.

Fix: Validate TCP/25 reachability from outside your network. Adjust firewall, cloud security groups, and provider inbound connectors/allowlists.

3) Symptom: Bounces saying “Recipient address rejected” after cutover

Root cause: Domain not fully verified/accepted on the new platform; routing rules misconfigured; mailbox not provisioned yet.

Fix: Confirm accepted domains, recipient provisioning, and catch-all/alias policies. Test RCPT TO against multiple recipients.

4) Symptom: Email “works” but is now landing in spam or being rejected by strict receivers

Root cause: SPF not updated for new outbound, DKIM not signing or selector missing, DMARC strict alignment rejects.

Fix: Update SPF to include new outbound sources, publish DKIM selectors, verify signing, temporarily relax DMARC policy during transition.

5) Symptom: Internal users see new MX; external partners see old MX

Root cause: Your internal DNS differs from public (split-horizon) or internal forwarders override public DNS.

Fix: Ensure internal DNS either mirrors public MX or is deliberately managed. During cutover, avoid inconsistent views unless you’re intentionally doing phased routing.

6) Symptom: Outlook/mobile clients break during cutover even though SMTP delivery works

Root cause: Autodiscover/autoconfig DNS records still point to old endpoints; TTL too high on CNAME/SRV records.

Fix: Stage TTL and update client configuration records. Test from clean devices/networks.

7) Symptom: After switching MX, some mail disappears “into the ether”

Root cause: Old system accepts but stores locally; no forwarding/relay; users stop checking old inbox.

Fix: Configure old inbound to relay to new during the transition. Communicate access to old mailboxes until traffic stops.

Joke #2: The only thing more persistent than DNS cache is an executive who “just wants a quick update” during an incident.

Checklists / step-by-step plan

Phase 1: Inventory and staging (do this early)

  1. List all records you will change: MX, SPF TXT, DKIM TXT, DMARC TXT, autodiscover CNAME/SRV, any inbound gateways.
  2. Record current TTLs and calculate your staging window (old TTL is the minimum wait before benefits appear).
  3. Lower TTLs to 300 seconds on those records.
  4. Verify authoritative DNS is serving the new TTL (query @nameserver, not your laptop cache).
  5. Verify multiple external resolvers are converging on the lower TTL over time.

Phase 2: Pre-cutover verification (prove the new system works)

  1. Confirm new MX endpoints accept TCP/25 and present valid SMTP banners.
  2. Perform a controlled SMTP transaction to a test recipient.
  3. Verify message trace on the new platform shows accepted and delivered.
  4. Ensure old platform can remain online to receive/relay mail during transition.
  5. Prepare rollback: exact old MX values, TTLs, and the person authorized to execute.

Phase 3: Cutover execution (keep the blast radius small)

  1. Change MX to new provider while TTL is low.
  2. Immediately query authoritative and multiple resolvers to confirm new answers are live.
  3. Send test mails from at least two external sources (personal mailbox + a service you control).
  4. Watch old system logs for inbound traffic; if it’s still receiving, relay/forward.
  5. Track failures: timeouts, 5xx rejects, spam placement. Fix the biggest first (usually reachability or acceptance policy).

Phase 4: Stabilization and cleanup (don’t leave sharp edges)

  1. Once stable, raise TTL back to 1800–3600 seconds (or your standard).
  2. Revisit DMARC policy if you relaxed it; tighten gradually.
  3. Keep old inbound running long enough to catch retries and stragglers; then decommission deliberately.
  4. Write down what you learned: actual propagation timeline, stubborn resolvers, and what monitoring worked.

FAQ

1) If I lower TTL to 300 seconds, does that guarantee the cutover completes in 5 minutes?

No. It means caches are allowed to keep answers for 5 minutes after they fetch them. If a resolver fetched
right before you lowered TTL, it may keep the old answer for the old TTL duration. That’s why you stage early.

2) How far in advance should I lower TTL?

At least the current TTL, plus buffer. If MX TTL is 3600, lower it at least a day before a major cutover so you’re
not racing random caches. For very high TTLs (8–24 hours), stage a week out and sleep better.

3) Should I lower TTL for SPF/DKIM/DMARC too?

If you plan to change them during the migration window, yes. Authentication failures rarely look like “down,” but
they do look like “email stopped arriving,” which is close enough.

4) What TTL value should I use during the cutover?

300 seconds is a strong default. Go lower only if you have a specific reason and you trust your DNS provider under load.

5) Can I “force DNS propagation” by changing records multiple times?

No. Repeated changes can create oscillation where different caches hold different answers. Make one planned change,
verify it at authority, and let caches expire naturally.

6) If email is delayed after cutover, is that always DNS?

Not even close. SMTP retries, greylisting, inbound policy rejects, TLS negotiation problems, and spam filtering
can all create delays. DNS is just the first thing to prove or eliminate.

7) What’s the safest rollback plan for MX changes?

Keep the old MX values and ensure the old system can still accept mail. With TTL staged low, rolling back is fast.
Without TTL staging, rollback can take as long as cutover took—sometimes longer, because now caches are mixed.

8) Do I need to coordinate with partners or big senders?

If you have key partners who send high-value mail (banks, payroll, ticketing systems), tell them the cutover window.
Also ask whether they hard-code routes. Some legacy systems literally store the MX hostname in configuration.

9) What about on-prem environments with internal DNS split-horizon?

Decide whether internal users should follow public MX or a different route. If internal DNS is different, document it
and test it. Split-horizon is fine when it’s intentional, and painful when it’s accidental.

10) When should I raise TTL back up?

After you’ve observed stable inbound delivery and you’ve confirmed straggler traffic to the old system is near zero
(or at least safely relayed). Typically 24–72 hours post-cutover is conservative and sane.

Conclusion: next steps that keep you employed

TTL staging isn’t a hack. It’s basic operational hygiene. If you do it early, your MX cutover becomes a controlled
change with a short convergence window and a real rollback. If you skip it, you’re signing up for split delivery,
confusion, and an incident call where everyone argues about “propagation” like it’s weather.

Practical next steps:

  1. Right now: run dig and record your current MX TTL.
  2. Today: lower TTL to 300 seconds for every email-related record you expect to change.
  3. Tomorrow: verify authoritative answers and cross-check multiple resolvers.
  4. Cutover day: change MX once, test SMTP acceptance, and monitor both old and new inbound paths.
  5. Afterwards: raise TTL back, then clean up authentication and decommission old inbound deliberately.
← Previous
Antivirus That Breaks PCs: The Irony That Keeps Repeating
Next →
Scroll progress bar for articles: CSS-first, minimal JS that won’t page-fault your UX

Leave a comment