Email migrations fail in one of two ways: loudly (users can’t send) or quietly (messages disappear into the fog, and you only find out when Finance says “we never got it”). If you’re moving thousands of mailboxes and you can’t afford downtime, you don’t need heroics. You need choreography.
This is the move plan that production teams use when the inbox is a business process, not a feature. It’s practical, opinionated, and paranoid in the right places. The goal is simple: no lost messages, minimal user disruption, and an exit ramp if anything smells wrong.
What “zero downtime” really means for email
“Zero downtime” in email doesn’t mean “no changes.” It means the user experience stays continuous while the plumbing changes underneath. Users keep receiving mail. They can send mail. Their clients keep authenticating. And if something goes sideways, you can roll back without dropping mail on the floor.
The hard part is that email is not one protocol or one server. It’s a set of contracts:
- Inbound delivery contract: the internet knows where to deliver mail for your domain via MX, and receivers trust you via SPF/DKIM/DMARC.
- Outbound sending contract: your users and apps can submit messages, and your egress IPs/domains aren’t blacklisted.
- Mailbox access contract: IMAP/POP/ActiveSync/REST clients can authenticate and find the same folders and messages as yesterday.
- Identity contract: addresses, aliases, groups, and permissions mean the same thing pre- and post-migration.
- Compliance contract: retention, journaling, holds, and audit logs remain intact.
Most “lost email” during migrations is not truly lost. It’s misrouted, rejected, quarantined, duplicated, or delivered into a mailbox the user isn’t looking at anymore. That’s fixable, but only if you design for observability and rollback.
Here’s the simplest operational definition that holds up in incident reviews:
- No message loss: every message accepted by any of your systems can be accounted for and delivered exactly once (or intentionally duplicated with traceability).
- No hard cutover on day one: you run coexistence long enough to prove routing and sync correctness.
- Rollback is real: you can revert inbound and outbound routing without reconfiguring every client in the company.
One quote to keep your team honest, because migrations are basically risk management with nicer meeting invites:
“Hope is not a strategy.” — Gen. Gordon R. Sullivan
Short joke #1: The only thing more permanent than a temporary migration workaround is the calendar invite that keeps getting rescheduled.
Migration models that actually work (and when to use them)
1) Big-bang cutover (avoid unless you have a tiny domain)
This is the classic “flip MX and pray” approach. It can work for a small org with low mail volume and a simple client profile. In production environments with compliance needs, it’s asking for an incident ticket avalanche.
Use only when: you have a few dozen mailboxes, no complex routing, no third-party senders you don’t control, and you can accept a short mailflow risk window.
2) Staged migration with coexistence (the default for grown-ups)
Some users (or groups) move first. Routing between old and new systems stays correct in both directions. Mailboxes synchronize ahead of time. You test end-to-end flows repeatedly.
Use when: you have hundreds to tens of thousands of mailboxes, multiple domains, shared mailboxes, delegates, or a serious helpdesk.
3) Dual delivery (belt-and-suspenders for inbound mail)
Inbound mail is delivered to both systems for a period. That way, even if a user checks the “wrong” mailbox during the transition, the message exists in both places.
Use when: you can tolerate temporary duplicates and you have a plan to reconcile them (or accept them as an operational cost).
4) Gateway/rerouting strategy (smart when you control SMTP edges)
You keep a stable inbound SMTP gateway (or use an email security gateway) and steer delivery to old vs new based on recipient location. MX stays stable; your internal routing changes. This reduces DNS propagation drama and gives you better rollback.
Use when: you already have a gateway layer, or you can build one without making everything worse.
5) Hybrid/coexistence via vendor tooling (works, but verify the assumptions)
Microsoft 365 hybrid configurations and Google Workspace migration tools can be solid. They can also hide the operational levers you need when troubleshooting. Treat them as automation, not as correctness guarantees.
Use when: you have strong directory integration needs, you want smooth client transitions, and you can invest time in verifying mailflow and identity mapping.
Interesting facts and historical context (so you stop repeating history)
- MX records arrived in the mid-1980s to let domains specify mail exchangers, replacing earlier “just use the host name” assumptions in the early SMTP era.
- SMTP predates modern authentication; it was designed for a smaller, more trusting network. That’s why so much of today’s “email reliability” is bolted on later.
- MIME (early 1990s) is why attachments and rich content work across systems. Migrations that break MIME handling will “work” until someone forwards a signed PDF.
- IMAP became the protocol of choice for server-stored mailboxes; it’s flexible but easy to mis-handle during migration due to flags, UID validity, and folder semantics.
- SPF (2000s) reduced simple spoofing by publishing which servers may send for a domain, but it can break legitimate senders you forgot existed.
- DKIM (2000s) made reputation portable by signing messages; cutovers that change selectors/keys can tank deliverability for days if not staged.
- DMARC (2010s) gave policy teeth (reject/quarantine), but strict policies turn a small migration misconfig into a full-on mail outage.
- Greylisting was once a popular anti-spam tactic; it can amplify perceived migration “slowness” because new sending IPs look suspicious and get deferred.
- Large providers now use engagement signals (opens, replies, spam complaints) as part of reputation; a migration that changes sending patterns can trigger filtering even with correct DNS.
The architecture of a safe move: coexistence, routing, and identity
Start with invariants: what must not change
Before you touch MX, decide what stays stable throughout the migration window:
- User addresses: primary SMTP addresses should not change. Aliases should not vanish.
- Inbound hostname and TLS posture: if you can keep the same MX hostname (fronted by a gateway), do it.
- Outbound identity: keep the same envelope-from and header-from patterns as much as possible.
- Directory source of truth: one system is authoritative for mailbox existence and routing attributes at any moment.
Routing strategy: recipient-based steering beats DNS roulette
DNS is global caching with vibes. Even when TTL is low, resolvers ignore you, middleboxes rewrite things, and some senders cache aggressively. A stable SMTP gateway lets you route per-recipient:
- If recipient is migrated: deliver to new platform.
- If not: deliver to old platform.
- If unsure: queue safely, don’t bounce.
This is how you get real rollback: you change a routing map, not public DNS.
Coexistence: the unglamorous superpower
Coexistence is not a checkbox. It’s a period where you prove:
- Mail from internet → migrated users lands in the new mailbox.
- Mail from internet → non-migrated users lands in the old mailbox.
- Mail between old ↔ new users behaves normally, including reply chains and calendar invites if relevant.
- Aliases, groups, shared mailboxes, and delegates still work.
Identity and auth: you’re migrating trust, not just data
Email access is glued to identity. If you change auth (new IdP, new MFA, new conditional access), do it as a separate project unless you enjoy chaotic multi-variable debugging.
My bias: stabilize identity first, then migrate mailboxes. Or migrate mailboxes first while keeping identity stable. Don’t do both simultaneously unless you have a lab that mirrors production and time to use it.
Storage reality: mailboxes are datasets, not vibes
Mailbox stores behave like databases. Migration throughput depends on:
- IOPS and latency on source and destination stores
- Indexing and search rebuild behavior
- Concurrency limits (per-user, per-tenant, per-connector)
- Network throughput and packet loss
- Throttling policies (especially in SaaS)
If you want zero downtime, design for queueing and retries, and assume you’ll be rate-limited.
Checklists / step-by-step plan: the staged cutover
Phase 0: Decide your migration success criteria (write it down)
- Accepted inbound mail is never lost; worst-case it is queued and delivered late.
- Outbound deliverability stays within an agreed tolerance (bounces, spam placement, delays).
- Users can access mailboxes via agreed clients (desktop, mobile, web).
- Compliance controls (journaling, holds, retention) remain effective.
- Rollback: inbound and outbound can revert within a defined time window.
Phase 1: Inventory and dependency mapping (the part people skip)
- List all domains and subdomains that send or receive mail.
- Identify all inbound paths: direct-to-MX, via security gateway, via cloud filtering.
- Identify outbound senders: user submission, SMTP relays, apps, printers, monitoring systems.
- Catalog shared mailboxes, groups, aliases, resource mailboxes, and delegates.
- Export a list of “special” users: exec assistants, legal, finance, support queues.
- Document existing SPF/DKIM/DMARC and any third-party senders.
Phase 2: Build coexistence routing
- Stand up or configure your stable SMTP edge (gateway) if possible.
- Implement recipient-based routing to old vs new.
- Set conservative queue lifetimes and retry backoff on the gateway.
- Enable logging with message IDs that let you trace end-to-end.
Phase 3: Migrate identities and objects (before mail data)
- Provision mail-enabled users in destination with correct primary/alias addresses.
- Recreate shared mailboxes, groups, resources, and permissions mappings.
- Set mailbox routing attributes (or routing maps) so coexistence knows where to deliver.
- Test authentication paths for a pilot group.
Phase 4: Pre-seed mailbox data (bulk copy while users keep working)
- Run initial sync of historical mail and calendars if applicable.
- Expect this to take the longest and be the most throttled.
- Measure throughput; adjust concurrency; don’t melt the source server.
- Verify folder counts, item counts, and spot-check message integrity.
Phase 5: Pilot cutover (small blast radius)
- Select pilot users across roles, not just IT.
- Cut over mail routing for pilots (not the whole domain).
- Switch client autodiscover/config for pilots.
- Monitor delays, bounces, and user-reported anomalies for at least a full business day.
Phase 6: Wave migrations (repeatable, boring, controlled)
- Migrate in waves sized to your support capacity and platform throttling.
- Run delta sync close to each wave cutover.
- Keep coexistence routing in place until the last wave is stable.
- Keep the old system receiving (or at least queueing) until you have confidence and backups.
Phase 7: Domain-level cutover (if needed)
- Only after coexistence has proven correct.
- Lower TTL in advance, but assume it won’t save you everywhere.
- Change MX and update SPF/DKIM/DMARC deliberately with staging.
- Keep the old MX (or gateway) able to accept mail for a while to catch stragglers.
Phase 8: Decommission safely
- Freeze changes and keep read-only access to old mailboxes for an agreed period.
- Export final logs and migration reports.
- Confirm compliance retention and legal hold behavior in destination.
- Decommission in steps: inbound acceptor last, storage last, DNS last.
Practical tasks with commands: verify, decide, proceed
These are the kinds of tasks that keep you out of status meetings where nobody has data. Each one includes: command, what the output means, and the decision you make.
Task 1: Check current MX records and what the world sees
cr0x@server:~$ dig +nocmd example.com MX +noall +answer
example.com. 300 IN MX 10 mx1.oldmail.example.com.
example.com. 300 IN MX 20 mx2.oldmail.example.com.
Meaning: MX points to old system, TTL is 300 seconds. That TTL is advisory, not a promise.
Decision: If you plan to cut MX, lower TTL days ahead and ensure old hosts will keep accepting mail during propagation.
Task 2: Validate DNS TTL for MX and related records before changing anything
cr0x@server:~$ dig example.com MX +noall +answer
example.com. 3600 IN MX 10 mx-gw.example.com.
Meaning: TTL is 3600 seconds (1 hour). Some senders will cache longer anyway.
Decision: If you want a tighter cutover window, reduce TTL now and wait out at least the old TTL (preferably 24–48 hours) before changing MX.
Task 3: Inspect SPF to find third-party senders you’ll break
cr0x@server:~$ dig +short TXT example.com
"v=spf1 ip4:203.0.113.10 include:_spf.vendor-a.example include:_spf.vendor-b.example -all"
Meaning: You have at least three sending sources: one IP and two includes.
Decision: Inventory what each include represents. If you’re changing outbound, ensure the new sender IPs/hosts are included or you’ll see silent deliverability failures.
Task 4: Check DMARC policy level (because strict policies magnify mistakes)
cr0x@server:~$ dig +short TXT _dmarc.example.com
"v=DMARC1; p=reject; rua=mailto:dmarc-reports@example.com; ruf=mailto:dmarc-forensics@example.com; adkim=s; aspf=s"
Meaning: DMARC is set to reject, with strict alignment. Good for security, unforgiving for migrations.
Decision: If you can’t guarantee DKIM alignment during transition, plan a temporary policy relaxation (carefully) or use a gateway that maintains signing consistently.
Task 5: Verify the TLS certificate served by your inbound SMTP endpoint
cr0x@server:~$ echo | openssl s_client -starttls smtp -connect mx-gw.example.com:25 -servername mx-gw.example.com 2>/dev/null | openssl x509 -noout -subject -issuer -dates
subject=CN = mx-gw.example.com
issuer=CN = Example Intermediate CA
notBefore=Jan 5 00:00:00 2026 GMT
notAfter=Apr 5 23:59:59 2026 GMT
Meaning: Cert is valid and not expiring tomorrow. TLS matters for deliverability and security gateways.
Decision: If the certificate is wrong/expired, fix it before cutover. Otherwise you’ll debug “random delivery problems” that are really TLS failures.
Task 6: Confirm your gateway (Postfix) is queueing safely, not bouncing prematurely
cr0x@server:~$ postconf | egrep 'maximal_queue_lifetime|bounce_queue_lifetime|minimal_backoff_time|maximal_backoff_time'
maximal_queue_lifetime = 5d
bounce_queue_lifetime = 1d
minimal_backoff_time = 300s
maximal_backoff_time = 4000s
Meaning: Messages can remain queued up to 5 days. Bounces happen after 1 day for undeliverable mail.
Decision: During migration, prefer longer queue lifetime and fewer bounces. If the destination has a temporary problem, you want delays, not NDRs.
Task 7: Observe the mail queue depth during migration waves
cr0x@server:~$ mailq | tail -n 5
-- 25 Kbytes in 12 Requests.
-- 18 Kbytes in 9 Requests.
-- 302 Kbytes in 140 Requests.
-- 1.8 Mbytes in 820 Requests.
-- 6.2 Mbytes in 2900 Requests.
Meaning: Queue is growing. If it keeps growing, your downstream is slow or rejecting.
Decision: If queue growth correlates with a wave cutover, pause migrations, inspect downstream errors, and protect inbound acceptance.
Task 8: Find why mail is stuck (deferred reasons)
cr0x@server:~$ grep -E 'status=deferred|status=bounced' /var/log/mail.log | tail -n 6
Feb 4 10:11:02 mx-gw postfix/smtp[23144]: 9A2C11234: to=<user@example.com>, relay=new-mx.example.net[198.51.100.20]:25, delay=1220, delays=0.2/0.1/900/320, dsn=4.2.0, status=deferred (host new-mx.example.net[198.51.100.20] said: 451 4.2.0 Temporary server error)
Feb 4 10:11:07 mx-gw postfix/smtp[23146]: 8BCDE1234: to=<user2@example.com>, relay=new-mx.example.net[198.51.100.20]:25, delay=980, delays=0.1/0.1/600/380, dsn=4.7.0, status=deferred (host new-mx.example.net[198.51.100.20] said: 421 4.7.0 Too many connections)
Meaning: Destination is throttling or erroring temporarily (451/421). Your gateway is doing the right thing by deferring.
Decision: Reduce concurrency to the destination and coordinate with the destination platform limits; do not “fix” this by bouncing.
Task 9: Verify outbound path and identity (which server is actually sending)
cr0x@server:~$ grep -R "relayhost" /etc/postfix/main.cf
relayhost = [smtp-out.newplatform.example.net]:587
Meaning: Outbound is relayed to a new platform submission endpoint.
Decision: Ensure SPF includes the new outbound, ensure DKIM signing is correct, and watch for rate limits or authentication errors.
Task 10: Check that your outbound IP matches what recipients will see (basic sanity)
cr0x@server:~$ curl -s ifconfig.me
203.0.113.55
Meaning: This is the public IP from the host you’re testing from (useful if this host sends mail directly).
Decision: If outbound IP changed, update SPF and consider warm-up strategy; a cold IP with high volume gets filtered.
Task 11: Verify IMAP folder list on source (what you must preserve)
cr0x@server:~$ doveadm mailbox list -u alice@example.com | head
INBOX
Archive
Drafts
Sent
Trash
Projects
Projects/2025
Meaning: Folder structure exists; nested folders matter. Users will notice if “Projects/2025” becomes “Projects.2025” or vanishes.
Decision: Configure your migration tool to preserve hierarchy and special-use folders; test with real messy mailboxes, not IT’s tidy ones.
Task 12: Confirm message counts on key folders (spot-check, don’t trust summaries)
cr0x@server:~$ doveadm mailbox status -u alice@example.com messages INBOX Archive Sent
INBOX messages=18421
Archive messages=90211
Sent messages=13105
Meaning: Baseline counts. Not perfect (flags and duplicates can skew), but good for gross mismatch detection.
Decision: If destination counts are wildly lower, do not cut over that user. Investigate throttling, excluded folders, or size limits.
Task 13: Verify that a specific message exists by Message-ID (the forensic way)
cr0x@server:~$ doveadm search -u alice@example.com mailbox INBOX header Message-ID "<CA+abc123@example.net>" | head
1
Meaning: Result “1” indicates a matching message UID was found. If empty, it’s not in that mailbox/folder.
Decision: Use Message-ID searches to prove whether mail is missing vs delivered elsewhere. This is how you keep arguments out of the war room.
Task 14: Watch disk and IO pressure on source mail store (migration can DoS your own storage)
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (mailstore01) 02/04/2026 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.10 0.00 5.80 28.30 0.00 53.80
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 820.0 1200.0 84500.0 91000.0 8.20 0.40 82.0
Meaning: iowait is high; device utilization is high. Await of 8ms might be fine or not depending on your mailbox store, but rising await during migration is a red flag.
Decision: If %iowait climbs and users complain, reduce migration concurrency or schedule bulk sync off-hours. Protect production mail access first.
Task 15: Confirm that the old server is still accepting mail post-cutover (stragglers happen)
cr0x@server:~$ swaks --to user@example.com --server mx1.oldmail.example.com --from external-test@sender.example --data "Subject: old-mx test"
=== Trying mx1.oldmail.example.com:25...
=== Connected to mx1.oldmail.example.com.
<= 220 mx1.oldmail.example.com ESMTP
=> EHLO mailtester
<= 250-mx1.oldmail.example.com
=> MAIL FROM:<external-test@sender.example>
<= 250 2.1.0 Ok
=> RCPT TO:<user@example.com>
<= 250 2.1.5 Ok
=> DATA
<= 354 End data with <CR><LF>.<CR><LF>
=> .
<= 250 2.0.0 Ok: queued as 12345ABCDE
Meaning: Old MX still accepts and queues. Good: it can forward, relay, or hold during propagation.
Decision: Keep this capability until you’re confident all senders have moved. Don’t decommission the old edge early just because your laptop resolves the new MX.
Task 16: Validate that DNS changes have propagated across multiple resolvers
cr0x@server:~$ for r in 1.1.1.1 8.8.8.8 9.9.9.9; do echo "Resolver $r"; dig @$r +short example.com MX; done
Resolver 1.1.1.1
10 mx-gw.example.com.
Resolver 8.8.8.8
10 mx-gw.example.com.
Resolver 9.9.9.9
10 mx1.oldmail.example.com.
20 mx2.oldmail.example.com.
Meaning: Different resolvers see different MX. This is normal during propagation. It’s also why “flip MX at 9pm” is not a real plan.
Decision: Maintain dual acceptance and forwarding. Do not assume a single resolver’s view represents the internet.
Fast diagnosis playbook: find the bottleneck in minutes
Email migrations don’t fail creatively. They fail predictably: DNS confusion, routing loops, throttling, authentication drift, or storage overload. When users say “email is broken,” your job is to identify which contract is broken first.
First: Is inbound mail being accepted anywhere?
- Check gateway or MX logs for new connections and whether messages are accepted (250) or rejected (5xx).
- Check queue depth. If inbound accept is fine but queue is exploding, downstream is the problem.
- Send a controlled test message from an external source using a tool like
swaksand capture the queue ID.
What this tells you: Are you losing messages at the edge (rejections) or delaying them (deferrals/queue)? Delays are survivable; rejections are reputational damage.
Second: Is routing correct for a specific recipient?
- Pick one impacted user and determine if they are “migrated” per your routing map/directory attribute.
- Trace a message by Message-ID through logs and mailbox search on both sides.
- Look for loops: gateway → new → old → gateway patterns show up as repeated Received headers or repeated queue attempts.
What this tells you: Whether coexistence routing logic is wrong, stale, or inconsistent.
Third: Is the destination throttling or failing?
- Look for SMTP 4xx bursts like 421 too many connections, 451 temporary errors, 452 insufficient system storage, 4.7.x rate limits.
- Measure latency (queue delay breakdown) on your gateway logs.
- Check destination health dashboards if available, but trust your own telemetry first.
What this tells you: Whether you should pause migrations, reduce concurrency, or escalate to the destination provider.
Fourth: Is outbound deliverability tanking?
- Inspect SMTP responses from recipient domains for blocks, policy failures, or authentication failures.
- Check SPF/DKIM alignment on a sample of outbound messages.
- Confirm outbound IP/domain reputation signals via your bounce logs and complaint feedback loops if you have them.
What this tells you: Whether you have an authentication mismatch or a reputation/volume pattern shift.
Fifth: Are clients failing (auth/autodiscover) rather than mailflow?
- Check auth logs for MFA/conditional access blocks.
- Confirm autodiscover or configuration endpoints resolve to the correct tenant/service.
- Test with webmail to isolate client configuration from server-side issues.
What this tells you: If the mailbox exists and mail is flowing, but clients can’t find it, your problem is identity/config, not SMTP.
Common mistakes: symptoms → root cause → fix
1) Symptom: “Some senders can’t reach us, others can” after MX cutover
Root cause: DNS propagation variance and resolver caching; old MX decommissioned too early; some senders still target old MX.
Fix: Keep old MX accepting and forwarding for a defined overlap. If possible, keep MX pointing to a stable gateway and route internally by recipient.
2) Symptom: Inbound mail is delayed by hours, then arrives in bursts
Root cause: Destination throttling (421/451/4.7.x), too much concurrent delivery, or greylisting behavior triggered by new IP/host.
Fix: Reduce SMTP concurrency and adjust retry intervals. If you control the gateway, implement per-domain rate limits. Don’t bounce; queue.
3) Symptom: Outbound mail lands in spam after migration, even though SPF passes
Root cause: DKIM missing or misaligned; DMARC alignment failing; sending IP reputation reset; changed HELO/EHLO and rDNS mismatch.
Fix: Stabilize DKIM signing and alignment. Ensure rDNS/HELO are sane. Ramp volume rather than blasting from cold IPs.
4) Symptom: Users report missing folders or “Sent” is empty
Root cause: Folder mapping differences; special-use flags not preserved; migration tool excluded certain folders; client is connected to a new empty profile.
Fix: Verify special-use folder mapping (Sent/Drafts/Trash). Test with a mailbox that has deep nesting and non-ASCII names. Force client profile refresh only after data is present.
5) Symptom: Some internal mail between old and new users bounces
Root cause: Coexistence routing missing for subdomains or aliases; directory shows wrong “mailbox location”; connector rules incomplete.
Fix: Implement deterministic routing: if recipient is migrated, route to new; else old. Ensure aliases and group expansion happen in the correct system.
6) Symptom: Duplicated inbound messages for migrated users
Root cause: Dual delivery enabled without de-duplication plan; forwarding rules exist on old system; users have POP clients pulling and deleting inconsistently.
Fix: Pick one dual-delivery mechanism (gateway copy or journaling copy), not three. Disable user-level forwarding during transition if feasible. Communicate clearly about POP.
7) Symptom: Migration tool says “completed,” but users can’t find old mail
Root cause: “Completed” means “no more items discovered at last scan,” not “user-visible correctness.” Also: some systems exclude archives by default.
Fix: Validate with message counts, folder lists, and Message-ID spot checks. Include archives explicitly in scope.
8) Symptom: Shared mailbox access breaks for delegates
Root cause: Permissions models differ; delegate rights not migrated; automapping/autodiscover changes; group-based permissions not synchronized.
Fix: Export and reapply permissions as a separate migration stream. Test with exec assistant scenarios. Make shared mailbox cutovers explicit, not accidental.
Three corporate mini-stories from the migration trenches
Mini-story 1: The incident caused by a wrong assumption
They had a plan that looked fine in a slide deck: lower TTL, flip MX, monitor, celebrate. The assumption was that a 300-second TTL meant most senders would switch within a few minutes. That assumption belongs in the same museum as “nobody uses fax anymore.”
Their old inbound servers were shut down early because the new platform was receiving mail and internal tests looked clean. Then the trouble started: a handful of critical external partners—large enterprises with conservative resolver caching and outbound MTAs configured by someone who retired in 2012—kept delivering to the old MX. Those deliveries started hard-failing.
The helpdesk reports were weirdly specific: “Supplier A can’t email AP,” “Bank B says messages bounce,” while everyone else was fine. The team spent two hours looking at the new platform because that’s where “the migration happened.” The new platform was innocent.
The fix was embarrassingly simple: bring the old MX back online, accept mail again, and forward to the new system. The painful part was the reputational hit: bounce messages don’t just hurt feelings, they train senders to mistrust your domain.
After the incident, they rebuilt around a stable gateway so MX didn’t need to change per wave. The key lesson wasn’t “lower TTL earlier.” It was “DNS is a hint; your edge must be resilient to senders who didn’t get the memo.”
Mini-story 2: The optimization that backfired
A different company wanted the migration done fast. They increased migration concurrency aggressively: more threads, more parallel mailbox copies, bigger batches. The dashboards looked great for an hour. Then the source mailbox servers started to crawl.
Users complained that search was slow, opening messages lagged, and the web UI timed out. The migration tool kept retrying on failures, which increased load. A classic positive feedback loop: errors create retries, retries create load, load creates more errors.
The storage layer was the silent victim. The mailbox store sat on shared disks sized for “normal usage,” not for a bulk read of every mailbox plus normal user activity. Latency spiked; IO queues deepened. The email platform didn’t crash, it just became miserable.
They tried to “optimize” further by disabling some indexing tasks and background maintenance to free resources. That helped short-term, then bit them: users later complained that search results were incomplete and some folders looked empty until reindexing caught up.
They recovered by throttling migration concurrency, scheduling bulk sync off-hours, and explicitly protecting production QoS. The most valuable optimization was boring: establish a safe migration rate and stick to it, even if it extends the calendar. Short joke #2: If you think you can “just crank the threads,” congratulations—you’ve invented a denial-of-service attack on your own email.
Mini-story 3: The boring but correct practice that saved the day
This one is less dramatic, which is the point. The team built a stable SMTP gateway in front of both old and new systems. They maintained a routing table keyed by recipient address: old vs new. They also logged every decision: accepted, routed, deferred, delivered, bounced.
During wave three, a destination-side throttling event hit. Messages started deferring with 421 responses. Users noticed delays for newly migrated recipients. But there were no bounces. No panic. The gateway queued messages calmly, like a grown-up.
The on-call engineer ran a quick queue inspection, identified the destination throttle, and reduced concurrency to the new platform. They paused the next wave. The backlog drained overnight. The next morning, users had their mail; a handful arrived late, but nothing vanished.
Later, legal requested proof that a specific message sent during the incident window was not lost. The team pulled the gateway log line with queue ID and correlated it with the destination delivery confirmation and the mailbox Message-ID search. That ended the conversation quickly and politely.
The takeaway: a stable gateway plus disciplined logging is not exciting. It also turns “we think mail is missing” into “here is the chain of custody.”
FAQ
1) Can you really do an email migration with zero downtime?
Yes, if you define downtime correctly: continuous ability to send/receive, plus safe queueing during transient issues. You may still have some client reconfiguration moments, but mailflow should remain continuous.
2) Should we flip MX early or late?
If you can avoid flipping MX by using a stable gateway, do that. If you must flip MX, do it after coexistence is proven, and keep old MX accepting/forwarding for overlap.
3) How long should we keep the old system running?
At minimum: through DNS propagation plus a business-defined confidence window (often weeks). Keep inbound acceptance/forwarding capability longer than you think; storage decommission comes last.
4) What about “lost” mail during the migration window?
Most “lost” mail is misrouted or delayed. Design for traceability: gateway queue IDs, logs with Message-ID, and the ability to search mailboxes for a specific Message-ID on both sides.
5) Is dual delivery worth it?
Sometimes. It reduces the chance a message is only in the “other” mailbox during transition. It also creates duplicates and cleanup work. If you do it, do it deliberately and for a limited time.
6) What’s the biggest deliverability risk during migration?
Authentication and reputation changes: broken DKIM alignment, missing SPF includes for new senders, or sending high volume from a cold IP/domain pattern. Mail can be “delivered” but effectively invisible in spam.
7) How do we handle third-party senders like CRMs and ticketing systems?
Inventory them early, update SPF includes, and decide whether they should sign DKIM with your domain or use their own. Test DMARC alignment with your actual policies before you cut over.
8) Can we migrate and change identities/MFA at the same time?
You can, but you’re multiplying failure modes. If you want a calm migration, keep identity stable or migrate identity as a separate phase with its own rollout and rollback plan.
9) What’s the best way to prove to management that the migration is safe?
Metrics and chain-of-custody evidence: queue sizes, defer rates, bounce codes, delivery latency percentiles, and Message-ID trace samples across old/new. Charts beat confidence.
10) How do we avoid hammering the source mail servers?
Throttle migration concurrency, schedule bulk reads off-hours, and watch IO latency. Protect production mail access. Your migration is not allowed to become the top workload on the mailbox store.
Conclusion: next steps you can execute tomorrow
If you want a zero-downtime migration that won’t lose messages, stop thinking of it as “moving mailboxes” and start treating it as maintaining contracts while changing infrastructure. Your users don’t care where the mailbox lives. They care that the inbox remains a reliable system.
Practical next steps:
- Choose a routing model: staged coexistence by recipient, ideally behind a stable SMTP gateway.
- Write your rollback plan in operational terms: what gets changed back, by whom, in what order.
- Inventory senders and DNS auth: SPF includes, DKIM selectors, DMARC policy strictness, and third-party systems.
- Build your verification routine: queue monitoring, log tracing, Message-ID searches, and external test sends.
- Run a pilot that includes weird users: delegates, shared mailboxes, heavy archives, and compliance constraints.
- Throttle for stability: migrations finish when they finish; outages create longer timelines than patience ever will.
Do the boring work. It’s how you get the exciting result: nothing breaks, nobody notices, and you keep your weekend.