Email migration: Move mail to a new server with minimal downtime (real steps)

Was this helpful?

Mail migrations don’t fail because SMTP is hard. They fail because humans forget that email is a distributed system held together by DNS caching, weird client state, and three different “truths” about where a message lives.

You can do this with close to zero downtime. But you need a plan that respects how mail actually behaves in the wild: messages in transit, cached MX records, IMAP UID quirks, and the unglamorous reality that someone’s Outlook has been running since 2017 without a restart.

What “minimal downtime” really means

For email, “downtime” isn’t a single switch. SMTP delivery is store-and-forward: remote servers will retry for hours or days if your server is down. IMAP is stateful: clients keep connections open and cache mailbox identifiers. DNS is cached: your “instant” MX change might take a day for some senders.

So the objective isn’t “no disruptions.” It’s controlled disruption:

  • No lost mail. Delayed is acceptable; lost is a career-limiting event.
  • Short window where writes are single-homed. During cutover, new mail should land on one server. Everything else is a footgun.
  • Predictable client behavior. IMAP/Submission endpoints and certificates should be stable so clients don’t revolt.
  • A rollback that doesn’t require prayer. Rollback should be “flip MX back and re-enable inbound,” not “rebuild the old box from smoke.”

The migration pattern that works in production is two-pass replication:

  1. Bulk sync while the old server is live.
  2. Short freeze (or controlled dual-delivery) during cutover, then final sync.

You’re not moving “email.” You’re moving mailboxes, identities, cryptographic reputation, and a pile of operational assumptions.

A few facts (and bits of history) that change decisions

  1. SMTP is older than most of your tooling. SMTP was standardized in the early 1980s, and it still assumes intermittent connectivity and retries. That’s why your cutover can be forgiving if you manage queues correctly.
  2. MX records don’t force delivery. Some senders ignore MX preferences or cache aggressively. You reduce risk with TTL planning, but you never get perfect control.
  3. IMAP UIDs are not message IDs. Clients rely on per-mailbox UID sequences and UIDVALIDITY. If you reset them accidentally, clients may duplicate or “lose” messages until a full resync.
  4. Maildir vs mbox is a real operational difference. Maildir is many files; mbox is one file. Migrating mbox with rsync can look “done” while still corrupting state under concurrent writes.
  5. DMARC changed how “just change IPs” works. Modern deliverability is reputation plus alignment (SPF and DKIM). You can break inbound acceptance at recipients even when your server is perfectly healthy.
  6. Greylisting is still alive in some places. First-time senders get deferred intentionally. During cutover, new IP + cold reputation can mean more deferrals and slower mail.
  7. TLS on SMTP became mainstream because of opportunistic encryption. Most servers will happily fall back to plaintext if misconfigured—unless policy enforces it. Your migration can accidentally reduce security without anyone noticing.
  8. Large providers run their own rules. Some recipients maintain internal reputation per sending IP and per domain; moving servers can look like “new sender” even if your domain is old.

Decide what you’re migrating: SMTP, IMAP, auth, and storage

Pick a target shape before touching data

A mail server isn’t one daemon. It’s a small ecosystem:

  • SMTP inbound (port 25): usually Postfix. Receives mail for your domains and enforces policy.
  • Submission (587/465): authenticated sending for users; should require TLS and auth.
  • IMAP/POP: Dovecot typically. This is where client pain happens.
  • Auth: local users, LDAP, SQL, OAuth-ish gateways. Migrating auth wrong is how you wake up at 02:00.
  • Content scanning: rspamd/Amavis/ClamAV. Optional, but many orgs rely on it without admitting it.
  • DKIM signing: OpenDKIM/rspamd. Keys must survive the move.
  • Storage: filesystem + quotas + backups + snapshots. This is where you either have boring success or you have a story people tell.

Stop pretending DNS is just MX

For a clean cutover, you’ll usually touch:

  • A/AAAA for mail.example.com (IMAP/submission endpoint)
  • MX for example.com
  • SPF TXT: authorize new sending IPs
  • DKIM TXT: ensure the new server signs with the same selector/key, or publish the new key first
  • DMARC TXT: alignment and policy; don’t tighten policy on migration week
  • Reverse DNS (PTR) for the new IP: many recipients score without it

One reliability quote, because it’s true

Hope is not a strategy. — often attributed in engineering circles; use it as a paraphrased idea if you prefer. Either way, build the rollback.

Joke #1: Email is the only product where “it was delivered eventually” is considered a success criterion and a threat model.

Pre-migration prep that actually matters

1) Choose the identity that will not change

For minimal downtime, keep the client-visible hostnames stable. Use mail.example.com for IMAP/submission, and move the IP behind it. Don’t change hostnames and certificates during the same maintenance window unless you enjoy support tickets.

2) Lower DNS TTLs early (and confirm)

TTL changes only help if they have time to propagate before cutover. Lower them at least 24–48 hours ahead for:

  • MX records for the domain
  • mail.example.com A/AAAA records

Use a TTL like 300 seconds temporarily. Then restore sane TTLs later (3600–14400) after stability.

3) Decide: rsync, dsync, or export/import?

If you’re running Dovecot and IMAP, Dovecot dsync is usually the safest because it respects mailbox metadata and can do incremental syncs cleanly. For pure Maildir, rsync can work, but only if you understand what “in-flight changes” mean.

4) Plan the mail freeze window

The cleanest cutover is:

  • Stop inbound SMTP on the old server (or reject temporarily with a 4xx defer).
  • Drain queues and ensure all new mail lands on the new server.
  • Final sync from old to new for any last mailbox changes.

That freeze window can be 5–30 minutes if you did the bulk sync earlier.

5) Don’t migrate a mess; snapshot it

If your storage supports snapshots (ZFS, LVM, filesystem snapshots), take them. Being able to revert mailbox state is the difference between “oops” and “incident.”

Real tasks with commands: check, replicate, verify, decide

These are not “toy” commands. They’re the things you actually run while migrating a production mail system. Each task includes: command, what the output means, and what decision you make.

Task 1: Inventory listening services (old and new)

cr0x@server:~$ sudo ss -lntp | egrep ':(25|465|587|993|143)\b'
LISTEN 0      100          0.0.0.0:25        0.0.0.0:*    users:(("master",pid=1123,fd=13))
LISTEN 0      100          0.0.0.0:587       0.0.0.0:*    users:(("master",pid=1123,fd=14))
LISTEN 0      128          0.0.0.0:993       0.0.0.0:*    users:(("dovecot",pid=1402,fd=40))

Meaning: Confirms Postfix master is bound for SMTP/submission and Dovecot for IMAPS. Missing ports means clients will fail or inbound won’t land.

Decision: If new server isn’t listening on the correct ports, stop. Fix firewall/service units before you sync data or touch DNS.

Task 2: Confirm DNS TTLs (what you think you set vs reality)

cr0x@server:~$ dig +nocmd +noall +answer MX example.com
example.com.  300 IN MX 10 mail.example.com.

Meaning: TTL 300 is low. Good for cutover agility. If you still see 3600/86400, you waited too late.

Decision: If TTLs are high and cutover is soon, plan a longer coexistence window and keep old server receiving longer.

Task 3: Verify name resolves consistently (A/AAAA)

cr0x@server:~$ dig +short A mail.example.com
203.0.113.10

Meaning: Shows the IP clients will hit for IMAP/submission (if you use this hostname). Dual-stack? Check AAAA too.

Decision: If A/AAAA doesn’t match your plan, fix DNS now—before you debug phantom “can’t connect” reports later.

Task 4: Check TLS certificate on IMAP (clients care a lot)

cr0x@server:~$ openssl s_client -connect mail.example.com:993 -servername mail.example.com -brief
CONNECTION ESTABLISHED
Protocol version: TLSv1.3
Peer certificate: CN = mail.example.com
Verification: OK

Meaning: CN/SAN matches hostname; verification OK. If you see verification errors, clients may prompt or refuse.

Decision: Fix certificate chain and hostname before cutover. Do not “we’ll fix it after”; you won’t.

Task 5: Confirm Postfix identity and domains (avoid accepting for wrong domains)

cr0x@server:~$ sudo postconf myhostname mydestination virtual_mailbox_domains
myhostname = mail.example.com
mydestination = $myhostname, localhost.localdomain, localhost
virtual_mailbox_domains = example.com, example.org

Meaning: You accept mail for the intended domains as virtual mailboxes (common setup). If mydestination mistakenly includes example.com, you might deliver locally into system accounts.

Decision: Correct domain routing now. Wrong classification causes silent misdelivery.

Task 6: Verify mailbox format and location (Maildir vs mbox)

cr0x@server:~$ sudo doveconf -n | egrep '^(mail_location|mail_privileged_group|first_valid_uid)'
mail_location = maildir:/var/vmail/%d/%n/Maildir
first_valid_uid = 5000

Meaning: Maildir path and UID baseline. If the new server uses a different mail_location or UID mapping, clients can see empty mailboxes or permission errors.

Decision: Keep mailbox paths and UID strategy consistent across servers unless you are intentionally migrating formats (which is a different project).

Task 7: Measure mailbox data size (so you can estimate sync time)

cr0x@server:~$ sudo du -sh /var/vmail
842G    /var/vmail

Meaning: Rough size of mail storage. If it’s close to disk capacity on the new server, you’re not migrating—you’re moving into a fire.

Decision: Ensure at least 30–50% headroom. Mail systems grow and spikes happen (journaling, indexing, temporary files).

Task 8: Check filesystem and inode pressure (Maildir eats inodes)

cr0x@server:~$ df -h /var/vmail && df -ih /var/vmail
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       1.8T  1.1T  680G  62% /var/vmail
Filesystem      Inodes  IUsed  IFree IUse% Mounted on
/dev/sdb1        120M   79M   41M   66% /var/vmail

Meaning: Maildir can burn inodes fast. Running out of inodes looks like “disk full” while df -h looks fine.

Decision: If inode usage is high, choose a filesystem/inode ratio appropriate for millions of small files, or plan an archival strategy.

Task 9: Bulk sync Maildir with rsync (first pass)

cr0x@server:~$ sudo rsync -aHAX --numeric-ids --delete --info=stats2,progress2 /var/vmail/ root@mail-new:/var/vmail/
sending incremental file list
...
Number of files: 18,492,310 (reg: 18,491,900, dir: 310)
Total file size: 901,223,554,120 bytes
Total transferred file size: 901,223,554,120 bytes
Literal data: 210,331,112,455 bytes
Speedup is 4.28

Meaning: First pass copies nearly everything. The “literal data” and speedup tell you about compression and delta behavior; Maildir deltas are usually poor because messages are immutable but numerous.

Decision: If performance is terrible, don’t “tune” blindly. Check disk IOPS and network. You may need to throttle, schedule off-peak, or use ZFS send/receive if you can.

Task 10: Incremental sync (second pass) with rsync

cr0x@server:~$ sudo rsync -aHAX --numeric-ids --delete --info=stats2 /var/vmail/ root@mail-new:/var/vmail/
Number of files: 18,492,400 (reg: 18,492,090, dir: 310)
Number of created files: 120
Number of deleted files: 6
Total transferred file size: 87,991,230 bytes

Meaning: The delta is now small. That’s what you want before cutover: a short final sync window.

Decision: If deltas are still huge, users are actively changing mail (moving folders, deleting). Plan a short freeze or use dsync for safer reconciliation.

Task 11: Dovecot dsync replication (preferred when available)

cr0x@server:~$ sudo doveadm -v dsync -u alice@example.com remote:vmail@mail-new
dsync(alice@example.com): Debug: Added connection to remote:vmail@mail-new
dsync(alice@example.com): Debug: Mailbox INBOX: GUIDs match
dsync(alice@example.com): Debug: Finished successfully

Meaning: Dsync compares mailbox GUIDs and syncs changes safely. “GUIDs match” is comforting; mismatches can be fixed but require attention.

Decision: If dsync reports UIDVALIDITY or GUID issues, pause and diagnose mailbox metadata before cutting over. This is where client “duplicates” are born.

Task 12: Verify Dovecot can see mailboxes on new server

cr0x@server:~$ sudo doveadm mailbox list -u alice@example.com
INBOX
Sent
Trash
Archive/2024

Meaning: Mailbox hierarchy is present for a representative user. If it’s empty, your mail_location, permissions, or auth mapping is wrong.

Decision: Fix structure now; do not cut DNS and then discover everyone has “only INBOX.”

Task 13: Check permissions/ownership on migrated Maildir

cr0x@server:~$ sudo namei -l /var/vmail/example.com/alice/Maildir
f: /var/vmail/example.com/alice/Maildir
drwxr-x--- root  vmail /var
drwxr-x--- root  vmail vmail
drwxr-x--- vmail vmail example.com
drwxr-x--- vmail vmail alice
drwx------ vmail vmail Maildir

Meaning: Directories are owned by the mail user/group. If you see root:root in mailbox directories, Dovecot may throw “Permission denied” and clients will see empty folders.

Decision: Fix with chown -R vmail:vmail /var/vmail (carefully) and verify first_valid_uid/last_valid_uid align.

Task 14: Validate Postfix queue state before cutover

cr0x@server:~$ sudo postqueue -p | head
-Queue ID-  --Size-- ----Arrival Time---- -Sender/Recipient-------
2A1B2C3D4E*     912 Thu Jan  4 10:15:11  sender@example.net
                                         user@example.com

-- 12 Kbytes in 1 Request.

Meaning: The old server still has queued mail. A small queue is fine; a growing queue means upstream delivery problems.

Decision: If the queue is large or stuck, investigate before cutover. Otherwise you’re migrating a backlog and calling it “downtime.”

Task 15: Watch logs for active delivery errors (during testing)

cr0x@server:~$ sudo tail -n 30 /var/log/mail.log
Jan  4 10:21:07 mail postfix/smtpd[2201]: connect from unknown[198.51.100.77]
Jan  4 10:21:08 mail postfix/smtpd[2201]: NOQUEUE: reject: RCPT from unknown[198.51.100.77]: 450 4.7.1 Client host rejected: cannot find your hostname; from= to= proto=ESMTP helo=

Meaning: You’re rejecting a sender because of reverse DNS policy. That may be intentional, but it will bite you if your policy is stricter than your business tolerance.

Decision: Decide whether you want to be “technically pure” or “receive mail.” Tune smtpd_client_restrictions accordingly, especially during cutover week.

Task 16: Confirm SPF and DKIM are in place before moving sending traffic

cr0x@server:~$ dig +short TXT example.com
"v=spf1 ip4:203.0.113.10 ip4:203.0.113.20 -all"

Meaning: SPF authorizes both old and new IPs temporarily. That’s the safe approach for cutover and rollback.

Decision: Keep old IP authorized until you’re sure no systems are still sending from it (apps, printers, legacy relays). Then remove.

Task 17: Confirm DKIM signing actually happens on the new server

cr0x@server:~$ sudo rspamc stat | egrep 'dkim|dmarc|spf'
Actions: reject: 0, add header: 120, greylist: 4, no action: 980
DKIM sign: 110
DMARC policy: 12
SPF: 940

Meaning: “DKIM sign” count increasing indicates outbound is being signed. If it’s zero, you’re about to have deliverability issues that look like “random bounces.”

Decision: Fix DKIM signing before moving outbound traffic. Deliverability is slow to recover and fast to destroy.

Task 18: Smoke-test end-to-end mail flow (SMTP in, IMAP read)

cr0x@server:~$ swaks --to user@example.com --server mail.example.com --port 25
=== Trying mail.example.com:25...
=== Connected to mail.example.com.
<  220 mail.example.com ESMTP Postfix
>  EHLO client.local
<  250-mail.example.com
...
>  DATA
<  354 End data with <CR><LF>.<CR><LF>
>  Subject: migration test
>
>  hello
>  .
<  250 2.0.0 Ok: queued as 9F8E7D6C5B

Meaning: SMTP acceptance works and message is queued. You still need to confirm delivery to mailbox and client visibility.

Decision: If SMTP accepts but mail doesn’t show up in IMAP, you’re looking at delivery/LMTP issues, mailbox permissions, or filtering rules. Don’t touch DNS until you understand why.

Checklists / step-by-step plan (two-pass migration)

Phase 0: Constraints and rollback (write it down)

  • Define maintenance window and who is on call for DNS, mail, and storage.
  • Decide your rollback trigger: e.g., “inbound delivery error rate > X for Y minutes,” or “IMAP auth failure widespread.”
  • Keep old server intact and reachable until you are done. Do not repurpose the IP on day one.
  • Temporarily authorize both old and new sending IPs in SPF.

Phase 1: Build the new server like you mean it

  • Install and configure Postfix + Dovecot + filtering stack to match existing behavior.
  • Copy DKIM keys and ensure signing selectors match DNS records (or publish new records first).
  • Install TLS certs for the stable hostnames (IMAP and submission especially).
  • Match user database/auth mapping (LDAP/SQL/local). Test with a real user.
  • Provision storage with headroom and correct inode density; enable snapshots/backups.

Phase 2: Bulk sync mail data (first pass)

  • Run rsync or dsync while old server is live.
  • Do not stop services yet; let users work.
  • Measure duration and throughput. If it will take days, plan around it rather than pretending.

Phase 3: Validation before any DNS change

  • Pick 5–10 representative mailboxes: large, small, shared, heavy IMAP folder usage.
  • Verify login, folder list, and message counts on new server.
  • Send test mail in and out; check headers for DKIM signatures and Received chain.
  • Confirm quotas, sieve rules (if used), and any server-side filtering behavior.

Phase 4: Pre-cutover delta sync (second pass)

  • Run an incremental sync to reduce the final window.
  • Monitor old server write activity (active IMAP connections and deliveries).
  • Communicate: a short window where mail might be delayed is better than a long mystery outage.

Phase 5: Cutover (short freeze + DNS)

  1. Stop or defer inbound SMTP on old server (return temporary failure). You want senders to retry, not bounce.
  2. Ensure submission (587/465) for users points to the new server, ideally via the same hostname.
  3. Perform final sync (rsync or dsync).
  4. Switch MX (and A/AAAA if applicable) to the new server.
  5. Re-enable inbound SMTP on the new server and watch logs like it’s a thriller.

Phase 6: Post-cutover monitoring and cleanup

  • Watch Postfix queue, Dovecot auth failures, and delivery latency.
  • Keep old server accepting nothing but available for emergency retrieval for a while.
  • After stable period: restore higher TTLs, remove old IP from SPF, and archive the old server safely.

Joke #2: The most reliable mail migration tool is a calendar invite titled “Final sync” that everyone actually attends.

Cutover day: DNS, queues, and client impact

How to “freeze” inbound without losing mail

Your old server should respond with a temporary failure so sending MTAs retry. A controlled defer is safer than shutting the port and causing some senders to treat it as hostile. If you can’t do policy routing cleanly, stopping Postfix is still acceptable—SMTP retry semantics exist for a reason.

One practical approach: keep old server running, but make it refuse new inbound with a 4xx. That way it’s still alive for queue delivery and admin access.

Queue drain strategy

Before cutover, you want the old server’s outbound queue stable or shrinking. If it’s growing, you’ll cut over and still have outbound stuck, which users interpret as “email is broken” even if inbound is fine.

After cutover, you may still see some inbound trickle to the old server because of cached MX. Keep a plan:

  • Option A (clean): old server defers all inbound with 4xx for a period, forcing retries to new MX once caches refresh.
  • Option B (messy but workable): old server accepts and relays inbound to the new server for a limited time.

Option B can be useful, but it’s also how you accidentally create mail loops and duplicate deliveries if you misconfigure transport maps. Choose carefully.

Client impact and how to keep it boring

If users connect to mail.example.com for IMAP and submission, and the certificate is valid on both servers, you can move the A/AAAA record without requiring client reconfiguration. The main client-visible behavior becomes a short disconnect/reconnect.

If you change hostnames, you’ll get:

  • Certificate warnings
  • Saved-password mismatches
  • Mobile clients stuck “verifying account”
  • Tickets claiming “my mailbox is empty” when it’s just re-indexing

So: keep hostnames stable unless you absolutely cannot.

Fast diagnosis playbook (find the bottleneck in minutes)

This is the triage order when something feels slow or broken right after cutover. Don’t get creative; be systematic.

First: Is it DNS or routing?

  • Check what MX the world sees for the domain.
  • Check what A/AAAA clients resolve for mail.example.com.
  • Confirm the new server is actually receiving connections.
cr0x@server:~$ dig +short MX example.com
10 mail.example.com.
cr0x@server:~$ sudo tail -n 20 /var/log/mail.log
Jan  4 10:31:22 mail postfix/smtpd[3122]: connect from mx.remote[198.51.100.77]
Jan  4 10:31:23 mail postfix/smtpd[3122]: NOQUEUE: reject: RCPT from mx.remote[198.51.100.77]: 550 5.1.1 <user@example.com>: Recipient address rejected: User unknown in virtual mailbox table; ...

Interpretation: If the logs show connections, DNS is probably fine. The error indicates recipient lookup/auth/db mismatch.

Second: Is SMTP accepting but not delivering?

  • Look for mail being queued but not delivered.
  • Check LMTP/local delivery and Dovecot LDA/LMTP logs.
  • Check filesystem permissions on vmail.
cr0x@server:~$ sudo postqueue -p | head -n 20
-Queue ID-  --Size-- ----Arrival Time---- -Sender/Recipient-------
9F8E7D6C5B      1043 Thu Jan  4 10:29:10  sender@example.net
                                         user@example.com

Interpretation: If queue grows with local recipients, delivery pipeline is broken. That’s Dovecot LMTP/socket, permissions, or recipient maps.

Third: Is IMAP/auth the problem?

  • Auth failures spike? LDAP/SQL creds, TLS requirements, or firewall.
  • IMAP slow? Disk IOPS or Dovecot indexes rebuilding.
cr0x@server:~$ sudo doveadm auth test alice@example.com 'CorrectHorseBatteryStaple'
passdb: alice@example.com auth succeeded
extra fields:
  user=alice@example.com

Interpretation: Auth test succeeding points away from LDAP/password issues. If clients still fail, look at TLS hostname mismatch or old clients trying a different port.

Fourth: Is storage the hidden villain?

  • High iowait, slow fsync, saturated pool—Maildir makes this visible fast.
  • Check for inode exhaustion.
cr0x@server:~$ iostat -xm 2 3
Device            r/s     w/s   rMB/s   wMB/s  %util  await
nvme0n1          12.1   210.3    0.3     8.7   98.9   45.2

Interpretation: %util near 100 and high await means storage is saturated. IMAP will feel “down” while services are technically up.

Decision: Reduce load (limit concurrency), move indexes to faster storage, or scale IOPS. Don’t keep blaming Postfix; it’s just standing there.

Common mistakes: symptom → root cause → fix

1) Symptom: Some senders deliver to old server for hours

Root cause: Cached MX records or sender policy ignoring low TTL; you changed TTL too late or some resolvers clamp it.

Fix: Keep old server available. Either defer inbound with 4xx so senders retry and follow new MX later, or relay to new server temporarily with loop protection.

2) Symptom: Users see duplicate messages after migration

Root cause: IMAP UID/UIDVALIDITY changed due to mailbox metadata reset or format conversion without preserving state.

Fix: Prefer dsync. Preserve Dovecot metadata (indexes, GUIDs) when possible. If already broken, isolate affected mailboxes and resync cleanly with client cache reset guidance.

3) Symptom: “Authentication failed” everywhere right after cutover

Root cause: New server pointing to wrong LDAP base DN, missing SASL config, or firewall blocking auth backend.

Fix: Use doveadm auth test and check SASL logs. Validate network connectivity to LDAP/SQL from the new host.

4) Symptom: SMTP accepts mail but it never reaches mailboxes

Root cause: Postfix delivery transport misconfigured (LMTP socket path wrong), or vmail permissions/UID mismatch.

Fix: Verify Postfix virtual_transport/mailbox_transport and Dovecot LMTP socket. Confirm ownership and first_valid_uid match actual vmail UID.

5) Symptom: Outbound mail goes to spam or is rejected after migration

Root cause: SPF doesn’t include new IP, DKIM not signing, missing PTR, or cold IP reputation.

Fix: Pre-publish SPF for both IPs; ensure DKIM signing; set PTR and HELO to match; ramp outbound gradually if possible.

6) Symptom: IMAP is “up” but feels painfully slow

Root cause: Storage IOPS bottleneck, Dovecot reindexing, or antivirus/content filter chewing CPU/disk.

Fix: Check iostat, load, and Dovecot process counts. Move indexes to fast storage, tune process limits, and avoid doing full-text indexing during the cutover window.

7) Symptom: Random 550 “User unknown” for valid users

Root cause: Recipient maps not loaded, stale SQL credentials, or case sensitivity differences; sometimes a missing domain in virtual_mailbox_domains.

Fix: Compare old/new Postfix maps and reload. Test lookups directly (SQL/LDAP) and verify domain lists.

Three corporate mini-stories (how it goes wrong and right)

Mini-story 1: The incident caused by a wrong assumption

They assumed lowering TTL on the MX record meant “everyone will switch within five minutes.” It was a tidy assumption, the kind you make when you mostly deal with internal DNS and Kubernetes service discovery.

The cutover happened at 19:00. They flipped MX, watched inbound mail hit the new server, and declared victory. At 21:30, a senior exec forwarded a customer complaint: “We sent an urgent contract update two hours ago. Did you receive it?” The new server logs showed nothing. The old server logs did.

Turns out several partner organizations used resolvers that ignored the new TTL (or cached the old answer anyway). Those senders kept delivering to the old MX. Meanwhile, the team had already set the old Postfix to “reject everything” with a hard 5xx because they wanted to force traffic over. SMTP did what SMTP does: the senders bounced immediately. Some of those bounces never made it back to human inboxes because… yes, their email was also in flux.

The fix wasn’t exotic. They changed the old server to temporary 4xx deferrals, so senders queued and retried. They also added a controlled relay from old to new for a few critical partner IPs. Within a day, everything stabilized.

The lesson: don’t confuse “TTL set low” with “the internet obeyed.” For cutover safety, treat the old server as a compatibility layer for at least one full TTL cycle plus “people who don’t care about TTL.”

Mini-story 2: The optimization that backfired

A different company decided to speed up migration by running parallel rsync jobs per domain. Ten domains, ten rsyncs, all at once. It looked efficient on a whiteboard and terrifying in iostat.

The new server had fast CPUs and “decent” SSDs, so they expected it to swallow the load. But Maildir migrations aren’t big sequential writes; they’re millions of small file creates, metadata updates, and fsync storms. The SSDs hit high latency, Dovecot indexes started timing out, and Postfix deliveries backed up because writing to disk was slow.

The team responded by adding more concurrency. This is the classic mistake: treating a storage bottleneck like a CPU-bound job. Latency got worse. The server wasn’t down; it was just so slow that clients retried and amplified the load. A self-inflicted thundering herd.

They recovered by doing the opposite of what their instincts wanted: they reduced rsync parallelism to one job, enabled rate limiting, and scheduled bulk sync outside business hours. They moved Dovecot index storage to faster local NVMe and kept mail storage on the bigger array. Boring architecture, suddenly excellent.

The lesson: when you’re creating millions of files, “faster” often means “less concurrent.” Your bottleneck is metadata I/O, not bandwidth.

Mini-story 3: The boring but correct practice that saved the day

One organization had a policy: every migration must include a written rollback plan and a live rollback drill for one mailbox. People complained it was bureaucratic. It was also the reason they slept.

During their email move, they noticed outbound mail from the new server was intermittently deferred by a major recipient. Inbound was fine. Users could read mail. But outbound was a slow bleed of “why didn’t they receive my email?” tickets.

Because they had a rollback plan, they didn’t panic. They simply moved submission traffic back to the old server for outbound while keeping inbound on the new server. SPF already authorized both IPs, DKIM keys were consistent, and the client hostname hadn’t changed. Users barely noticed except that mail started flowing again.

Over the next day, they fixed the deliverability issue: reverse DNS mismatch and an overly strict HELO name. Then they moved outbound back to the new server in a controlled way. No heroics. No late-night “we’ll just restart it” rituals.

The lesson: the most valuable engineering work is often the least glamorous. Rollback planning is not pessimism; it’s competence.

FAQ

1) Can I really do an email migration with “zero downtime”?

You can get close: minimal visible disruption and no lost mail. But you can’t stop DNS caching or guarantee every sender’s behavior. Plan for a short freeze and a coexistence period.

2) Should I move MX first or move the IMAP/submission hostname first?

In most orgs: stabilize IMAP/submission first (same hostname, valid TLS), then cut MX. Users notice IMAP breakage immediately; inbound SMTP delays are often tolerated because SMTP retries.

3) Is rsync safe for Maildir?

Mostly, if you do it in two passes and you accept that concurrent changes can race. For the final sync, a short freeze is cleaner. If you have Dovecot on both sides, dsync is typically safer for metadata consistency.

4) What about mbox mail storage?

Mbox under concurrent writes is risky to copy with rsync because you can capture partial writes or lock state weirdness. Prefer application-level migration or convert to Maildir with a controlled procedure and downtime window.

5) Do I need to migrate Dovecot indexes?

Not strictly, but rebuilding them can cause heavy I/O and slow IMAP after cutover. If you can migrate indexes safely and they’re compatible, it can reduce post-cutover pain. Test first.

6) How do I handle mail that arrives at the old server after cutover?

Either defer with 4xx so senders retry to the new MX, or temporarily relay inbound to the new server. If you relay, implement loop prevention and time-box it.

7) Should I keep the same DKIM key or rotate it?

Keep the same DKIM key/selector during migration if you can. Rotate later. Migration week is not the time to “also improve security hygiene” unless you enjoy combined failure modes.

8) How long should I keep the old server around?

At least long enough to cover DNS caches and any straggler systems sending or delivering to the old host. Commonly one to two weeks for cautious orgs, shorter if you have full observability and control.

9) What’s the most common cause of “mailbox is empty” reports?

Permissions/ownership mismatches, wrong mail_location, or auth mapping sending users to a different path than expected. Less often: UIDVALIDITY changes causing client confusion.

10) How do I know it’s safe to restore high TTL values?

When logs show negligible inbound to the old server, outbound is stable from the new IP, and you’re not seeing straggler client connections to old endpoints.

Next steps (practical, not motivational)

  1. Lower TTLs for MX and mail hostnames 24–48 hours ahead, and verify with dig.
  2. Build the new server to match the old behavior: ports, TLS, auth, domains, filters, quotas.
  3. Run a bulk sync, then an incremental sync, and measure deltas so you can predict the final window.
  4. Validate with real mailboxes and end-to-end SMTP/IMAP tests before touching DNS.
  5. Cut over with a short freeze, prefer 4xx deferrals over permanent rejects on the old server.
  6. Monitor like a professional: queues, auth failures, storage latency, and DKIM/SPF behavior.
  7. Keep rollback easy: old server intact, SPF temporarily includes both IPs, and don’t delete anything until you’ve had a quiet week.

If you do these steps, the migration becomes pleasantly boring. And boring is the highest compliment you can pay an email system.

← Previous
Debian 13 “Dependency failed” at boot: find the one service blocking startup (case #29)
Next →
ZFS for MySQL: Avoiding Latency Meltdowns Under Write Bursts

Leave a comment