Email: TLS cert renewal — keep SMTP/IMAP from breaking on renew day

Was this helpful?

Certificate renewal day is when “nothing changed” turns into a four-alarm fire. Mail is special that way:
the rest of your stack might tolerate a blip, but SMTP retries, IMAP clients scream, and executives rediscover
what “outbox” means.

The good news: keeping SMTP/IMAP stable during TLS certificate renewal is mostly boring mechanics. The bad news:
it only works if you do the boring parts on purpose, and you test the exact things that actually break.

A production mental model: what can break, where, and why

Renewing a TLS certificate sounds like a simple swap: replace a file, restart a daemon, done.
Email servers punish that optimism because there are multiple TLS contexts and multiple clients,
and they don’t all behave the same.

SMTP isn’t one protocol, it’s three deployment modes

Most mail operators run all of these at once:

  • SMTP with STARTTLS (port 25) for server-to-server delivery. Opportunistic TLS, but still picky about chains and hostnames depending on policy.
  • Submission with STARTTLS (port 587) for users/clients. Usually strict validation, especially on modern mobile/desktop clients.
  • SMTPS (port 465) implicit TLS. Still common; clients expect a clean handshake immediately.

You can renew a cert and “test port 25” and still break half your users on 587.
Or vice versa. Ask me how I know. Actually don’t; it’s character building, and you probably already have enough character.

IMAP/POP are “long-lived connections” land

Dovecot (or your IMAP/POP daemon of choice) often holds many TLS sessions open.
A cert swap doesn’t automatically renegotiate. Some clients will keep working until they reconnect; others will reconnect
aggressively and start failing in bursts. That makes it look like “intermittent network issues” until you correlate it
with the renewal event.

The file swap is not the deploy

This is the single most common conceptual bug: “certbot renewed, so we’re done.”
Certbot (or any ACME client) writes files. Your daemons don’t magically reread them unless you reload/restart,
and even then only if your config points at the right paths. Add symlinks and container mounts and you get the classic:
“The cert on disk is new but the service still serves the old one.”

What “breaking” actually looks like

You’ll see one of these:

  • Handshake failures (wrong key, broken chain, unsupported protocol/ciphers, or bad permissions).
  • Hostname mismatch (wrong CN/SAN, wrong vhost mapping, wrong SNI, or clients hitting a different name than you tested).
  • Policy failures (MTA-STS enforced, DANE mismatch, pinned certs in clients, corporate proxies doing “inspection”).
  • Operational failures (reload didn’t happen, config drift, bad file ownership, SELinux/AppArmor blocks reads).

A useful (slightly paranoid) framing: treat a TLS renewal like a tiny config deployment that affects every inbound and outbound
connection. You don’t “renew a certificate.” You “deploy a new trust artifact into a distributed system.”

One quote that still holds up in ops: “Hope is not a strategy.” — Gene Kranz

Interesting facts and historical context (yes, it matters)

Email TLS behavior is the result of decades of retrofits. Some history clarifies why your “simple” change can trigger
20 different client behaviors.

  1. STARTTLS was added later. SMTP predates TLS; STARTTLS is an extension bolted onto an older protocol, so “opportunistic” became the default vibe.
  2. Port 465 was “deprecated,” then came back. SMTPS was once unofficial, then submission on 587 became preferred, and now 465 is again widely recommended for implicit TLS.
  3. Certificate chains got shorter over time. Many environments moved from long cross-signed chains to cleaner roots/intermediates; old clients sometimes require the “right” intermediate.
  4. Let’s Encrypt accelerated renew cycles. Short-lived certs normalized automation, but also normalized “oops, we forgot the reload hook.”
  5. IMAP IDLE makes failures look random. Long-lived connections mean cert issues surface during reconnect storms, not at the exact renewal time.
  6. TLS 1.0/1.1 removal was a breaking change. Tightening protocol versions improved security, but legacy mail clients (and appliances) reacted poorly.
  7. MTA-STS and TLS-RPT changed incentives. Server-to-server TLS went from “nice to have” to “some domains will refuse you” if you fail policy.
  8. DANE exists and is underused. It can enforce TLS at SMTP level using DNSSEC, but it also makes renewal mistakes instantly obvious to strict peers.

Joke #1: TLS renewal is like changing a tire while the car is moving—except the car is also delivering executive email and judging you silently.

Fast diagnosis playbook (first/second/third checks)

When “mail TLS is broken,” you need a fast path to the bottleneck. Don’t start by redeploying everything.
Start by isolating which service, which port, which name, which cert.

First: identify the failing surface area

  • Which protocol? SMTP inbound (25), submission (587), SMTPS (465), IMAPS (993), POP3S (995).
  • Which hostname? mail.example.com vs smtp.example.com vs domain apex.
  • Which client type? Gmail servers, Outlook desktop, iOS Mail, a printer, a SIEM appliance.

Decision: if only one port breaks, you have a config mapping/reload issue. If all ports break, suspect file permissions,
key mismatch, or chain corruption.

Second: fetch the served certificate from the network

Do not trust what’s on disk yet. Trust what’s actually served.

Decision: if the served cert is old, your reload didn’t apply. If it’s new but invalid, it’s chain/hostname/key.

Third: confirm chain and key pairing locally

If the network handshake fails, confirm the private key matches the cert, and the full chain is correct.

Decision: mismatch means wrong files wired; chain issues mean you need the right intermediate/fullchain file in config.

Fourth: read the daemon logs with purpose

Postfix and Dovecot typically tell you exactly what they can’t load. People just don’t read the lines that matter.
Find “cannot load,” “no start line,” “permission denied,” “key values mismatch.”

Fifth: verify policy layer (MTA-STS/DANE) if only certain domains fail

If most deliveries succeed but some domains bounce with “TLS required,” you’re not dealing with “TLS is down,”
you’re dealing with “TLS compliance is down.” Different problem, different fix.

Practical tasks: commands, outputs, and decisions (12+)

Below are real operational tasks. Each includes: the command, a realistic snippet of output,
what the output means, and what you should decide next.

Task 1: Confirm certificate expiry and SANs on disk

cr0x@server:~$ openssl x509 -in /etc/letsencrypt/live/mail.example.com/fullchain.pem -noout -dates -subject -ext subjectAltName
notBefore=Jan  1 00:12:03 2026 GMT
notAfter=Apr  1 00:12:02 2026 GMT
subject=CN = mail.example.com
X509v3 Subject Alternative Name:
    DNS:mail.example.com, DNS:smtp.example.com, DNS:imap.example.com

Meaning: This cert is currently valid and covers multiple hostnames.

Decision: If the failing hostname is not in SANs, stop. Issue the correct cert (or add the name),
don’t attempt “workarounds” like disabling hostname verification.

Task 2: Verify the private key matches the certificate (modulus check)

cr0x@server:~$ openssl x509 -noout -modulus -in /etc/letsencrypt/live/mail.example.com/cert.pem | openssl md5
MD5(stdin)= 9c0a0c7d5d0f4f1d0c1b1c1f3b7a2a7e
cr0x@server:~$ openssl rsa -noout -modulus -in /etc/letsencrypt/live/mail.example.com/privkey.pem | openssl md5
MD5(stdin)= 9c0a0c7d5d0f4f1d0c1b1c1f3b7a2a7e

Meaning: The cert and key pair match.

Decision: If hashes differ, you’re serving a cert with the wrong key. Fix wiring and redeploy; do not restart services repeatedly hoping it resolves.

Task 3: Check what Postfix thinks its TLS files are

cr0x@server:~$ postconf -n | egrep 'smtpd_tls_cert_file|smtpd_tls_key_file|smtpd_tls_CAfile|smtpd_tls_chain_files|smtp_tls_security_level'
smtp_tls_security_level = may
smtpd_tls_cert_file = /etc/letsencrypt/live/mail.example.com/fullchain.pem
smtpd_tls_key_file = /etc/letsencrypt/live/mail.example.com/privkey.pem

Meaning: Postfix is configured to present fullchain.pem.

Decision: If these paths don’t point where you think, fix config first, then reload. If you’re using chroot, ensure Postfix can see these paths from inside it.

Task 4: Check Dovecot TLS config paths

cr0x@server:~$ doveconf -n | egrep 'ssl =|ssl_cert|ssl_key|ssl_min_protocol|ssl_cipher_list'
ssl = required
ssl_cert = </etc/letsencrypt/live/mail.example.com/fullchain.pem
ssl_key = </etc/letsencrypt/live/mail.example.com/privkey.pem
ssl_min_protocol = TLSv1.2

Meaning: Dovecot reads the full chain and requires TLS.

Decision: If Dovecot points at cert.pem instead of fullchain.pem, you’ll often break clients that don’t fetch intermediates. Use the full chain unless you have a strong reason not to.

Task 5: Test STARTTLS on port 25 from a client perspective

cr0x@server:~$ openssl s_client -starttls smtp -connect mail.example.com:25 -servername mail.example.com -showcerts -verify_return_error </dev/null
depth=2 O = Digital Signature Trust Co., CN = DST Root CA X3
verify return:1
depth=1 C = US, O = Let's Encrypt, CN = R3
verify return:1
depth=0 CN = mail.example.com
verify return:1
250 SMTPUTF8

Meaning: TLS verifies cleanly, and SMTP capabilities are advertised after STARTTLS.

Decision: If verification fails, read the first “verify error” line. If it’s “unable to get local issuer,” your chain is wrong. If it’s “hostname mismatch,” your SANs are wrong or SNI is misrouted.

Task 6: Test submission port 587 with STARTTLS and AUTH advertisement

cr0x@server:~$ openssl s_client -starttls smtp -connect mail.example.com:587 -servername mail.example.com -crlf -quiet </dev/null | head -n 12
depth=0 CN = mail.example.com
verify return:1
250-mail.example.com
250-PIPELINING
250-SIZE 52428800
250-STARTTLS
250-AUTH PLAIN LOGIN
250-ENHANCEDSTATUSCODES
250 8BITMIME

Meaning: The cert verifies and the server advertises AUTH (typical on submission).

Decision: If 25 works but 587 fails, you likely have a separate listener, separate cert config, or a different process/container serving 587.

Task 7: Test implicit TLS on port 465

cr0x@server:~$ openssl s_client -connect mail.example.com:465 -servername mail.example.com -verify_return_error </dev/null | egrep 'subject=|issuer=|Verify return code'
subject=CN = mail.example.com
issuer=C = US, O = Let's Encrypt, CN = R3
Verify return code: 0 (ok)

Meaning: SMTPS handshake is clean.

Decision: If 465 fails but 587 works, suspect a different TLS stack (e.g., separate service definition, legacy config, or an LB terminating TLS for 465 only).

Task 8: Test IMAPS (993) and verify the presented certificate

cr0x@server:~$ openssl s_client -connect mail.example.com:993 -servername mail.example.com -verify_return_error </dev/null | egrep 'subject=|issuer=|Verify return code'
subject=CN = mail.example.com
issuer=C = US, O = Let's Encrypt, CN = R3
Verify return code: 0 (ok)

Meaning: IMAPS is presenting the expected cert and chain.

Decision: If IMAPS fails but SMTP works, focus on Dovecot: file permissions, reload, or a separate ssl_cert path.

Task 9: Confirm which certificate Postfix actually loaded (log-driven)

cr0x@server:~$ sudo journalctl -u postfix -n 200 --no-pager | egrep -i 'tls|certificate|key|error' | tail -n 12
Jan 04 10:21:18 mail postfix/smtpd[21456]: warning: TLS library problem: error:0A00018F:SSL routines::ee key too small
Jan 04 10:21:18 mail postfix/smtpd[21456]: fatal: TLS library initialization failed
Jan 04 10:21:19 mail postfix/master[1023]: warning: process /usr/lib/postfix/sbin/smtpd pid 21456 exit status 1
Jan 04 10:21:19 mail postfix/master[1023]: warning: /usr/lib/postfix/sbin/smtpd: bad command startup -- throttling

Meaning: The TLS library rejected the key size (common after security hardening or when someone installed the wrong key).

Decision: Re-issue a cert/key pair that meets your OpenSSL policy (typically RSA 2048+ or ECDSA). Don’t relax OpenSSL security levels just to “make it work.”

Task 10: Confirm Dovecot can read the files (permissions and SELinux-ish reality)

cr0x@server:~$ sudo -u dovecot test -r /etc/letsencrypt/live/mail.example.com/privkey.pem && echo OK || echo NO
OK
cr0x@server:~$ sudo -u dovecot test -r /etc/letsencrypt/live/mail.example.com/fullchain.pem && echo OK || echo NO
OK

Meaning: The dovecot user can read both files.

Decision: If you get NO, fix file permissions/ownership or use a deploy copy step into a directory your service can read. Also consider MAC systems (SELinux/AppArmor): “readable” at Unix permissions doesn’t guarantee allowed.

Task 11: Reload daemons safely and verify they stayed up

cr0x@server:~$ sudo systemctl reload postfix
cr0x@server:~$ sudo systemctl reload dovecot
cr0x@server:~$ systemctl is-active postfix dovecot
active
active

Meaning: Both services accepted reload and remained active.

Decision: If reload fails or service goes inactive, stop and roll back to last-known-good cert files. A cert deploy that takes down the daemon is worse than an expiring cert by several orders of magnitude.

Task 12: Verify from inside the host network namespace (container/LB gotcha)

cr0x@server:~$ openssl s_client -starttls smtp -connect 127.0.0.1:587 -servername mail.example.com -verify_return_error </dev/null | egrep 'subject=|Verify return code'
subject=CN = mail.example.com
Verify return code: 0 (ok)

Meaning: Locally the service is correct. If external tests fail, suspect load balancers, proxies, NAT, or a different node serving traffic.

Decision: Compare local vs external cert fingerprints; if different, you have split-brain across instances.

Task 13: Confirm the served certificate fingerprint (good for comparing nodes)

cr0x@server:~$ echo | openssl s_client -connect mail.example.com:993 -servername mail.example.com 2>/dev/null | openssl x509 -noout -fingerprint -sha256
sha256 Fingerprint=6B:12:8E:3B:40:65:8D:29:42:0E:AD:AF:47:3F:9A:5C:0C:52:6A:0F:84:62:26:25:13:73:4C:CA:19:1C:0E:8A

Meaning: This is the actual cert served on that endpoint right now.

Decision: Run the same command against each node or VIP target. If fingerprints differ, fix your deployment pipeline or LB pool membership before you do anything else.

Task 14: Validate the chain file is structurally sane

cr0x@server:~$ awk 'BEGIN{c=0} /BEGIN CERTIFICATE/{c++} END{print c}' /etc/letsencrypt/live/mail.example.com/fullchain.pem
2

Meaning: fullchain.pem contains two certificates (leaf + intermediate), which is typical.

Decision: If this prints 1, you might be serving leaf-only. If it prints a surprising number, someone concatenated extra certs and confused picky clients.

Task 15: Check for upcoming expiry across all served names (cheap insurance)

cr0x@server:~$ for host in mail.example.com smtp.example.com imap.example.com; do
> echo -n "$host 587: "
> echo | openssl s_client -starttls smtp -connect $host:587 -servername $host 2>/dev/null \
> | openssl x509 -noout -enddate
> done
mail.example.com 587: notAfter=Apr  1 00:12:02 2026 GMT
smtp.example.com 587: notAfter=Apr  1 00:12:02 2026 GMT
imap.example.com 587: notAfter=Apr  1 00:12:02 2026 GMT

Meaning: All names are serving a consistent certificate with the same expiry.

Decision: If one hostname shows a different date, you have a routing/vhost discrepancy or a stale node.

Three corporate mini-stories (anonymized, painfully plausible)

Mini-story 1: The outage caused by a wrong assumption (the “fullchain is optional” myth)

A mid-sized SaaS company ran Postfix + Dovecot on a pair of VMs behind a TCP load balancer. They renewed with ACME,
and in a tidy-up commit, an engineer “simplified” config by pointing both Postfix and Dovecot at cert.pem
instead of fullchain.pem. The assumption was reasonable: “clients can fetch intermediates.”

The first sign wasn’t a total outage. It was support tickets: “Outlook says the certificate is untrusted.”
Then a sales rep couldn’t connect on hotel Wi‑Fi. Then a partner’s on-prem mail gateway stopped accepting deliveries
with a TLS validation error. Most other clients kept working, because modern stacks often build missing chains using
cached intermediates.

Ops spent hours chasing “network flakiness” because symptoms varied by client and location. The load balancer health checks
were green (it only checked TCP). Mail queues slowly grew, then spiked as retries piled up. Nobody correlated it with
the renewal because “the cert is valid” and the expiry date was in the future.

The fix was one line: serve fullchain.pem. But the lesson stuck: assumptions about client behavior don’t belong
in production mail. The chain you send is the chain you’re responsible for.

Mini-story 2: The optimization that backfired (reload avoidance as a “stability” tactic)

An enterprise IT team had a strict change-control culture, and they’d been burned by restarts.
So they declared a new standard: renew certs automatically, but never reload mail daemons automatically.
“We’ll do it during the monthly window,” they said, and it sounded mature.

For weeks, everything looked fine. Certificates renewed on disk. Dashboards showed green because they monitored
expiry dates by reading local files. Then one day a certificate actually expired because the monthly window moved
due to a business event, and nobody reloaded.

The daemons continued serving the old, now-expired cert. Some clients ignored it temporarily; others refused outright.
Submission failures hit first (587/465). Users could receive mail (IMAP sessions stayed up), but couldn’t send.
That is the exact failure mode that creates executive escalations: “I can’t email the board.”

The postmortem was awkward: the “optimization” was intended to avoid downtime, but it created a delayed-action outage
that was harder to spot. Their monitoring checked the wrong thing: the filesystem, not the network.
They changed policy to: automatic renew and automatic reload, gated by a canary handshake test.

Mini-story 3: The boring but correct practice that saved the day (staging, canaries, and fingerprints)

A payments company ran mail as a “necessary evil” but treated it like production anyway. They had three habits:
(1) always test what’s served externally, (2) deploy certs via a controlled copy to a service-owned directory,
and (3) reload during business hours with a canary check, because that’s when humans are awake.

One renewal cycle, their ACME client produced a valid cert, but the intermediate chain shifted (still valid, just different).
A small set of ancient scanners in a warehouse rejected the new chain. Nobody cared about the scanners—until they did,
because those scanners emailed shipping exceptions.

The canary caught it. Their test suite included a “known-bad legacy” client profile that still existed in production.
The canary didn’t block the deploy permanently; it triggered a decision: keep the modern chain for the public endpoints,
and run a dedicated internal hostname with a compatibility chain for the warehouse segment.

Boring practice wins again. They didn’t have an outage; they had a controlled compatibility fix. That’s the difference
between “renewal day” and “Tuesday.”

Checklists / step-by-step plan for renew day

Principles (the opinionated part)

  • Never deploy a cert you didn’t validate from the network. On-disk checks are necessary, not sufficient.
  • Serve a complete chain. Use fullchain.pem unless you’re doing something intentionally weird.
  • Automate reloads, but gate them. A reload hook that runs a handshake test before and after is not “dangerous automation,” it’s safer than humans.
  • Don’t “fix” certificate problems by weakening TLS. You’re not troubleshooting, you’re creating a future incident.
  • Assume you have split-brain until proven otherwise. Multi-node mail + load balancers + renewals = inconsistency by default.

Step-by-step plan: safe renewal pipeline (single node)

  1. Renew into a staging location.
    Keep the service’s live cert directory stable until you’ve verified key pairing and SANs.
  2. Validate cert and key pairing locally (Task 2). If mismatch, stop.
  3. Validate chain structure (Task 14). If leaf-only, stop.
  4. Copy files into a service-owned directory with stable permissions.
    Avoid pointing daemons directly at ACME directories if your security model makes that brittle.
  5. Reload Dovecot and Postfix (Task 11).
  6. Run external handshake tests for 25/587/465/993 (Tasks 5–8). Confirm SANs and verify return code.
  7. Confirm fingerprints for auditability (Task 13). Record them in the change note.
  8. Watch logs and queues for 15–30 minutes. Mail breaks slowly sometimes.

Step-by-step plan: multi-node with a load balancer

  1. Pick a canary node. Drain it from the LB (or reduce weight) but keep it reachable for tests.
  2. Deploy cert to canary node (staged copy), reload services.
  3. Test canary directly by node IP/DNS using SNI and the expected hostname (Tasks 5–8 plus Task 13).
  4. Re-add canary to LB and monitor error rates and logs. If clean, roll forward node-by-node.
  5. After rollout, test the LB VIP and also each backend to ensure fingerprints match (Task 13).

Renew hook example: validate before reload, validate after

This is not a “script tutorial,” it’s a pattern: renewal event triggers tests, then reload, then re-tests.
The tests should exit non-zero if verification fails.

cr0x@server:~$ sudo bash -lc 'cat >/usr/local/sbin/mail-tls-postrenew.sh <<'"'"'EOF'"'"'
#!/usr/bin/env bash
set -euo pipefail

HOST="mail.example.com"

precheck() {
  echo | openssl s_client -starttls smtp -connect 127.0.0.1:587 -servername "$HOST" 2>/dev/null \
    | openssl x509 -noout -enddate >/dev/null
}

postcheck() {
  echo | openssl s_client -starttls smtp -connect "$HOST:587" -servername "$HOST" -verify_return_error 2>/dev/null \
    | openssl x509 -noout -enddate
  echo | openssl s_client -connect "$HOST:993" -servername "$HOST" -verify_return_error 2>/dev/null \
    | openssl x509 -noout -enddate
}

precheck
systemctl reload postfix
systemctl reload dovecot
postcheck
EOF
chmod 0755 /usr/local/sbin/mail-tls-postrenew.sh
/usr/local/sbin/mail-tls-postrenew.sh'
notAfter=Apr  1 00:12:02 2026 GMT
notAfter=Apr  1 00:12:02 2026 GMT

Meaning: Before and after checks succeeded; reload didn’t break external verification.

Decision: If this script fails, treat it like a failed deploy: roll back or halt rollout. Don’t keep cycling reloads.

Common mistakes: symptoms → root cause → fix

These are not theoretical. These are the ones that show up at 02:00 with a very awake on-call.

1) “It renewed, but clients still see the old certificate”

Symptoms: On disk the cert is new; externally the cert fingerprint is unchanged.
Clients keep warning about expiry or untrusted cert.

Root cause: Service wasn’t reloaded, or the service points at different file paths than you updated.
In clusters, one node didn’t get the update.

Fix: Verify served cert with Task 13. Reload daemons (Task 11). Confirm config paths (Tasks 3–4).
For multi-node: check each node’s fingerprint and fix deployment inconsistency.

2) “Handshake failure: key values mismatch / wrong key”

Symptoms: Postfix/Dovecot logs show key mismatch; TLS won’t initialize; service may refuse to start TLS.

Root cause: Cert and private key don’t match, usually due to copying only some files, mixing ECDSA and RSA,
or pointing at the wrong privkey.pem.

Fix: Run modulus check (Task 2). Correct file paths and deploy the right pair.
If you rotate keys, rotate both together, atomically.

3) “Some clients work, some don’t (especially Outlook or embedded devices)”

Symptoms: Modern clients connect fine; older devices fail with “untrusted” or “cannot verify.”

Root cause: You served leaf-only cert (no intermediate) or served a chain those clients can’t build.

Fix: Serve fullchain.pem (Tasks 3–4, 14). If you truly have legacy dependencies,
you may need a compatibility chain strategy (but don’t cargo-cult it—test with the actual devices).

4) “Hostname mismatch after renewal”

Symptoms: Error like “certificate subject name does not match target host name.”
Happens on some ports/hostnames but not others.

Root cause: Missing SAN entries, incorrect hostname used by clients, or different vhost config per port.
Sometimes SNI routing is wrong (LB sends wrong cert for name).

Fix: Inspect SANs (Task 1). Validate with SNI using -servername (Tasks 5–8).
Fix certificate issuance to include all used hostnames, and align DNS/clients if you have drift.

5) “Only specific external domains bounce mail with TLS required”

Symptoms: Mail to most domains works. Some domains return bounces like “TLS required but handshake failed.”

Root cause: Policy enforcement via MTA-STS or DANE on the receiver side; your TLS offering or validation breaks under strict rules.

Fix: Reproduce with strict verification (Task 5 with -verify_return_error).
Ensure correct chain, correct hostname, and sane protocol versions. Don’t disable TLS; fix compliance.

6) “Reload worked, then hours later IMAP complaints started”

Symptoms: Slow-burning failures, especially on IMAP; not tied to the reload moment.

Root cause: Client reconnect storms reveal cert or chain issues gradually; some clients cached intermediates earlier and later lose cache.

Fix: Validate IMAPS externally (Task 8) and confirm full chain.
Monitor logs for TLS alerts post-renewal. Consider forcing a controlled reconnect window if you must flush stale sessions.

Monitoring and alerting that catches this before users do

The usual monitoring mistake is checking the filesystem expiry date. That tells you what’s available,
not what’s served. Monitor from the network, per port, per hostname, from at least two locations.

What to monitor (minimum viable)

  • Served certificate expiry on 25 (STARTTLS), 587 (STARTTLS), 465 (implicit), 993 (implicit), and any POP ports you expose.
  • Handshake success rate (binary). Either you can establish TLS and verify the chain, or you can’t.
  • Certificate fingerprint drift between nodes and the VIP (optional, but gold for diagnosing split-brain).
  • Queue depth and defer rate for Postfix. Cert failures often show up as deferrals and retries.
  • TLS error rate in logs for both Postfix and Dovecot.

Example: quick expiry check from the network (cron-friendly)

cr0x@server:~$ bash -lc 'host=mail.example.com
for p in 465 993; do
  echo -n "$host:$p "
  echo | openssl s_client -connect $host:$p -servername $host 2>/dev/null | openssl x509 -noout -enddate
done'
mail.example.com:465 notAfter=Apr  1 00:12:02 2026 GMT
mail.example.com:993 notAfter=Apr  1 00:12:02 2026 GMT

Meaning: The served certificate expiry is visible and consistent on implicit TLS ports.

Decision: Alert if expiry is within your threshold (e.g., 14 days) or if the command fails (no output usually means handshake failure).

Log-based signal: find TLS failures fast

cr0x@server:~$ sudo journalctl -u postfix -S "1 hour ago" --no-pager | egrep -i 'tls|handshake|SSL|certificate|verify|alert' | tail -n 20
Jan 04 11:02:07 mail postfix/smtpd[22911]: warning: TLS library problem: error:0A000086:SSL routines::certificate verify failed
Jan 04 11:02:07 mail postfix/smtpd[22911]: warning: TLS library problem: error:0A000418:SSL routines::tlsv1 alert unknown ca

Meaning: There are active TLS errors. “unknown ca” often indicates a chain issue or client trust issue.

Decision: If this spikes after renewal, treat it as a regression; confirm you’re serving full chain and not presenting an unexpected cert.

Joke #2: The only thing that renews itself faster than a Let’s Encrypt cert is an email outage rumor in a corporate chat.

Hardening: chains, ciphers, SNI, DANE, MTA-STS, and reality

Use fullchain by default, but understand what it is

fullchain.pem is your leaf cert plus the intermediate(s) needed to build to a trusted root.
Most clients don’t want to go treasure-hunting for intermediates mid-handshake.

Your mail daemon typically should present:

  • Leaf certificate (for mail.example.com)
  • Intermediate CA certificate(s)
  • Not the root CA certificate (clients already have roots; including it can confuse some stacks)

SNI: the silent footgun in multi-domain setups

If you serve multiple domains/certs on one IP, clients using SNI will request the correct name.
Some SMTP clients do, some don’t, and some proxies strip it.
Load balancers may terminate TLS and re-encrypt, changing what the client sees.

Operational advice:

  • Test with -servername always (Tasks 5–8). That approximates real client behavior.
  • Also test without SNI occasionally. If your “default” cert is wrong, legacy clients may get it.

Protocol versions and cipher suites: be strict, but not performative

For mail, TLS 1.2 minimum is a sane baseline. TLS 1.3 is great, but you don’t need to treat it like a personality trait.
The bigger pitfall is tightening too fast without inventorying legacy clients.

If you want to know what you’re offering, check it directly:

cr0x@server:~$ openssl s_client -connect mail.example.com:993 -servername mail.example.com -tls1_2 </dev/null | egrep 'Protocol|Cipher|Verify return code'
Protocol  : TLSv1.2
Cipher    : ECDHE-RSA-AES256-GCM-SHA384
Verify return code: 0 (ok)

Meaning: TLS 1.2 works and verification passes.

Decision: If TLS 1.2 fails but TLS 1.3 works, you may have made the server too strict for some clients. Decide whether that’s acceptable; if not, broaden compatibility intentionally.

DANE: powerful, but it forces you to be correct

DANE for SMTP uses DNSSEC-signed TLSA records to tell senders what cert/key to expect.
That’s great for security and terrible for sloppy renewals.

If you use DANE, any key change requires TLSA updates. If you pin the leaf SPKI and rotate keys without updating DNS,
strict senders will treat you as hostile.

MTA-STS: policy makes your “mostly works” unacceptable

MTA-STS doesn’t make TLS perfect, but it makes failure actionable: if you advertise a policy that says “enforce,”
receivers will start requiring a valid handshake to the correct hostname.

That means renewal mistakes go from “some clients warn” to “mail bounces.”
If you enable MTA-STS, you’re implicitly agreeing to be disciplined about certificates.
That’s fine. Just don’t do it halfway.

Atomicity: deploy as a unit, not as individual files

The cert, key, and chain are a unit. Don’t copy them one by one while the service is reloading.
If you have to copy, copy into a new directory and swap a symlink atomically, then reload.

This is especially important on busy servers where a reload can happen mid-copy, producing intermittent failures that
disappear when you “try again.” Those are the worst incidents: the ones that gaslight the on-call.

FAQ

1) Do I need to restart Postfix/Dovecot to pick up a renewed cert?

Usually a systemctl reload is sufficient, but only if the daemon actually re-reads the cert on reload.
Confirm by checking the served fingerprint (Task 13). If it didn’t change, you need a restart or your reload isn’t wired.

2) Should I point configs to cert.pem or fullchain.pem?

Use fullchain.pem for mail services unless you have a tested reason not to. Leaf-only often breaks older or stricter clients.

3) Why does port 25 work but port 587 fails (or vice versa)?

Different listeners, different config blocks, different processes, or different termination points (LB/proxy).
Treat each port as its own product and test it explicitly (Tasks 5–7).

4) How do I know what certificate clients actually see?

Use openssl s_client from outside your network and record the fingerprint (Task 13). Don’t rely on local files.

5) What’s the quickest way to catch split-brain across mail nodes?

Compare fingerprints per node and on the VIP. If two nodes present different certs, your deploy isn’t consistent.
Fingerprints make this obvious in seconds.

6) Can I avoid reloads by relying on hot-reloading of certificate files?

Some software can watch files; many mail stacks don’t do it reliably, and behavior varies by version and distro patches.
Assume you need an explicit reload and design automation around it.

7) What about ECDSA certificates—are they safe for mail clients?

Often yes, but not universally. Some legacy clients and appliances have incomplete ECDSA support.
If your user base includes old devices, consider dual-certs or test before switching algorithms.

8) What if I use a load balancer that terminates TLS?

Then renewal is an LB problem first: update the LB cert store and validate on the VIP.
Also ensure backend TLS (if you use it) is correct, but remember the client sees the LB’s cert.

9) Why do IMAP issues show up later than SMTP issues?

IMAP clients often keep long-lived TLS sessions (especially with IDLE). SMTP submission clients reconnect frequently.
So submission breaks loud and fast; IMAP can break slowly as sessions churn.

10) Is it okay to temporarily disable certificate verification to “get mail flowing”?

For client apps, no. For server-to-server, weakening verification may still not help because peers enforce policies.
You’ll also be training your org to accept insecure fixes. Fix the cert and chain properly.

Next steps (the practical kind)

If you want renew day to feel like a non-event, do these next:

  1. Add network-based certificate checks for 25/587/465/993, per hostname, and alert on both expiry and handshake failure.
  2. Standardize on full chain deployment and verify SAN coverage before reload.
  3. Implement a gated post-renew hook that runs handshake tests, reloads services, and re-tests (pattern shown above).
  4. Make rollout atomic and consistent across nodes: staged copy, symlink swap, then reload, then fingerprint verification.
  5. Write down your “fast diagnosis” steps in your on-call runbook and include the exact commands you’ll run at 02:00.

The goal isn’t to make TLS renewals exciting. The goal is to make them boring—and boring is the highest compliment production systems can earn.

← Previous
Debian 13: Memory leaks in services — find them with the least disruption (case #43)
Next →
Reverse DNS (PTR): Why Your Email Suffers and How to Fix rDNS Correctly

Leave a comment