Dovecot Mailbox Corruption: Recovery Steps That Minimize Damage

Was this helpful?

You notice it first as a vibe. Users complain that unread counts are haunted, folders show “phantom” messages, searches lie, or Outlook freezes at “Synchronizing…” forever. Then your logs get spicy: “Corrupted index”, “UID validity changed”, “Mailbox GUID mismatch”, “Internal error occurred”. You’ve got mailbox corruption, and every click risks turning a fixable scrape into a full-on data loss event.

This is a production playbook for recovering Dovecot mailboxes in a way that minimizes damage. It’s opinionated because indecision is how you end up “repairing” the only copy of the CFO’s inbox with a recursive delete.

A practical mental model: what “corruption” means in Dovecot

Dovecot is not a monolithic “mail store.” It’s a set of behaviors layered on top of your chosen storage format (Maildir, mdbox, sdbox) plus metadata (indexes, caches, UID lists), plus optional features (quota, full-text search), plus client behavior (IMAP IDLE, CONDSTORE, QRESYNC).

Mailbox corruption in Dovecot usually means one of three things:

  1. Metadata disagrees with reality. Index files say message X exists, but the underlying message file or record does not (or vice versa). Most “corruption” tickets land here.
  2. The underlying storage is inconsistent. Maildir filenames are mangled, duplicated, partially written; or mdbox map/log got out of sync; or the filesystem returned stale/partial data. This is where “minimize damage” matters most.
  3. The client’s view is incompatible with what the server now claims. UIDVALIDITY changes, modseq resets, “unknown UID” errors. The server may be correct, but you still have to handle client fallout.

What not to do: treat corruption as a single bug with a single fix. It’s a mismatch between layers. Your job is to identify which layer is lying, then force a controlled reconciliation in the safest direction.

Default safest direction: prefer underlying message storage as source of truth, then rebuild metadata. But if the underlying storage is damaged, you may need to take a snapshot, copy out what’s readable, and rebuild the store from extracted messages.

Also: “corruption” is often just the symptom of something else: storage flaps, antivirus scanning in-place, buggy network FS semantics, or a too-clever migration script.

One quote worth taping to the rack: (paraphrased idea) “Hope is not a strategy” — often attributed in ops circles to engineers like Gene Kranz. In mailbox repair, hope is how you overwrite the last good copy.

Fast diagnosis playbook (first/second/third)

This is the order that gets you to the bottleneck fastest, with the least self-inflicted injury.

First: confirm the blast radius and whether writes are still happening

  • Is it one user, one mailbox, or everything?
  • Are clients still writing (new mail delivery, IMAP APPEND, flag changes)?
  • Are you seeing active filesystem or disk errors?

If writes are ongoing and the store is unstable, you are not “repairing,” you’re gambling. Freeze or isolate.

Second: classify the failure by log signature

  • Index/caches: “Corrupted index file”, “cache file is corrupted”, “index header mismatch”. Usually fixable by index rebuild.
  • UID/UIDVALIDITY: “UIDVALIDITY changed”, “Invalid UIDNEXT”, “uidlist corruption”. More client impact, but still typically metadata rebuild.
  • Storage layer: “read() failed”, “Short read”, “mbox map corrupted”, “mdbox: rebuild failed”, “Invalid message size”. This can be real data damage.

Third: decide the recovery posture

  1. Non-destructive repair: snapshot/backup, stop writes, rebuild indexes, force resync.
  2. Controlled reconstruction: export what’s readable (or dsync from replica), build a clean mailbox, reimport.
  3. Forensics mode: preserve everything, make a copy, and work on the copy when legal/compliance matters.

Before you touch anything: safety rules that prevent damage

Mailbox recovery is one of those tasks where the tool will do exactly what you asked, not what you meant. Your prime directive is to avoid turning a local inconsistency into irreversible deletion.

Rule 1: Freeze writes or isolate the mailbox

Rebuilding metadata while the mailbox is being modified is how you get repeated corruption, shifting UIDNEXT, and clients that never converge. For one user, you can temporarily disable their login or route them to a maintenance host. For a whole backend, you might stop IMAP/LMTP briefly or drain the proxy.

Rule 2: Take a snapshot/backup of the mailbox directory (and indexes)

Yes, indexes too. Sometimes the “corrupted” index contains the only map to a message file that got renamed or misplaced, especially with partial migrations.

Rule 3: Work on a copy when the store itself may be damaged

If the filesystem is throwing errors, do not run aggressive repair commands on the only copy. Clone the volume, or at least rsync the mailbox tree to healthy storage.

Rule 4: Prefer rebuild over “surgical edits”

Hand-editing Dovecot index files is the mail equivalent of editing a database file with a hex editor because “it’s faster.” It isn’t, and it won’t be.

Short joke #1: Mailbox corruption is like glitter: you think you cleaned it up, and then it shows up in your logs for three more weeks.

Facts and historical context (why these failures exist)

  • Dovecot’s index files are an optimization, not the source of truth. They exist to make IMAP fast: quick flag lookups, sorting, threading, and cache.
  • Maildir was designed for safe concurrent delivery using filename semantics, but it assumes a filesystem that provides atomic rename and consistent directory operations.
  • IMAP’s UIDVALIDITY is a contract with clients. Change it and clients are allowed to treat it as a different mailbox; some will resync, some will duplicate, and some will sulk.
  • “dsync” grew out of Dovecot’s replication needs and is now a handy repair tool because it forces a consistent view by re-walking the store.
  • mdbox/sdbox exist largely because Maildir’s “one file per message” isn’t always kind to disks (inode pressure, directory scaling, backup overhead). They trade filesystem simplicity for Dovecot-managed consistency.
  • Many corruption reports are actually storage semantics bugs—particularly network filesystems that don’t behave like local POSIX filesystems under rename/fsync pressure.
  • Dovecot has historically been picky about index versioning. Upgrades can trigger index rebuilds; mixing old and new binaries against the same indexes can create confusion.
  • Full-text search (FTS) indexes are separate from mailbox indexes and can be wrong while mail is fine. Users will call this “missing emails.” It’s usually “missing search results.”

These are not excuses. They’re clues. The failure modes are shaped by decades of IMAP expectations and the uncomfortable truth that email is a distributed system with opinions.

Triage by storage format: Maildir vs mdbox/sdbox

Maildir: your messages are files

With Maildir, corruption often means directory and filename inconsistencies:

  • Messages exist in tmp/ that never got moved to new/ or cur/.
  • Duplicate filenames or non-unique base names (usually due to broken delivery agent behavior).
  • Clients and server disagree on flags because filenames encode flags, and metadata caches lag behind.
  • Filesystem issues cause partial writes or directory entry weirdness.

Recovery posture: preserve the directory tree, then rebuild Dovecot indexes/caches. If message files are damaged, you’re in extraction territory.

mdbox/sdbox: your messages are records, and Dovecot manages maps

With mdbox, messages are stored in larger files, and Dovecot uses mapping/indexing to locate message records. Corruption can manifest as:

  • Map/index mismatch: metadata points to a record that doesn’t parse.
  • Log or map corruption after an unclean shutdown or storage hiccup.
  • Disk-full events that truncate a write mid-record.

Recovery posture: be conservative. Snapshot first. Rebuild metadata with Dovecot tools; avoid “fixing” by deleting random files in dovecot.index* unless you understand the format and scope.

Practical tasks: commands + what the output means + the decision you make

These are the tasks I actually run. Each includes what to look for and what decision it drives. Adjust paths for your installation and user database.

Task 1: Identify the failing mailbox and error signature

cr0x@server:~$ sudo journalctl -u dovecot --since "2 hours ago" | egrep -i "corrupt|uidvalidity|uidnext|index|mdbox|sdbox|short read|internal error" | tail -n 30
Jan 03 10:41:12 mx1 dovecot[2219]: imap(user@example.com)<11234>: Error: Mailbox INBOX: Corrupted index file /var/vmail/example.com/user/dovecot.index: Header size mismatch
Jan 03 10:41:14 mx1 dovecot[2219]: imap(user@example.com)<11234>: Error: Mailbox INBOX: UIDVALIDITY changed (14211 -> 18602)

Meaning: You have an index corruption signature, plus UIDVALIDITY change (or at least Dovecot thinks it changed). That’s likely metadata rebuild territory, but you must stop writes or risk churn.

Decision: Put the affected mailbox in maintenance mode (block login or route user away), then snapshot before repairs.

Task 2: Confirm Dovecot version and storage backend

cr0x@server:~$ dovecot --version
2.3.19.1
cr0x@server:~$ doveconf -n | egrep -i "mail_location|mail_driver|mail_home|namespace inbox|mail_plugins"
mail_location = maildir:/var/vmail/%d/%n
mail_plugins = quota fts fts_lucene

Meaning: It’s Maildir, and you have quota + FTS. Expect separate FTS issues and quota recalculation needs after repairs.

Decision: Plan to rebuild mailbox indexes first; rebuild FTS indexes only after mail is consistent.

Task 3: Verify filesystem health signals (don’t repair on a dying disk)

cr0x@server:~$ dmesg -T | egrep -i "ext4|xfs|btrfs|zfs|I/O error|blk_update_request|nvme|ata|sense" | tail -n 20
[Fri Jan  3 10:35:22 2026] EXT4-fs warning (device sdb1): ext4_end_bio:342: I/O error 10 writing to inode 928381 (offset 0 size 4096 starting block 1234567)

Meaning: You have write I/O errors. If you rebuild indexes now, you might be writing more metadata into a failing filesystem.

Decision: Stop. Stabilize storage first: failover, remount read-only if necessary, or restore from snapshot/replica.

Task 4: Check free space and inode availability (classic silent killer)

cr0x@server:~$ df -h /var/vmail
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       2.0T  1.9T   12G 100% /var/vmail
cr0x@server:~$ df -i /var/vmail
Filesystem     Inodes   IUsed   IFree IUse% Mounted on
/dev/sdb1     61054976 61054912     64  100% /var/vmail

Meaning: Full disk and no inodes. Maildir will break in entertaining ways because it can’t create new files for deliveries or index rebuilds.

Decision: Free space/inodes first, then address corruption. Otherwise you’ll “fix” it into a worse state.

Task 5: Snapshot/backup the mailbox directory before changes

cr0x@server:~$ sudo rsync -aHAX --numeric-ids --info=progress2 /var/vmail/example.com/user/ /srv/recovery-snapshots/example.com/user-2026-01-03/
...output...

Meaning: You now have a point-in-time copy you can diff against and roll back from. If rsync errors, that itself is a storage integrity signal.

Decision: If rsync reports read errors, move to forensics mode: clone the filesystem and avoid further writes.

Task 6: Verify mailbox list and basic access with doveadm (read-only check)

cr0x@server:~$ sudo doveadm mailbox list -u user@example.com | head
INBOX
Sent
Drafts
Trash
Archive

Meaning: Dovecot can enumerate mailboxes. If this fails with internal errors, corruption may be broader than a single folder.

Decision: If listing fails, suspect namespace config, permissions, or severe index corruption; plan broader index rebuild.

Task 7: Check mailbox status for inconsistent counters

cr0x@server:~$ sudo doveadm mailbox status -u user@example.com messages unread uidvalidity uidnext highestmodseq vsize INBOX
messages=842 unread=19 uidvalidity=18602 uidnext=901 highestmodseq=120044 vsize=188392012

Meaning: You get a coherent status line. If uidnext is less than messages or errors appear, you likely need a resync.

Decision: If UID counters look wrong, schedule doveadm force-resync after backup and write freeze.

Task 8: Force a mailbox resync (metadata rebuild without deleting mail)

cr0x@server:~$ sudo doveadm force-resync -u user@example.com INBOX
cr0x@server:~$ sudo doveadm mailbox status -u user@example.com messages unread uidvalidity uidnext INBOX
messages=842 unread=19 uidvalidity=18602 uidnext=902

Meaning: Dovecot re-walked the storage and rebuilt key metadata. UIDNEXT changed by one, which can be normal after reconciliation.

Decision: If errors persist, rebuild indexes explicitly (next task) and check filesystem damage.

Task 9: Rebuild Dovecot indexes for a user (aggressive but common)

cr0x@server:~$ sudo doveadm index -u user@example.com -q INBOX
cr0x@server:~$ sudo doveadm index -u user@example.com -q '*'

Meaning: Recreates index data. The -q quiet flag reduces noise; check logs for any “corrupt” messages during indexing.

Decision: If indexing triggers “read() failed” or “Short read”, treat it as storage/message damage, not just metadata.

Task 10: Remove only index/cache files (last resort, but sometimes necessary)

cr0x@server:~$ sudo find /var/vmail/example.com/user/ -maxdepth 2 -type f -name "dovecot.index*" -o -name "dovecot-uidlist" -o -name "dovecot-uidvalidity" -o -name "dovecot.list.index*" -print
/var/vmail/example.com/user/dovecot-uidlist
/var/vmail/example.com/user/dovecot.index
/var/vmail/example.com/user/dovecot.index.cache
cr0x@server:~$ sudo mv /var/vmail/example.com/user/dovecot.index /var/vmail/example.com/user/dovecot.index.bak
cr0x@server:~$ sudo mv /var/vmail/example.com/user/dovecot.index.cache /var/vmail/example.com/user/dovecot.index.cache.bak
cr0x@server:~$ sudo mv /var/vmail/example.com/user/dovecot-uidlist /var/vmail/example.com/user/dovecot-uidlist.bak

Meaning: You’ve removed metadata files so Dovecot is forced to recreate them from Maildir. This can change UIDs and annoy clients, but it’s often the cleanest reset.

Decision: Do this only after backup and after stopping writes. If clients support QRESYNC, expect a resync; if not, expect a heavier download.

Task 11: Detect “stranded” Maildir messages in tmp/ (delivery interrupted)

cr0x@server:~$ sudo find /var/vmail/example.com/user/Maildir/tmp -type f | head
/var/vmail/example.com/user/Maildir/tmp/1704271042.M12345P6789.mx1,S=2048,W=2090
cr0x@server:~$ sudo find /var/vmail/example.com/user/Maildir/tmp -type f | wc -l
17

Meaning: Messages stuck in tmp/ can be invisible to clients. They may be half-written, or simply never renamed due to disk-full or crash.

Decision: Inspect a few files (size, headers). If sane, you can move them into new/ cautiously; if not, preserve forensics and don’t inject garbage.

Task 12: Inspect a suspicious message file without mutating it

cr0x@server:~$ sudo ls -lh /var/vmail/example.com/user/Maildir/tmp/1704271042.M12345P6789.mx1,S=2048,W=2090
-rw------- 1 vmail vmail 2.0K Jan  3 10:30 /var/vmail/example.com/user/Maildir/tmp/1704271042.M12345P6789.mx1,S=2048,W=2090
cr0x@server:~$ sudo head -n 20 /var/vmail/example.com/user/Maildir/tmp/1704271042.M12345P6789.mx1,S=2048,W=2090
Return-Path: <sender@example.net>
Delivered-To: user@example.com
Received: by mx1 with LMTP id 7xYk...; Fri, 03 Jan 2026 10:30:41 +0000
Date: Fri, 03 Jan 2026 10:30:40 +0000
From: Sender <sender@example.net>
To: user@example.com
Subject: test
Message-ID: <...>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8

Meaning: It looks like a valid RFC 5322 message. Good sign.

Decision: If it’s valid, you can move it to new/ (with correct permissions) and resync. If headers are missing/garbled, keep it quarantined.

Task 13: Safely move stranded tmp messages into new/ (only after confirming)

cr0x@server:~$ sudo sh -c 'for f in /var/vmail/example.com/user/Maildir/tmp/*; do test -f "$f" && mv -n "$f" /var/vmail/example.com/user/Maildir/new/; done'
cr0x@server:~$ sudo doveadm force-resync -u user@example.com INBOX

Meaning: Messages are now discoverable. The -n prevents overwrite if a name collision occurs.

Decision: If there are collisions, stop and investigate why filenames aren’t unique—likely a broken delivery agent or restored snapshot mixing.

Task 14: Recalculate quota (after repairs; quotas can lie)

cr0x@server:~$ sudo doveadm quota get -u user@example.com
Quota name=User quota Type=STORAGE Value=2048M Limit=2048M %
cr0x@server:~$ sudo doveadm quota recalc -u user@example.com
cr0x@server:~$ sudo doveadm quota get -u user@example.com
Quota name=User quota Type=STORAGE Value=1460M Limit=2048M %

Meaning: Before recalculation, the quota system may have been counting messages that weren’t actually present (or missing messages that were). After recalc, it matches reality.

Decision: If users report “mailbox full” after repairs, run quota recalc. Do not just raise limits; that hides real storage problems.

Task 15: Diagnose FTS confusion (search broken, mail fine)

cr0x@server:~$ sudo doveadm fts rescan -u user@example.com INBOX
cr0x@server:~$ sudo doveadm fts optimize -u user@example.com

Meaning: FTS is rebuilt separately. This fixes “search can’t find the email I’m staring at.”

Decision: Only do this after mailbox indexes are stable; otherwise you index garbage twice.

Task 16: Use dsync as a “truth enforcer” (great with replicas)

cr0x@server:~$ sudo doveadm dsync -u user@example.com backup -R -f /var/vmail/example.com/user/ "remote:replica-user@example.com"
dsync(user@example.com): info: Sync: mailbox INBOX: 842 messages, 0 expunges, 2 flag changes

Meaning: Dsync reconciles differences between two stores. In backup mode, it prefers the local side as source of truth; swap direction carefully.

Decision: If the local store is damaged and replica is good, run dsync in the direction that overwrites local with replica, but only after snapshotting the damaged local.

Short joke #2: The second most dangerous thing on a mail server is a “quick fix.” The first is a “quick fix” run as root without a snapshot.

Checklists / step-by-step plan (damage-minimizing recovery)

Pick the path that matches your situation. The mistake is treating all corruption like an index rebuild.

Plan A: Index corruption only (most common, least scary)

  1. Confirm scope. Is it one mailbox or many? Use logs and doveadm mailbox status.
  2. Freeze writes for that mailbox. Temporarily block login for the user or move them to a maintenance backend.
  3. Snapshot/backup the mailbox directory. Include hidden and dovecot.* files.
  4. Run a resync. doveadm force-resync -u user@example.com INBOX
  5. Rebuild indexes. doveadm index -u user@example.com -q '*'
  6. Recalculate quota. If you use quota: doveadm quota recalc
  7. Rebuild FTS if users complain about search.
  8. Unfreeze writes and watch logs. If errors recur immediately, stop and check storage/delivery semantics.

Plan B: Maildir directory inconsistencies (tmp/new/cur mess)

  1. Freeze writes. Delivery + IMAP changes must pause for this mailbox.
  2. Backup the entire Maildir. You want the messy version preserved.
  3. Count stranded tmp files. If it’s a few, inspect; if it’s thousands, you have a systemic delivery/fsync issue.
  4. Validate a sample of tmp files. Check headers and size plausibility.
  5. Move valid tmp files to new/ using non-overwrite moves.
  6. Remove/rebuild indexes. Move aside dovecot.index* and dovecot-uidlist, then resync.
  7. Client recovery messaging. Tell users to restart clients; some will re-download headers. If UIDVALIDITY changes, warn about duplicates in some clients.

Plan C: Underlying store damage suspected (I/O errors, short reads, truncated records)

  1. Stop writes immediately. If possible remount mail store read-only or fail over.
  2. Get a byte-for-byte copy or snapshot. Work on the copy.
  3. Attempt non-destructive Dovecot rebuilds on the copy first. If they fail with read errors, do not keep retrying.
  4. Extract readable messages. For Maildir, copy out intact files. For mdbox, use Dovecot tools to export if possible.
  5. Restore from replica/backup for missing parts. Dsync can help reconcile.
  6. Bring up a clean mailbox store. Import extracted messages into a new mailbox and then cut clients over.
  7. Postmortem the storage event. Fix the root cause: disk, controller, network FS semantics, or capacity planning.

Plan D: Replica exists and seems healthy (the “use the second copy” plan)

  1. Verify replica health signals. Confirm the replica can serve the mailbox without errors.
  2. Snapshot both sides. Yes, even the good one. Replication mistakes can be symmetric disasters.
  3. Choose direction deliberately. If primary is corrupted, prefer replica as source of truth.
  4. Run dsync in the correct mode. Make small tests on one mailbox first.
  5. Validate message counts and a spot-check of recent mail.
  6. Re-enable normal traffic and monitor.

Common mistakes: symptom → root cause → fix

1) Symptom: “Missing emails” but only search is broken

Root cause: FTS index corruption or lag; mailbox is fine.

Fix: Run doveadm fts rescan for affected users/mailboxes after stabilizing mailbox indexes.

2) Symptom: UIDVALIDITY changed; clients duplicate or re-download everything

Root cause: Deleted/recreated dovecot-uidvalidity or forced rebuild that reset UIDVALIDITY; or mailbox got recreated during migration.

Fix: Avoid deleting UIDVALIDITY unless necessary. If already changed, communicate client steps; for stubborn clients, removing and re-adding the account may be required.

3) Symptom: Rebuilding indexes “works” but corruption returns daily

Root cause: Underlying storage semantics issue (network FS, antivirus, backup tool altering files), or disk errors, or out-of-space churn.

Fix: Investigate dmesg, mount options, and third-party file touchers. Fix the environment, then rebuild once.

4) Symptom: Many messages stuck in tmp/ and new deliveries intermittently missing

Root cause: Disk full / inode exhaustion / crash during delivery / broken LMTP/LDA behavior.

Fix: Resolve capacity. Then validate and move tmp to new, followed by force-resync.

5) Symptom: “Internal error occurred. Refer to server log” on mailbox open

Root cause: Permission/ownership mismatch after restore, or index files created by wrong UID, or broken ACL inheritance.

Fix: Verify filesystem ownership and Dovecot’s mail_uid/mail_gid expectations; correct permissions, then rebuild indexes.

6) Symptom: mdbox mailbox shows sudden gaps, “Invalid message size”

Root cause: Truncated mdbox file due to disk-full or I/O error mid-write.

Fix: Snapshot, attempt Dovecot rebuild/export on a copy, restore missing messages from replica/backup. Don’t “rm random files.”

7) Symptom: One folder broken, others fine; error mentions dovecot.list.index

Root cause: Mailbox list index corruption, sometimes after abrupt shutdown or mixed-version binaries.

Fix: Move aside dovecot.list.index* for the user, then have Dovecot regenerate by listing mailboxes and resyncing.

8) Symptom: Repairs cause message flag chaos (read/unread flips)

Root cause: Maildir flags in filenames disagree with cache; or clients cache state and reapply changes after reconnection.

Fix: Rebuild indexes; then allow a stabilization window. For key users, recommend restarting clients after server-side repair.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized company ran Dovecot with Maildir on a network filesystem. The team assumed “it’s just files” meant it would behave like local disk. It worked for months, which is how assumptions get promoted to “facts.”

Then a maintenance window happened: a storage failover, a brief stall, and a handful of clients that never stopped syncing. Monday morning: users reported missing messages in Sent, but only in certain subfolders. Logs showed intermittent index corruption and occasional “short read” messages.

The initial response was predictable: rebuild indexes for affected users, one by one. It helped for a few hours, then the same mailboxes “corrupted” again. Someone tried a more aggressive approach—delete dovecot.* across the board—because it had worked once in a lab. This triggered UIDVALIDITY changes for a chunk of mailboxes, and clients responded by re-downloading, duplicating, and generally turning the helpdesk queue into performance art.

The root cause wasn’t Dovecot. The network filesystem did not reliably honor the expected rename+fsync semantics under failover. Maildir depends on those operations to present an atomic “message appears” transition. Under stress, tmp/new transitions were getting weird, and Dovecot’s indexes were just the messenger.

They fixed it by moving mail storage to local disk on the mail nodes and using proper replication at the Dovecot layer. Index rebuilds stopped being a daily ritual. The lesson: a system that “usually behaves like POSIX” is not the same as POSIX when your CEO’s iPhone is doing 20 IMAP operations per second.

Mini-story 2: The optimization that backfired

Another organization wanted faster backups and fewer inodes. Maildir felt “inefficient,” so they migrated to mdbox. It was a reasonable engineering decision with a hidden cost: the team’s operational muscle memory was Maildir-based, and their runbooks were still “just move files around.”

They also tuned for speed: aggressive caching and a busy FTS configuration. Everything was fast, until a disk-full event during peak ingest. Disk-full is not dramatic when it happens; it’s boring, and then everything else becomes dramatic later.

After the event, a subset of users couldn’t open their INBOX. The logs included “Invalid message size” in mdbox, and index rebuilds didn’t resolve it. Someone attempted to delete what looked like “just indexes” in the mdbox directory. That made it worse because they removed mapping files that were not merely caches.

Recovery worked, but it required discipline: snapshot what remained, restore from a replica where possible, and export/import readable messages for the rest. Their optimization did deliver performance, but it demanded higher rigor in capacity monitoring and a better understanding of which files were metadata versus essential mapping.

The lesson: performance optimizations are fine. But every optimization has an operations tax. If you don’t pay it gradually (monitoring, runbooks, training), you’ll pay it all at once during an outage.

Mini-story 3: The boring but correct practice that saved the day

A regulated company ran Dovecot with replication and a strict “snapshot before mutation” policy. It wasn’t glamorous. It also meant on-call had a slower first response, because every runbook began with “freeze, snapshot, verify.”

One day, a batch of mailboxes started throwing index corruption errors after an unclean power event. Users were loud; management was louder. The SRE on duty followed the boring checklist anyway: isolate a mailbox, snapshot it, attempt force-resync, rebuild indexes, then verify with a small set of doveadm status checks.

For the handful of mailboxes that still failed, they didn’t thrash. They immediately switched those users to the replica, then used dsync to reconcile once the primary was stable. No hand edits. No random deletions. No “try it and see” loops.

Because they had snapshots from before each mutation, they could reverse a couple of misguided early attempts by a junior engineer without drama. Auditors later asked how they ensured integrity during recovery. The answer was dull, precise, and correct: immutable snapshots plus controlled rebuilds. That’s the kind of boring that keeps your weekends intact.

FAQ

1) Should I delete dovecot.index to fix corruption?

Sometimes, yes—but only after a backup/snapshot and after stopping writes for that mailbox. Prefer doveadm force-resync and doveadm index first. Deleting index files can cause UID/flag churn and client resync pain.

2) Will doveadm force-resync delete messages?

It’s designed to reconcile metadata with the underlying store, not to delete mail. But if your underlying store is missing messages that indexes referenced, the “result” is that those messages no longer appear. That’s not deletion by the command; it’s reality being acknowledged.

3) Users say mail is gone, but I can see files in Maildir. Why?

Dovecot may not be seeing them due to index corruption, permission issues, or because they’re stuck in tmp/. Validate ownership and location, then resync.

4) How do I handle UIDVALIDITY changes with minimal client chaos?

Avoid triggering them unnecessarily. If they happened, communicate clearly: clients may re-download or duplicate. Encourage users to restart clients; for persistent duplication, removing and re-adding the account is often the cleanest client-side reset.

5) Is Maildir “safer” than mdbox for corruption?

Maildir is easier to inspect and salvage at the file level. mdbox can be more efficient and consistent under heavy load but requires more respect for Dovecot-managed metadata. “Safer” depends on whether your team understands the format and whether your storage behaves correctly.

6) Can antivirus or backup agents cause mailbox corruption?

Yes. Anything that modifies, locks, renames, or partially reads/writes message files and dovecot metadata can create inconsistencies. The classic failure is an agent that scans in-place and races with delivery or index updates.

7) What’s the difference between mailbox index issues and FTS issues?

Mailbox indexes affect folder contents, flags, and basic IMAP operations. FTS affects search results. Users frequently report FTS failures as “missing mail” because they search instead of browsing.

8) Should I run repairs while the server is live?

Not if you can avoid it. For a single mailbox you can sometimes get away with it, but the safest approach is to freeze writes for the affected scope. Repairing while clients are hammering the mailbox is how you get non-repeatable results.

9) How do I know if corruption is caused by the filesystem?

Look for I/O errors in dmesg, filesystem warnings, and rsync/read failures when copying maildirs. If errors correlate with storage events (failover, disk-full, latency spikes), treat storage as the primary suspect.

10) After a repair, why are unread counts wrong?

Unread state is a combination of message flags and index/cache state. After rebuilding, counts may change to match what’s actually on disk. Some clients also “reapply” cached state after reconnect; give it a stabilization window.

Conclusion: next steps you can do today

Mailbox corruption recovery is mostly about discipline: stop writes, preserve evidence, rebuild metadata from the safest source of truth, and only then worry about performance add-ons like FTS. When you skip the snapshot or ignore storage health, you turn a repair into a rewrite of history.

Practical next steps:

  1. Add a runbook gate: “snapshot before mutation” for any mailbox repair.
  2. Instrument disk space and inode alerts specifically for mail storage; disk-full is a corruption factory.
  3. Decide your “source of truth” policy in advance (primary vs replica) and document dsync direction choices.
  4. Audit third-party agents (antivirus, backup, indexing) that touch the mail store directly.
  5. Practice on a staging copy: run doveadm force-resync, doveadm index, quota recalc, and FTS rebuild so on-call isn’t learning during an outage.

If you take nothing else: do fewer things, more carefully. Email is patient; your users are not.

← Previous
Cmd+K Search Modal UI: Results Lists, Keyboard Hints, and Empty States (HTML/CSS-First)
Next →
Intel Core generations: how to decode the names without losing it

Leave a comment