If you’ve ever had a “but the docs said it was ACID” conversation while a pager keeps vibrating the desk,
you already know the theme: transactions are not a feature, they’re a system behavior. And system behavior
has sharp edges—especially when replication, failover, caches, and application retries get involved.
PostgreSQL and MongoDB both support transactions. They both have isolation knobs. They both claim durability.
In production, the difference is less about “who has transactions” and more about what you actually get
when the network hiccups, the primary steps down, the disk stalls, and your app does the least helpful thing: retry.
Where the docs stop and reality begins
Documentation usually answers “what is supported.” Production asks “what happens when it fails.”
Transactions aren’t just BEGIN/COMMIT; they’re a contract among:
the storage engine, replication protocol, failover behavior, client drivers, and your retry logic.
PostgreSQL’s transaction model is older, stricter, and opinionated. It’s built around MVCC and WAL,
and it tends to fail in recognizable ways: lock contention, bloat, replication lag, hot standby conflicts.
MongoDB’s transactional story arrived later and is layered on top of a replica set and document engine.
It can be perfectly correct—until you mix in write concerns, stepdowns, long-running transactions,
and workload patterns that fight its design.
Here’s the operational truth: if your system needs multi-entity invariants, correct uniqueness, and
predictable rollbacks under concurrency, PostgreSQL is the default. MongoDB can do transactions,
but you pay for them, and you must be explicit about durability and read semantics.
If you treat “supports transactions” as “works like Postgres,” you’ll build a trap for your future self.
Quick history: why these systems behave like they do
Transactions are not a fashion accessory. They’re the product of decades of pain. A few context points that
explain today’s tradeoffs:
- PostgreSQL descends from POSTGRES (late 1980s), designed in the academic era where correctness was a first-class goal, not an optional SKU.
- PostgreSQL’s MVCC became its defining concurrency strategy: readers don’t block writers, but vacuum becomes your silent roommate who eats your CPU at night.
- WAL (Write-Ahead Logging) is the durability backbone of Postgres; it’s why crash recovery is boring in the best way.
- MongoDB started (late 2000s) emphasizing developer agility and document flexibility; early versions leaned heavily on atomicity at the document level.
- Replica sets became MongoDB’s operational center; “primary” and “secondaries” are not just topology, they define read/write semantics.
- MongoDB multi-document transactions arrived later (4.0+), first for replica sets, then for sharded clusters—implemented by coordinating writes with additional bookkeeping.
- “Majority write concern” exists because asynchronous replication is not durability; it’s optimism with a good marketing team.
- PostgreSQL’s isolation defaults to Read Committed, which surprises developers expecting serializable-by-default behavior.
- MongoDB’s read concern levels evolved to address “what did I actually read” in a replicated world (local, majority, snapshot, linearizable).
These histories matter because they predict the failure modes. Postgres tends to protect correctness first,
then makes you tune performance. MongoDB tends to make scaling and schema evolution accessible,
then makes you choose how much correctness you’re willing to buy at runtime.
PostgreSQL transaction model: MVCC, locks, and WAL
MVCC means “reads see a snapshot,” not “locks don’t exist”
PostgreSQL uses MVCC: each row version has visibility metadata, and a transaction sees a snapshot of what’s
committed (at a certain point, depending on isolation). Reads don’t block writes because readers read older
row versions. Writers create new versions. This is elegant, but it creates consequences:
dead tuples, vacuum requirements, and long-running transactions that prevent cleanup.
Locking still matters. Updating a row takes row-level locks. DDL takes heavier locks.
Uniqueness is enforced with index-level locking behavior that’s safe but can be surprising under hot keys.
And if you push concurrency while holding locks too long, Postgres will patiently show you a queue of misery.
Isolation levels in practice
PostgreSQL supports Read Committed, Repeatable Read, and Serializable.
In production, most systems run Read Committed and rely on constraints and careful queries.
Serializable is real, but it’s optimistic: it may abort transactions with serialization failures under contention.
That’s not a bug; it’s the database telling you “your concurrency story isn’t serializable.”
The “reality differs from docs” moment is when a team turns on Serializable without implementing retries for
SQLSTATE 40001. Then they learn that correctness requires cooperation.
WAL durability: COMMIT is a disk story, not a feelings story
In PostgreSQL, durability is fundamentally about WAL being flushed. The knobs are
synchronous_commit, wal_sync_method, and the behavior of the underlying filesystem and storage.
Replication introduces another layer: is the commit acknowledged by the primary only, or also by replicas?
When someone says “Postgres lost committed data,” it’s usually one of:
asynchronous replication failover, deliberately relaxed durability (synchronous_commit=off),
a storage stack that lied (write cache without proper barriers), or human error.
The Postgres superpower: constraints are transactionally enforced
Foreign keys, unique constraints, exclusion constraints, check constraints. You can model invariants close
to the data and rely on them under concurrent load. This is not just convenience; it’s reducing the surface
area of application-level race conditions.
MongoDB transaction model: sessions, write concern, and replica sets
MongoDB transactions exist—but they are not free
MongoDB’s multi-document transactions (replica set and sharded) provide ACID semantics within the transaction.
The “within” is doing a lot of work here. The engine maintains additional state, and the coordinator must manage
commit across participants (especially in sharded deployments). Latency rises. Throughput often drops.
Under contention, you can see more aborts and transient errors that require retries.
If your MongoDB workload is mostly single-document operations, Mongo can be extremely fast and operationally
pleasant. The moment you lean heavily into multi-document transactions for core correctness, you’re asking MongoDB
to behave like a relational database—while still paying the tax of being a distributed document store.
Write concern is the difference between “acknowledged” and “durable enough”
MongoDB’s write concern defines what “success” means for a write. The classic pitfall: using
w:1 (acknowledged by primary) and assuming it means the write will survive failover.
It might. It also might not, depending on timing and replication.
For durability through primary failure, you typically want w:majority plus sane journaling settings
(journaling is enabled by default in modern versions, but confirm). Then you learn the next reality:
majority can be slower, especially with slow secondaries or cross-zone deployments.
Read concern: what did you actually read?
MongoDB’s read concern controls the visibility guarantees: local, majority,
snapshot, and linearizable (with constraints). Read preference adds another axis:
primary vs secondary reads.
In production, the common mistake is to read from secondaries for “scale” while keeping writes on the primary,
then acting surprised when users see data “go backwards” or when a transaction reads something that later
disappears after rollback. If you treat replication lag as a feature, it will treat your correctness as optional.
Sessions, retries, and the “unknown commit result” tax
MongoDB drivers implement retryable writes and transaction retry behavior. Good—until it’s not.
You can get cases where the client times out and doesn’t know whether the commit succeeded.
Then your application retries and accidentally creates duplicates unless you design idempotency.
That’s not MongoDB being evil; that’s distributed systems being distributed systems. But you need to plan for it.
Semantics that matter: a practical comparison matrix
1) Atomicity scope
PostgreSQL: atomic across any rows/tables touched by the transaction within the database.
MongoDB: atomic at single-document level by default; multi-document atomicity exists via transactions, with overhead
and more operational caveats (especially across shards).
2) Isolation defaults and developer surprise
PostgreSQL default Read Committed: each statement sees a snapshot at statement start. Many anomalies are prevented
by constraints and careful query structure, but you can still write races if you assume “repeatable” behavior.
MongoDB: outside transactions, you’re mostly in a per-operation world with read concern and write concern.
Inside transactions, you can get snapshot isolation behavior. The surprise is often about what “snapshot” means
relative to replication and read preference.
3) Durability semantics under failover
PostgreSQL primary-only commit is durable on that node once WAL is flushed (modulo storage honesty).
But if you fail over to an async replica, you can lose acknowledged transactions that weren’t replicated.
MongoDB: a write acknowledged with w:1 can roll back on failover. With w:majority,
rollback risk is reduced significantly because the write was replicated to a majority.
4) “Read your writes” and monotonic reads
PostgreSQL on a single node: yes, within the same transaction and session. With replicas, it depends on your
routing; if you read from replicas you can see lag.
MongoDB: if you read from secondaries without causally consistent sessions (and correct settings), you can read stale data.
Even with sessions, topology changes can complicate guarantees. The system can be correct; your assumptions might not be.
5) Observability
PostgreSQL exposes transactional state and locking very directly via system catalogs and views.
MongoDB exposes state via serverStatus/currentOp, profiler, and replication metrics.
Both are observable. Postgres tends to give you clearer “who is blocking whom” pictures; MongoDB tends to make you
reason about replica set health and write concern fulfillment.
6) Operational blast radius
Postgres: one bad query can lock a table, bloat storage, or saturate I/O; failures are often localized to that node,
with replication lag as the secondary symptom.
MongoDB: a sick replica set member, slow secondaries, or elections can turn “fast writes” into “why is everything timing out.”
In sharded clusters, cross-shard transactions can spread pain quickly.
Failure modes you meet at 2 a.m.
PostgreSQL: lock contention and “why is COMMIT slow?”
COMMIT latency in Postgres is usually storage latency (fsync), WAL contention, or synchronous replication waiting.
It’s rarely “Postgres is slow” in the abstract. It’s almost always “your storage or your durability settings are making
your desired semantics expensive.”
Another classic: a long-running transaction prevents vacuum from cleaning dead tuples, bloats indexes,
and then everything becomes slow in a way that looks like “random I/O meltdown.”
MongoDB: elections, write concern waits, and transaction retries
In MongoDB, the operational pain often comes from the replica set doing its job:
elections happen, primaries step down, and clients see transient errors. If your application doesn’t handle these well,
your incident becomes “the database is down” when it’s actually “your retries are a DDoS against your own primary.”
Transactions can amplify this. A transaction held open while doing lots of work keeps resources tied up
and increases the chance that something changes underneath you (like a stepdown), forcing an abort and retry.
The distributed truth: you can’t avoid ambiguity
Both systems can land you in “did it commit?” ambiguity when the client loses the response.
Postgres typically leaves you to implement idempotency at the application layer.
MongoDB makes the issue more explicit with retryable writes and transaction commit behavior—but you still must design
for it.
One quote worth keeping on your wall, because it stays true no matter which database you pick:
Hope is not a strategy.
—Gene Kranz
Joke #1: A transaction is like a promise: it’s only comforting until you have to enforce it in court, aka production.
Three corporate-world mini-stories
Incident: a wrong assumption about “acknowledged writes”
A mid-sized SaaS company ran MongoDB for user profiles and billing state. The schema was clean, the code was modern,
and the team had a habit of reading “acknowledged” as “durable.” Writes used the default write concern, and the app
read from the primary. Things looked fine in staging, and fine for months in production.
Then a routine maintenance window collided with a network flap between availability zones. The primary accepted writes,
acknowledged them, and shortly after stepped down during an election. Some of those last-second writes hadn’t replicated
to a majority. A new primary was elected that didn’t have them.
The incident wasn’t a dramatic outage. It was worse: a handful of customer actions “disappeared.” Support tickets arrived
with screenshots. Engineering initially chased ghosts in the frontend because the backend logs showed successful writes.
The data simply wasn’t there anymore.
The fix was operational and architectural. They moved critical writes to w:majority (with timeouts tuned),
made certain reads use readConcern: "majority" when returning billing-critical state, and added idempotency keys
so retries wouldn’t duplicate side effects. Performance dipped slightly; correctness improved massively.
Optimization that backfired: “let’s speed up Postgres commits”
A fintech team ran PostgreSQL with a write-heavy ledger. They were chasing a p99 latency regression and noticed that
commit latency correlated with I/O spikes. Someone proposed the classic tweak:
set synchronous_commit=off for “non-critical” writes and rely on replication.
It worked. Latency improved instantly. Throughput went up. The graphs looked like a promotion packet.
Then they had an unclean power event on the primary. The box rebooted. WAL had not been flushed for some commits.
The database recovered correctly—meaning it rolled back those transactions because, per the new settings, they were never
guaranteed durable.
The business impact was not catastrophic, but it was humiliating: “confirmed” actions had to be reconciled.
The team learned that durability shortcuts are not free; they are deferred consequences.
They rolled back the setting, moved to faster storage for WAL, and used async processing for truly non-critical events
instead of weakening durability on the core ledger.
Joke #2: Turning off durability to make commits faster is like removing your smoke alarm because it beeps too loudly.
Boring but correct practice: constraint-first design that saved the day
A B2B platform stored orders in PostgreSQL and had a background worker that “finalized” orders by creating invoices,
decrementing inventory, and emailing customers. The workflow was distributed across services, because of course it was.
The team did something unfashionable: they modeled invariants in the database. Inventory decrements were constrained to
never go below zero. Invoice numbers were unique with a strict index. The order state machine was guarded by check constraints
and transitions enforced in transactional SQL.
During a deploy, a bug caused the worker to run twice for a subset of orders. In many systems that becomes a multi-day
cleanup project. Here, the second execution mostly failed fast with constraint violations.
The service logged errors, SREs spotted the spike, and customers mostly never noticed.
The postmortem was short and almost cheerful: the database had acted as a circuit breaker. The fix was to correct the job
scheduling bug and improve idempotency, but the constraints had already prevented financial and inventory corruption.
Boring can be beautiful.
Practical tasks: commands, outputs, and decisions (12+)
These are the checks I run when someone says “transactions are slow” or “we lost data” or “it’s inconsistent.”
Each task includes: a command, a realistic output snippet, what it means, and what decision you make from it.
Task 1 (PostgreSQL): Confirm isolation and durability settings
cr0x@server:~$ psql -h pg-primary -U postgres -d app -c "SHOW default_transaction_isolation; SHOW synchronous_commit; SHOW wal_level;"
default_transaction_isolation
-----------------------------
read committed
(1 row)
synchronous_commit
--------------------
on
(1 row)
wal_level
-----------
replica
(1 row)
Meaning: Default Read Committed; commits wait for WAL flush; WAL configured for replication.
Decision: If you’re debugging anomalies, confirm whether the app assumes Repeatable Read/Serializable.
If commit latency is high, keep synchronous_commit=on unless you explicitly accept data loss.
Task 2 (PostgreSQL): Identify who is blocking whom
cr0x@server:~$ psql -h pg-primary -U postgres -d app -c "SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query, blocking.pid AS blocking_pid, blocking.query AS blocking_query FROM pg_stat_activity blocked JOIN pg_locks blocked_locks ON blocked_locks.pid = blocked.pid JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pid JOIN pg_stat_activity blocking ON blocking.pid = blocking_locks.pid WHERE NOT blocked_locks.granted;"
blocked_pid | blocked_query | blocking_pid | blocking_query
------------+------------------------------+--------------+-------------------------------
24190 | UPDATE orders SET ... | 23811 | ALTER TABLE orders ADD COLUMN
(1 row)
Meaning: DDL is blocking a write. That’s not “transactions are slow,” that’s “someone took a heavyweight lock.”
Decision: Stop the DDL if it’s unsafe, or schedule it properly. Use concurrent index builds and online migration patterns.
Task 3 (PostgreSQL): Check for long-running transactions preventing vacuum
cr0x@server:~$ psql -h pg-primary -U postgres -d app -c "SELECT pid, now() - xact_start AS xact_age, state, query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_age DESC LIMIT 5;"
pid | xact_age | state | query
-------+------------+--------+----------------------------------------
31245 | 02:13:08 | idle | BEGIN;
29902 | 00:18:41 | active | SELECT ... FROM events ORDER BY ...
(2 rows)
Meaning: An idle transaction has been open for hours. That pins old row versions and can cause bloat.
Decision: Fix the application: ensure transactions are short and always committed/rolled back.
Consider idle_in_transaction_session_timeout.
Task 4 (PostgreSQL): Check WAL and checkpoint pressure
cr0x@server:~$ psql -h pg-primary -U postgres -d app -c "SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time, checkpoint_sync_time FROM pg_stat_bgwriter;"
checkpoints_timed | checkpoints_req | checkpoint_write_time | checkpoint_sync_time
------------------+-----------------+-----------------------+----------------------
1021 | 487 | 98765432 | 1234567
(1 row)
Meaning: Many requested checkpoints; write time is high. Checkpoint pressure can hurt commit latency and I/O.
Decision: Tune max_wal_size, checkpoint_timeout, and ensure storage can handle WAL and data writes.
Task 5 (PostgreSQL): Measure replication lag and risk window
cr0x@server:~$ psql -h pg-primary -U postgres -d app -c "SELECT application_name, state, sync_state, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
application_name | state | sync_state | write_lag | flush_lag | replay_lag
------------------+-----------+------------+-----------+-----------+------------
pg-replica-1 | streaming | async | 00:00:02 | 00:00:03 | 00:00:05
(1 row)
Meaning: Replica is async and seconds behind. A failover can lose a few seconds of commits.
Decision: If you can’t tolerate that, implement synchronous replication for critical clusters,
or change failover policy to avoid promoting lagging replicas.
Task 6 (PostgreSQL): Spot serialization failures and retry needs
cr0x@server:~$ psql -h pg-primary -U postgres -d app -c "SELECT datname, xact_commit, xact_rollback, conflicts FROM pg_stat_database WHERE datname='app';"
datname | xact_commit | xact_rollback | conflicts
---------+-------------+---------------+-----------
app | 98765432 | 123456 | 842
(1 row)
Meaning: Rollbacks/conflicts exist. Not all rollbacks are bad, but spikes can indicate serialization failures or deadlocks.
Decision: Correlate with application errors (SQLSTATE 40001, 40P01). Add retry logic with jitter and reduce contention hotspots.
Task 7 (MongoDB): Inspect replica set health and election churn
cr0x@server:~$ mongosh --host rs0/mb-primary,mb-secondary-1,mb-secondary-2 --eval "rs.status().members.map(m=>({name:m.name,stateStr:m.stateStr,health:m.health,optime:m.optime.ts}))"
[
{ name: 'mb-primary:27017', stateStr: 'PRIMARY', health: 1, optime: Timestamp({ t: 1735550101, i: 1 }) },
{ name: 'mb-secondary-1:27017', stateStr: 'SECONDARY', health: 1, optime: Timestamp({ t: 1735550101, i: 1 }) },
{ name: 'mb-secondary-2:27017', stateStr: 'SECONDARY', health: 1, optime: Timestamp({ t: 1735550096, i: 1 }) }
]
Meaning: One secondary is a few seconds behind. That matters for w:majority latency and read-from-secondary staleness.
Decision: If majority writes are slow, investigate lagging members; if secondary reads are used, consider tighter read concerns or routing.
Task 8 (MongoDB): Verify write concern defaults and journaling
cr0x@server:~$ mongosh --host mb-primary --eval "db.getMongo().getWriteConcern()"
{ w: 1 }
Meaning: Default is w:1. That’s “acknowledged by primary,” not “survives primary loss.”
Decision: For critical data, set w:majority at the client or collection level, and use timeouts to avoid hanging writes.
Task 9 (MongoDB): Check for transaction aborts and retry storms
cr0x@server:~$ mongosh --host mb-primary --eval "db.serverStatus().transactions"
{
currentActive: Long('14'),
currentInactive: Long('3'),
currentOpen: Long('17'),
totalAborted: Long('982'),
totalCommitted: Long('184433'),
totalStarted: Long('185415')
}
Meaning: Aborts are happening. Some aborts are normal; spikes during stepdowns or contention are not.
Decision: If aborts climb, shorten transactions, reduce cross-collection work, and confirm driver retry behavior isn’t amplifying load.
Task 10 (MongoDB): Find slow operations that hold transactions open
cr0x@server:~$ mongosh --host mb-primary --eval "db.currentOp({active:true, secs_running:{$gte:5}}).inprog.map(op=>({secs:op.secs_running,ns:op.ns,desc:op.desc,command:op.command && Object.keys(op.command)}))"
[
{ secs: 31, ns: 'app.orders', desc: 'conn12345', command: [ 'find' ] },
{ secs: 12, ns: 'app.inventory', desc: 'conn23456', command: [ 'update' ] }
]
Meaning: Active ops running for a long time. In transactions, long ops increase conflict probability and resource usage.
Decision: Add indexes, reduce document scan size, and cap transaction work. If it’s a runaway query, kill it deliberately.
Task 11 (MongoDB): Confirm read preference and read concern in the client path
cr0x@server:~$ mongosh --host mb-primary --eval "db.getMongo().getReadPref()"
{ mode: 'secondaryPreferred' }
Meaning: Reads may go to secondaries. That’s a correctness choice, not a free scaling trick.
Decision: For user-facing “I just wrote it” flows, use primary reads or causally consistent sessions and appropriate readConcern.
Task 12 (PostgreSQL): Confirm fsync behavior and catch “lying storage” early
cr0x@server:~$ psql -h pg-primary -U postgres -d app -c "SHOW fsync; SHOW full_page_writes; SHOW wal_log_hints;"
fsync
-------
on
(1 row)
full_page_writes
------------------
on
(1 row)
wal_log_hints
---------------
off
(1 row)
Meaning: WAL is being fsynced; full page writes are enabled (important for crash safety).
Decision: Keep these on for production unless you have a very specific reason and compensating controls.
If you suspect the storage stack, validate cache settings at the hardware layer as part of incident response.
Task 13 (PostgreSQL): Find deadlocks and the statements causing them
cr0x@server:~$ psql -h pg-primary -U postgres -d app -c "SELECT deadlocks, conflicts FROM pg_stat_database WHERE datname='app';"
deadlocks | conflicts
-----------+-----------
19 | 842
(1 row)
Meaning: Deadlocks occurred. Postgres will kill one participant; your app sees an error and must retry.
Decision: Identify the tables/queries (via logs), standardize lock ordering in code, and keep transactions small.
Task 14 (MongoDB): Check replication lag in seconds, not vibes
cr0x@server:~$ mongosh --host mb-primary --eval "rs.printSecondaryReplicationInfo()"
source: mb-secondary-1:27017
syncedTo: Mon Dec 30 2025 02:41:55 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: mb-secondary-2:27017
syncedTo: Mon Dec 30 2025 02:41:49 GMT+0000 (UTC)
6 secs (0 hrs) behind the primary
Meaning: One secondary is 6 seconds behind. That affects majority acknowledgment speed and stale reads.
Decision: Investigate that node (disk, CPU, network). If it’s routinely lagging, it will haunt your transaction latencies.
Task 15 (PostgreSQL): Confirm synchronous replication settings if “no data loss” is required
cr0x@server:~$ psql -h pg-primary -U postgres -d app -c "SHOW synchronous_standby_names; SHOW synchronous_commit;"
synchronous_standby_names
--------------------------
'ANY 1 (pg-replica-1)'
(1 row)
synchronous_commit
--------------------
on
(1 row)
Meaning: Primary waits for at least one synchronous standby. This is how you buy fewer “acknowledged but lost” surprises.
Decision: Use this for critical systems, but monitor: it couples write latency to replica health and network quality.
Fast diagnosis playbook
When transaction behavior is “weird,” don’t start by debating databases. Start by locating the bottleneck.
This is the order that gets you to a root cause before the meeting invite arrives.
First: determine whether the pain is latency, throughput, or correctness
- Latency spike: commits slow, queries slow, timeouts.
- Throughput drop: queue growth, worker backlog, rising connections.
- Correctness bug: missing writes, duplicates, out-of-order reads.
Second: check the “durability and replication contract”
- PostgreSQL: is replication async? did failover happen? is
synchronous_commitrelaxed? is WAL flush slow? - MongoDB: what is the write concern? are elections happening? are secondaries lagging? are clients reading from secondaries?
Third: check contention
- PostgreSQL: blocking locks, deadlocks, long-running transactions, hot rows, index contention.
- MongoDB: transaction conflicts, long-running ops, write concern waits, shard coordinator overhead.
Fourth: check storage and host saturation
- Disk latency is the invisible hand behind “COMMIT is slow.”
- CPU saturation can look like “locks” because everything slows down and queues form.
- Network jitter can look like “database instability” because replication and elections are sensitive to timeouts.
Fifth: confirm client behavior
- Are you retrying on every error with no backoff? Congratulations, you built an outage multiplier.
- Are timeouts shorter than your commit path during failover? Then you’re generating “unknown commit result” by design.
- Are you mixing read preferences and expecting read-your-writes? Then you’re testing your luck in production.
Common mistakes: symptoms → root cause → fix
1) “Committed data disappeared after failover”
Symptoms: app logs show successful writes; after failover, data is missing.
Root cause: async replication failover (Postgres), or MongoDB writes acknowledged with w:1 rolled back.
Fix: For Postgres, use synchronous replication for critical data or avoid promoting lagging replicas. For MongoDB, use w:majority and design idempotency.
2) “Transactions are slow only during peak”
Symptoms: p95/p99 latency jumps; throughput drops; CPU looks okay; users time out.
Root cause: contention (locks in Postgres; write concern waits or transaction conflicts in MongoDB), or storage fsync saturation.
Fix: Identify blocking chains / long ops. Reduce transaction scope. Add or fix indexes. Move WAL/journal to faster storage if needed.
3) “We enabled Serializable and everything started failing”
Symptoms: Postgres errors with serialization failures; retry storms.
Root cause: Serializable requires retries; high contention triggers aborts.
Fix: Implement bounded retries with jitter; reduce hot spots; consider Repeatable Read plus constraints if appropriate.
4) “MongoDB transaction keeps aborting with transient errors”
Symptoms: high abort rate; stepdowns; app sees transient transaction errors.
Root cause: elections, long-running transactions, conflicts, or cross-shard coordination overhead.
Fix: Shorten transactions, avoid cross-shard transactions when possible, tune timeouts, and ensure driver retry logic is correct and rate-limited.
5) “Reads are inconsistent right after writes”
Symptoms: user updates profile; refresh shows old data; later it appears.
Root cause: reading from replicas/secondaries; insufficient read concern; replica lag.
Fix: Route read-your-writes to primary, or use sessions with causal consistency and appropriate read concern; monitor replication lag.
6) “Postgres disk usage keeps growing; queries slow down gradually”
Symptoms: bloat, rising I/O, autovacuum can’t keep up.
Root cause: long-running transactions and update-heavy tables producing dead tuples; vacuum starvation.
Fix: Eliminate idle-in-transaction; tune autovacuum per table; consider partitioning; schedule maintenance when needed.
7) “We used secondaryPreferred and now we have ‘impossible’ bugs”
Symptoms: counters go backwards; state transitions look out of order.
Root cause: stale reads from secondaries; assumptions about monotonic reads not met.
Fix: Use primary reads for state machines; if using secondary reads, accept staleness explicitly and design UI/logic around it.
8) “Our retry logic made the incident worse”
Symptoms: spike in QPS and connections during outage; DB falls over harder.
Root cause: unbounded retries, no jitter, retries on non-retryable errors, and lack of idempotency keys.
Fix: Implement bounded exponential backoff with jitter; distinguish retryable errors; add idempotency tokens; consider circuit breakers.
Checklists / step-by-step plan
Decision checklist: should this workload use Postgres or MongoDB transactions?
- Do you need cross-entity invariants? If yes, favor PostgreSQL. If MongoDB, expect to use multi-document transactions and pay for them.
- Do you need strong uniqueness guarantees across many documents? PostgreSQL is straightforward. MongoDB needs careful indexing and transaction design.
- Is failover data loss unacceptable? Postgres: synchronous replication and careful failover. MongoDB:
w:majorityand majority reads where required. - Is your access pattern mostly single-document? MongoDB shines; don’t accidentally turn it into a relational system via constant multi-document transactions.
- Will developers reliably implement retries? If not, avoid settings that require them frequently (Serializable everywhere; long MongoDB transactions under contention).
- Can you staff operational expertise? Both require it. Postgres expertise often looks like query tuning + vacuum + replication discipline. MongoDB expertise often looks like replica set/shard hygiene + write/read concern discipline.
Implementation plan: Postgres done right for transactional correctness
- Model invariants with constraints (FKs, unique constraints, check constraints).
- Keep transactions short; avoid chatty “transaction as a workflow.”
- Implement idempotency for external side effects (emails, payments, webhooks).
- Choose isolation level intentionally; add retries if using Serializable.
- Decide your failover consistency: async (possible loss) vs synchronous (higher latency).
- Instrument: lock waits, deadlocks, replication lag, checkpoint timing, fsync latency at the host.
- Set guardrails:
statement_timeout,idle_in_transaction_session_timeout, connection pool limits.
Implementation plan: MongoDB transactions without self-sabotage
- Default to single-document atomic patterns when possible (embed, use atomic operators).
- When you need multi-document transactions, keep them short and small in document count.
- Set write concern explicitly for critical writes (
w:majority+ timeout). - Choose read concern and read preference intentionally; don’t mix “secondary reads” with “strict correctness” expectations.
- Implement idempotency keys; design for “unknown commit result.”
- Monitor elections, replication lag, transaction aborts, and lock/queue metrics.
- In sharded clusters, avoid cross-shard transactions for hot paths if you like sleeping.
FAQ
1) Are MongoDB transactions “real ACID”?
Within the transaction, yes: atomicity and isolation are provided, and durability depends on write concern and journaling.
The production gotcha is that your cluster topology and write concern determine what “durable” means under failover.
2) Does PostgreSQL guarantee no data loss?
On a single node with durable settings and honest storage, committed transactions survive crashes.
In a replicated setup, failover policy matters: promoting an async replica can lose committed transactions.
3) What’s the single most common MongoDB transaction mistake?
Assuming the default write concern implies survivable commits. If you need failover-safe durability, be explicit with w:majority.
4) What’s the single most common PostgreSQL transaction mistake?
Holding transactions open too long (often “idle in transaction”). It causes bloat, lock retention, and performance collapse that looks unrelated—until it’s not.
5) Is PostgreSQL Serializable the same as “just like serial execution”?
It aims for serializable behavior but uses an optimistic approach that can abort transactions.
You must handle retries correctly, or you’ll convert correctness into downtime.
6) Can MongoDB give me linearizable reads?
MongoDB supports linearizable reads in limited scenarios, typically from the primary with specific settings.
They’re slower and more restrictive; use them only when you truly need that guarantee.
7) Why are MongoDB multi-document transactions slower than single-document ops?
Because the database has to coordinate multiple writes, track transaction state, and ensure atomic commit semantics.
In sharded clusters, coordination costs go up further.
8) Why does Postgres sometimes “freeze” during schema changes?
Many DDL operations take heavyweight locks that block reads/writes.
Use safer migration patterns (concurrent indexes, phased migrations, and avoiding long locks in peak hours).
9) If I use MongoDB with w:majority, am I safe?
Safer, not magically safe. You still need correct retry logic, idempotency, and a plan for timeouts and unknown outcomes.
Majority also increases sensitivity to slow secondaries and network issues.
10) If I use Postgres synchronous replication, am I safe?
You reduce the risk of losing acknowledged commits, but you trade for higher commit latency and a dependence on replica health.
You still need operational discipline: monitoring, capacity planning, and tested failover.
Conclusion: next steps you can actually do
Stop treating transactions as a checkbox. Treat them as a contract you must verify under failure.
PostgreSQL and MongoDB can both run reliable transactional systems, but they make different promises by default,
and they punish different kinds of laziness.
- Write down your correctness requirements: what can be stale, what can be lost, what must be unique, what must be atomic.
- Make durability explicit: Postgres replication mode and failover policy; MongoDB write concern and read concern.
- Practice failure: simulate stepdowns, kill clients mid-commit, and verify the application handles ambiguity without duplication.
- Instrument the truth: lock waits, aborts, replication lag, fsync latency, elections. If you can’t see it, you can’t own it.
- Choose the database that matches your invariants: if you need relational constraints and cross-row correctness, default to Postgres. If your model is document-first and mostly single-document atomic, MongoDB is a strong fit—just don’t pretend it’s Postgres.