Your system is calm. Graphs are boring. Then marketing “drops” something, a partner retries aggressively, or one cron job decides it’s time to reindex the universe. Write traffic goes vertical. Latency follows. Suddenly you’re learning what your database actually does when it’s scared.
This is not a religious war between SQL and NoSQL. This is a survival guide for write bursts: what MariaDB (InnoDB) and MongoDB (WiredTiger) do under pressure, where they fail, how to diagnose the bottleneck quickly, and what changes actually move the needle in production.
What a write burst really is (and why it hurts)
A write burst isn’t “high QPS.” It’s a mismatch between incoming write demand and the slowest durable step in your write path. That step might be disk flush latency, log serialization, lock contention, replication acks, or checkpointing.
Under steady load, both MariaDB and MongoDB can look heroic. Under bursts, the illusion ends because both engines eventually hit a wall where they must either:
- Apply backpressure (clients wait; queues grow; tail lat explodes),
- Drop durability (ack before safely persisted),
- Or fall over (OOM, disk-full, replica lag spiral, thread exhaustion).
When you ask “who survives spikes,” the real question is: who fails predictably, and who gives you enough controls to fail safely—without turning the incident channel into a live podcast.
Interesting facts and historical context
- MariaDB was created after Oracle acquired Sun (and therefore MySQL), driven by governance and licensing concerns rather than performance alone.
- MongoDB popularized a developer-friendly document model at a time when sharding relational databases was still mostly “a weekend project and a future regret.”
- InnoDB (the default MariaDB storage engine) centers durability around a redo log and background flushing—writes aren’t “just writes,” they’re log writes plus later page flushes.
- MongoDB’s WiredTiger uses a write-ahead log (journal) and checkpoints; burst behavior often hinges on checkpoint cadence and cache pressure.
- Group commit made transactional databases dramatically better at bursts by batching fsync costs across multiple transactions.
- MongoDB write concerns (w:1, majority, journaling) are effectively a dial between “fast” and “provably durable,” and your choice matters more during spikes.
- Replication lag is not just “a replica problem”; it can feed back into primaries when clients require majority acks.
- NVMe adoption changed the shape of bottlenecks: with high IOPS devices, CPU, contention, and log serialization can become the limiting factor sooner.
- Cloud networking made “write acknowledgment time” a distributed systems problem: majority writes can be limited by the slowest voter’s latency, not the primary’s CPU.
MariaDB under bursts: the InnoDB reality
The write path that matters
Under InnoDB, a typical transactional write involves:
- Modifying pages in the buffer pool (memory) and marking them dirty.
- Appending redo records to the redo log buffer.
- On commit, ensuring redo is durable according to
innodb_flush_log_at_trx_commit. - Later, flushing dirty pages to data files in the background.
During bursts, three things dominate:
- Redo log durability latency (fsync cost and log serialization).
- Dirty page accumulation (buffer pool fill, then forced flushing).
- Contention (row locks, secondary index maintenance, hot pages, and sometimes metadata locks).
How MariaDB “survives” a burst
MariaDB tends to survive bursts if:
- Your storage has predictable fsync latency (not just high IOPS).
- Your redo log is sized and configured to avoid constant “flush pressure.”
- You avoid pathological secondary indexes on write-heavy tables.
- You accept that strict durability costs something, and you tune around it rather than pretending it’s free.
MariaDB’s failure mode under bursts is usually latency explosion, not immediate crash—unless you run it into swap, fill the disk, or let replication/connection handling spiral.
InnoDB knobs that actually matter during spikes
innodb_flush_log_at_trx_commit: 1 is safest; 2 is common for better throughput; 0 is “I hope the power never blinks.”sync_binlog: if you use binlog (replication), this is the other half of durability. Setting it to 1 is safe and can be expensive.innodb_log_file_sizeandinnodb_log_files_in_group: larger logs smooth bursts; too small forces aggressive flushing.innodb_io_capacityandinnodb_io_capacity_max: if too low, dirty pages pile up; too high, you can starve foreground I/O.innodb_buffer_pool_size: enough memory prevents reads from fighting writes; too much can hide problems until checkpoints hit like a truck.innodb_flush_method=O_DIRECT: often reduces double buffering pain on Linux.
Backpressure in MariaDB: mostly implicit
MariaDB doesn’t give you a simple “queue depth” dial. Backpressure shows up as:
- Commit latency rising (redo fsync).
- Threads waiting on locks or I/O.
- Replication falling behind (if writes also produce binlog and replicas can’t keep up).
When MariaDB gets into trouble, it often looks like “the database is up, but nothing finishes.” That’s still a form of mercy. Your app can time out. You can shed load. You can do something.
MongoDB under bursts: the WiredTiger reality
The write path that matters
MongoDB’s write behavior depends heavily on write concern and journaling, but the usual path is:
- Write enters the primary and updates in-memory structures (WiredTiger cache).
- Journal (write-ahead log) is appended; durability depends on journaling and commit intervals.
- Replication sends operations to secondaries via the oplog; majority write concern waits for voters.
- Checkpoints periodically flush data to disk, producing bursts of I/O.
How MongoDB “survives” a burst
MongoDB survives write bursts when:
- Your disk can absorb journaling + checkpoint I/O without long-tail fsync spikes.
- You’re not running so hot that the WiredTiger cache is constantly evicting dirty pages.
- Your replication topology can keep up with majority acks (or you’re willing to relax write concern during bursts).
- Your documents and indexes are designed for write locality (or at least not designed to sabotage it).
MongoDB’s nastiest burst failure mode is replication lag + majority write concern + slow disk. That combination turns a burst into a self-inflicted distributed lockstep.
WiredTiger knobs and behaviors that show up during spikes
- Write concern:
w:1vsmajority, andj:true. This is your latency vs durability switchboard. - Checkpointing: checkpoints create periodic I/O spikes; if the system is already stressed, checkpoints can amplify stall behavior.
- Cache pressure: when cache is full, eviction becomes the hidden governor on write throughput.
- Oplog size: too small and secondaries fall off; too big and recovery times, storage, and IO profiles change.
- Index write amplification: each secondary index is extra work. MongoDB is not exempt from physics.
Backpressure in MongoDB: more explicit, but easier to misread
MongoDB can apply backpressure when it cannot keep journaling/checkpointing or when replication cannot satisfy write concern fast enough. In practice you’ll see:
- Queueing in the driver, then timeouts.
- Increasing “WT cache eviction” activity.
- Oplog replication lag growing until elections, rollbacks, or read staleness become a business problem.
The trap: teams interpret “MongoDB accepts writes fast” as “MongoDB is durable fast.” Under bursts, those are different sentences.
Who “survives” spikes: a decision table you can defend
If you need strict transactional guarantees under bursts
Favor MariaDB when you need multi-row transactional integrity, complex constraints, and predictable semantics during bursts. InnoDB’s behavior is well understood: commit cost is about log durability and contention, not mystery.
MongoDB can be durable and consistent, yes. But once you require majority and journaling, you’ve entered distributed latency land. Under spikes, that land gets expensive quickly.
If you need flexible schema and you can tolerate tuning write concern
Favor MongoDB when the document model is actually your data model (not “we didn’t want to design tables”), and you can explicitly choose durability levels during bursts.
MongoDB’s advantage is operational: for certain workloads it’s easier to shard horizontally and keep writes flowing—assuming you’re disciplined about keys and indexes.
If your spikes are “bursty ingestion” and reads are secondary
Both can ingest. The difference is where they pay:
- MariaDB pays at commit (redo/binlog fsync), and later via flushing if you overrun dirty page limits.
- MongoDB pays via journaling and checkpoints, and potentially via replication lag when write concern is strict.
If you can only afford one reliable thing: predictable latency
This is where I’m opinionated: choose the system whose failure mode you can operationalize.
- MariaDB tends to degrade into “slow but correct” when configured sanely.
- MongoDB can degrade into “fast but not where you think the truth is” if you’ve been casual with write concern, replication health, or disk.
Joke #1: A write burst is just your users running a load test you didn’t budget for.
The quote you should tape to the dashboard
Hope is not a strategy.
— General Gordon R. Sullivan
That line survives because it’s the truest thing ever said about production writes.
Fast diagnosis playbook
You have 10 minutes before someone suggests “just add more pods.” Here’s what to check first, second, third—so you can identify the real bottleneck instead of treating symptoms.
First: is it disk flush latency, or CPU/locks?
- Check disk utilization and await: if await is high and util is pegged, you’re I/O-bound.
- Check CPU steal and saturation: if CPU is pegged or you’re throttled, you’re compute-bound.
- Check lock waits: if threads are waiting on locks, I/O graphs can lie.
Second: is durability configuration forcing syncs on every write?
- MariaDB:
innodb_flush_log_at_trx_commit,sync_binlog, binlog format, commit batching. - MongoDB: write concern (w, j), commit interval behavior, majority waits.
Third: is replication the hidden limiter?
- MariaDB: replicas applying binlog slowly, semi-sync settings, or slow network storage on replicas.
- MongoDB: secondaries lagging, elections, majority write concern waiting on a sick node.
Fourth: are you checkpointing/flushing too aggressively (or too late)?
- MariaDB: dirty page percentage, redo log pressure, background flushing settings.
- MongoDB: WiredTiger cache eviction rate, checkpoint durations, journal pressure.
Fifth: are indexes and schema the real tax?
Bursts expose write amplification. A schema that “worked fine” at 1k writes/s can implode at 20k writes/s because every write touches too many indexes, too many hot keys, or too many secondary structures.
Practical tasks: commands, outputs, and what you decide
These are the tasks I actually run when a burst hits. Each includes the command, example output, what it means, and the decision you make. No heroics, just receipts.
1) Linux: confirm disk saturation and latency
cr0x@server:~$ iostat -x 1 5
Linux 6.5.0 (db01) 12/30/2025 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.10 0.00 6.20 18.40 0.00 63.30
Device r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 110.0 4200.0 3.2 180.5 87.1 35.2 8.4 1.9 8.6 0.23 99.2
What it means: %util near 100 and a high avgqu-sz shows a deep queue. Write await is ~8.6ms. Under bursts, this often becomes 20–100ms, and databases start timing out.
Decision: If await rises with load, you’re I/O-bound. You either reduce fsync frequency (carefully), batch commits, or move to faster/less contended storage.
2) Linux: check pressure stalls (CPU/memory/IO contention)
cr0x@server:~$ cat /proc/pressure/io
some avg10=12.45 avg60=8.21 avg300=2.14 total=9382211
full avg10=4.10 avg60=2.01 avg300=0.50 total=1923112
What it means: The kernel is telling you tasks are stalled waiting on I/O. “full” indicates periods where no task could make progress due to IO.
Decision: Treat as a platform bottleneck first. Database tuning won’t beat a saturated storage stack.
3) Linux: verify filesystem space and inode headroom
cr0x@server:~$ df -h /var/lib/mysql /var/lib/mongodb
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 1.8T 1.6T 120G 94% /
/dev/nvme1n1p1 1.8T 1.2T 560G 69% /data
What it means: 94% used is flirting with performance cliffs and operational mistakes. Some filesystems behave badly when nearly full; databases behave badly when they can’t extend files.
Decision: If a write burst is in progress and you’re above ~90%, you prioritize freeing space or adding capacity before you chase query plans.
4) MariaDB: see what the server thinks is happening right now
cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST\G" | head -n 40
*************************** 1. row ***************************
Id: 4312
User: app
Host: 10.10.2.14:51132
db: prod
Command: Query
Time: 12
State: Waiting for redo log flush
Info: COMMIT
*************************** 2. row ***************************
Id: 4321
User: app
Host: 10.10.2.18:51410
db: prod
Command: Query
Time: 8
State: Updating
Info: INSERT INTO events ...
What it means: “Waiting for redo log flush” is a flashing sign: commits are gated by log durability. That’s classic write-burst pain.
Decision: Investigate fsync latency, redo log sizing, and durability settings. Don’t start adding indexes right now; you’re drowning in flush cost.
5) MariaDB: confirm durability settings (the real ones)
cr0x@server:~$ mysql -e "SHOW VARIABLES WHERE Variable_name IN ('innodb_flush_log_at_trx_commit','sync_binlog','binlog_format')"
+-------------------------------+-------+
| Variable_name | Value |
+-------------------------------+-------+
| binlog_format | ROW |
| innodb_flush_log_at_trx_commit| 1 |
| sync_binlog | 1 |
+-------------------------------+-------+
What it means: This is “maximum safety” territory: redo fsync every commit, binlog fsync every commit. Under bursts, that can turn into a commit-latency bonfire unless your storage is excellent.
Decision: If the business can tolerate a tiny window of loss during power failure, consider innodb_flush_log_at_trx_commit=2 and/or sync_binlog=100 for bursty ingestion—after a deliberate risk review.
6) MariaDB: inspect InnoDB dirty pages and flush pressure
cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | egrep -i "Modified db pages|Log sequence number|Log flushed up to|pages flushed" -n | head
121:Log sequence number 98422341122
122:Log flushed up to 98422338816
401:Modified db pages 812345
405:Pages flushed up to 255112233
What it means: A large number of modified pages indicates dirty page buildup. If “Log flushed up to” lags LSN during a burst, commits queue behind fsync.
Decision: If dirty pages stay high and flushing can’t keep up, adjust I/O capacity settings and redo log size. If fsync is slow, fix storage first.
7) MariaDB: detect lock contention hot spots
cr0x@server:~$ mysql -e "SELECT * FROM information_schema.INNODB_LOCK_WAITS\G" | head -n 60
*************************** 1. row ***************************
requesting_trx_id: 123456789
requested_lock_id: 123456789:45:3:12
blocking_trx_id: 123456700
blocking_lock_id: 123456700:45:3:12
What it means: Transactions are waiting on other transactions. During bursts, one hot row or hot index page can serialize throughput.
Decision: If lock waits correlate with spikes, you fix access patterns (shard keys, partitioning, avoid hot counters) rather than tuning fsync.
8) MariaDB: verify buffer pool hit rate and read/write contention
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_dirty';"
+---------------------------------------+-----------+
| Variable_name | Value |
+---------------------------------------+-----------+
| Innodb_buffer_pool_read_requests | 982341122 |
| Innodb_buffer_pool_reads | 8123411 |
+---------------------------------------+-----------+
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| Innodb_buffer_pool_pages_dirty | 81234 |
+--------------------------------+-------+
What it means: If physical reads are climbing during a write burst, your buffer pool may be too small or you’re thrashing. Dirty pages also matter: too many triggers flushing storms.
Decision: If reads spike with writes, add memory or reduce read pressure (cache, query changes). If dirty pages are high, tune flushing and log sizing.
9) MongoDB: check replica set health and lag
cr0x@server:~$ mongosh --quiet --eval 'rs.status().members.map(m=>({name:m.name,state:m.stateStr,health:m.health,lagSeconds:(m.optimeDate?Math.round((new Date()-m.optimeDate)/1000):null)}))'
[
{ name: 'mongo01:27017', state: 'PRIMARY', health: 1, lagSeconds: 0 },
{ name: 'mongo02:27017', state: 'SECONDARY', health: 1, lagSeconds: 6 },
{ name: 'mongo03:27017', state: 'SECONDARY', health: 1, lagSeconds: 84 }
]
What it means: One secondary is 84 seconds behind. If clients use w:"majority", this lag can directly add latency (or timeouts) depending on voting and write concern requirements.
Decision: If majority acks are required, fix the lagging node (disk, CPU, network) or remove it from voting temporarily, with change control.
10) MongoDB: confirm write concern used by the application
cr0x@server:~$ mongosh --quiet --eval 'db.getMongo().getWriteConcern()'
{ w: 'majority', wtimeout: 0 }
What it means: Majority writes. Great for correctness; under bursts, you now depend on replication health and inter-node latency.
Decision: During planned ingestion bursts, consider a different write concern only if the data can be rebuilt and the risk is documented. Otherwise, scale the replica set and storage to match the requirement.
11) MongoDB: check WiredTiger cache pressure and eviction churn
cr0x@server:~$ mongosh --quiet --eval 'var s=db.serverStatus().wiredTiger.cache; ({ "bytes currently in cache": s["bytes currently in the cache"], "tracked dirty bytes": s["tracked dirty bytes in the cache"], "pages evicted": s["pages evicted"] })'
{
"bytes currently in cache": 29192355840,
"tracked dirty bytes": 9423123456,
"pages evicted": 182341122
}
What it means: High dirty bytes and rapid eviction often mean the engine is spending cycles pushing dirty data out, which can throttle writes and amplify checkpoint stalls.
Decision: If eviction is intense during bursts, revisit cache sizing, document size/indexes, and disk throughput. Don’t “just add RAM” without checking checkpoint behavior.
12) MongoDB: inspect current operations for lock/IO symptoms
cr0x@server:~$ mongosh --quiet --eval 'db.currentOp({active:true,secs_running:{$gte:2}}).inprog.slice(0,3).map(o=>({opid:o.opid,secs:o.secs_running,ns:o.ns,desc:o.desc,waitingForLock:o.waitingForLock,locks:o.locks}))'
[
{
opid: 18231,
secs: 9,
ns: 'prod.events',
desc: 'conn31291',
waitingForLock: false,
locks: { Global: 'w' }
},
{
opid: 18247,
secs: 5,
ns: 'prod.events',
desc: 'conn31340',
waitingForLock: false,
locks: { Global: 'w' }
}
]
What it means: Long-running writes can indicate downstream I/O stalls, index contention, or journaling pressure. Modern MongoDB has fine-grained locking, but global write lock presence can still show in aggregates.
Decision: If operations run long without lock waits, suspect storage latency or checkpoint/journal issues. Correlate with disk metrics.
13) Linux: confirm you’re not being throttled by cgroups
cr0x@server:~$ systemctl show mariadb -p CPUQuota -p MemoryMax
CPUQuota=50%
MemoryMax=8589934592
What it means: You gave the database half a CPU and 8GB RAM and expected it to swallow bursts. Bold.
Decision: Remove artificial throttles for stateful services, or explicitly design for them with admission control and queueing.
14) Linux: check swap activity (the silent killer)
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 2 524288 112340 8204 3124000 40 120 210 1820 980 2200 22 7 48 23 0
What it means: Swap in/out during bursts turns “fast storage” into “slow misery.” Database latencies become unbounded.
Decision: Reduce memory pressure (buffers, cache sizing), fix co-located workloads, or move the database to a host that doesn’t treat RAM as optional.
Three corporate mini-stories (how this fails in real life)
Mini-story #1: The incident caused by a wrong assumption
A mid-size SaaS company ran MongoDB for event ingestion. The data model made sense: events were documents, schema evolved weekly, and sharding was on the roadmap. They had a replica set with three members across zones and felt pretty mature about it.
Then a partner system started retrying on 500s with no jitter. Writes spiked, and the primary’s CPU stayed reasonable. The team assumed “MongoDB is fine with high write throughput,” because it had been fine before. Latency climbed anyway, and then the app started timing out on writes with w:"majority". The incident commander stared at the primary’s disk graphs and saw something that looked merely “busy,” not catastrophic.
The wrong assumption was subtle: they thought majority write concern meant “two out of three, always quick.” In reality, one secondary had been degraded for weeks—still healthy enough to vote, unhealthy enough to lag badly. Under the burst, it fell further behind. Majority acks now depended on the slowest voter catching up often enough to acknowledge new operations.
They stabilized by stepping down the degraded secondary (and later re-adding it after repair), effectively reducing their ack quorum behavior. The long-term fix was operational: monitor replication lag with alerting that triggers before it’s a crisis, and don’t let a sick voter stay in the quorum for convenience.
They didn’t “outscale” the incident. They removed the broken assumption that topology health is optional during bursts. It is not.
Mini-story #2: The optimization that backfired
An e-commerce platform ran MariaDB for orders and inventory. They expected bursty writes during launches, so they tuned for speed: bigger buffer pool, aggressive thread settings, and they set innodb_flush_log_at_trx_commit=2 to reduce fsync pressure. It worked. Commit latency dropped. Everyone celebrated.
Then the backfire: the same team also increased concurrency in the app to “use the new headroom,” and they added a few secondary indexes to support new dashboards. During the next burst, transactions piled up behind row locks on inventory rows and hot secondary index pages. At the same time, dirty pages accumulated fast. When the system finally had to flush aggressively, I/O spiked, latency went nonlinear, and the app’s retries turned it into a storm.
The optimization wasn’t wrong by itself. The mistake was thinking a single knob is a capacity upgrade. By making commits cheaper, they encouraged more concurrent writes into the same hot spots, amplifying contention and background flushing pressure.
The fix was boring engineering: remove one dashboard index that was punishing the write path, shard the hottest inventory updates by adding a partitioning strategy, and implement client-side rate limiting so retries don’t become load. Only then did their durability tuning produce stable gains.
Mini-story #3: The boring but correct practice that saved the day
A fintech-ish company (the kind that says “fintech” but mostly sells subscriptions) ran both: MariaDB for transactions and MongoDB for audit events. They lived in a world of compliance checks and uncomfortable questions, so they were allergic to “just set it to 0.”
They also did something unsexy: they ran a monthly “burst rehearsal.” Not a synthetic benchmark in a lab—an actual controlled load increase in staging with production-like data volumes and the same storage class. They practiced failure: replica lag, disk latency injection, primary failover. Someone always complained it was time-consuming. Someone else always found something important.
One month, the rehearsal revealed that a new MongoDB secondary in a different zone had worse fsync latency. Majority writes looked fine at normal traffic and became a problem only when write rate doubled. They fixed it before the real campaign hit by moving the node to a better storage tier and adjusting voting so the slow member didn’t dictate ack latency.
During the actual burst weeks later, both databases ran hot but stable. The team didn’t look brilliant. They looked boring. That’s the point.
Common mistakes: symptoms → root cause → fix
1) Symptom: commits stall, many threads “Waiting for redo log flush” (MariaDB)
Root cause: fsync latency or redo log serialization becomes the governor; log too small increases pressure.
Fix: Move to lower-latency storage; increase redo log size; validate innodb_flush_log_at_trx_commit and sync_binlog against business durability needs; ensure binlog and redo are on sane storage.
2) Symptom: sudden write latency spikes every few minutes (MongoDB)
Root cause: checkpoint/journal flush cycles causing periodic I/O bursts; made worse by cache pressure.
Fix: Ensure disk throughput and low tail latency; reduce write amplification (indexes, doc size); ensure WiredTiger cache isn’t starving the OS page cache; review checkpoint behavior in metrics.
3) Symptom: MongoDB writes time out only when using majority
Root cause: one voting secondary is lagging or slow; majority ack now includes a slow path.
Fix: repair/replace slow secondary; consider adjusting voting members; fix network and storage consistency across nodes; increase oplog if secondaries fall off.
4) Symptom: MariaDB CPU is fine, but QPS collapses
Root cause: lock contention (hot rows, gap locks, long transactions) or I/O wait; CPU graphs are misleading.
Fix: identify lock waits; reduce transaction scope; avoid hot counters; change isolation or access pattern; add appropriate indexes only if they reduce lock time (not just “for reads”).
5) Symptom: both databases “randomly” get slow during bursts in VMs
Root cause: noisy neighbors and storage contention; variable fsync latency; CPU steal.
Fix: isolate storage; dedicated volumes; measure fsync tail; check CPU steal; stop colocating batch workloads with the database.
6) Symptom: write burst causes a retry storm and then everything dies
Root cause: client retries without backoff; database becomes the rate limiter; timeouts amplify load.
Fix: exponential backoff with jitter; circuit breakers; server-side admission control; cap concurrency; return 429/503 intentionally rather than letting timeouts cascade.
7) Symptom: “We scaled up disk IOPS but it’s still slow”
Root cause: you’re bottlenecked by serialization (log mutex, single hot shard), CPU, or network replication latency, not raw IOPS.
Fix: measure where time is spent (fsync, locks, replication acks); reduce hot spots; shard/partition for write distribution; verify you got lower tail latency, not just higher peak throughput.
Checklists / step-by-step plan
Step-by-step: preparing MariaDB for predictable write bursts
- Measure fsync tail latency on the actual volume class used for redo/binlog. If p99 is ugly, stop and fix storage first.
- Set durability intentionally: decide
innodb_flush_log_at_trx_commitandsync_binlogwith a written risk statement. - Right-size redo logs to smooth bursts and reduce flush pressure.
- Audit secondary indexes on write-heavy tables. Remove vanity indexes. Keep the ones that prevent full scans that lock too much.
- Control transaction size: small commits are easier to batch and replicate; huge transactions create lock and flush storms.
- Plan for backpressure: set sane timeouts, limit app concurrency, implement jittered retries.
- Replication rehearsal: confirm replicas can apply at burst rate; otherwise replicas become your recovery-time problem.
Step-by-step: preparing MongoDB for predictable write bursts
- Define write concern per workload (not per mood). Critical writes: majority+journal. Rebuildable telemetry: maybe lower, with guardrails.
- Monitor replication lag and elections like it’s a first-class SLO. Under bursts, it is.
- Validate storage for journaling and checkpoints: low tail latency beats headline IOPS.
- Choose shard keys for write distribution if you shard. Hot shards turn “distributed” into “single-node pain with extra steps.”
- Index discipline: each index is a tax on your write budget. Keep the ones that serve real queries.
- Capacity for cache + eviction: watch dirty bytes and eviction churn; avoid running constantly at the edge.
- Driver timeouts and retry policies: configure them consciously; don’t let the driver “help” you into a retry storm.
Joke #2: Databases don’t “handle spikes.” They negotiate with them, and the contract is written in milliseconds.
FAQ
1) Which one is faster at writes: MariaDB or MongoDB?
Neither “wins” universally. MongoDB can ingest fast with relaxed write concern and a good shard key. MariaDB can ingest fast with group commit and sane indexes. Under strict durability, both are gated by fsync tail latency and write amplification.
2) What’s the most common reason MariaDB melts during write bursts?
Commit durability costs (redo/binlog fsync) plus flush pressure from dirty pages, often compounded by too many secondary indexes or hot-row contention.
3) What’s the most common reason MongoDB melts during write bursts?
Replication lag plus majority write concern, or checkpoint/journal I/O spikes when the disk can’t keep tail latency low.
4) Can I “just scale horizontally” to survive bursts?
Sometimes. MongoDB sharding can spread writes if you choose a key that distributes. MariaDB can scale with sharding at the application layer or with clustering solutions, but multi-writer setups have their own constraints. Horizontal scale is a design project, not a knob.
5) Is turning off durability acceptable for bursts?
Only if the data is rebuildable and you have a documented, rehearsed procedure. Otherwise you’re trading an incident now for a data integrity incident later, which is usually more expensive and less forgivable.
6) Should I put redo logs / journal on separate disks?
Separation can help if contention is the issue and you have real isolation (not two partitions on the same underlying device). Often the better win is fewer, faster, more predictable devices rather than complicated layouts.
7) What’s the single best metric for burst survival?
Tail latency of durable writes. Not average. Not peak throughput. If p99 commit/journal latency goes bad, everything upstream starts behaving badly, too.
8) Does “more RAM” fix burst write problems?
Sometimes it delays them. For MariaDB, more buffer pool reduces read contention and can smooth writes, but it can also allow more dirty pages to accumulate. For MongoDB, more cache reduces churn, but checkpointing and journaling still hit the disk. RAM helps; it doesn’t repeal physics.
9) What about Galera (MariaDB) vs MongoDB replica sets for bursts?
Multi-master synchronous-ish systems can reduce some failover pain but can also amplify write latency because coordination becomes part of every commit. Under bursts, coordination overhead is not your friend unless carefully engineered.
10) How do I pick if my workload is “unknown future requirements”?
If you need relational constraints and complex transactions, start with MariaDB and add a document store where it fits. If your domain is truly document-centric and you’re committed to shard design and write concern discipline, MongoDB can be the simpler operational shape.
Conclusion: next steps you can execute
Write bursts don’t reward optimism. They reward systems with predictable durable write latency, disciplined schema/index design, and backpressure that doesn’t turn into a retry hurricane.
- Pick your durability posture (MariaDB commit settings / MongoDB write concern) and write it down as policy.
- Measure fsync/journal tail latency on your real storage. If it’s inconsistent, fix the platform before you touch query plans.
- Run a burst rehearsal monthly: replication lag, checkpoints, failovers, and client retry behavior.
- Implement load shedding: cap concurrency, add jittered retries, and stop pretending timeouts are a strategy.
- Reduce write amplification: audit indexes, avoid hot keys, and keep transactions small and boring.
If you want a single takeaway: MariaDB tends to fail slow and loud; MongoDB tends to fail in topology-dependent ways. Build for the failure mode you can manage at 2 a.m., not the benchmark you can brag about at 2 p.m.