MariaDB vs SQLite for Write Bursts: Who Handles Spikes Without Drama

Was this helpful?

Write bursts don’t arrive politely. They show up as a thundering herd: job runners waking up together, a backlog draining after a deploy, mobile clients reconnecting after a train tunnel, or an “oops” backfill you promised would run “slowly.” The question isn’t whether your database can write. The question is whether it can write a lot, right now, without turning your on-call rotation into a hobby.

MariaDB and SQLite can both store your data. But under spikes, they behave like different species. MariaDB is a server with concurrency controls, background flushing, buffer pools, and a long history of being yelled at by production workloads. SQLite is a library that lives inside your process, brutally efficient and wonderfully low-maintenance—until you ask it to do something that looks like a multi-writer storm.

The real question: what does “burst” mean for your system

“Write burst” is a vague phrase that causes expensive misunderstandings. There are at least four different beasts that people call a burst:

  1. Short spike, high concurrency: 500 requests hit at once, each doing a tiny insert.
  2. Sustained surge: 10× normal write rate for 10–30 minutes (batch jobs, backfills).
  3. Long tail latency explosion: average throughput looks fine, but every 20 seconds commits stall for 300–2000 ms.
  4. I/O cliff: the disk or storage system hits a flush wall (fsync/flush cache behavior), and everything queues behind it.

MariaDB vs SQLite under “bursts” is mostly about how they behave under concurrency and how they pay for durability. If you only ever have one writer and you can tolerate some queueing, SQLite can be ridiculously good. If you have many writers, many processes, or you need to keep serving reads while writes thrash, MariaDB is usually the grown-up in the room.

But there are traps on both sides. SQLite’s trap is locking. MariaDB’s trap is thinking the database server is the bottleneck when it’s actually the storage subsystem (or your commit policy).

A few facts and history that actually matter

Some context points that are short, concrete, and surprisingly predictive of burst behavior:

  • SQLite is a library, not a server. There’s no separate daemon; your app links it and directly reads/writes the DB file. That’s a performance superpower and an operational constraint.
  • SQLite’s original design optimized for embedded systems. It became popular in desktop/mobile because it’s “just a file” and doesn’t need a DBA to babysit it.
  • WAL mode in SQLite was introduced to improve concurrency. It separates reads from writes by appending to a write-ahead log, allowing readers during writes—up to a point.
  • SQLite still has a single-writer rule at the database level. WAL helps readers, but multiple concurrent writers still serialize on the write lock.
  • MariaDB is a fork of MySQL. The fork happened after Oracle acquired Sun; MariaDB became the “community-friendly” continuity play for many orgs.
  • InnoDB became the default MySQL/MariaDB engine for a reason. It’s built around MVCC, redo logs, background flushing, and crash recovery—features that matter when bursts hit.
  • MariaDB’s performance during bursts depends heavily on fsync behavior. Your redo log flush policy can shift the pain from “every commit stalls” to “some commits stall but throughput improves.” It’s a trade, not free money.
  • Most “database is slow” incidents during write spikes are actually “storage is slow.” The database is just the first thing to admit it by blocking on fsync.

Write path anatomy: MariaDB/InnoDB vs SQLite

SQLite: one file, one writer, very little ceremony

SQLite writes to a single database file (plus, in WAL mode, a WAL file and a shared-memory index file). Your process issues SQL; SQLite translates it into page updates. During a transaction commit, SQLite must ensure durability based on your pragma settings. This usually means forcing data to stable storage using fsync-like calls, depending on platform and filesystem.

Under bursts, SQLite’s critical detail is how quickly it can cycle through “acquire write lock → write pages/WAL → sync policy → release lock”. If commits are frequent and small, the overhead is dominated by sync calls and lock handoffs. If commits are batched, SQLite can fly.

WAL mode changes the shape: writers append to the WAL and readers can keep reading the main DB snapshot. But there’s still one writer at a time, and checkpoints can become a second kind of burst (more on that later).

MariaDB/InnoDB: concurrency, buffering, and background I/O

MariaDB is a server process with multiple worker threads. InnoDB maintains a buffer pool (cache) for pages, a redo log (write-ahead), and often an undo log for MVCC. When you commit, InnoDB writes redo records and—depending on settings—flushes them to disk. Dirty pages are flushed in the background.

Under bursts, InnoDB’s superpower is that it can accept many concurrent writers, queue the work, and smooth it out with background flushing—assuming you’ve sized it and your I/O can keep up. Its weakness is that it can still hit a hard wall where the redo log or dirty page flushing becomes urgent, and then latency spikes look like a synchronized collapse.

There’s a paraphrased idea from Werner Vogels (Amazon CTO) that operations people repeat because it keeps being true: everything fails, so design for recovery and minimize blast radius (paraphrased idea). In burst-land, that often means: expect write amplification and expect the disk to be the first to complain.

Who handles spikes better (and when)

If you want a clean, honest rule: SQLite handles bursts without drama when you can shape the write workload into fewer transactions and you don’t have many writers across processes. MariaDB handles bursts without drama when you have many concurrent writers, multiple app instances, and you need predictable behavior under contention—assuming your storage and configuration aren’t sabotaging you.

SQLite wins when

  • Single process or controlled writers: one writer thread, a queue, or a dedicated writer process.
  • Short transactions, batched commits: you can commit every N records or every T milliseconds.
  • Local disk, low-latency fsync: NVMe, not a wobbly network filesystem.
  • You want simplicity: no server, fewer moving parts, fewer pages to wake up at 3 a.m.
  • Read-heavy with occasional bursts: WAL mode can keep reads snappy while writes happen.

SQLite loses (loudly) when

  • Many concurrent writers: they serialize, and your app threads pile up behind “database is locked.”
  • Multiple processes write at once: especially on busy hosts or containers without coordination.
  • Checkpointing becomes a burst: WAL grows, checkpoint triggers, and suddenly you have a write storm inside your write storm.
  • Storage has odd fsync semantics: some virtualized or networked storage makes durability extremely expensive or inconsistent.

MariaDB wins when

  • You have real concurrency: multiple app instances, each writing at the same time.
  • You need operational tooling: replication, backups, online schema changes, observability hooks.
  • You need to isolate workload: buffer pool absorbs spikes, thread pools and queueing can prevent total collapse.
  • You need predictable isolation semantics: MVCC with consistent reads under write load.

MariaDB loses when

  • Your disk can’t flush fast enough: redo log flush stalls the world; latency balloons.
  • You mis-size the buffer pool: too small and it thrashes; too big and the OS cache + swapping drama arrives.
  • You “tune” durability away blindly: you buy throughput by selling your future self a data loss incident.
  • Your schema forces hot spots: single-row counters, poor indexes, or monotonic inserts fighting over the same structures.

Joke #1: SQLite is the friend who’s always on time—unless you invite three other friends to talk at once, then it just locks the door.

Durability knobs: what you’re really buying with fsync

Write bursts are where durability settings stop being theoretical. They become a bill your storage must pay, immediately, in cash.

SQLite durability levers

SQLite exposes durability via pragmas. The big ones for bursts:

  • journal_mode=WAL: usually the default recommendation for concurrent reads and steady write performance.
  • synchronous: controls how aggressively SQLite syncs data to disk. Higher durability generally means more fsync cost.
  • busy_timeout: doesn’t improve throughput, but prevents useless failures by waiting for locks.
  • wal_autocheckpoint: controls when SQLite tries to checkpoint (move WAL contents into the main DB file).

Here’s the subtle part: in WAL mode, the system can feel great until the WAL grows and checkpointing becomes unavoidable. That “checkpoint tax” often shows up as periodic latency spikes that look like the database “hiccuping.” If you’re logging or time-series inserting, this can bite hard.

MariaDB/InnoDB durability levers

In InnoDB, the critical burst knobs are about redo log flush behavior and how quickly dirty pages can be written:

  • innodb_flush_log_at_trx_commit: the classic durability/throughput trade. Value 1 is safest (flush at every commit), 2 trades some durability for speed, 0 is faster but riskier.
  • sync_binlog: if you use binary logs for replication, this can be an additional fsync cost.
  • innodb_redo_log_capacity (or older log file sizing): too small and you hit frequent checkpoints; too big and recovery time changes. Spikes often reveal logs that are undersized.
  • innodb_io_capacity / innodb_io_capacity_max: tells InnoDB how aggressive to be with background flushing.

For burst tolerance, you want the database to absorb the burst and flush steadily rather than panic-flush. Panic flushing is where latency gets “interesting.”

Common burst patterns and what breaks first

Pattern: tiny transactions at high QPS

This is the classic “insert one row and commit” loop, multiplied by concurrency. It’s a commit storm.

  • SQLite: lock contention + fsync overhead. You’ll see “database is locked” or long waits unless you queue writes and batch commits.
  • MariaDB: can handle concurrency, but fsync per commit may dominate latency. You’ll see high trx commits, log flush waits, and I/O saturation.

Pattern: backfill with heavy indexes

You add columns, backfill, and update secondary indexes. Now each write fans out into multiple B-tree updates.

  • SQLite: single writer makes it predictable but slow; the lock window is longer, so everyone else waits longer.
  • MariaDB: throughput depends on buffer pool and I/O. Hot indexes can cause latch contention; too many threads can make it worse.

Pattern: burst coincides with checkpoint/flush cycle

This is the “it’s fine, it’s fine, it’s fine… why is it on fire every 30 seconds?” scenario.

  • SQLite WAL checkpoint: long checkpoint cycles can block or slow writes, depending on mode and conditions.
  • InnoDB checkpoint: redo log fills, dirty pages must flush, and foreground work starts waiting on background I/O.

Pattern: storage latency jitter

Everything is normal until the disk pauses. Cloud volumes, RAID cache flushes, neighbor noise, filesystem journal commits—pick your villain.

  • SQLite: your app thread is the database; it blocks. Latency spikes propagate directly to request latency.
  • MariaDB: can queue and parallelize, but eventually the server threads block too. The difference is you can see it from inside the engine via status counters and logs.

Joke #2: “We’ll just make it synchronous and fast” is the database equivalent of “I’ll just be calm and on time during the airport security rush.”

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “SQLite can handle a few writers, right?”

A mid-sized product team shipped a new ingestion service. Each container took events from a queue and wrote them to a local SQLite file for “temporary buffering,” then another job would ship the file to object storage. The assumption was that “it’s local disk, so it’ll be fast.” And it was—during the happy-path demo.

Then production happened. Autoscaling spun up multiple containers on the same node, all writing to the same SQLite database file via a shared hostPath. The moment traffic spiked, writers collided. SQLite did what it’s designed to do: serialize writes. The application did what it’s designed to do: panic.

Symptoms were messy: request timeouts, “database is locked” errors, and a retry loop that multiplied the burst. The host itself looked underutilized on CPU, which encouraged exactly the wrong debugging instinct: “it can’t be the database; the CPU is idle.”

The fix was embarrassingly simple and operationally adult: one writer per database file. They switched to per-container SQLite files, and introduced an explicit write queue in-process. When they needed cross-container writes, they moved the buffering layer to MariaDB with proper connection pooling and transaction batching.

The takeaway: SQLite is incredible when you own the write serialization intentionally. It’s chaos when you discover the serialization accidentally.

Optimization that backfired: “Let’s relax fsync and crank threads”

An internal admin platform ran on MariaDB. During a quarterly import job, they saw commit latency spikes. Someone (well-meaning, tired) changed innodb_flush_log_at_trx_commit from 1 to 2 and increased concurrency in the importer from 16 to 128 threads. They wanted to “push through the batch faster” and reduce the time window of pain.

Throughput improved for about five minutes. Then the system hit a different wall: buffer pool churn plus write amplification from secondary indexes. Dirty pages accumulated faster than flushing could keep up. InnoDB started aggressive flushing. Latency went from spiky to consistently awful, and the primary started lagging replication because the binary log fsync pattern changed under load.

They didn’t lose data, but they did lose time: the import took longer end-to-end because the system oscillated between bursts of progress and long stalls. Meanwhile user-facing traffic suffered because the database couldn’t keep response times stable.

The eventual solution wasn’t “more tuning.” It was disciplined load shaping: throttle the importer, batch commits, and schedule the job with a predictable rate limit. They kept durability settings conservative and fixed the real issue: the importer had no business behaving like a DDoS test.

The takeaway: turning knobs without controlling concurrency is how you trade one failure mode for a more confusing one.

Boring but correct practice that saved the day: “Measure fsync, keep headroom, rehearse restores”

A payments-adjacent service (the kind where you don’t get to be creative about durability) used MariaDB with InnoDB. Every few weeks they’d get a burst: reconciliation jobs plus a traffic bump. It never caused an outage, and nobody celebrated that. That was the point.

They had a boring routine. They measured disk latency (including fsync) continuously, not just IOPS. They kept a buffer in redo log capacity and sized the buffer pool so the system didn’t thrash during bursts. They also rehearsed restores on a schedule so nobody learned backup behavior during an incident.

One day their storage latency jitter doubled due to a noisy neighbor situation on the underlying hardware. The service didn’t fall over. It got slower, alarms triggered early, and the team applied a known mitigation: temporarily rate-limit the batch jobs and pause non-critical writers. User traffic stayed within SLO.

Later, when they moved to different storage, they already had baselines proving the storage layer was the culprit. Procurement meetings are much easier when you show graphs instead of emotions.

The takeaway: the “boring” practice of measuring the right thing and keeping headroom is the cheapest burst insurance you can buy.

Fast diagnosis playbook

When a write burst hits and everything gets weird, you do not have time to philosophize. You need a fast decision tree: are we lock-bound, CPU-bound, or I/O-bound?

First: confirm the shape of the pain (latency vs throughput)

  • If throughput stays high but p95/p99 latency explodes: look for fsync/journal/checkpoint stalls.
  • If throughput collapses: look for lock contention, thread exhaustion, or storage saturation.

Second: decide whether it’s SQLite or MariaDB specific

  • SQLite: errors like “database is locked,” long waits, WAL file growth, or checkpoint stalls.
  • MariaDB: threads waiting on log flush, dirty page flushing, row lock waits, or replication lag compounding the pressure.

Third: prove or eliminate storage as the bottleneck

  • Check disk latency, queue depth, and fsync behavior under load.
  • If storage is jittery, almost any database will look guilty.

Fourth: stop making it worse

  • Throttle the burst source (batch job, importer, retry loop).
  • Batch commits. Reduce concurrency. Turn off “infinite retry with no jitter.”
  • Capture evidence before restarting services. Restarts erase the clues and rarely fix the physics.

Practical tasks: commands, outputs, and decisions

These are the things you can run during an incident or a tuning session. Each includes: command, what the output means, and the decision you make.

1) See if the system is I/O saturated (Linux)

cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (db01)  12/30/2025  _x86_64_ (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.1    0.0    4.3    31.7     0.0    51.9

Device            r/s     w/s   rkB/s   wkB/s  rrqm/s  wrqm/s  %util  await  svctm
nvme0n1          20.0  900.0   640.0 42000.0    0.0    50.0   98.7   22.4   0.8

Meaning: %util near 100% plus high await indicates the device is the limiter; iowait is also elevated.

Decision: Stop “tuning SQL” first. Throttle writers, reduce commit frequency, or move to faster storage. If this is a shared volume, investigate neighbor noise.

2) Identify which process is issuing the most write I/O

cr0x@server:~$ pidstat -d 1 3
Linux 6.5.0 (db01)  12/30/2025  _x86_64_  (8 CPU)

# Time   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
12:01:01  999     2145      0.00  52000.00     0.00  mariadbd
12:01:01 1000     8812      0.00   1200.00     0.00  importer

Meaning: MariaDB is doing the bulk writes; importer is contributing but not dominant.

Decision: Focus on InnoDB flushing/redo behavior and storage latency, not just application batching.

3) Check filesystem and mount options (durability surprises live here)

cr0x@server:~$ mount | grep -E " /var/lib/mysql | /data "
/dev/nvme0n1p2 on /var/lib/mysql type ext4 (rw,relatime,commit=5)
/dev/nvme1n1p1 on /data type xfs (rw,relatime,attr2,inode64,logbufs=8)

Meaning: ext4 commit interval and journaling behavior can affect periodic latency spikes.

Decision: If you see periodic stalls aligned with journal commits, consider filesystem tuning or moving DB files to a volume with better latency characteristics.

4) Measure raw fsync-ish latency with a simple sync test

cr0x@server:~$ dd if=/dev/zero of=/var/lib/mysql/sync-test.bin bs=4k count=4096 oflag=dsync
4096+0 records in
4096+0 records out
16777216 bytes (17 MB, 16 MiB) copied, 3.91 s, 4.3 MB/s

Meaning: oflag=dsync forces sync per block; low throughput implies high sync cost. Not a perfect model, but it reveals “storage is lying.”

Decision: If this looks terrible on “fast” disks, stop and fix storage or virtualization settings before blaming the database.

5) MariaDB: confirm InnoDB flush policy and redo sizing

cr0x@server:~$ mariadb -e "SHOW VARIABLES WHERE Variable_name IN ('innodb_flush_log_at_trx_commit','sync_binlog','innodb_redo_log_capacity','innodb_io_capacity','innodb_io_capacity_max');"
+------------------------------+-----------+
| Variable_name                | Value     |
+------------------------------+-----------+
| innodb_flush_log_at_trx_commit | 1       |
| sync_binlog                   | 1        |
| innodb_redo_log_capacity      | 1073741824|
| innodb_io_capacity            | 200      |
| innodb_io_capacity_max        | 2000     |
+------------------------------+-----------+

Meaning: Full durability on both redo and binlog (costly during bursts). Redo capacity may be small depending on workload.

Decision: If p99 is dying and you can tolerate small durability tradeoffs, consider adjusting settings—but only with clear business sign-off. Otherwise increase storage performance and consider batching commits.

6) MariaDB: see if you’re waiting on log flush

cr0x@server:~$ mariadb -e "SHOW GLOBAL STATUS LIKE 'Innodb_log_waits'; SHOW GLOBAL STATUS LIKE 'Innodb_os_log_fsyncs';"
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| Innodb_log_waits | 1834  |
+------------------+-------+
+----------------------+--------+
| Variable_name        | Value  |
+----------------------+--------+
| Innodb_os_log_fsyncs | 920044 |
+----------------------+--------+

Meaning: Log waits means transactions had to wait for redo log flushing. Bursts + fsync latency = pain.

Decision: Reduce commit frequency (batch), reduce concurrency, or improve fsync latency. Don’t just add CPU.

7) MariaDB: check dirty page pressure (flush debt)

cr0x@server:~$ mariadb -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_dirty'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_total';"
+--------------------------------+--------+
| Variable_name                  | Value  |
+--------------------------------+--------+
| Innodb_buffer_pool_pages_dirty | 412345 |
+--------------------------------+--------+
+--------------------------------+--------+
| Variable_name                  | Value  |
+--------------------------------+--------+
| Innodb_buffer_pool_pages_total | 524288 |
+--------------------------------+--------+

Meaning: A very high dirty ratio suggests the system is behind on flushing; checkpoints may force stalls.

Decision: Increase I/O capacity settings cautiously, ensure storage can sustain writes, and reduce incoming write rate until dirty pages stabilize.

8) MariaDB: identify lock waits and hot tables

cr0x@server:~$ mariadb -e "SELECT * FROM information_schema.innodb_lock_waits\G"
*************************** 1. row ***************************
requesting_trx_id: 123456
blocking_trx_id: 123455
blocked_table: `app`.`events`
blocked_lock_type: RECORD
blocking_lock_type: RECORD

Meaning: You have contention on a specific table/index.

Decision: Fix the hot spot: add index, change access pattern, avoid single-row counters, or shard by key/time. Throwing more threads at lock contention makes it worse.

9) MariaDB: inspect current thread states (what are they waiting on?)

cr0x@server:~$ mariadb -e "SHOW PROCESSLIST;"
+-----+------+-----------+------+---------+------+------------------------+------------------------------+
| Id  | User | Host      | db   | Command | Time | State                  | Info                         |
+-----+------+-----------+------+---------+------+------------------------+------------------------------+
| 101 | app  | 10.0.0.12 | app  | Query   |   12 | Waiting for handler commit | INSERT INTO events ...     |
| 102 | app  | 10.0.0.13 | app  | Query   |   11 | Waiting for handler commit | INSERT INTO events ...     |
| 103 | app  | 10.0.0.14 | app  | Sleep   |    0 |                        | NULL                         |
+-----+------+-----------+------+---------+------+------------------------+------------------------------+

Meaning: “Waiting for handler commit” commonly correlates with commit/fsync pressure.

Decision: Investigate redo/binlog flush settings and disk latency; consider write batching.

10) SQLite: verify journal mode and synchronous settings

cr0x@server:~$ sqlite3 /data/app.db "PRAGMA journal_mode; PRAGMA synchronous; PRAGMA wal_autocheckpoint;"
wal
2
1000

Meaning: WAL mode is enabled; synchronous=2 is FULL (durable, slower); autocheckpoint at 1000 pages.

Decision: If you’re spiking and seeing stalls, consider whether FULL is required. Also plan checkpoint strategy (manual/controlled) rather than letting autocheckpoint surprise you.

11) SQLite: detect lock contention using a controlled write test

cr0x@server:~$ sqlite3 /data/app.db "PRAGMA busy_timeout=2000; BEGIN IMMEDIATE; INSERT INTO events(ts, payload) VALUES(strftime('%s','now'),'x'); COMMIT;"

Meaning: If this intermittently fails with “database is locked,” you have competing writers or long transactions.

Decision: Introduce a single-writer queue, shorten transactions, and make sure readers aren’t holding locks longer than expected (e.g., long-running SELECTs in a transaction).

12) SQLite: watch WAL growth and checkpoint health

cr0x@server:~$ ls -lh /data/app.db /data/app.db-wal /data/app.db-shm
-rw-r--r-- 1 app app 1.2G Dec 30 12:05 /data/app.db
-rw-r--r-- 1 app app 3.8G Dec 30 12:05 /data/app.db-wal
-rw-r--r-- 1 app app  32K Dec 30 12:05 /data/app.db-shm

Meaning: WAL is bigger than the main DB. That’s not automatically fatal, but it’s a sign checkpointing isn’t keeping up.

Decision: Run a controlled checkpoint during a quiet window, or adjust your workload so checkpoints occur predictably. Investigate long-lived readers preventing checkpoint progress.

13) SQLite: check whether readers are blocking checkpoints (busy database)

cr0x@server:~$ sqlite3 /data/app.db "PRAGMA wal_checkpoint(TRUNCATE);"
0|0|0

Meaning: The three numbers are (busy, log, checkpointed). Zeros after TRUNCATE suggests checkpoint succeeded quickly and WAL truncated.

Decision: If “busy” is non-zero or WAL won’t truncate, hunt for long-running read transactions and fix them (shorten reads, avoid holding transactions open).

14) MariaDB: confirm buffer pool sizing and pressure

cr0x@server:~$ mariadb -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads';"
+-------------------------+------------+
| Variable_name           | Value      |
+-------------------------+------------+
| innodb_buffer_pool_size | 8589934592 |
+-------------------------+------------+
+-------------------------+----------+
| Variable_name           | Value    |
+-------------------------+----------+
| Innodb_buffer_pool_reads| 18403921 |
+-------------------------+----------+

Meaning: If buffer pool reads climb rapidly during the burst, you’re missing cache and doing more physical I/O than planned.

Decision: Increase buffer pool (if RAM allows), reduce working set (indexes, query patterns), or shard workload. Don’t ignore the OS; swapping will ruin your day.

15) Networked storage suspicion: check latency distribution quickly

cr0x@server:~$ ioping -c 10 -W 2000 /var/lib/mysql
4 KiB <<< /var/lib/mysql (ext4 /dev/nvme0n1p2): request=1 time=0.8 ms
4 KiB <<< /var/lib/mysql (ext4 /dev/nvme0n1p2): request=2 time=1.1 ms
4 KiB <<< /var/lib/mysql (ext4 /dev/nvme0n1p2): request=3 time=47.9 ms
...
--- /var/lib/mysql ioping statistics ---
10 requests completed in 12.3 s, min/avg/max = 0.7/6.4/47.9 ms

Meaning: That max latency spike is exactly what commit latency looks like when the disk hiccups.

Decision: If you see jitter like this, stop chasing micro-optimizations in SQL. Fix storage QoS, move volumes, or add buffering/batching.

16) Find retry storms in application logs (the “self-amplifying burst”)

cr0x@server:~$ journalctl -u app-ingester --since "10 min ago" | grep -E "database is locked|retrying" | tail -n 5
Dec 30 12:00:41 db01 app-ingester[8812]: sqlite error: database is locked; retrying attempt=7
Dec 30 12:00:41 db01 app-ingester[8812]: sqlite error: database is locked; retrying attempt=8
Dec 30 12:00:42 db01 app-ingester[8812]: sqlite error: database is locked; retrying attempt=9
Dec 30 12:00:42 db01 app-ingester[8812]: sqlite error: database is locked; retrying attempt=10
Dec 30 12:00:42 db01 app-ingester[8812]: sqlite error: database is locked; retrying attempt=11

Meaning: You’re not just experiencing contention; you’re multiplying it with retries.

Decision: Add exponential backoff with jitter, cap retries, and consider a single-writer queue. Retrying aggressively is how you turn a spike into an outage.

Common mistakes (symptoms → root cause → fix)

1) Symptom: “database is locked” errors during spikes (SQLite)

Root cause: Multiple concurrent writers or long-lived transactions holding locks; single-writer reality collides with multi-writer workload.

Fix: Serialize writes explicitly (one writer thread/process), use WAL mode, set a sane busy_timeout, and batch commits. Avoid holding read transactions open while writing.

2) Symptom: periodic 200–2000 ms stalls every N seconds (SQLite)

Root cause: WAL checkpoint cycles or filesystem journal commits creating bursty sync behavior.

Fix: Control checkpoints (manual during quiet windows), tune wal_autocheckpoint, reduce synchronous level only with clear durability requirements, and validate storage latency jitter.

3) Symptom: MariaDB p99 spikes while CPU is low

Root cause: I/O-bound commits: redo/binlog fsync latency dominates; threads wait on log flush or handler commit.

Fix: Batch transactions, reduce concurrency, review innodb_flush_log_at_trx_commit and sync_binlog with business approval, and improve storage latency.

4) Symptom: throughput collapses when you “add more workers” (MariaDB)

Root cause: Lock/latch contention or flushing pressure amplified by thread thrash; more concurrency increases context switching and contention.

Fix: Cap concurrency, use connection pooling, fix hot indexes/tables, and tune InnoDB background flushing rather than adding threads.

5) Symptom: WAL file grows forever (SQLite)

Root cause: Long-lived readers prevent checkpoint from completing; or autocheckpoint settings don’t match workload.

Fix: Ensure readers don’t hold transactions open, run wal_checkpoint during controlled windows, and consider splitting workload across multiple DB files if contention is structural.

6) Symptom: MariaDB replication lag spikes during imports

Root cause: Binary log fsync and redo flush patterns under heavy write load; single-threaded apply (depending on setup) can’t keep up.

Fix: Batch writes, schedule imports, review binlog durability settings, and ensure replica apply configuration matches workload. Don’t treat replication as “free.”

7) Symptom: “It’s fast on my laptop, slow in prod” (both)

Root cause: Storage semantics differ: laptop NVMe vs shared cloud volume; fsync and latency jitter are different universes.

Fix: Benchmark on production-like storage, measure latency distribution, and set SLOs around p99 commit latency—not just average throughput.

Checklists / step-by-step plan

If you’re choosing between MariaDB and SQLite for bursty writes

  1. Count writers, not requests. How many processes/hosts can write concurrently?
  2. Decide whether you can enforce a single writer. If yes, SQLite remains on the table.
  3. Define durability requirements plainly. “We can lose 1 second of data” is a real requirement; “must be durable” is not.
  4. Measure storage fsync latency. If it’s jittery, both databases will look flaky under spikes.
  5. Plan for backfills. If you’ll routinely import or reprocess data, design throttling and batching from day one.

SQLite burst-hardening plan (practical)

  1. Enable WAL mode and confirm it stays enabled.
  2. Set busy_timeout to something non-trivial (hundreds to thousands of ms), and handle SQLITE_BUSY with backoff + jitter.
  3. Batch commits: commit every N rows or every T milliseconds.
  4. Introduce a write queue with one writer thread. If multiple processes exist, introduce one writer process.
  5. Control checkpoints: run wal_checkpoint during low traffic; tune wal_autocheckpoint.
  6. Watch WAL size and checkpoint success as first-class metrics.

MariaDB burst-hardening plan (practical)

  1. Confirm you’re on InnoDB for bursty write tables.
  2. Size buffer pool so the working set fits as much as reasonable without swapping.
  3. Check redo log capacity; avoid too-small redo that forces frequent checkpoints.
  4. Align innodb_io_capacity with real storage capability (not wishful thinking).
  5. Cap application concurrency; use connection pooling; avoid thread storms.
  6. Batch writes and use multi-row inserts where safe.
  7. Measure and alert on log waits, fsync latency indicators, and dirty page ratio.

When to migrate from SQLite to MariaDB (or vice versa)

  • Migrate SQLite → MariaDB when you can’t enforce a single writer, you need multi-host writes, or operational tooling (replication/online backups) matters.
  • Migrate MariaDB → SQLite when the workload is local, single-writer, and you’re paying unnecessary operational overhead for a small embedded dataset.

FAQ

1) Can SQLite handle high write throughput?

Yes—if you batch transactions and keep writers serialized. SQLite can be extremely fast per core because it avoids network hops and server overhead.

2) Why does SQLite say “database is locked” instead of queueing writers?

SQLite’s locking model is simple and intentional. It expects the application to control concurrency (busy_timeout, retries, and ideally a single writer). If you want the database to manage heavy multi-writer concurrency, you’re describing a server DB.

3) Is WAL mode always the right choice for SQLite under bursts?

Often, but not always. WAL helps concurrent reads during writes and can smooth steady write load. It also introduces checkpoint behavior you must manage. If you ignore checkpoints, you get periodic stalls and giant WAL files.

4) For MariaDB, what setting most affects burst behavior?

innodb_flush_log_at_trx_commit and (if binlog is used) sync_binlog. They directly determine how often you pay the fsync cost. Changing them changes durability, so treat it like a business decision.

5) Why do write spikes sometimes look worse after adding indexes?

Indexes increase write amplification. One insert becomes multiple B-tree updates and more dirty pages. Under spikes, the difference between “one write” and “five writes” is not theoretical; it’s your p99.

6) Should I put SQLite on network storage?

Usually no. SQLite depends on correct, low-latency locking and sync semantics. Network filesystems and some distributed volumes can make locking unpredictable and fsync painfully slow. If you must, test the exact storage implementation under load.

7) If MariaDB is slow during bursts, should I just scale up CPU?

Only after you prove you’re CPU-bound. Most burst pain is I/O latency or contention. Adding CPU to an fsync bottleneck is like adding more checkout clerks when the store only has one register.

8) What’s the simplest way to make either database handle bursts better?

Batch commits and throttle concurrency. Bursts are often self-inflicted by “unlimited workers” and “commit every row.” Fix that first.

9) Which is safer for durability under spikes?

Both can be safe; both can be configured unsafely. MariaDB’s defaults tend to be conservative for server workloads. SQLite can be fully durable too, but the performance cost under bursts is more visible because it sits in your request path.

10) How do I know whether I’m bottlenecked on checkpointing?

SQLite: WAL grows and checkpoints report “busy” or don’t truncate. MariaDB: log waits rise, dirty pages climb, and you see stalls tied to flushing. In both cases, correlate with disk latency spikes.

Conclusion: practical next steps

If you have bursty writes and you’re deciding between MariaDB and SQLite, don’t start with ideology. Start with the write model.

  • If you can enforce one writer, batch commits, and keep the database on low-latency local storage, SQLite will handle spikes quietly and cheaply.
  • If you have many writers across processes/hosts and you need operational tools like replication and robust observability, MariaDB is the safer bet—assuming you respect fsync physics and tune with care.

Then do the unglamorous work that prevents drama: measure fsync latency, cap concurrency, batch writes, and make checkpoints/flushing a controlled part of the system rather than a surprise. Your future self will still be tired, but at least they’ll be bored. That’s the goal.

← Previous
Dovecot: maildir vs mdbox — pick storage that won’t haunt you later
Next →
Google Search Console “Page with redirect”: When It’s Fine and When It Hurts

Leave a comment