You ship a perfectly reasonable feature. It works in staging. In production, you get the kind of error that reads like your database is too polite to say what it really thinks: database is locked, SQLITE_BUSY, or a pile of “Lock wait timeout exceeded” messages.
Locking isn’t a bug. It’s how databases keep your money from being double-spent and your rows from becoming modern art. But the locking model you pick changes the failure modes you get. SQLite fails loudly with “busy.” MariaDB usually fails later, with the slow creep of threads piling up until your app looks like it’s running through wet cement.
Fast diagnosis playbook
When you’re on-call, you don’t want a philosophy lecture. You want “what do I check first so I stop bleeding?” Here’s the triage sequence I use for SQLite “busy” errors and MariaDB lock contention.
First: confirm whether you’re blocked on locks or stuck on I/O
- SQLite: “busy” is often lock contention, but can also be “writer can’t checkpoint” which looks like contention and feels like latency.
- MariaDB: lock waits vs disk stalls can look identical at the app tier (requests time out). You need to separate “threads waiting on locks” from “threads waiting on storage.”
Second: find the one transaction holding the door closed
- SQLite: locate long-lived transactions, open connections, or background jobs that write while the app reads.
- MariaDB: identify the blocking thread/transaction and the wait graph (InnoDB status and processlist).
Third: fix scope and concurrency before you tune timeouts
Timeouts are a Band-Aid. Sometimes the right Band-Aid. But they’re still a Band-Aid.
- Shorten transactions.
- Reduce write frequency or batch intelligently.
- Add the right index to prevent lock amplification.
- Only then: adjust
busy_timeout(SQLite) orinnodb_lock_wait_timeout(MariaDB), plus retry logic with jitter.
One reliable rule: if your “fix” is “increase the timeout,” you’re just choosing how long it takes to fail.
Locking models: what actually locks, and when
SQLite: one writer at a time, and it means it
SQLite is an embedded database. It’s a library linked into your process, writing to a file. That file is the shared resource. When you see “busy,” SQLite is telling you it can’t acquire the lock it needs on that file (or on the WAL/shared-memory coordination files that come with WAL mode).
The core operational reality:
- Many readers can coexist.
- Only one writer can commit at a time. Even if multiple connections queue up, only one gets to be “the writer.”
- Transactions define lock lifetime. A “small update” inside a transaction that stays open for seconds becomes “a lock held for seconds.”
SQLite has multiple journaling modes. Two matter in production discussions:
- Rollback journal (the older default in many mental models): writers can block readers more readily depending on lock state, and checkpointing is simpler but can be disruptive under concurrency.
- WAL (Write-Ahead Logging): readers generally don’t block writers and writers generally don’t block readers, but you get a new failure mode: the WAL file grows until it can be checkpointed, and checkpointing itself can become contentious.
SQLite concurrency is not “bad.” It’s explicit. It forces you to be honest about write patterns. That’s a feature right up until you’re pretending it’s a networked OLTP server.
MariaDB (InnoDB): row locks, gap locks, and quiet queueing
MariaDB with InnoDB is a classic client-server database. It has a buffer pool, a lock manager, background threads, and a lot of code dedicated to letting many sessions do work concurrently without stepping on each other.
The operational reality here:
- Multiple writers can run concurrently as long as they don’t touch the same rows (or ranges, thanks to next-key locking).
- Lock waits are often silent. Your query runs, then stalls. The client sees “it’s slow,” not “it’s busy.”
- Bad indexing causes lock amplification. A missing index can turn an intended row-level operation into a range scan that locks a whole neighborhood.
InnoDB also has deadlocks, which are not the apocalypse—InnoDB detects them and aborts one participant. The problem is when your application treats deadlocks as “this should never happen” and responds with “panic, retry instantly, repeat forever.”
Paraphrased idea (not a verbatim quote): Werner Vogels has repeatedly pushed the principle that you design for failure and build systems that assume components will misbehave. Locks are one of those components.
Interesting facts and short history you can use in a meeting
- SQLite is older than many “modern” web stacks. Development started in 2000 as a self-contained SQL engine for embedded use.
- SQLite’s “serverless” design is literal. There is no daemon. Every process is its own database client and part-time database “server,” competing for file locks.
- WAL mode was introduced to improve concurrency by separating reads from writes via an append-only log, but it introduces checkpointing dynamics you must manage.
- InnoDB wasn’t originally “the default MySQL engine.” It became the de facto standard because it brought transactions, crash recovery, and row-level locking that MyISAM didn’t provide.
- MariaDB is a fork created after Oracle acquired Sun, largely to keep MySQL development open and community-driven.
- “database is locked” in SQLite is often an application bug, like a transaction left open across network calls or a connection leaked in a pool.
- InnoDB’s gap and next-key locks exist to prevent phantoms under certain isolation levels; they can surprise engineers who assume only “rows I touched” get locked.
- SQLite uses POSIX/Win32 locking primitives, meaning NFS and other network filesystems can turn locking from deterministic to “exciting.”
- SQLite is used in more places than most people realize: browsers, mobile OSes, and countless desktop apps—because the operational footprint is tiny and reliability is high when used as intended.
Joke #1: SQLite is like a one-lane bridge—safe, simple, and everyone gets across, but only one truck at a time.
What “busy” really means (and what it doesn’t)
SQLITE_BUSY is SQLite saying: “I tried to get the lock I need, and I couldn’t, and I’m not going to wait forever unless you told me to.” The default behavior in many bindings is effectively “don’t wait.” That’s why “busy” errors show up as soon as concurrency increases by a tiny amount.
The three usual culprits
- A long-lived write transaction. A background job begins a transaction, writes a bunch of rows, and takes its sweet time. Everyone else experiences “busy.”
- WAL checkpoint pressure. The WAL file grows. Eventually checkpointing needs cooperation from readers. A long reader can prevent checkpoint progress.
- Multiple processes on a filesystem that lies about locks. Networked filesystems and container volume drivers can turn correct SQLite usage into random failures.
What it is not
- It’s not primarily “SQLite is slow.” SQLite can be extremely fast on local SSD, especially for read-heavy workloads.
- It’s not fixed by “just add retries” if your transactions are long and your concurrency is sustained. You’ll just build a retry storm.
- It’s not something you should silence. Treat “busy” as a signal that your concurrency model is mismatched.
MariaDB/InnoDB contention: the quieter, sneakier cousin
MariaDB rarely throws “busy” in your face. It lets you queue. That seems nicer, until your application threads stack up like airplanes in a holding pattern. Your p95 latency goes vertical. Then your connection pool saturates. Then your callers start retrying. And now you’ve invented your own distributed denial-of-service.
How MariaDB lock pain typically shows up
- Slow queries with no CPU usage (threads “Sending data” or “Waiting for row lock”).
- Lock wait timeouts (eventually) and deadlocks (occasionally, but in bursts).
- Replication lag because a replica SQL thread is blocked behind a big transaction or a lock wait.
The common root causes
- Transactions too large (bulk updates, schema changes, or app code that “helpfully” wraps everything in a transaction).
- Missing indexes that turn targeted updates into range scans and broad locking.
- Isolation level surprises that introduce gap locks and block inserts into “empty space.”
- Hot rows (counters, “last_seen,” leaderboards) that cause write contention no matter how good the database is.
Joke #2: InnoDB deadlocks are like office meetings—someone has to leave early so anything can move forward.
Practical tasks: commands, outputs, and decisions
These are the tasks I actually run when I’m diagnosing lock contention. Each one includes: a command you can run, what the output means, and the decision you make next.
SQLite tasks (local file DB behavior)
Task 1: Confirm journal mode and whether WAL is enabled
cr0x@server:~$ sqlite3 /var/lib/myapp/app.db "PRAGMA journal_mode;"
wal
Meaning: WAL is enabled. Readers should not block writers in the common case, but checkpointing becomes a thing.
Decision: If you’re not in WAL and you have any concurrency, switch to WAL unless you’re on a filesystem with unreliable locks or you have constraints that require rollback journal.
Task 2: Check busy timeout configured (SQLite side)
cr0x@server:~$ sqlite3 /var/lib/myapp/app.db "PRAGMA busy_timeout;"
0
Meaning: SQLite will fail immediately with SQLITE_BUSY if it can’t get the lock.
Decision: Set a sane busy timeout (often 2000–10000 ms) and fix transaction scope. Timeout alone is not a strategy.
Task 3: Inspect WAL autocheckpoint threshold
cr0x@server:~$ sqlite3 /var/lib/myapp/app.db "PRAGMA wal_autocheckpoint;"
1000
Meaning: SQLite will try to checkpoint after ~1000 pages are in WAL (page size dependent).
Decision: If you see WAL growth and checkpoint stalls, tune this and add a controlled checkpoint routine during low traffic.
Task 4: Check if the WAL file is growing (symptom of checkpoint not keeping up)
cr0x@server:~$ ls -lh /var/lib/myapp/app.db*
-rw-r----- 1 myapp myapp 64M Dec 30 11:40 /var/lib/myapp/app.db
-rw-r----- 1 myapp myapp 512M Dec 30 11:41 /var/lib/myapp/app.db-wal
-rw-r----- 1 myapp myapp 32K Dec 30 11:41 /var/lib/myapp/app.db-shm
Meaning: The WAL is huge relative to the base DB file; checkpointing likely isn’t completing.
Decision: Look for long-running readers and consider manual checkpointing, shortening read transactions, or moving heavy analytics reads elsewhere.
Task 5: Find processes holding the database files open
cr0x@server:~$ sudo lsof /var/lib/myapp/app.db | head
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
myapp 2134 myapp 12u REG 253,0 67108864 91234 /var/lib/myapp/app.db
myapp 2134 myapp 13u REG 253,0 536870912 91235 /var/lib/myapp/app.db-wal
worker 2201 myapp 10u REG 253,0 67108864 91234 /var/lib/myapp/app.db
Meaning: Two processes have the DB open; that’s normal, but now you know who to blame when locks happen.
Decision: If you see unexpected processes (backup scripts, cron jobs, “just a quick report”), stop them or redirect them to a replica/export.
Task 6: Confirm the filesystem type (SQLite on NFS is a recurring tragedy)
cr0x@server:~$ findmnt -no FSTYPE,TARGET /var/lib/myapp
nfs4 /var/lib/myapp
Meaning: Your SQLite DB lives on NFS. Locking semantics may be unreliable or slow; busy errors can be random.
Decision: Move the DB to local storage, or move the workload to MariaDB/Postgres. SQLite-on-NFS is a “works until it doesn’t” arrangement.
Task 7: Measure whether “busy” correlates with slow disk flushes
cr0x@server:~$ iostat -x 1 3
Linux 6.8.0 (server) 12/30/2025 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
3.10 0.00 1.20 22.50 0.00 73.20
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 5.0 420.0 200.0 5200.0 48.2 1.1 92.0
Meaning: High %iowait, high await, high %util. Writers may be holding locks longer because fsync is slow.
Decision: Fix storage first (disk saturation, noisy neighbor, throttling) before rewriting transaction logic. Locks don’t release until writes complete.
Task 8: Check SQLite compile options (sometimes relevant in embedded distros)
cr0x@server:~$ sqlite3 -cmd ".compile_options" ":memory:" | head
COMPILER=gcc-13.2.0
ENABLE_FTS5
ENABLE_RTREE
THREADSAFE=1
USE_URI
Meaning: Thread safety enabled; good. In rare cases, unusual builds can change locking behavior.
Decision: If you’re on a weird embedded build, standardize the SQLite package across environments to reduce “it only happens on prod” mysteries.
MariaDB tasks (InnoDB locking and waiting)
Task 9: See who is running and who is waiting
cr0x@server:~$ mysql -uroot -p -e "SHOW FULL PROCESSLIST\G" | sed -n '1,60p'
*************************** 1. row ***************************
Id: 12431
User: app
Host: 10.0.2.15:53320
db: mydb
Command: Query
Time: 28
State: Waiting for row lock
Info: UPDATE accounts SET balance=balance-10 WHERE id=42
*************************** 2. row ***************************
Id: 12405
User: app
Host: 10.0.2.14:53112
db: mydb
Command: Query
Time: 61
State: Updating
Info: UPDATE accounts SET balance=balance+10 WHERE id=42
Meaning: One session is blocked waiting on a row lock while another is actively updating the same row.
Decision: Find the blocker and shorten that transaction; also consider design changes for hot rows (sharding counters, batching, or moving to atomic primitives).
Task 10: Check InnoDB engine status for lock waits and deadlocks
cr0x@server:~$ mysql -uroot -p -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,120p'
------------------------
LATEST DETECTED DEADLOCK
------------------------
*** (1) TRANSACTION:
TRANSACTION 987654, ACTIVE 3 sec starting index read
...
*** (2) TRANSACTION:
TRANSACTION 987655, ACTIVE 3 sec updating or deleting
...
WE ROLL BACK TRANSACTION (1)
------------
TRANSACTIONS
------------
Trx id counter 987700
History list length 1240
Meaning: A deadlock occurred; InnoDB chose a victim. Also note history list length (purge lag) can indicate long transactions.
Decision: Ensure the app retries deadlocked transactions with backoff; also reduce transaction size and ensure consistent locking order across code paths.
Task 11: Verify lock wait timeout and whether it matches reality
cr0x@server:~$ mysql -uroot -p -e "SHOW VARIABLES LIKE 'innodb_lock_wait_timeout';"
+--------------------------+-------+
| Variable_name | Value |
+--------------------------+-------+
| innodb_lock_wait_timeout | 50 |
+--------------------------+-------+
Meaning: Sessions can wait up to 50 seconds before failing a lock wait.
Decision: If your upstream timeouts are 5–10s, a 50s lock wait just ties up threads. Lower it to align with end-to-end latency budgets, but only after you identify the blocking pattern.
Task 12: Check transaction isolation level (gap locks surprise people)
cr0x@server:~$ mysql -uroot -p -e "SHOW VARIABLES LIKE 'transaction_isolation';"
+-----------------------+-----------------+
| Variable_name | Value |
+-----------------------+-----------------+
| transaction_isolation | REPEATABLE-READ |
+-----------------------+-----------------+
Meaning: Default isolation is repeatable read, which can use next-key locks and block inserts in ranges.
Decision: If you’re seeing insert contention on ranges and can tolerate it, consider READ-COMMITTED for OLTP workloads—after testing for correctness.
Task 13: Identify missing indexes that cause broad locking
cr0x@server:~$ mysql -uroot -p -e "EXPLAIN UPDATE orders SET status='paid' WHERE customer_id=123 AND status='pending'\G"
*************************** 1. row ***************************
id: 1
select_type: UPDATE
table: orders
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 842113
filtered: 10.00
Extra: Using where
Meaning: Full table scan (type: ALL) on an update. This can lock a lot more than you think and take forever.
Decision: Add a composite index (e.g., (customer_id, status)) and re-check the plan. This is one of the highest ROI lock fixes you can make.
Task 14: Monitor InnoDB metrics for lock waits (quick signal)
cr0x@server:~$ mysql -uroot -p -e "SHOW GLOBAL STATUS LIKE 'Innodb_row_lock%';"
+-------------------------------+----------+
| Variable_name | Value |
+-------------------------------+----------+
| Innodb_row_lock_current_waits | 12 |
| Innodb_row_lock_time | 184223 |
| Innodb_row_lock_time_avg | 15351 |
| Innodb_row_lock_time_max | 60000 |
| Innodb_row_lock_waits | 48 |
+-------------------------------+----------+
Meaning: Lock waits are happening now (current_waits) and have been expensive on average.
Decision: Escalate to query/transaction analysis. If current_waits stays elevated during incidents, you’re dealing with real contention, not a one-off.
Task 15: Check whether the server is thread-saturated due to waits
cr0x@server:~$ mysql -uroot -p -e "SHOW GLOBAL STATUS LIKE 'Threads_running'; SHOW GLOBAL STATUS LIKE 'Max_used_connections';"
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| Threads_running | 187 |
+-----------------+-------+
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| Max_used_connections | 498 |
+----------------------+-------+
Meaning: Many threads are running (often “running” includes waiting). Connections have been heavily used.
Decision: If the app is retrying aggressively, rate-limit retries and add jitter immediately. Then identify and fix the blocking query.
Task 16: OS-level confirmation: are we CPU-bound or waiting?
cr0x@server:~$ top -b -n 1 | sed -n '1,12p'
top - 11:44:12 up 17 days, 3:28, 1 user, load average: 22.15, 20.97, 18.44
Tasks: 312 total, 5 running, 307 sleeping, 0 stopped, 0 zombie
%Cpu(s): 4.3 us, 1.1 sy, 0.0 ni, 72.8 id, 21.7 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 32112.0 total, 1120.3 free, 10234.8 used, 20756.9 buff/cache
MiB Swap: 2048.0 total, 1980.0 free, 68.0 used. 18940.2 avail Mem
Meaning: High IO wait (wa). This often stretches lock hold time because commits wait on fsync.
Decision: Investigate storage latency. Lock tuning won’t fix slow durable writes.
Three corporate-world mini-stories (anonymized, painfully plausible)
Incident caused by a wrong assumption: “SQLite will be fine; we only write a little”
At a mid-size company, a team shipped an internal “ops notebook” service. It stored incident notes, runbooks, and a small audit log. Someone chose SQLite because it was one binary, no extra infrastructure, and the writes were “tiny.” It ran on a VM with a shared filesystem mount so two app instances could access the same DB file.
The wrong assumption wasn’t “SQLite can’t handle writes.” SQLite can handle plenty of writes. The wrong assumption was thinking the filesystem and process model didn’t matter. Under light use, the service looked great. Under incident load—when multiple engineers were editing notes and the audit log was spiking—requests started failing with database is locked. Naturally, this happened during incidents, when the tool was needed most.
The first fix was the classic: add retries. That turned “some requests fail fast” into “requests hang, then fail.” The connection pool filled. Latency went nonlinear. Engineers started copy-pasting notes into chat because the notebook was timing out. Productivity cratered. The postmortem included the phrase “tooling became a single point of failure,” which is a polite way of saying “we built a trap.”
The real fix was boring but correct: move the DB file to local disk and run a single writer instance (or switch to MariaDB). They also changed the audit log to batch writes and shortened transactions in the app code. Lock errors dropped to near-zero and, more importantly, the remaining ones were actionable signals rather than background noise.
Optimization that backfired: WAL everywhere, forever, with a never-ending reader
A different company had a local SQLite database embedded in an edge agent. They enabled WAL mode across the fleet to improve concurrency: the agent was reading config while also writing metrics. Initial benchmarks looked better. Everyone high-fived. Then months later, field devices started running out of disk in weird ways.
It wasn’t the base DB file. It was the WAL file. It kept growing, because checkpointing couldn’t complete reliably. The agent had a “watcher” thread that held a read transaction open while streaming changes to another component. That read transaction prevented old WAL frames from being checkpointed. The WAL became a slow-motion disk leak.
Operations noticed devices becoming sluggish. Disk usage alarms triggered. Some devices hit 100% disk and started failing in unrelated subsystems. Debugging was painful because the agent itself was “fine” until it wasn’t. When it crossed the threshold, everything failed at once.
The fix: redesign the watcher to avoid holding a transaction open, and explicitly checkpoint during idle periods. They also set reasonable limits for WAL growth through policy (alerting on WAL size relative to DB size). WAL wasn’t the enemy. Unbounded WAL plus a never-ending reader was.
Boring but correct practice that saved the day: consistent transaction boundaries and backoff
A payments-adjacent team ran MariaDB. They had periodic lock spikes, but incidents were rare. The reason wasn’t luck. They had an internal standard: every write path had a maximum transaction scope, no network calls inside transactions, and all retryable database errors used exponential backoff with jitter.
One day a batch reconciliation job was deployed with an accidental cross-join update. It began locking more rows than intended. In many orgs, that turns into a thundering herd: app retries hammer the DB, the DB becomes unresponsive, and the incident grows teeth.
Here, the job still caused pain, but the blast radius was contained. The app backed off instead of piling on. Other services degraded gracefully. Engineers had time to identify the query, kill it, and deploy a fix. Postmortem action items were straightforward: add a guardrail, improve query review for batch jobs.
This is the unsexy truth of reliability: the practices that feel slow in development are the ones that keep production from turning into performance art.
How to prevent busy errors (SQLite) and lock pileups (MariaDB)
Pick the right engine for the write topology
If you have one process, local disk, and moderate concurrency: SQLite can be excellent. If you have multiple app instances, multiple hosts, and “writes from everywhere”: SQLite becomes a coordination problem you didn’t mean to adopt.
- SQLite is a great choice for embedded, single-node, local-disk workloads; read-heavy systems; caches; durable queues with controlled writers.
- MariaDB is a great choice when you need real concurrent writers, remote access, operational tooling, and predictable behavior under multiple clients.
SQLite: concrete tactics that actually work
- Enable WAL for concurrent read/write (most of the time). Then monitor WAL growth and checkpoint behavior.
- Set
busy_timeoutand/or a busy handler so short contention doesn’t become errors. Keep it aligned with your request budget (don’t set 60 seconds and pretend it’s fine). - Keep transactions short. Do not hold a transaction open while you do network calls, parse JSON, or wait for a user to blink.
- Batch writes, but don’t create “mega transactions.” Batch into small commits (e.g., hundreds or thousands) rather than one infinite transaction.
- Use one writer pattern if you can. A dedicated writer thread/process that serializes writes can eliminate contention and simplify retries.
- Avoid running SQLite DB files on network filesystems. If you must, test locking behavior under load. Most teams discover the truth too late.
- Be explicit about checkpointing for long-running services that have persistent readers. Consider periodic
PRAGMA wal_checkpoint(TRUNCATE);during low traffic, but test carefully.
MariaDB/InnoDB: concrete tactics that actually work
- Fix indexing first. Most “locking issues” are really “we’re scanning too much and therefore locking too much.”
- Lower transaction scope. Commit sooner. Split big updates. Avoid “read-modify-write” patterns on hot rows.
- Design around hot rows. Counters and “last seen” fields can be written by many threads. Use sharded counters, append-only event tables, or periodic aggregation.
- Make retry behavior civilized. Deadlocks happen. Your app should retry with jitter, and only a bounded number of times.
- Align lock wait timeouts with upstream timeouts. If your API times out at 8 seconds, letting DB sessions wait 50 seconds is just resource hoarding.
- Be careful with isolation changes. Switching to
READ-COMMITTEDcan reduce locking contention, but validate correctness (phantoms, repeat reads) for your workload.
Common mistakes: symptoms → root cause → fix
1) SQLite “database is locked” spikes after adding a background job
Symptoms: Busy errors correlate with a cron/worker run; app logs show failures even at low traffic.
Root cause: The job opens a transaction and performs many writes (or holds it open while doing other work), blocking other writers.
Fix: Commit in smaller chunks; move job to a single-writer queue; add busy_timeout; ensure the job doesn’t run concurrently with itself.
2) WAL file grows without bound
Symptoms: app.db-wal keeps growing; disk usage climbs; occasional stalls during checkpoint attempts.
Root cause: A long-lived read transaction prevents checkpointing from recycling WAL frames.
Fix: Avoid long read transactions; change the reader to snapshot smaller scopes; schedule checkpoints; verify your code doesn’t keep a cursor open across long operations.
3) SQLite “busy” happens mostly on certain hosts
Symptoms: Same workload, different lock behavior depending on host or container.
Root cause: DB file on a different filesystem (overlayfs quirks, NFS, network block device), or storage latency causing longer lock hold times.
Fix: Standardize storage location and filesystem; move to local SSD; measure fsync latency; if you need multi-host access, move to MariaDB.
4) MariaDB lock wait timeouts during “small” updates
Symptoms: Lock wait timeout exceeded while updating a single row; intermittent but recurring.
Root cause: Hot row contention (same row updated by many sessions) or a transaction that touches the row and then holds the lock while doing other work.
Fix: Redesign hot row writes (shard, append-only, cache then aggregate); reduce transaction duration; avoid external calls inside transactions.
5) MariaDB suddenly “slow everywhere” after an index change
Symptoms: Increased lock waits; replication lag; CPU not pegged, but latency high.
Root cause: Query plan regression causing wider scans and more locking, or online DDL causing metadata contention in a busy schema.
Fix: Validate plans with EXPLAIN before and after; stage index changes; ensure queries use intended indexes; throttle schema changes.
6) Retrying makes everything worse
Symptoms: After adding retries, incidents get longer; DB sees more QPS during lock events.
Root cause: Immediate retries create a herd. Every client retries at once, extending contention windows.
Fix: Exponential backoff + jitter; cap retries; fail fast for non-idempotent operations unless carefully designed.
Checklists / step-by-step plan
Plan A: You’re on SQLite and you want fewer “busy” errors this week
- Verify storage is local and stable. If it’s NFS/remote/overlay weirdness, prioritize moving it. Busy errors there are a lifestyle.
- Enable WAL (if appropriate). Confirm with
PRAGMA journal_mode;. - Set a sane busy timeout. Start with 2000–5000 ms for interactive paths; adjust to match request budgets.
- Audit transaction scope. Ensure no transactions span network calls, sleeps, or “processing loops.”
- Identify your writers. List processes with
lsof; ensure you don’t have surprise writers. - Watch WAL size vs DB size. Alert if WAL exceeds a ratio you consider dangerous for your disk budget.
- Introduce a single-writer pattern for high-write paths. Queue writes in-process or via IPC; serialize commits.
- Add bounded retries with jitter. Only for retryable operations; log the retry count and duration so it doesn’t hide problems.
Plan B: You’re on MariaDB and lock waits are killing p95
- Capture the blocking query. Processlist + InnoDB status during the incident.
- Check for missing indexes on the hot update paths using
EXPLAIN. - Reduce transaction duration. Confirm no code path keeps transactions open across non-DB work.
- Align timeouts. Set
innodb_lock_wait_timeoutto something that matches upstream budgets and avoids thread hoarding. - Fix retries. Ensure deadlocks/lock timeouts are retried with exponential backoff and jitter. Cap retries.
- Design away hot rows. If one row is a contested resource, scale won’t fix it. Change the data model.
- Validate storage latency. If fsync is slow, locks last longer; you’ll blame the lock manager for a storage problem.
Decision checklist: when to move from SQLite to MariaDB
- You need multiple app instances writing concurrently and can’t or won’t funnel writes through a single writer.
- You need remote access from multiple hosts to the same database.
- You need operational features: fine-grained privileges, built-in replication, online schema changes, mature monitoring of locks.
- You’re currently “solving” busy errors with bigger timeouts and more retries.
FAQ
1) Is SQLite “database is locked” always a bug?
It’s usually a design mismatch or a transaction-scope bug. Occasionally it’s legitimate brief contention that needs a busy timeout. If it’s frequent, treat it as a system design problem.
2) Should I enable WAL mode by default?
For most local-disk, concurrent read/write workloads: yes. If you’re on a filesystem with questionable locking semantics (common with networked storage), test carefully or avoid WAL.
3) If I set busy_timeout, am I done?
No. You’ve just decided to wait instead of failing fast. If the writer holds the lock for 5 seconds and your timeout is 2 seconds, you still fail. If you set it to 60 seconds, your app will just hang longer.
4) Why does WAL sometimes make things worse?
WAL improves reader/writer concurrency, but it introduces checkpoint behavior. Long-lived readers can block checkpointing, letting the WAL grow and occasionally causing stalls.
5) Can SQLite handle multiple writers?
Multiple connections can attempt to write, but only one can commit at a time. Under sustained write concurrency, you get queueing and busy errors unless you coordinate writers.
6) In MariaDB, why do I see inserts blocked when no one is touching the same rows?
Gap/next-key locks under REPEATABLE-READ can lock ranges, not just individual rows. Missing indexes can also widen scans and lock ranges you didn’t intend.
7) Should I lower innodb_lock_wait_timeout to avoid pileups?
Often, yes—after you confirm the workload. A lower timeout prevents thread hoarding and forces faster failure, which can protect the server. But if you lower it without fixing the blocker, you’ll trade latency for errors.
8) What’s the right retry strategy for lock timeouts and deadlocks?
Retry deadlocks and lock timeouts with exponential backoff and jitter, capped retries, and idempotency awareness. Never retry instantly in a tight loop. That’s how you turn a small lock event into an outage.
9) Can I run SQLite on Kubernetes with a shared volume?
You can, but “can” isn’t “should.” If that shared volume is networked, locking and latency can be unpredictable. If you need multi-pod writers, use MariaDB (or another client-server DB) or enforce a single-writer architecture.
10) How do I know whether I’m lock-bound or I/O-bound?
Look for high IO wait (top, iostat) and long commit times, versus many sessions stuck in lock wait states (processlist, InnoDB status). Often it’s both: slow I/O extends lock hold time, which increases contention.
Conclusion: next steps that actually reduce pager noise
If you remember one thing: locking errors are usually your system telling you the truth about shared resources. SQLite tells you quickly and bluntly. MariaDB lets you pretend for longer.
Do these next:
- Run the fast diagnosis playbook and decide whether you’re lock-bound or I/O-bound.
- Find the longest transaction and make it shorter. This fixes more incidents than any knob.
- Stop retry storms with backoff + jitter and a hard cap.
- For SQLite: WAL + sane
busy_timeout+ controlled writers + checkpoint awareness. - For MariaDB: index correctness, transaction boundaries, hot-row redesign, and timeouts aligned with real latency budgets.
Then choose the database that matches your concurrency reality, not your org chart. Your future self will enjoy the novelty of sleeping through the night.