The database choice that hurts most is the one you don’t realize you made. You ship a tidy app, it works on your laptop,
it works in staging, and then it meets Friday traffic, a noisy neighbor, or a slightly “creative” filesystem. Now you’re
learning about locks, fsync, and backups the hard way.
PostgreSQL and SQLite are both excellent. They also fail differently. This is the practical map of where each one breaks first,
why, and how to diagnose it fast when your pager is doing its little dance.
The decision lens: what are you optimizing for?
“Reliability vs simplicity” sounds like a philosophical debate until you’ve been on-call for both.
In practice it’s a trade between where complexity lives and which failure you can tolerate.
SQLite optimizes for frictionless correctness… until you scale the wrong axis
SQLite is a library. There is no server. That’s not a cute detail; it changes your entire failure surface.
You get fewer moving parts, fewer ports, fewer daemons, fewer credentials, fewer runbooks.
Your app process is the database process.
The first thing to fail isn’t “availability.” It’s usually write concurrency or operational assumptions:
“We can put the DB on NFS,” or “we can have 20 workers writing constantly,” or “we don’t need a backup because it’s a file.”
PostgreSQL optimizes for predictable behavior under load… and demands you pay the operator tax
PostgreSQL is a database server. It’s designed to be shared by many clients doing many things at once, and it’s very good at that.
But you inherit the classic operational chores: memory sizing, connection management, WAL, replication, vacuum, upgrades, monitoring,
and the occasional “why is that query suddenly a crime?”
What fails first in PostgreSQL is rarely the database file. It’s usually latency (bad queries, IO stalls),
operational drift (autovacuum mis-tuned, storage near full), or human error (the wrong setting in prod).
My bias: if your app is single-node, modest write volume, and you can accept “scale up or refactor later,” SQLite is a gift.
If you need multi-writer throughput, remote access, multi-tenant isolation, or serious HA, PostgreSQL is the adult in the room.
You can still hurt yourself with both; PostgreSQL just gives you more ways to do it at higher budgets.
Interesting facts and historical context
- SQLite was created in 2000 by D. Richard Hipp to support a U.S. Navy contract; it was designed to be small, reliable, and self-contained.
- SQLite is famously “serverless,” meaning no separate daemon; it’s a library linked into your process, changing failure domains and deployment models.
- SQLite aims for a stable file format across versions. You can often move a database file between machines with minimal drama—if you respect filesystem rules.
- PostgreSQL descends from POSTGRES at UC Berkeley (1980s). The “SQL” part arrived later; the culture of correctness stuck.
- PostgreSQL introduced MVCC early as a core concurrency model, which is why readers don’t block writers the way they do in simpler locking systems.
- SQLite’s WAL mode (write-ahead logging) became widely used to improve concurrency and performance by separating reads from writes.
- PostgreSQL’s WAL is not just for durability; it’s the backbone of replication and point-in-time recovery.
- SQLite is everywhere—phones, browsers, embedded devices—because “it’s just a file” is exactly the deployment story hardware vendors want.
- PostgreSQL’s “VACUUM” lineage is the operational cost of MVCC: dead tuples don’t vanish unless the system cleans them up.
What fails first: PostgreSQL vs SQLite
SQLite: the first failure is usually contention or the filesystem
SQLite’s pain is unromantic. It’s not a “database outage,” it’s “why are requests stuck,” or “why did we get ‘database is locked’,”
or “why is this file corrupted after a reboot.”
Common “first failures” for SQLite in production:
- Write contention: too many concurrent writers, long transactions, or a busy checkpoint cycle.
- Bad placement: database file on network filesystems or flaky storage semantics; file locks and durability guarantees get weird.
- Misconfigured durability: PRAGMA settings chosen for speed without understanding the crash model.
- Backup by copying the file live: you get a clean-looking file that’s logically inconsistent (or just corrupt).
- Process-level failure domain: memory corruption, SIGKILL, or container eviction kills both app and DB together.
Joke #1: SQLite is like a bicycle—reliable, elegant, and quietly judgmental when you try to haul a refrigerator on it.
PostgreSQL: the first failure is usually ops drift or IO pressure
PostgreSQL’s initial failures are less about correctness and more about operational entropy. It is extremely good at not corrupting data.
It is less good at preventing you from running it into a wall with your own choices.
Common “first failures” for PostgreSQL in production:
- Connection storms: too many client connections, memory overhead, context switching, and lock contention.
- IO stalls: slow storage, mis-sized WAL, checkpoint spikes, or saturated disks.
- Autovacuum lag: bloat, table/idx growth, transaction ID wraparound risk, and query plans degrading.
- Replication surprises: lag, slots holding WAL forever, failover scripts that work until they don’t.
- Bad query plans: missing indexes, stale stats, and the occasional “we turned off nested loops globally because a blog post said so.”
Durability and crash safety: what “committed” really means
Reliability begins at the commit boundary. When your code returns “OK,” what exactly did you buy?
Not “probably.” Not “cached.” Not “maybe after the kernel feels like it.” What did you actually guarantee?
SQLite durability is configurable—and that’s both power and a trap
SQLite gives you pragmas that change durability characteristics. If you don’t set them, you inherit defaults that are generally safe,
but performance-sensitive teams often go hunting for speed and accidentally bargain away crash safety.
Key SQLite levers:
- journal_mode: DELETE, TRUNCATE, PERSIST, MEMORY, WAL, OFF. WAL is usually the best practical balance.
- synchronous: OFF, NORMAL, FULL, EXTRA. Lower means faster and riskier on crash/power loss.
- locking_mode: NORMAL vs EXCLUSIVE; EXCLUSIVE can improve speed but can surprise other processes.
- temp_store: affects temp objects; can shift IO patterns.
The hard reality: SQLite can be extremely crash-safe on a local disk with correct settings. But if you run it on a filesystem that lies
about fsync, or a network share with advisory locks behaving differently, your durability story becomes “hope-based.”
PostgreSQL durability is more rigid—and easier to reason about
PostgreSQL also has durability knobs, but the culture and defaults lean toward correctness. The core model:
a commit is durable when the WAL record is safely persisted (subject to your synchronous_commit setting).
PostgreSQL’s WAL is a professional-grade audit trail: it’s used for crash recovery, replication, and point-in-time restore. This is why
Postgres can survive a crash and come back with data intact—assuming the storage obeys durability semantics.
The storage layer decides who wins the argument
Both databases rely on the filesystem and underlying storage to honor writes and flushes. If the platform lies, the database loses.
Battery-backed caches, write barriers, and correct mount options matter. So does not using a filesystem+transport combo that “mostly works.”
One paraphrased idea often attributed to Werner Vogels (Amazon CTO): reliability comes from building systems that assume failure and recover quickly rather than pretending failure won’t happen.
Concurrency: locks, contention, and the shape of pain
SQLite concurrency: one writer (mostly), many readers (usually)
SQLite allows multiple readers, and with WAL it can keep readers going while a writer appends to the WAL. But it still has a core constraint:
write transactions serialize. You can have concurrent writers in the sense of multiple threads trying, but they’ll queue and time out if you
design the app like a chatty OLTP system.
What this means in practice:
- Short transactions are survival. Long transactions are self-harm.
- Busy timeouts are not a fix; they are a way to mask queueing until latency becomes user-visible.
- WAL checkpoint behavior can create periodic stalls if not managed (especially on slow disks).
PostgreSQL concurrency: MVCC plus locks, which is both better and more complicated
PostgreSQL shines under mixed read/write concurrency because readers don’t block writers in the naive way. But don’t confuse “MVCC” with “no locks.”
Postgres has plenty of locks, plus heavyweight locks, plus lightweight locks, plus internal contention points.
The typical Postgres pain shape:
- One slow query blocks others by holding locks longer than expected.
- One migration changes a table and causes lock queues and timeouts.
- Too many connections burn memory and saturate CPU on context switching.
Joke #2: PostgreSQL will let you do almost anything—sometimes including things you’ll deny ever doing during the postmortem.
Backups and restores: simplicity vs guarantees
SQLite backups are simple only if you do them correctly
SQLite feels backup-friendly because it’s a single file. That seduces people into copying it while the app is writing.
Sometimes you get away with it. Then one day you don’t, and you discover your backup strategy was “vibes.”
Safe patterns:
- Use the SQLite backup API (via
sqlite3 .backupor application integration). - If using WAL: capture the database and WAL state coherently (or checkpoint before copying).
- Validate restores regularly. A backup you haven’t restored is a rumor.
PostgreSQL backups are more complex—and far more flexible
Postgres has mature tooling for logical backups (pg_dump), physical backups (pg_basebackup),
and point-in-time recovery using WAL archiving. The complexity isn’t optional, but it buys you real operational power:
restore to yesterday, clone production, recover from operator mistakes, and build read replicas.
The typical failure mode isn’t “the backup tool doesn’t work.” It’s “we never tested restore,” or “WAL archiving silently broke,”
or “we filled the disk with WAL because a replication slot pinned it.”
Operations reality: monitoring, upgrades, migrations
SQLite ops: fewer knobs, but you own the process lifecycle
SQLite doesn’t have a server to monitor, but you still need operational discipline:
file permissions, disk fullness, filesystem consistency, backups, and application-level timeouts.
Your “DB upgrade” might be a library upgrade. Your “DB restart” is an app restart.
If your app runs as a fleet of stateless containers, SQLite becomes tricky unless each instance has local persistent storage and you accept
per-instance databases or a careful synchronization model. SQLite is great at being local. It’s not great at pretending to be shared.
PostgreSQL ops: you get power tools, and power tools cut fingers
PostgreSQL operational excellence looks like boring consistency:
track WAL volume, replication lag, vacuum health, slow queries, disk growth, and lock contention.
Upgrades are manageable, but they’re a thing. So are schema migrations.
The upside is you can scale reads with replicas, scale writes with better design (or sharding if you must), and recover from a wider class of disasters.
Three corporate mini-stories from the trenches
Mini-story 1: Incident caused by a wrong assumption
A product team shipped a small internal tool with SQLite because it was “just metadata.” It ran on a shared VM, and the database file lived on a mounted network share.
That choice wasn’t debated; it was implicit. The team wanted the file to survive VM rebuilds, and the share was “reliable storage.”
Weeks later, a maintenance event triggered a brief network hiccup. The app kept running. SQLite kept trying.
The logs showed intermittent “database disk image is malformed” and “database is locked” errors. The tool didn’t fully die; it just became untrustworthy.
Users retried. Retries amplified writes. The database file became a crime scene.
The postmortem was awkward because nobody made a single obviously stupid change. They made a reasonable-sounding assumption:
“A database file is a file, so any storage that stores files is fine.” But SQLite’s correctness depends on specific locking and durability semantics.
The network filesystem mostly worked—until it didn’t.
The fix was boring: move the SQLite file to local storage, enable WAL, and create a real backup pipeline using .backup.
Eventually, as usage grew, they migrated to PostgreSQL so multiple services could write without treating a file lock like a distributed consensus algorithm.
Mini-story 2: An optimization that backfired
Another team ran PostgreSQL for a customer-facing application with a normal OLTP pattern: lots of reads, steady writes, occasional bursts.
A new engineer noticed latency spikes during checkpoints. They read a few tuning posts and decided the fix was to “reduce disk flush overhead.”
They changed settings to make Postgres less eager to fsync and adjusted checkpoint parameters aggressively.
For a week, graphs looked prettier. Latency smoothed out. The engineer earned a quiet sense of victory.
Then a host reboot happened during a power event. Postgres recovered, but the last few minutes of acknowledged transactions were missing.
Users filed tickets about disappearing updates. The application team initially suspected caching. It wasn’t caching.
The optimization traded durability for performance without a written risk decision. In isolation, the settings were “valid.”
In reality, the business requirement was “don’t lose committed updates,” and the system’s semantics had been altered.
The remediation wasn’t just reverting settings. They implemented a change-management rule: durability-affecting parameters require review,
and any performance test must include “pull the power” style failure injection (or as close as you can safely get in a lab).
Mini-story 3: A boring but correct practice that saved the day
A platform group ran PostgreSQL with streaming replication and a conservative backup routine: nightly base backups, continuous WAL archiving,
and a monthly restore drill. It was not glamorous work. Nobody got promoted for “successful restore drill #12.”
One afternoon, an engineer ran a data cleanup script against the wrong environment. It wasn’t malicious. It was a muscle-memory mistake:
terminal tab, autocomplete, enter. The script executed quickly and did exactly what it was told.
The team detected the issue within minutes via monitoring: a sudden drop in row counts and a spike in delete activity.
They declared an incident, froze writes, and chose a recovery point just before the script ran. Because WAL archiving was healthy
and restore procedures were rehearsed, they performed point-in-time recovery to a new cluster and redirected traffic.
The result: an unpleasant afternoon, but no permanent data loss. The saving grace was not heroics. It was the habit of practicing restore
and verifying that WAL archives were actually usable. Boring practices are underrated because they don’t feel like engineering—until they are.
Practical tasks: commands, outputs, and decisions
These are the kinds of checks you run when you’re deciding between SQLite and Postgres, or when something is already on fire.
Each task includes a command, an example output, what it means, and what decision you make.
Task 1 (SQLite): Identify journaling mode and durability level
cr0x@server:~$ sqlite3 /var/lib/app/app.db "PRAGMA journal_mode; PRAGMA synchronous;"
wal
2
What it means: WAL mode is enabled; synchronous=2 is FULL. Commits are more durable, typically slower than NORMAL.
Decision: For production on local SSD, WAL + FULL is a safe baseline. If latency is too high, test NORMAL, but document the risk.
Task 2 (SQLite): Check for busy timeouts and immediate lock errors
cr0x@server:~$ sqlite3 /var/lib/app/app.db "PRAGMA busy_timeout;"
5000
What it means: The client will wait up to 5 seconds for locks before failing.
Decision: If you see user-visible latency, reduce contention (shorter transactions, fewer writers) instead of raising this indefinitely.
Task 3 (SQLite): Run an integrity check after a crash or storage event
cr0x@server:~$ sqlite3 /var/lib/app/app.db "PRAGMA integrity_check;"
ok
What it means: The database structure is consistent.
Decision: If output is not ok, stop writes, take a copy for forensics, and restore from a known-good backup.
Task 4 (SQLite): Inspect WAL/checkpoint state
cr0x@server:~$ sqlite3 /var/lib/app/app.db "PRAGMA wal_checkpoint(TRUNCATE);"
0|0|0
What it means: Checkpoint succeeded; no frames left in WAL; WAL truncated.
Decision: If you see large values or failures, investigate long-running readers or IO bottlenecks. Consider scheduling checkpoints during low load.
Task 5 (SQLite): Perform a safe online backup
cr0x@server:~$ sqlite3 /var/lib/app/app.db ".backup '/var/backups/app-$(date +%F).db'"
What it means: This uses SQLite’s backup mechanism, producing a consistent snapshot.
Decision: Prefer this over cp for live databases. Then run PRAGMA integrity_check on the backup as part of your pipeline.
Task 6 (Linux): Confirm the database is not on a network filesystem
cr0x@server:~$ df -T /var/lib/app/app.db
Filesystem Type 1K-blocks Used Available Use% Mounted on
/dev/nvme0n1p2 ext4 205113344 73214512 121345024 38% /
What it means: Local ext4, not NFS/CIFS. Good for SQLite durability assumptions.
Decision: If you see nfs or cifs, reconsider SQLite for shared access or move the file to local disk and replicate at a higher layer.
Task 7 (PostgreSQL): Check if the server is actually up and accepting connections
cr0x@server:~$ pg_isready -h 127.0.0.1 -p 5432
127.0.0.1:5432 - accepting connections
What it means: Postgres is up and responsive at the TCP level.
Decision: If it’s “rejecting” or “no response,” check logs, disk full conditions, and recovery status before blaming the app.
Task 8 (PostgreSQL): Identify connection pressure
cr0x@server:~$ psql -X -qAt -c "SELECT count(*) FROM pg_stat_activity;"
187
What it means: 187 backend sessions exist. Depending on instance size, that might be fine or a problem.
Decision: If high and rising, implement a connection pooler and set sane limits in the application.
Task 9 (PostgreSQL): Find blocking locks fast
cr0x@server:~$ psql -X -qAt -c "SELECT blocked.pid, blocked.query, blocking.pid, blocking.query FROM pg_stat_activity blocked JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid)) WHERE blocked.wait_event_type='Lock';"
4123|UPDATE orders SET status='paid' WHERE id=$1;|3999|ALTER TABLE orders ADD COLUMN note text;
What it means: A migration is blocking application writes.
Decision: Stop/rollback the blocking DDL if possible, or reschedule migrations using lock-friendly patterns (e.g., add column without default, backfill in batches).
Task 10 (PostgreSQL): Measure replication lag (if you have replicas)
cr0x@server:~$ psql -X -qAt -c "SELECT application_name, state, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
replica1|streaming|00:00:00.120|00:00:00.180|00:00:00.450
What it means: Sub-second lag; healthy for many workloads.
Decision: If lag is seconds to minutes, investigate IO saturation, network issues, or long-running transactions on the replica.
Task 11 (PostgreSQL): Check for autovacuum falling behind
cr0x@server:~$ psql -X -qAt -c "SELECT relname, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
events|983421|2025-12-30 08:12:41+00
orders|221904|2025-12-30 08:03:10+00
sessions|110992|2025-12-30 08:15:02+00
What it means: Dead tuples are accumulating; autovacuum is running but may not be keeping up with write churn.
Decision: Tune autovacuum for hot tables, add indexes carefully, and consider partitioning for high-churn event tables.
Task 12 (PostgreSQL): Identify top latency queries (requires pg_stat_statements)
cr0x@server:~$ psql -X -qAt -c "SELECT calls, mean_exec_time::numeric(10,2), left(query,80) FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;"
142|812.34|SELECT * FROM reports WHERE org_id = $1 ORDER BY created_at DESC LIMI
9211|203.11|UPDATE sessions SET last_seen = now() WHERE id = $1
What it means: One query is consistently slow; another is moderately slow but called a lot.
Decision: Fix the consistently slow query first if it blocks user journeys; fix high-call moderate queries if they dominate CPU/IO.
Task 13 (OS): Check disk space (because databases hate surprises)
cr0x@server:~$ df -h /var/lib/postgresql
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 196G 189G 1.9G 99% /
What it means: You are one log file away from a bad day.
Decision: Free space immediately. Then implement alerts at 80/90/95% and plan capacity growth. Disk-full is not a “later” problem.
Task 14 (PostgreSQL): Inspect WAL growth pressure
cr0x@server:~$ psql -X -qAt -c "SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),'0/0')) AS wal_bytes_since_origin;"
1643 GB
What it means: This is a crude indicator, but it suggests heavy WAL generation over time (or just that origin is zero). Better: check pg_wal directory size and replication slots.
Decision: If WAL directory is large, check for stuck replication slots or failed archiving before changing checkpoint knobs.
Task 15 (PostgreSQL): Check replication slots that might pin WAL
cr0x@server:~$ psql -X -qAt -c "SELECT slot_name, active, restart_lsn FROM pg_replication_slots;"
analytics_slot|f|1A/2F000000
What it means: A slot exists and is inactive. It can retain WAL until a consumer catches up or the slot is dropped.
Decision: If the consumer is gone, drop the slot after confirming it’s safe. If it’s needed, fix the consumer and capacity-plan WAL retention.
Fast diagnosis playbook
When latency spikes or writes fail, you don’t have time for a philosophical debate. You need a short path to “what’s the bottleneck.”
This playbook is intentionally opinionated.
First: identify which class of failure you’re in
- Hard down: cannot connect / file won’t open / corruption errors.
- Soft down: connections work but requests time out.
- Wrong results: missing data, inconsistent reads, or partial updates.
Second: check the substrate (disk and filesystem) before blaming SQL
- Is disk nearly full?
- Is IO saturated or slow?
- Is the database on a network filesystem (SQLite danger) or a flaky volume?
Third: check contention signals
- SQLite: “database is locked,” long transactions, WAL checkpoint stalls, busy timeouts.
- PostgreSQL: lock waits, connection counts, slow queries, autovacuum lag, replication lag.
Fourth: decide the quickest safe mitigation
- SQLite: reduce writer concurrency, shorten transactions, enable WAL, add busy_timeout (as a band-aid), move to local disk.
- Postgres: kill the blocking query, pause a migration, add an index carefully, scale up IO, enable pooling, or fail over if primary is unhealthy.
Fifth: write down the failure mode in one sentence
If you can’t describe it, you can’t reliably fix it. Examples:
“Writes queue behind a long transaction holding a lock” or “SQLite file on NFS lost lock semantics during a network flap.”
Common mistakes: symptom → root cause → fix
SQLite mistakes
-
Symptom: frequent “database is locked” errors under load
Root cause: too many concurrent writers or long write transactions; default rollback journal mode amplifies contention
Fix: enable WAL, reduce write concurrency, keep transactions short, add targeted batching, set a sane busy_timeout. -
Symptom: database corruption after reboot/power loss
Root cause: synchronous=OFF/NORMAL chosen without understanding; storage lies about fsync; or DB placed on unsuitable filesystem
Fix: use local disk, set synchronous=FULL for critical data, avoid network filesystems, validate with integrity_check and restore drills. -
Symptom: backup restores but app behaves strangely (missing rows, constraint errors later)
Root cause: live file copy during active writes; WAL not captured coherently
Fix: use SQLite backup API; checkpoint appropriately; test restore and run integrity_check on backups. -
Symptom: periodic latency spikes every few minutes
Root cause: WAL checkpointing causing IO bursts, often worsened by slow disks or long readers preventing checkpoint completion
Fix: tune checkpoint strategy, reduce long-running read transactions, consider manual checkpoint during low load.
PostgreSQL mistakes
-
Symptom: high CPU and “too many connections” errors
Root cause: one-connection-per-request patterns; lack of pooling; max_connections raised until memory suffers
Fix: use a connection pooler, cap connections, fix application connection reuse, monitor pg_stat_activity. -
Symptom: write latency spikes and fsync-heavy periods
Root cause: checkpoint bursts, WAL on slow storage, mis-sized shared_buffers/checkpoint settings, or saturated IO
Fix: put WAL on fast storage, tune checkpoint_timeout and checkpoint_completion_target carefully, measure IO with system tools. -
Symptom: queries slow down over weeks, indexes grow, disk usage balloons
Root cause: vacuum falling behind; bloat accumulation; long transactions preventing cleanup
Fix: tune autovacuum per table, eliminate long transactions, consider partitioning, run vacuum/analyze where appropriate. -
Symptom: disk fills with WAL unexpectedly
Root cause: replication slot pinned; replica offline; WAL archiving broken and retention grows
Fix: inspect replication slots, restore consumers, drop unused slots, alert on pg_wal size and archiver errors. -
Symptom: a migration causes widespread timeouts
Root cause: DDL acquiring locks; long-running transactions block DDL and vice versa
Fix: use lock-minimizing migration patterns, set lock timeouts, deploy in off-peak windows, verify blocking with pg_blocking_pids.
Checklists / step-by-step plan
Checklist A: When SQLite is the right call (and how to not regret it)
- Confirm workload shape: mostly reads, limited concurrent writes, short transactions.
- Place DB on local disk: avoid NFS/CIFS and “shared volumes” with unclear locking semantics.
- Set WAL mode: use WAL for better read/write concurrency.
- Choose durability intentionally: synchronous=FULL for critical writes; document if you choose NORMAL.
- Implement backups using the backup API: schedule, rotate, and verify restores.
- Add integrity checks: run integrity_check in CI for backup artifacts or after unclean shutdowns.
- Plan your exit: define the threshold at which you migrate to Postgres (writers, remote access, multi-instance).
Checklist B: When PostgreSQL is the right call (and how to keep it boring)
- Right-size connections: don’t equate “more connections” with “more throughput.” Use pooling.
- Put WAL on good storage: low-latency disks matter more than you want them to.
- Enable essential visibility: slow query logging, pg_stat_statements, lock monitoring, replication monitoring.
- Make vacuum a first-class citizen: watch bloat, dead tuples, and long transactions.
- Backups + restore drills: pick logical/physical/PITR based on RPO/RTO, and rehearse restores.
- Schema changes are deployments: practice safe migrations and set lock timeouts.
- Have a failover plan: even if it’s manual, write it down and test it when you’re calm.
Step-by-step: choosing between them in a real project
- Write down failure tolerance: acceptable data loss (RPO) and acceptable downtime (RTO).
- Quantify concurrency: number of concurrent writers and longest transaction time in the critical path.
- Decide deployment topology: single node vs multi-instance; need remote access; need replicas.
- Pick the simplest thing that meets the SLO: SQLite if single-node and low write contention; Postgres otherwise.
- Prototype the worst case: load test writes; inject failures (kill -9, reboot in staging, disk latency simulation).
- Operationalize: backups, alerts, dashboards, and restore tests before you call it “done.”
FAQ
1) Is SQLite “less reliable” than PostgreSQL?
Not inherently. SQLite can be extremely reliable on local storage with sane settings. It’s less forgiving when you add concurrency,
shared filesystems, or sloppy backup practices. PostgreSQL is built for shared access and operational recovery patterns, so it tends to stay reliable as complexity grows.
2) What’s the most common way SQLite fails in production?
Lock contention: too many writers or transactions that are too long. The symptom is “database is locked,” timeouts, or user-facing latency.
The fix is architectural (reduce write concurrency) more than magical PRAGMA tuning.
3) Can I run SQLite on NFS or a shared Kubernetes volume?
You can, but you’re betting your data on filesystem semantics you probably haven’t tested under failure. If you need shared access across nodes,
PostgreSQL is the safer default. If you must use SQLite, keep it on node-local persistent storage and treat it as per-instance state.
4) Does WAL mode make SQLite “multi-writer”?
No. WAL improves read/write concurrency and reduces writer blocking readers, but writes still serialize. You can reduce pain, not change the core model.
5) What’s the most common way PostgreSQL fails first?
Operational overload: too many connections, IO saturation, or vacuum falling behind. Postgres usually keeps correctness; it just gets slow or stuck behind locks.
6) If PostgreSQL is more powerful, why not always use it?
Because the operator tax is real. You need backups, monitoring, upgrades, and capacity planning. For small single-node apps,
SQLite can deliver excellent reliability with dramatically less operational surface.
7) How do backups differ in the “what fails first” sense?
SQLite’s risk is “we copied the file wrong.” PostgreSQL’s risk is “we configured backups but never tested restore,” or “WAL archiving broke silently.”
Both are human problems; Postgres just gives you more ways to do it right if you commit to the practice.
8) Which one is easier to debug during an incident?
PostgreSQL, usually. You have visibility: pg_stat_activity, locks, query stats, replication views. SQLite debugging often starts in your application logs and OS-level file/IO behavior.
SQLite’s simplicity reduces failure surface, but when it does fail, the cause is often outside “SQL.”
9) What about data corruption risk?
Both are designed to avoid corruption. The bigger risk is environment: unreliable storage, unsafe settings, and incorrect backup procedures.
SQLite is more exposed to filesystem quirks because it’s a single file with locking expectations. PostgreSQL is more exposed to operational misconfiguration and disk-full events.
10) What’s the clean migration path if I start with SQLite?
Plan for it early: keep schemas compatible where possible, avoid SQLite-only quirks, and build a migration tool that can export/import deterministically.
When the time comes, do a dual-write or a controlled cutover with verification, not a Friday-night “it’s just SQL.”
Next steps you can take this week
If you’re choosing today: pick SQLite for single-node apps with modest write concurrency and a strong preference for minimal ops.
Pick PostgreSQL when you need concurrency, remote access, HA, or when you can’t tolerate “it worked until load changed.”
Then do the unsexy work:
- For SQLite: set WAL + intentional synchronous level, move the DB to local storage, and back it up with the backup API. Test restore.
- For PostgreSQL: cap connections and add pooling, watch locks and vacuum health, and implement backups with restore drills. Treat migrations as production changes.
- For both: alert on disk space, validate assumptions about the filesystem, and write down your failure model in plain English.
The winning database is the one whose failures you’ve rehearsed. The losing database is the one you chose because it felt simple, right up until it wasn’t.