PostgreSQL vs SQLite: scaling path—how to move from file DB without downtime

Was this helpful?

You don’t “outgrow” SQLite on a Tuesday at 2 p.m. You outgrow it at 2:07 a.m. when an innocuous deploy adds one more writer, your latency graphs turn into modern art, and someone asks why “a file” is blocking a fleet of web servers.

This is the pragmatic scaling path: how to move from SQLite (a fantastic embedded database) to PostgreSQL (a fantastic server database) with no downtime—or, more realistically, with downtime that’s so small it can hide between load balancer health checks. We’ll talk about dual writes, change capture, data parity, cutover mechanics, and the failure modes you only learn by doing this in production.

When SQLite is the right answer (and when it isn’t)

SQLite is not a “toy DB.” It’s a serious piece of engineering: a transactional SQL database in a single library that writes into a single file. It’s embedded. It’s portable. It’s dead simple to deploy. That combination is so powerful it’s easy to misuse it.

SQLite wins when

  • Your app is mostly single-process or single-writer and the database lives on local disk.
  • You need low operational overhead: no daemon, no network, no user management, no backups that require consistent snapshots across hosts.
  • You ship software to edge devices, desktops, mobiles, kiosks, or appliances. SQLite is basically the lingua franca there.
  • Your “database” is part of your artifact: a bundled catalog, a static dataset, a cache you can rebuild, or a local queue.

SQLite starts hurting when

  • You introduce multiple concurrent writers (or you thought you didn’t, but your code disagrees).
  • You scale horizontally and now several app instances need to share the same truth.
  • Your filesystem gets weird: network filesystems, container overlay layers, slow disks, noisy neighbors.
  • You need operational features like managed HA, read replicas, fine-grained access controls, online schema changes, and observability built into the server.

SQLite’s locking model is the key: it’s great at serializing one writer and many readers, but “great” is not the same as “infinite.” If your workload evolves into “a lot of small writes from multiple hosts,” PostgreSQL becomes the adult in the room.

One short joke (1/2): SQLite is like a bicycle: perfect until you try to tow a boat with it.

PostgreSQL vs SQLite: decision points that matter in production

Concurrency and locking: the real reason migrations happen

SQLite coordinates access to a single database file. Readers can be concurrent; writers are more constrained. With WAL mode, you get significantly improved concurrency—readers don’t block writers as much—but you still have one primary place where the truth lives: that file.

PostgreSQL is a multi-process server with MVCC designed for concurrent writes from many clients, with row-level locks, snapshot visibility, and separate WAL streaming for durability and replication. If you’re building something multi-tenant, multi-worker, or multi-host, PostgreSQL is simply designed for that.

Durability and recovery: “it committed” should mean the same thing every day

SQLite can be extremely durable if configured correctly, but it is also easy to configure it into “fast but maybe” mode (synchronous=NORMAL/OFF, journal tweaks, unsafe fsync behavior in certain environments). PostgreSQL has its own footguns, but in general its defaults are oriented toward server durability with WAL and crash recovery as a first-class feature.

Operational ergonomics

SQLite makes your app the database operator. That’s fine until you’re doing the database operator job badly. PostgreSQL makes a database server the operator, which you still need to run, patch, back up, and observe—but at least the boundaries are clean and the tooling ecosystem is mature.

Query planner and features

PostgreSQL’s planner, indexes, and advanced features (CTEs, window functions, partial indexes, GIN/GiST, JSONB, generated columns, robust constraints) change how you model data and how you keep latency predictable. SQLite has many features too, but the “big system under load” playbook generally assumes a server database.

None of this is moral superiority. It’s architecture fit. SQLite isn’t “worse.” It’s “different,” and the difference starts billing you when your workload changes.

Interesting facts and short history (that actually affects your decisions)

  1. SQLite was created in 2000 by D. Richard Hipp to avoid administrative overhead and to replace ad-hoc flat files in embedded systems.
  2. SQLite is public domain, which is unusual for a database engine and a major reason it’s shipped everywhere without licensing drama.
  3. SQLite aims for “small, fast, reliable” as a library—no separate server process—so your app inherits database responsibilities (file permissions, disk space, locking semantics).
  4. PostgreSQL descends from POSTGRES (1986), a research project that evolved into a production-grade relational system with a strong emphasis on correctness and extensibility.
  5. PostgreSQL’s MVCC model became a cornerstone for high concurrency without reader/writer blocking in typical OLTP workloads.
  6. SQLite’s WAL mode was a major step for concurrency; many teams “discover” WAL only after their first lock storm.
  7. PostgreSQL replication matured over time: physical streaming replication is old and battle-tested; logical replication is newer and changes how you do migrations and rollouts.
  8. SQLite is the most deployed database engine by sheer count of devices/software bundles, even if nobody “runs it” as a server.

The scaling path: from file DB to server DB without downtime

Here’s the migration goal stated plainly: keep serving requests while you move the authoritative dataset from a SQLite file to PostgreSQL, then switch reads/writes to PostgreSQL with minimal risk.

If your application can tolerate a maintenance window, great. Do that. But if you can’t—because you’re a B2B SaaS with customers in every timezone, or because you’re running an appliance fleet that never sleeps—then you need a plan that treats migration like a production release: incremental, observable, reversible.

Step 0: Decide what “no downtime” really means

Zero downtime migrations often mean one of these:

  • Zero visible downtime: users don’t notice because retries and health checks absorb the blip.
  • Zero write downtime: reads might degrade but writes continue (or vice versa).
  • Zero planned downtime: you don’t schedule a window, but you may still have a brief cutover incident if you’re sloppy.

Pick a target. Put a number on it. If your SLO says 99.9% monthly availability, your budget is about 43 minutes. If you’ve been spending that budget on deploys already, “no downtime migration” becomes less of a flex and more of a survival strategy.

Step 1: Make the SQLite side as sane as possible

Before you migrate, stabilize. A migration amplifies existing mess. If your SQLite database is already corrupting occasionally, or you have a half-dozen ad-hoc connection patterns, you’ll copy the mess into PostgreSQL and then call it “Postgres instability.” That’s not how physics works.

At minimum:

  • Enable WAL mode if you aren’t already (unless your environment makes it unsafe).
  • Standardize connection open/close behavior and busy timeout.
  • Identify every writer code path. There are always more than you think.

Step 2: Translate schema and data types deliberately

SQLite is dynamically typed. PostgreSQL isn’t. If you’ve been storing timestamps as “whatever worked,” PostgreSQL will force you to choose: timestamptz or timestamp, numeric types, text vs jsonb, constraints that you previously hand-waved.

Choose strictness now. Migrations are when you pay the type debt. Otherwise you’ll pay it later with incident interest.

Step 3: Build a backfill (bulk copy) that you can rerun

The first big copy is a backfill: extract from SQLite and load into PostgreSQL. Treat it like any other job in production:

  • Idempotent: safe to run multiple times.
  • Chunked: doesn’t explode memory or transaction logs.
  • Observable: logs counts, durations, and errors.

Rerunnable matters because you’ll discover mismatches. You always discover mismatches.

Step 4: Keep PostgreSQL in sync: dual writes or change capture

You have two practical patterns:

Pattern A: Dual writes (application writes to both)

Pros: simple conceptually, no fancy tooling. Cons: tricky failure handling; you can create split-brain at the application layer if you don’t design it carefully.

Rules for dual writes that keep you out of trouble:

  • Pick one source of truth during the dual-write phase. Usually SQLite remains authoritative until cutover.
  • Make writes idempotent. Use stable primary keys and retry safely.
  • Log and reconcile failures. “Fire and forget” dual writes is how you mint silent data loss.

Pattern B: Change data capture (CDC) from SQLite

SQLite doesn’t have built-in logical replication like PostgreSQL. But you can approximate CDC by:

  • Adding triggers that write to an append-only change table.
  • Using a write-ahead log reader approach (harder, more fragile, and environment-dependent).
  • Writing changes to an outbox table in SQLite that a worker ships to PostgreSQL.

Triggers + an outbox is often the least-bad way. It’s not glamorous. It works.

Step 5: Read shadowing and parity checks

Before you switch reads, shadow them: read from SQLite as usual, also read from PostgreSQL in the background, compare results, and emit metrics. This catches type issues, collation differences, and query semantic differences.

Don’t compare entire payloads blindly. Compare what matters: primary keys returned, counts, aggregates, specific fields with deterministic ordering. PostgreSQL and SQLite can disagree on ordering unless you specify it.

Step 6: Cut over reads first, then writes (usually)

A common low-risk sequence:

  1. Backfill to PostgreSQL.
  2. Dual-write with SQLite as source of truth.
  3. Shadow-read and compare.
  4. Switch reads to PostgreSQL (keep dual writes).
  5. Switch writes to PostgreSQL (SQLite becomes fallback / audit, temporarily).
  6. Remove dual-write after confidence and a set bake time.

Write cutover is the moment that can hurt. Read cutover is where you learn. Stage it.

Step 7: Keep a rollback plan that isn’t fantasy

Rollback plans fail because they assume perfect symmetry. After you switch writes to PostgreSQL, rolling back to SQLite means you have to replay the writes back. If you didn’t build that pipeline, rollback is a slide deck, not a capability.

A realistic plan is: after write cutover, keep the dual-write or outbox in place long enough that rollback remains feasible. Define a time horizon (hours/days). Measure how hard it is to replay. Practice once in staging with a real-ish dataset.

One quote (paraphrased idea): Werner Vogels (Amazon CTO) has often pushed the idea that “everything fails, all the time,” so you design systems assuming failure, not hoping it won’t happen.

Practical tasks: commands, outputs, and what decision you make

These are real tasks you can run while planning, executing, or troubleshooting the migration. Each includes (1) a command, (2) sample output, and (3) the decision you make from it.

Task 1: Identify SQLite journal mode and busy timeout behavior

cr0x@server:~$ sqlite3 /var/lib/myapp/app.db "PRAGMA journal_mode; PRAGMA synchronous; PRAGMA busy_timeout;"
wal
2
5000

What it means: WAL mode is enabled; synchronous=2 (FULL); busy_timeout=5000ms.

Decision: WAL is good for concurrency; FULL is safer but slower. If you’re seeing lock contention, increase busy_timeout and fix writer frequency. Don’t flip synchronous=OFF as a “performance fix” unless you enjoy explaining data loss.

Task 2: Detect SQLite lock contention quickly

cr0x@server:~$ sqlite3 /var/lib/myapp/app.db "PRAGMA compile_options;" | grep -i threads
THREADSAFE=1

What it means: SQLite was compiled threadsafe, so your lock issues are not “because sqlite is single-threaded.” They’re workload and access-pattern driven.

Decision: Focus on transaction scope and writer count, not folklore. Instrument where writes happen.

Task 3: Get table sizes in SQLite (to estimate backfill time)

cr0x@server:~$ sqlite3 /var/lib/myapp/app.db "SELECT name, SUM(pgsize) AS bytes FROM dbstat GROUP BY name ORDER BY bytes DESC LIMIT 5;"
events|2147483648
users|402653184
sessions|268435456
sqlite_schema|4096

What it means: events is ~2 GiB; backfill strategy needs chunking and maybe partitioning in PostgreSQL.

Decision: If you backfill this in one transaction, you’ll blow up WAL and maybe disk. Plan chunked copy by primary key or time windows.

Task 4: Verify SQLite integrity before copying

cr0x@server:~$ sqlite3 /var/lib/myapp/app.db "PRAGMA integrity_check;"
ok

What it means: Basic corruption check passes.

Decision: If it’s not “ok,” stop. Fix corruption first (restore from backup, rebuild from source events). Migrating corruption just moves the blame.

Task 5: Dump SQLite schema (don’t trust what you remember)

cr0x@server:~$ sqlite3 /var/lib/myapp/app.db ".schema users"
CREATE TABLE users(
  id TEXT PRIMARY KEY,
  email TEXT,
  created_at TEXT
);
CREATE INDEX idx_users_email ON users(email);

What it means: Timestamps are stored as TEXT. That will become an argument later.

Decision: Decide how to map created_at to timestamptz and how to parse legacy values. Build a conversion function and tests.

Task 6: Create PostgreSQL roles and a dedicated database

cr0x@server:~$ sudo -u postgres psql -c "CREATE ROLE myapp LOGIN PASSWORD 'REDACTED' NOSUPERUSER NOCREATEDB NOCREATEROLE;"
CREATE ROLE

What it means: You now have a least-privileged app role.

Decision: Don’t run your app as postgres. If you do, you will eventually do something exciting and irreversible.

Task 7: Check PostgreSQL is actually durable (fsync on, full_page_writes on)

cr0x@server:~$ sudo -u postgres psql -tAc "SHOW fsync; SHOW synchronous_commit; SHOW full_page_writes;"
on
on
on

What it means: You’re not in “YOLO durability” mode.

Decision: Keep these on for production. If performance is bad, fix schema and I/O, not correctness.

Task 8: Estimate PostgreSQL capacity and bloat risk (baseline DB size)

cr0x@server:~$ sudo -u postgres psql -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) FROM pg_database ORDER BY pg_database_size(pg_database.datname) DESC;"
  datname  | pg_size_pretty
-----------+----------------
 postgres  | 8456 kB
 template1 | 8280 kB
 template0 | 8208 kB
(3 rows)

What it means: Baseline is small; you have room. You’re about to make it not small.

Decision: Confirm disk headroom for: data + indexes + WAL + overhead + backups. If you can’t articulate that, you’re not ready to cut over.

Task 9: Backfill using a safe, chunked pipeline (example: CSV with COPY)

cr0x@server:~$ sqlite3 -header -csv /var/lib/myapp/app.db "SELECT id,email,created_at FROM users;" > /tmp/users.csv
cr0x@server:~$ sudo -u postgres psql myapp -c "\copy users(id,email,created_at) FROM '/tmp/users.csv' WITH (FORMAT csv, HEADER true);"
COPY 120543

What it means: 120,543 rows loaded.

Decision: If COPY is slow, check indexes and constraints. During backfill, load into a staging table without heavy indexes, then build indexes after.

Task 10: Verify row counts match (coarse parity check)

cr0x@server:~$ sqlite3 /var/lib/myapp/app.db "SELECT COUNT(*) FROM users;"
120543
cr0x@server:~$ sudo -u postgres psql myapp -tAc "SELECT COUNT(*) FROM users;"
120543

What it means: Counts match for this table.

Decision: Counts are necessary but not sufficient. If counts match but data differs, you still lose.

Task 11: Verify key coverage and missing IDs (spot silent loss)

cr0x@server:~$ sqlite3 -csv /var/lib/myapp/app.db "SELECT id FROM users ORDER BY id LIMIT 5;"
001a,00b9,00c1,00d0,00f2
cr0x@server:~$ sudo -u postgres psql myapp -tAc "SELECT id FROM users ORDER BY id LIMIT 5;"
001a
00b9
00c1
00d0
00f2

What it means: Low IDs exist; ordering seems consistent for this sample.

Decision: Also check random samples and max ranges; bugs hide in the tail.

Task 12: Check PostgreSQL slow queries during shadow reads

cr0x@server:~$ sudo -u postgres psql myapp -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;"
                query                 | calls | mean_exec_time
--------------------------------------+-------+---------------
 SELECT * FROM events WHERE user_id=$1 |  9321 |        187.42
 SELECT * FROM sessions WHERE id=$1    | 21144 |         12.11
(2 rows)

What it means: Your hottest query is slow in PostgreSQL; likely missing index or poor query shape.

Decision: Add an index (events(user_id, created_at) maybe), or change query to select specific columns. Don’t cut over reads until this is under control.

Task 13: Confirm index usage with EXPLAIN (avoid guessing)

cr0x@server:~$ sudo -u postgres psql myapp -c "EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM events WHERE user_id='00f2' ORDER BY created_at DESC LIMIT 50;"
 Limit  (cost=0.42..12.55 rows=50 width=128) (actual time=0.311..0.829 rows=50 loops=1)
   Buffers: shared hit=210
   ->  Index Scan Backward using idx_events_user_created_at on events  (cost=0.42..812.11 rows=3400 width=128) (actual time=0.309..0.814 rows=50 loops=1)
         Index Cond: (user_id = '00f2'::text)
 Planning Time: 0.220 ms
 Execution Time: 0.901 ms

What it means: Index scan; sub-millisecond execution. That’s what you want.

Decision: If you see sequential scans on large tables, don’t argue with the planner—fix the schema/indexes or the query.

Task 14: Watch connection saturation (a classic Postgres cutover faceplant)

cr0x@server:~$ sudo -u postgres psql -tAc "SHOW max_connections; SELECT count(*) FROM pg_stat_activity;"
200
187

What it means: You’re close to max connections already.

Decision: Add connection pooling (PgBouncer), reduce app pool sizes, or scale the DB. If you cut over now, you’ll get connection storms and self-inflicted outages.

Task 15: Monitor WAL growth during backfill (don’t fill disk silently)

cr0x@server:~$ sudo -u postgres psql -tAc "SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS wal_bytes_since_start;"
42 GB

What it means: A lot of WAL generated (example output). During bulk loads, WAL can balloon.

Decision: Ensure disk headroom; consider batching and checkpoints. If you’re replicating, ensure replicas can keep up or you’ll retain WAL and fill disk.

Task 16: Confirm replication lag (if using replicas for safety)

cr0x@server:~$ sudo -u postgres psql -c "SELECT application_name, state, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
 application_name |   state   | write_lag | flush_lag | replay_lag
------------------+-----------+-----------+-----------+------------
 standby1          | streaming | 00:00:01  | 00:00:02  | 00:00:03
(1 row)

What it means: Replication is healthy with low lag.

Decision: Don’t cut over writes if replicas are minutes behind and your failover plan assumes they’re current.

Task 17: Validate constraints that SQLite didn’t enforce the way you assumed

cr0x@server:~$ sudo -u postgres psql myapp -c "SELECT COUNT(*) FROM users WHERE email IS NULL;"
14

What it means: You have null emails. If you planned email NOT NULL, your migration will fail or your app will.

Decision: Decide: fix data (backfill defaults), change constraint, or add a staged constraint later. Don’t “temporarily” ignore it forever.

Task 18: Cutover switch check: app points to Postgres and health checks pass

cr0x@server:~$ systemctl restart myapp
cr0x@server:~$ journalctl -u myapp -n 20 --no-pager
Dec 30 12:11:02 server myapp[19422]: db: connected to postgres host=pg1 dbname=myapp
Dec 30 12:11:02 server myapp[19422]: migrations: none pending
Dec 30 12:11:03 server myapp[19422]: http: listening on :8080

What it means: The app is using PostgreSQL and started cleanly.

Decision: Proceed with a canary release first. If only one instance is on Postgres, you can bail fast.

One short joke (2/2): Nothing is more permanent than a “temporary dual-write” that nobody dares to remove.

Fast diagnosis playbook: find the bottleneck in minutes

This is the “pager is buzzing, Slack is on fire” path. You’re mid-migration or post-cutover and something is slow or stuck. Check these in order.

First: Is it lock contention or connection exhaustion?

  • SQLite symptoms: “database is locked” errors, elevated write latency, threads stuck on commit.
  • Postgres symptoms: connection timeouts, too many clients, app pool waiting.
cr0x@server:~$ sudo -u postgres psql -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"
 state  | count
--------+-------
 active | 42
 idle   | 131
(2 rows)

Decision: If active is high and idle is high, you likely have over-provisioned app pooling. If active is pegged and queries are slow, you have query/index issues.

Second: Is the database I/O bound?

cr0x@server:~$ iostat -x 1 3
Device            r/s     w/s   rkB/s   wkB/s  await  %util
nvme0n1         120.0   340.0  8200.0 42000.0   9.4  92.1

Decision: If %util is near 100% and await is high, you’re I/O bound. Stop adding indexes mid-flight. Reduce write rate, batch, or scale storage/instance.

Third: Are slow queries the culprit (and are they new)?

cr0x@server:~$ sudo -u postgres psql myapp -c "SELECT query, calls, total_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 3;"
                  query                   | calls | total_exec_time
------------------------------------------+-------+----------------
 SELECT * FROM events WHERE user_id=$1     |  9233 |       1730042.0
 UPDATE users SET last_seen_at=$1 WHERE id=$2 | 60321 |        402112.3
(2 rows)

Decision: If one query dominates total time, fix that first. Add the missing index, reduce returned columns, or cache. Don’t “tune Postgres” generically.

Fourth: Is autovacuum keeping up (Postgres-specific slow burn)?

cr0x@server:~$ sudo -u postgres psql myapp -c "SELECT relname, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
 relname | n_dead_tup |        last_autovacuum
---------+------------+-------------------------------
 events  |   18200421 | 2025-12-30 11:40:22.12345+00
(1 row)

Decision: If dead tuples are huge and autovacuum is stale, your update-heavy table will degrade. Tune autovacuum for that table, or redesign updates.

Fifth: Is replication lag breaking your safety assumptions?

cr0x@server:~$ sudo -u postgres psql -c "SELECT application_name, replay_lag FROM pg_stat_replication;"
 application_name | replay_lag
------------------+------------
 standby1          | 00:05:12
(1 row)

Decision: If lag is minutes, failover will lose data relative to primary unless you use synchronous replication (which has its own tradeoffs). Adjust your cutover and durability assumptions.

Three corporate mini-stories (pain, regret, and one quiet win)

Mini-story 1: The incident caused by a wrong assumption

They were a mid-size SaaS with a “simple” job scheduler. The scheduler stored job state in SQLite because it started life as a single VM utility. Over time, the scheduler became a service: multiple workers, autoscaling, and a load balancer. The SQLite file moved to a shared network filesystem because “all workers need to see it.”

The wrong assumption was: a shared filesystem makes a file database a shared database. On a calm day, it worked. On a busy day, it behaved like a traffic circle with one lane and a thousand cars. Writers piled up behind locks. Readers retried. Latency rose. The app started timing out, which retried jobs, which increased writes, which increased locking. A feedback loop that looked like “sudden growth.”

When they finally looked at it, the smoking gun was in plain sight: errors that literally said the database was locked. But because the system was distributed, each node saw a tiny slice of the pain. Nobody had the whole picture until they aggregated logs. By then, the customer-visible damage was done.

The fix wasn’t heroic. They stopped pretending the shared filesystem was a database. They moved the scheduler state to PostgreSQL, put PgBouncer in front, and used a single-writer pattern for a few critical tables. The funniest part: once they shipped the fix, they found a bunch of “optimizations” (sleep loops, backoff hacks) that were only there to cope with SQLite locks. Those hacks then became latency bugs in PostgreSQL because they delayed legitimate work.

Mini-story 2: The optimization that backfired

A retail company had a local SQLite cache on each app host. They wanted to migrate the authoritative store to PostgreSQL, but they didn’t want to “waste time” building correct dual-write semantics. So they did what people do under deadline pressure: they made SQLite writes asynchronous.

Specifically, they queued writes in memory and flushed them in batches every few seconds. It worked in benchmarks. Production, of course, did something else. Under load, batches grew. Memory increased. Then a deploy restarted the service and the in-memory queue disappeared. Users saw stale carts and missing updates. The team’s postmortem headline was basically: “We invented data loss to save 15% CPU.”

When they tried to patch it, they added a disk-backed queue. Better, but still subtle: the queue was on the same disk as SQLite, and flush spikes caused lock contention again. They had reduced syscall overhead and increased worst-case latency. It was a net loss: the system became harder to reason about and still didn’t scale.

Eventually, they did the boring thing they should have done first: an outbox table in SQLite with triggers, a reliable shipper to PostgreSQL, and a parity dashboard. It wasn’t “fast,” but it was deterministic. Their incident rate dropped mostly because they stopped surprising themselves.

Mini-story 3: The boring but correct practice that saved the day

A B2B company planned a no-downtime cutover from SQLite to PostgreSQL for a configuration service. The dataset was small, but the blast radius was enormous: every request in their platform read config. They treated the migration like a reliability exercise, not a data exercise.

They built read shadowing early. Every config read still came from SQLite, but a background goroutine fetched the same key from PostgreSQL and compared a hash. Mismatches were counted and tagged by key prefix. This sounded paranoid. It was. And it was correct.

Two days before cutover they saw a mismatch spike on keys containing Unicode. SQLite and PostgreSQL collation and normalization differences were biting them: string comparisons and ordering weren’t identical, and one code path depended on “first match wins” behavior without explicit ordering. Without shadow reads, they would have cut over and slowly corrupted config reads in ways that look like “random outages.”

They fixed the query to be explicit, added a deterministic order, and wrote a migration-time assertion that rejected ambiguous keys. Cutover day was boring. Nobody noticed. That’s the highest compliment you can pay to operations work.

Common mistakes: symptom → root cause → fix

These are the patterns that show up repeatedly in SQLite-to-Postgres migrations. If you recognize yourself here, good. That means you can stop doing it.

1) “database is locked” storms after adding just one feature

  • Symptom: Spiky latency, intermittent write failures, lots of retries.
  • Root cause: Multiple writers increased transaction overlap; long transactions or unbounded transaction scope.
  • Fix: Shorten transactions; enable WAL; add busy_timeout; consolidate writes; or accept reality and move writes to PostgreSQL.

2) Postgres is “slow” right after cutover

  • Symptom: P95 latency jumps; CPU is fine; I/O is moderate; queries are slower than expected.
  • Root cause: Missing indexes due to schema translation gaps; queries used SQLite quirks (implicit casts, loose typing).
  • Fix: Use pg_stat_statements and EXPLAIN (ANALYZE); add the correct indexes; enforce types; rewrite queries to be explicit.

3) Unique constraints suddenly fail during backfill

  • Symptom: COPY fails with duplicate key errors.
  • Root cause: SQLite allowed duplicates because the constraint didn’t exist, or because values differed only by collation/case/whitespace.
  • Fix: Audit duplicates in SQLite before migration; decide canonicalization rules; implement in application and in migration scripts.

4) Timestamps become nonsense (off by hours or invalid)

  • Symptom: Data looks shifted; queries by time window miss records.
  • Root cause: SQLite stored timestamps as TEXT without timezone; Postgres parses into timestamp or timestamptz with different assumptions.
  • Fix: Choose timestamptz unless you have a reason not to; parse legacy strings explicitly; set connection timezones intentionally.

5) Migration “worked” but data is subtly missing

  • Symptom: Counts match; users report missing items; audits show gaps.
  • Root cause: Backfill ran once, but new writes during backfill were never copied; dual-write didn’t handle failures; no parity checks.
  • Fix: Implement change capture (outbox/trigger) and verify lag; run parity checks; block cutover until drift is zero (or explained).

6) Postgres runs out of connections during traffic bursts

  • Symptom: “too many clients,” timeouts, cascading failures.
  • Root cause: Each app instance kept a large pool; plus background jobs; plus admin scripts; Postgres isn’t a thread-per-connection toy.
  • Fix: Use PgBouncer; right-size pools; reduce idle connections; set statement timeouts and circuit breakers.

7) Autovacuum doesn’t keep up after cutover

  • Symptom: Performance degrades over days; table bloat; indexes grow; VACUUM runs forever.
  • Root cause: Update-heavy workload (common for “status” fields) and autovacuum settings not tuned per table.
  • Fix: Tune autovacuum thresholds for hot tables; avoid pointless updates (only update on change); consider append-only modeling for event-like data.

Checklists / step-by-step plan (boring, therefore effective)

Phase A: Pre-migration engineering checklist

  • Inventory all SQLite databases, files, and write paths. If you can’t list them, you can’t migrate them.
  • Standardize SQLite settings: journal_mode, synchronous, busy_timeout.
  • Decide target PostgreSQL version and deployment model (managed vs self-hosted).
  • Define data type mapping rules (TEXT timestamps, booleans, JSON blobs, numeric precision).
  • Define a “source of truth” policy for the dual-write phase.
  • Build parity checks: counts, sums, hashes, and spot checks for high-value entities.

Phase B: Build the migration machinery

  1. Create PostgreSQL schema with explicit types and constraints staged sensibly (start permissive, tighten later).
  2. Backfill into staging tables, then swap/rename once loaded and indexed.
  3. Implement change capture (dual writes or SQLite outbox with a shipper).
  4. Implement shadow reads for a representative slice of queries.
  5. Operationalize: metrics for lag, mismatch counts, write failures, and query latency.

Phase C: Cutover plan you can execute under stress

  1. Canary: route a small percentage of reads to PostgreSQL; keep SQLite as authority.
  2. Expand read cutover gradually; watch error budgets and slow query dashboards.
  3. Write cutover: switch the write path to PostgreSQL while still capturing changes for rollback (temporarily).
  4. Freeze and validate: short window where you ensure drift is zero and constraints are satisfied.
  5. Remove dual write only after a bake period and a successful “replay to SQLite” drill (if rollback is required by policy).

Phase D: Post-cutover hardening

  • Turn on statement timeouts for the app role.
  • Add monitoring for replication lag, disk usage, and autovacuum health.
  • Run a constraint-tightening campaign: NOT NULLs, CHECKs, FK constraints where appropriate.
  • Document operational runbooks: restore, failover, index rebuild, and emergency traffic shedding.

FAQ

1) Can SQLite handle high traffic if I enable WAL?

WAL helps a lot, especially for read-heavy workloads with occasional writes. It doesn’t turn SQLite into a multi-writer, multi-host database server. If your scaling problem is “many writers across many instances,” WAL is a bandage, not a cure.

2) What’s the safest “no downtime” migration strategy?

Backfill → change capture (outbox or dual writes) → shadow reads → cut over reads → cut over writes. The safest version includes parity dashboards and a rollback window where you can replay writes backward if needed.

3) Should I dual-write from the application or use triggers?

If you can modify the app safely and handle retries/idempotency, dual-write in the app is straightforward. If you have multiple apps writing, or you want a centralized mechanism, SQLite triggers into an outbox table can be cleaner. Either way, you need reconciliation and observability.

4) Why not just stop the world and do one big dump/restore?

If you can afford it, do it. The “no downtime” path has more moving parts and more ways to create subtle drift. The only reason to do it is because you must keep serving while migrating.

5) How do I deal with SQLite’s loose typing when moving to PostgreSQL?

Decide types explicitly and write conversion code. Expect garbage: numeric strings, empty timestamps, mixed formats. Build a quarantine path for bad rows rather than failing the whole migration.

6) What about SQLite features that don’t map cleanly to Postgres?

The sharp edges are usually around type affinity, implicit conversions, conflict clauses, and date/time functions. Audit queries, add explicit casts in Postgres, and don’t rely on unspecified ordering.

7) Do I need a connection pool for PostgreSQL?

In most production environments: yes. Particularly if you have many app instances. PgBouncer is common because Postgres connection overhead is real and “just raise max_connections” tends to end badly.

8) How do I prove data parity beyond row counts?

Use multiple checks: per-table counts, per-partition counts (by day/customer), checksums/hashes of key fields, and random sample verification. Also run shadow reads on real application queries and compare outputs.

9) Can I keep SQLite as a local cache after migrating to Postgres?

Yes, but treat it as a cache: rebuildable, expirable, and never authoritative unless you’re explicitly designing for offline mode. Otherwise you reintroduce split-brain under a nicer name.

10) What’s the biggest migration risk people underestimate?

Not the backfill. The ongoing sync. Drift is the silent killer: it looks fine until the first audit, billing run, or security incident. Build drift detection from day one.

Conclusion: next steps you can execute this week

If you’re on SQLite and feeling pain, don’t panic-migrate. Stabilize first, then migrate like an operator: measurable, reversible, and boring.

  1. Inventory writes: find every code path that mutates SQLite, including “background maintenance” jobs.
  2. Pick your sync strategy: dual writes with idempotency, or an outbox with triggers and a shipper.
  3. Build backfill + parity checks that you can rerun without drama.
  4. Stand up PostgreSQL with guardrails: pooling, monitoring, durable settings, and capacity headroom.
  5. Shadow reads before you cut over. If you skip this, you’re gambling with unknown unknowns.
  6. Cut over reads, then writes, and keep rollback feasible for a defined window.

The win isn’t just “we’re on Postgres now.” The win is you stop treating your database as a file you hope behaves, and start treating it as a system you can observe, reason about, and recover under pressure.

← Previous
DDR4 to DDR5: What Actually Makes Your System Faster (and What Doesn’t)
Next →
Ubuntu 24.04: Reboot required… but you can’t reboot — smart ways to plan maintenance

Leave a comment