MariaDB vs PostgreSQL Migration: Move Without Downtime and Without Surprises

Was this helpful?

Most database migrations don’t fail because engineers can’t copy bytes. They fail because production systems are full of assumptions pretending to be facts: collation quirks, implicit casts, “helpful” defaults, and app queries that only work because the old engine politely tolerated them.

If you’re moving from MariaDB to PostgreSQL and you want zero downtime, you’re signing up for a distributed systems project wearing a database hat. This is the field guide: what to measure, what to change, what not to touch on Friday, and how to cut over without the classic “why is everything slow now?” moment.

Pick the right migration shape (and why “just dump/restore” lies)

You have three broad ways to move from MariaDB to PostgreSQL:

  1. Big-bang (dump/restore): stop writes, take a snapshot, restore into Postgres, flip the app. It’s clean, fast to reason about, and it requires downtime. Even if you call it “maintenance mode,” your users call it “broken.”
  2. Shadow read: migrate data and gradually send read traffic to PostgreSQL while writes still go to MariaDB. Useful for performance validation. Not enough for full cutover unless you add write strategy.
  3. Zero-downtime (backfill + CDC + controlled cutover): keep MariaDB as source of truth while you backfill historical data into PostgreSQL, then apply ongoing changes with change data capture (CDC) or dual writes. Cut over reads and writes in a planned, reversible way.

For “no downtime and no surprises,” you want #3. It’s more work up front, but it trades panic for process. You end up with:

  • A repeatable pipeline you can run in staging and prod.
  • Quantified lag (seconds behind, not vibes behind).
  • A rollback plan that doesn’t involve prayer.

One operational truth: if you can’t measure replication lag and data divergence, you don’t have a migration. You have a hope.

Facts and historical context that actually matter

  • MariaDB exists because of trust fallout. It was created by MySQL’s original developers after Oracle acquired MySQL, with a pledge to stay open and community-driven.
  • PostgreSQL’s lineage is academic in the best way. It evolved from the POSTGRES project at UC Berkeley, which is why it’s so stubborn about correctness.
  • InnoDB became the default MySQL/MariaDB engine for a reason. MyISAM’s table-level locking and crash recovery limitations taught everyone the hard way.
  • PostgreSQL MVCC is “always on.” It shapes vacuuming, bloat, and index behavior; it’s not a toggle you can ignore.
  • MariaDB replication historically grew around statement/binlog traditions. CDC tools often rely on that ecosystem; you inherit its edge cases (DDL, non-determinism, time zones).
  • PostgreSQL treats identifiers differently. Unquoted identifiers fold to lowercase, which quietly breaks migrations from ecosystems that rely on case-insensitive behavior.
  • PostgreSQL’s JSONB is not just “JSON, but faster.” It stores a parsed representation with different indexing behavior and subtle equality semantics.
  • UTF-8 everywhere wasn’t always the norm. Collations and charsets are still a migration minefield because “same string” can mean “different order.”
  • MariaDB’s “zero” dates were a compatibility crutch. PostgreSQL refuses them. Your app code has to grow up.

MariaDB vs PostgreSQL: differences that bite in production

SQL dialect and type system: strictness is not a bug

MariaDB (like MySQL) is historically permissive: implicit casts, silent truncation depending on SQL mode, “helpful” conversions. PostgreSQL is stricter and more explicit. That’s good for correctness, but it exposes the parts of your application that were freeloading on undefined behavior.

Examples that commonly break:

  • Implicit string-to-int casts that worked before, now error.
  • GROUP BY behavior differences when selecting non-aggregated columns.
  • Timestamp defaults and time zone handling.
  • NULL sorting behavior and collation order affecting pagination.

Transactions and locking: same words, different consequences

Both have transactions; both have isolation levels; both can deadlock. But the shape differs:

  • PostgreSQL uses MVCC plus row-level locks; readers don’t block writers and vice versa, but long transactions prevent cleanup and create bloat.
  • MariaDB/InnoDB also uses MVCC, but lock patterns around gap locks, next-key locks, and unique checks show up differently, especially under REPEATABLE READ.
  • DDL locking differs. PostgreSQL can do many operations concurrently, but not all; MariaDB has its own “online DDL” rules that depend on version and engine.

Auto-increment vs sequences: you can’t ignore the mapping

MariaDB’s AUTO_INCREMENT is tied to table metadata. PostgreSQL typically uses SEQUENCE objects, often via GENERATED ... AS IDENTITY. During migration, you must set sequence values correctly or you’ll get primary key collisions after cutover.

Charset and collation: the silent correctness killer

Even if you’re “all UTF-8,” collations differ. Ordering and case-folding differ. That breaks:

  • Unique constraints (two strings considered equal in one system but distinct in another, or vice versa).
  • Index usage (collation-dependent).
  • Pagination when ordering by text columns (users see duplicates or missing rows across pages).

Operational model: vacuum is your new recurring meeting

PostgreSQL needs vacuuming to reclaim space and keep visibility maps healthy. Autovacuum is usually fine—until it isn’t. If your workload is heavy-update and you run long transactions, you will learn what “bloat” means in a way that becomes personal.

One quote worth keeping on a sticky note:

“Hope is not a strategy.” — General Gordon R. Sullivan

Joke #1: If your migration plan is “we’ll monitor it,” you don’t have a plan—you have a hobby.

Zero-downtime architecture: backfill + CDC + controlled cutover

The baseline pattern

A sane zero-downtime plan usually looks like this:

  1. Prepare Postgres: schema, roles, extensions, parameter tuning, observability.
  2. Initial backfill: copy existing tables from MariaDB to Postgres while the app stays online and writing to MariaDB.
  3. CDC / replication into Postgres: stream ongoing changes (inserts/updates/deletes) from MariaDB binlog to Postgres to keep it near-real-time.
  4. Consistency validation: checksums, counts, sampled row comparisons, plus application-level invariants.
  5. Cutover: switch reads, then writes (or do both with a short write freeze measured in seconds, not minutes).
  6. Post-cutover hardening: index tuning, vacuum tuning, query fixes, and a rollback window where MariaDB remains available.

CDC vs dual-write: pick your poison carefully

CDC (recommended) means the app keeps writing only to MariaDB until cutover. You replicate changes to Postgres out-of-band. This reduces application complexity and avoids dual-write consistency bugs. The cost is you need a robust CDC pipeline and a plan for DDL changes during the sync window.

Dual-write means your app writes to both databases for a period. It can work, but you now own the distributed transaction problem. If you do this, you must design for partial failure: one write succeeds, the other fails, and your system must reconcile. Most teams overestimate their appetite for this.

Practical advice: if you can avoid dual-write, avoid it. CDC plus a carefully managed cutover is less heroic and more repeatable.

Cutover mechanics that don’t ruin your weekend

The cleanest cutover is “reads first, writes last”:

  • Move read-only traffic (reporting, batch jobs) to Postgres early to shake out query differences and index gaps.
  • Keep OLTP writes on MariaDB while CDC keeps Postgres current.
  • Schedule a short write freeze (seconds to a few minutes) to drain lag to zero, validate, then flip writes.

If your product team demands “literally zero write freeze,” you can do it, but you’ll pay with complexity: idempotent writes, conflict resolution, and sometimes a temporary logical clock system. For most businesses, a brief write pause is the cheapest honesty you’ll ever buy.

DDL during migration: the enemy of “simple”

Schema changes mid-migration are where timelines go to die. Handle it deliberately:

  • Freeze DDL during the final sync window. Make it a policy, not a suggestion.
  • If you must apply DDL, apply it to both systems and validate that your CDC tool handles it correctly (many don’t, or only handle a subset).
  • Prefer backward-compatible schema evolution (add columns nullable, backfill, then enforce constraints later).

Practical tasks with commands: what you run, what it means, what you decide

These are the tasks that turn “migration project” into “migration with receipts.” Commands are representative and runnable; adapt credentials and hosts.

Task 1: Verify MariaDB binary logging (CDC prerequisite)

cr0x@server:~$ mysql -h mariadb01 -u admin -p -e "SHOW VARIABLES LIKE 'log_bin'; SHOW VARIABLES LIKE 'binlog_format'; SHOW VARIABLES LIKE 'server_id';"
Enter password:
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_bin       | ON    |
+---------------+-------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| binlog_format | ROW       |
+---------------+-----------+
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id     | 101   |
+---------------+-------+

What it means: log_bin=ON and binlog_format=ROW are the standard baseline for reliable CDC. server_id must be set and unique.

Decision: If binlog is off or not ROW, fix this before anything else. Statement-based replication will eventually create “but it worked in staging” moments.

Task 2: Check MariaDB GTID mode (helps resuming CDC)

cr0x@server:~$ mysql -h mariadb01 -u admin -p -e "SHOW VARIABLES LIKE 'gtid_strict_mode'; SHOW VARIABLES LIKE 'gtid_binlog_pos';"
Enter password:
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| gtid_strict_mode | ON    |
+------------------+-------+
+-----------------+------------------------------------------+
| Variable_name   | Value                                    |
+-----------------+------------------------------------------+
| gtid_binlog_pos | 0-101-9876543                            |
+-----------------+------------------------------------------+

What it means: GTID makes it easier to restart a CDC pipeline without guessing binlog file/position.

Decision: If GTID isn’t enabled, you can still do CDC, but your operational burden goes up. Decide early whether to standardize on GTID.

Task 3: Inventory schema features that won’t translate cleanly

cr0x@server:~$ mysql -h mariadb01 -u admin -p -N -e "SELECT TABLE_SCHEMA, TABLE_NAME, ENGINE, TABLE_COLLATION FROM information_schema.TABLES WHERE TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema','sys') ORDER BY TABLE_SCHEMA, TABLE_NAME LIMIT 10;"
Enter password:
appdb	users	InnoDB	utf8mb4_unicode_ci
appdb	orders	InnoDB	utf8mb4_unicode_ci
appdb	order_items	InnoDB	utf8mb4_unicode_ci
appdb	audit_log	InnoDB	utf8mb4_unicode_ci
appdb	sessions	InnoDB	utf8mb4_unicode_ci
appdb	settings	InnoDB	utf8mb4_unicode_ci
appdb	products	InnoDB	utf8mb4_unicode_ci
appdb	inventory	InnoDB	utf8mb4_unicode_ci
appdb	payments	InnoDB	utf8mb4_unicode_ci
appdb	refunds	InnoDB	utf8mb4_unicode_ci

What it means: You’re checking engines and collations. Non-InnoDB tables, weird collations, and legacy encodings are migration risk multipliers.

Decision: If you find MyISAM/Aria or mixed collations, budget time to normalize or explicitly map behavior in Postgres (including app-level expectations).

Task 4: Capture MariaDB SQL mode (tells you how much bad data you have)

cr0x@server:~$ mysql -h mariadb01 -u admin -p -e "SELECT @@sql_mode;"
Enter password:
+----------------------------------------------------------------------------------+
| @@sql_mode                                                                       |
+----------------------------------------------------------------------------------+
| STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION |
+----------------------------------------------------------------------------------+

What it means: Less strict modes often hide truncations and invalid dates. Strict mode reduces surprises, but doesn’t eliminate historical bad rows.

Decision: If strictness is lax, run data quality checks before migrating. Postgres will refuse some values MariaDB accepted.

Task 5: Identify “zero dates” and invalid timestamps

cr0x@server:~$ mysql -h mariadb01 -u admin -p -e "SELECT COUNT(*) AS zero_dates FROM appdb.orders WHERE created_at='0000-00-00 00:00:00';"
Enter password:
+-----------+
| zero_dates|
+-----------+
| 42        |
+-----------+

What it means: PostgreSQL doesn’t support year 0000 timestamps. Those rows will fail to load unless transformed.

Decision: Pick a remediation policy: convert to NULL, convert to epoch, or move the field to text temporarily. Document it; your auditors will ask later.

Task 6: Stand up Postgres with baseline settings and verify

cr0x@server:~$ psql -h pg01 -U postgres -c "SHOW server_version; SHOW wal_level; SHOW max_wal_senders;"
 server_version
----------------
 16.3
(1 row)

 wal_level
-----------
 replica
(1 row)

 max_wal_senders
-----------------
 10
(1 row)

What it means: You’re confirming the target engine and replication settings (even if you’re not using PG replication, WAL behavior impacts ops).

Decision: If you plan logical replication or heavy ingestion, set WAL parameters and checkpoint tuning deliberately. Default settings are not a performance plan.

Task 7: Create roles and enforce least privilege early

cr0x@server:~$ psql -h pg01 -U postgres -c "CREATE ROLE app_user LOGIN PASSWORD 'REDACTED'; GRANT CONNECT ON DATABASE appdb TO app_user;"
CREATE ROLE
GRANT

What it means: You’re avoiding the classic “we’ll fix permissions after cutover” lie.

Decision: If your app currently relies on wide privileges (e.g., DDL), decide whether to keep that (not recommended) or refactor now.

Task 8: Convert schema (example using pgloader) and inspect results

cr0x@server:~$ pgloader mysql://migrate:REDACTED@mariadb01/appdb postgresql://postgres@pg01/appdb
2025-12-30T10:01:12.000000Z LOG pgloader version "3.6.9"
2025-12-30T10:01:13.000000Z LOG Migrating from #<MYSQL-CONNECTION mysql://migrate@.../appdb>
2025-12-30T10:01:13.000000Z LOG Migrating into #<PGSQL-CONNECTION postgresql://postgres@pg01/appdb>
2025-12-30T10:05:44.000000Z LOG Data errors in table "orders": 0
2025-12-30T10:05:44.000000Z LOG Total import time: 00:04:31

What it means: The tool did a first pass at schema + data. The log line about data errors is your first red flag detector.

Decision: Treat this as scaffolding, not the final answer. Review types, constraints, indexes, and sequences; then run CDC for ongoing changes.

Task 9: Confirm sequences are set correctly after backfill

cr0x@server:~$ psql -h pg01 -U postgres -d appdb -c "SELECT pg_get_serial_sequence('orders','id') AS seq; SELECT MAX(id) FROM orders;"
           seq
--------------------------
 public.orders_id_seq
(1 row)

  max
--------
  984211
(1 row)

What it means: You found the sequence name and the current max id in the table.

Decision: If the sequence is behind, fix it now, before cutover.

cr0x@server:~$ psql -h pg01 -U postgres -d appdb -c "SELECT setval('public.orders_id_seq', (SELECT MAX(id) FROM orders));"
 setval
--------
 984211
(1 row)

Task 10: Measure CDC lag (example using Debezium-style offsets stored locally)

cr0x@server:~$ jq -r '.source.ts_ms' /var/lib/cdc/offsets/appdb.json
1735559142000

What it means: That timestamp (ms since epoch) is the source event time of the last processed binlog change.

Decision: Compare to current time; if lag grows, don’t cut over. Fix throughput first (network, apply speed, indexes, batch sizing).

Task 11: Verify row counts across systems (fast, imperfect, still useful)

cr0x@server:~$ mysql -h mariadb01 -u admin -p -N -e "SELECT COUNT(*) FROM appdb.orders;"
Enter password:
984211
cr0x@server:~$ psql -h pg01 -U postgres -d appdb -t -c "SELECT COUNT(*) FROM orders;"
 984211

What it means: Counts match for this table. That doesn’t prove data equality, but it catches missing chunks.

Decision: If counts differ, stop and find out why: filters in the loader, failed rows, CDC missing deletes, or PK collisions.

Task 12: Validate checksums for a sampled slice (better than counts)

cr0x@server:~$ mysql -h mariadb01 -u admin -p -N -e "SELECT MD5(GROUP_CONCAT(CONCAT(id,':',status,':',total_cents) ORDER BY id SEPARATOR '|')) FROM appdb.orders WHERE id BETWEEN 900000 AND 900500;"
Enter password:
2f1c7c0a9b0e0a2f8dd7d8a11d5f4b3a
cr0x@server:~$ psql -h pg01 -U postgres -d appdb -t -c "SELECT md5(string_agg(id||':'||status||':'||total_cents, '|' ORDER BY id)) FROM orders WHERE id BETWEEN 900000 AND 900500;"
 2f1c7c0a9b0e0a2f8dd7d8a11d5f4b3a

What it means: For that slice, the content matches. This method is sensitive to formatting/casting; that’s a feature if you’re worried about implicit conversions.

Decision: Run this for multiple ranges and high-churn tables. If mismatched, investigate transforms, rounding, timezone conversions, and text normalization.

Task 13: Catch slow queries on PostgreSQL immediately after read shadowing

cr0x@server:~$ psql -h pg01 -U postgres -d appdb -c "SELECT calls, mean_exec_time, rows, query FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;"
 calls | mean_exec_time | rows |                     query
-------+----------------+------+------------------------------------------------
  1200 |         87.332 |    1 | SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT $2
   310 |         54.910 |   25 | SELECT * FROM products WHERE name ILIKE $1
    98 |         42.774 |    1 | UPDATE inventory SET qty = qty - $1 WHERE sku = $2

What it means: You’re looking at query patterns, not individual incidents. Mean execution time is your “what got worse?” indicator.

Decision: Add indexes, rewrite queries, or change pagination strategy before cutover. Do not “wait and see” with customer-facing latency.

Task 14: Explain the worst offender and decide index vs query change

cr0x@server:~$ psql -h pg01 -U postgres -d appdb -c "EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE user_id = 123 ORDER BY created_at DESC LIMIT 20;"
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..152.12 rows=20 width=256) (actual time=24.112..24.145 rows=20 loops=1)
   Buffers: shared hit=12040 read=210
   ->  Index Scan Backward using orders_created_at_idx on orders  (cost=0.43..81234.55 rows=10680 width=256) (actual time=24.110..24.140 rows=20 loops=1)
         Filter: (user_id = 123)
         Rows Removed by Filter: 420000
         Buffers: shared hit=12040 read=210
 Planning Time: 0.231 ms
 Execution Time: 24.201 ms
(9 rows)

What it means: It’s scanning by created_at and filtering by user_id. That’s expensive: many rows removed by filter.

Decision: Create a composite index matching the predicate + order, e.g. (user_id, created_at DESC), or rewrite query to use a seek method if possible.

Task 15: Check vacuum health and dead tuples (prevents “it was fast yesterday”)

cr0x@server:~$ psql -h pg01 -U postgres -d appdb -c "SELECT relname, n_dead_tup, last_autovacuum, last_autoanalyze FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
   relname   | n_dead_tup |        last_autovacuum        |       last_autoanalyze
-------------+------------+-------------------------------+-------------------------------
 audit_log   |    1983321 | 2025-12-30 09:11:02+00        | 2025-12-30 09:12:40+00
 sessions    |     822110 | 2025-12-30 09:08:14+00        | 2025-12-30 09:10:05+00
 orders      |     110422 | 2025-12-30 09:00:01+00        | 2025-12-30 09:03:33+00
 products    |       9221 | 2025-12-30 08:55:40+00        | 2025-12-30 08:56:10+00
 inventory   |       1087 | 2025-12-30 08:54:11+00        | 2025-12-30 08:54:55+00

What it means: High dead tuples mean churn; vacuum is running but might not be keeping up, especially if transactions are long-lived.

Decision: Tune autovacuum per table for hot churn tables, reduce long transactions, and watch bloat before it becomes a ticket flood.

Task 16: Confirm connections and pooler behavior before cutover

cr0x@server:~$ psql -h pg01 -U postgres -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state ORDER BY count(*) DESC;"
  state  | count
---------+-------
 idle    |   180
 active  |    35
(2 rows)

What it means: Connection count and states tell you whether you need pooling. PostgreSQL doesn’t love 2000 idle app connections holding memory.

Decision: If you see connection storms or high idle counts, deploy a pooler (or fix app pooling) before the day you need calm.

Joke #2: A database cutover is like a parachute jump—discovering you packed it wrong is a memorable learning experience.

Three corporate-world mini-stories (pain included)

Mini-story #1: The incident caused by a wrong assumption

The company: mid-sized SaaS, multiple regions, a MariaDB primary with replicas. They wanted PostgreSQL for better query features and a saner extension ecosystem. The migration team did the right big pieces: backfill, CDC, shadow reads. Everything looked green. Cutover window booked.

The wrong assumption was tiny: “string ordering is basically the same.” Their app used cursor-based pagination for customer lists, ordering by a name column and applying WHERE name > last_seen_name. In MariaDB with their chosen collation, the ordering and comparison rules matched what their UI expected. In PostgreSQL, collation and locale-specific rules produced a different ordering for accented characters and some mixed-case edge cases.

After cutover, the symptom was subtle: customers reported “missing users” in lists. The data wasn’t missing. Pagination skipped records because the cursor logic no longer aligned with the sort order. Support escalations arrived with screenshots; engineers arrived with disbelief. That’s the usual order of operations.

The fix wasn’t heroic. They made ordering deterministic by adding a secondary sort key (stable unique id), changed the cursor to include both fields, and aligned collations where possible. They also added a regression test that generates random Unicode names and validates pagination across engines. It was dull, and it worked.

Mini-story #2: The optimization that backfired

The company: e-commerce platform with big tables and bursty traffic. They had a backfill pipeline that loaded into PostgreSQL. Someone noticed ingestion was slower than expected and decided to speed it up by disabling indexes and constraints during load, then recreating everything at the end. That’s not automatically wrong.

The backfire came from two directions. First, their CDC pipeline was already applying changes. With constraints absent, duplicates and orphan rows slipped in during the load phase, because their apply logic assumed the database would enforce uniqueness and foreign keys. Second, after they recreated indexes, the index builds saturated I/O and CPU and starved the CDC applier, which increased lag. Increased lag increased the cutover risk. Nobody slept well.

They eventually stabilized by changing the sequence: load in chunks with the essential constraints in place (primary keys, not every foreign key), keep CDC apply performance predictable, and schedule index builds using CREATE INDEX CONCURRENTLY in controlled batches. They also implemented dead-letter handling in the CDC pipeline for rows that violate constraints, so one bad event didn’t stall everything.

The moral: optimization is not “make the numbers bigger.” It’s “make the failure modes smaller.”

Mini-story #3: The boring but correct practice that saved the day

The company: enterprise B2B, lots of integrations, and an audit trail obsession. The team did something unsexy: they ran the entire migration pipeline in a production-like staging environment with production-like data volume and production-like traffic replays. Not a “unit test.” A dress rehearsal.

During rehearsal they discovered that one table had a TEXT column storing JSON blobs with inconsistent encoding. MariaDB accepted it. Postgres refused some rows when cast to jsonb. Because they were in rehearsal, they had time to implement a transformation: store raw text in a json_raw column, populate a validated jsonb column where possible, and add an async cleanup job to normalize old rows.

They also practiced rollback. Not just “we can roll back,” but actually flipping the app back to MariaDB in staging with cached connections, poolers, and DNS TTLs. They measured how long it took for every service to reconnect. That number became a requirement for the production cutover runbook.

On the real cutover day, a network ACL change slowed traffic to Postgres from one app subnet. Because they had rehearsed, their runbook’s first steps caught it quickly, and rollback readiness kept nerves low. They fixed the ACL and proceeded. Nobody wrote a postmortem titled “We assumed the network was fine.” Boring saved the day.

Fast diagnosis playbook

This is for the moment right after you start shadow reads, ramp traffic, or cut over—and latency climbs. Don’t argue with graphs. Triage.

First: is it the database, the network, or the app?

  1. Check app-side error rate and timeouts: are we saturating connection pools or hitting new errors from stricter SQL?
  2. Check network RTT and packet loss between app and Postgres (especially if Postgres is in a different subnet/region).
  3. Check Postgres connection states: active vs idle, waiting vs running.

Second: find the top wait, not the top query

  1. Look for locks: are transactions stuck waiting on locks because of long-running jobs or DDL?
  2. Look for I/O pressure: are reads coming from disk because indexes aren’t used or caches aren’t warm?
  3. Look for CPU pressure: is a missing index turning simple filters into scans?

Third: confirm Postgres isn’t fighting itself

  1. Autovacuum saturation: too many dead tuples or blocked vacuum due to long transactions.
  2. Checkpoint storms: WAL/checkpoint tuning causing periodic latency spikes.
  3. Mis-sized work_mem: sorts and hashes spilling to disk.

Practical “fast diagnosis” commands

cr0x@server:~$ psql -h pg01 -U postgres -d appdb -c "SELECT wait_event_type, wait_event, count(*) FROM pg_stat_activity WHERE wait_event IS NOT NULL GROUP BY 1,2 ORDER BY count(*) DESC;"
 wait_event_type |     wait_event     | count
-----------------+--------------------+-------
 Lock            | transactionid       |    12
 IO              | DataFileRead        |     9
 LWLock          | WALWrite            |     4
(3 rows)

What it means: You see what sessions are waiting on. Locks and I/O waits are common during cutover if indexes are wrong or transactions are long.

Decision: If Lock waits dominate, identify blockers and fix transaction behavior. If I/O waits dominate, add indexes or adjust queries; don’t “tune Postgres” before you fix access paths.

cr0x@server:~$ psql -h pg01 -U postgres -d appdb -c "SELECT pid, age(clock_timestamp(), xact_start) AS xact_age, state, wait_event, left(query,120) AS query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_age DESC LIMIT 5;"
 pid  | xact_age | state  | wait_event |                         query
------+----------+--------+------------+---------------------------------------------------------
 4412 | 00:34:51 | active |            | UPDATE inventory SET qty = qty - $1 WHERE sku = $2
 3201 | 00:12:03 | idle in transaction | ClientRead | SELECT * FROM orders WHERE id = $1
(2 rows)

What it means: “idle in transaction” is the silent vacuum killer and lock amplifier.

Decision: Fix app transaction scope; kill pathological sessions if needed. Postgres will not save you from leisurely transactions.

Common mistakes: symptom → root cause → fix

1) “Postgres is slower” right after cutover

Symptom: Latency jumps, CPU climbs, and you see sequential scans on hot paths.

Root cause: Missing composite indexes and different optimizer assumptions. MariaDB often got away with different index usage, and your queries may rely on it.

Fix: Use pg_stat_statements + EXPLAIN (ANALYZE, BUFFERS). Add indexes that match filters + ordering. Validate with real parameters, not toy examples.

2) Primary key collisions after enabling writes on Postgres

Symptom: Inserts fail with duplicate key on tables that used AUTO_INCREMENT.

Root cause: Sequences not advanced to the max existing id after backfill/CDC.

Fix: Run setval() for each sequence to at least the current max id, then re-check. Automate this in the cutover runbook.

3) CDC lags forever and never catches up

Symptom: Lag grows during peak traffic; Postgres apply can’t keep up.

Root cause: Apply side doing heavy index maintenance during backfill, insufficient batching, or network/disk throughput limits.

Fix: Backfill in chunks, build only essential indexes early, and add secondary indexes later. Tune CDC batch sizes. Ensure Postgres has enough IOPS and WAL settings aren’t throttling ingestion.

4) “It worked in staging” but production has data load failures

Symptom: Loader rejects rows, often around dates, numeric precision, or invalid UTF-8.

Root cause: Production data contains historical garbage: zero dates, out-of-range integers, invalid bytes.

Fix: Run data quality queries on MariaDB early. Decide transformation rules explicitly and implement them consistently in backfill and CDC.

5) Pagination duplicates/missing rows

Symptom: Users see the same item twice across pages or items disappear.

Root cause: Non-deterministic ordering, collation differences, or ordering by non-unique columns.

Fix: Add a stable tie-breaker (unique id) to ORDER BY and cursor tokens. Avoid offset pagination for high churn lists.

6) Deadlocks suddenly appear in Postgres

Symptom: Errors about deadlocks on update-heavy tables; retries spike.

Root cause: Different lock acquisition order across queries, plus new indexes that change execution paths.

Fix: Standardize lock order in application logic. Keep transactions small. Add retries with jitter. Analyze deadlock graphs via server logs.

7) Disk grows unexpectedly fast on Postgres

Symptom: Storage usage increases daily; performance degrades over time.

Root cause: Autovacuum not keeping up, long transactions preventing cleanup, or too-low autovacuum thresholds on hot tables.

Fix: Tune autovacuum per table; eliminate idle-in-transaction; consider partitioning for append-heavy logs; monitor bloat and vacuum lag.

Checklists / step-by-step plan

Phase 0: Decide the non-negotiables

  • Define acceptable write freeze (if any) in seconds.
  • Define rollback window (how long MariaDB stays ready).
  • Freeze DDL policy during final sync.
  • Pick migration strategy: CDC-first (recommended) or dual-write (only if you must).

Phase 1: Preflight inventory (where surprises are born)

  • List tables, sizes, and churn rate; mark hot tables.
  • Enumerate data types that don’t map cleanly: JSON-as-text, zero dates, unsigned ints, enums, collations.
  • Enumerate critical queries and their latency SLOs.
  • Identify background jobs that run long transactions.

Phase 2: Build PostgreSQL like you mean it

  • Pick Postgres version and stick to it across environments.
  • Enable pg_stat_statements and baseline observability.
  • Set up roles, migrations pipeline, and schema management.
  • Decide partitioning strategy for big append-only tables (audit logs, events).

Phase 3: Backfill safely

  1. Backfill in chunks by primary key ranges or time windows.
  2. Keep primary keys and essential indexes available to support CDC apply.
  3. Track rejected rows and fix upstream data or transform rules.
  4. Record backfill checkpoints so reruns are deterministic.

Phase 4: CDC sync window

  • Start CDC from a known binlog position/GTID.
  • Monitor lag and apply errors continuously.
  • Stop schema drift: no uncoordinated DDL.
  • Run consistency checks nightly: counts + sampled checksums + business invariants.

Phase 5: Shadow reads and performance tuning

  • Route a small percentage of read traffic to Postgres and compare results.
  • Fix query incompatibilities and performance regressions now, not during cutover.
  • Validate collation-sensitive features (search, sorting, pagination).
  • Warm caches and validate connection pool sizing.

Phase 6: Cutover runbook (a sequence you can execute under stress)

  1. Announce change freeze and confirm on-call coverage.
  2. Confirm CDC lag is low and stable; identify current lag in seconds.
  3. Enable brief write freeze (or strict rate limit) at the app layer.
  4. Wait for CDC to reach zero lag; confirm last GTID/position captured.
  5. Run final consistency checks on critical tables.
  6. Flip writes to Postgres (feature flag / config / service discovery).
  7. Flip reads to Postgres everywhere.
  8. Monitor errors, p95 latency, locks, replication, and storage.
  9. Keep MariaDB read-only and available for rollback during the agreed window.

Phase 7: Post-cutover hardening

  • Turn on stricter app error handling for query failures; don’t mask exceptions.
  • Review pg_stat_statements weekly for the first month.
  • Tune autovacuum for hot tables.
  • Retire CDC pipeline only after the rollback window closes and data is validated.

FAQ

1) Can I do MariaDB → PostgreSQL with literally zero downtime?

Yes, but “zero downtime” usually hides a write consistency problem. The pragmatic approach is a brief write freeze to drain CDC lag to zero. If you must avoid it, you’ll need dual-write plus reconciliation and idempotency—expect complexity and more edge cases.

2) Should I use CDC or dual-write?

Prefer CDC. It keeps the application simpler and localizes complexity to a pipeline you can observe and restart. Dual-write is for cases where you already have idempotent writes, strong retry semantics, and a willingness to build reconciliation tooling.

3) What breaks most often at the SQL level?

Implicit casts, GROUP BY semantics, date/time handling, and case sensitivity of identifiers. PostgreSQL being strict is a feature; it surfaces bugs you already had.

4) How do I handle MariaDB UNSIGNED integers?

Decide per column: map to a larger signed type in Postgres (e.g., unsigned int → bigint), or enforce constraints and accept reduced range. Don’t blindly map and hope—range overflows show up later as “impossible” bugs.

5) What about ENUM types?

In Postgres you can use native ENUM or a lookup table with a foreign key. Native enums are fine if values are stable; lookup tables are better if values change often and you want auditability.

6) How do I validate data correctness without comparing every row?

Combine techniques: row counts for coverage, sampled checksums for content, and business invariants (e.g., order totals match item sums). Also validate “edge” slices: newest rows, oldest rows, and high-churn ranges.

7) Why did my Postgres disk usage grow faster than expected?

MVCC means updates create dead tuples until vacuum cleans them. If you have long transactions or autovacuum can’t keep up, bloat grows. Fix transaction length and tune autovacuum; don’t just throw disks at it.

8) Do I need a connection pooler for Postgres?

Often yes. PostgreSQL connections are heavier than many teams expect, and “one connection per thread” can become expensive. If you have many app instances, pooling is usually the calm choice.

9) How do I plan rollback?

Keep MariaDB available and ideally read-only after cutover. To roll back writes cleanly, you need either (a) no writes occurred in Postgres (rare), or (b) a reverse sync plan (hard). In practice, rollback means flipping back quickly while you investigate, then re-cut when fixed. This is why minimizing post-cutover divergence time matters.

10) Is pgloader enough?

It’s a strong tool for initial schema and backfill, not a full zero-downtime story on its own. Treat it as the bulk copy step, then use CDC for ongoing changes and a disciplined cutover.

Conclusion: next steps you can execute

If you want a migration without downtime and without surprises, do the boring engineering: measure everything, constrain change, and practice the cutover. The systems that survive are the ones where you can answer, in real time, “how far behind are we?” and “can we roll back in five minutes?”

Next steps you can run this week:

  • Inventory MariaDB: binlog format, GTID, collations, SQL mode, and dirty data (zero dates, invalid UTF-8).
  • Stand up Postgres with observability: pg_stat_statements, connection monitoring, vacuum visibility.
  • Do one table end-to-end: backfill, CDC, checksum validation, and query performance on shadow reads.
  • Write the cutover runbook and rehearse it with traffic replay and a rollback drill.

Run the migration like you run production: controlled change, hard evidence, and a plan that doesn’t require optimism to function.

← Previous
Proxmox CIFS “Permission denied”: fix credentials, SMB dialect, and mount options
Next →
Ubuntu 24.04 Watchdog resets: detect silent hangs before they cost you uptime (case #18)

Leave a comment