Most database outages aren’t caused by a novel bug. They’re caused by an assumption you didn’t know you made—because the database default made it feel safe. You ship code that “works” in staging, and then production teaches you a new definition of the word “works.”
MariaDB and PostgreSQL are both excellent. They’re also opinionated in opposite ways. MariaDB’s defaults often try to keep the app running, even if the data gets a little… interpretive. PostgreSQL’s defaults are more likely to stop the train at the station and make you explain yourself. That difference is not academic. It changes incident rates, debugging time, and what kind of mistakes your team is allowed to make.
The thesis: forgiving vs punishing
In production, “forgiving” doesn’t mean “friendly.” It means the database will attempt to interpret your intent, and that interpretation might be different from what your auditors, your analysts, or your future self wanted.
MariaDB (especially in MySQL-compatibility land) has a long tradition of being permissive: coercing types, accepting weird date values depending on modes, quietly truncating, and generally trying not to break applications. PostgreSQL tends to be stricter: it throws errors where ambiguity exists, demands explicit casts, and forces you to face constraints early. One keeps the app up. The other keeps the data honest.
Both approaches can be correct. The trick is choosing the failure mode you prefer:
- MariaDB’s default vibe: “We’ll store something; don’t page anyone.” Good when uptime is king and you have strong upstream validation. Dangerous when data correctness is king and you don’t.
- PostgreSQL’s default vibe: “This is ambiguous; fix your code.” Good when correctness matters and you want mistakes to surface early. Dangerous when you can’t tolerate write errors and your app isn’t designed to retry properly.
Joke 1: MariaDB sometimes feels like a database that says “sure” the way a tired bartender says “sure.” You’ll regret it in the morning.
If you’re building systems that affect money, inventory, access control, compliance, or anything that ends in a meeting with Legal: prefer “punish mistakes early.” If you’re running a high-volume event ingestion pipeline with solid validation at the edges: forgiving defaults can be fine, as long as you make them explicit and observable.
Interesting facts and history (the parts that still matter)
- MariaDB exists because of fear of acquisition gravity: it was created after MySQL’s acquisition (Sun, then Oracle), to keep a community-led fork with compatibility as a core promise.
- PostgreSQL’s roots are academic but its culture is conservative: it descends from POSTGRES research; the project historically prioritized correctness and standards over “whatever works.”
- MySQL’s permissive behaviors shaped an ecosystem: ORMs and app codebases learned to rely on silent coercions. Porting those apps to Postgres often surfaces “bugs” that were always there.
- InnoDB became the default storage engine in MySQL because MyISAM kept losing arguments: transactions and crash recovery stop being optional once you’ve had your first real outage.
- PostgreSQL’s MVCC is central to its operational story: it avoids reader/writer blocking in many cases, but it “pays” with vacuum requirements and potential bloat if you ignore housekeeping.
- MariaDB’s ecosystem split over time: features and defaults diverged across MariaDB, MySQL, and Percona Server; operational recipes are not always portable even when SQL syntax is.
- Postgres’ extension culture is a superpower: key operational capabilities often arrive as extensions (for stats, indexing, partitioning tooling). That’s flexibility, but also more knobs to own.
- SQL standard compliance is a practical tool, not a trophy: Postgres leans closer; it changes how predictable your query semantics are during migrations and across drivers.
Defaults that decide your fate
SQL modes and “helpful” coercions
If there’s one reason MariaDB feels forgiving, it’s that it will often accept your input and reshape it into something storable. Sometimes it warns you; sometimes you won’t notice because your client library doesn’t surface warnings. Your monitoring probably doesn’t track them either.
MariaDB: behavior is heavily influenced by sql_mode. Without strict modes, inserts that overflow, truncate, or contain invalid values can succeed with warnings. In strict modes, many of those become hard errors. The database can be either forgiving or punishing; the problem is that many environments do not make that choice explicitly.
PostgreSQL: tends to error on invalid casts, out-of-range values, invalid timestamps, and so on. This is operationally “loud,” but it’s also how you prevent bad data from becoming a quiet, long-term incident that shows up in analytics six months later.
What this changes in practice:
- In MariaDB, you must treat
sql_modeas a production contract and version it like code. - In Postgres, you must treat application retries and error handling as a production contract and test them under realistic concurrency.
Transactions and isolation: what you get for free
Default isolation levels are not trivia. They’re the difference between “our counters are sometimes weird” and “we have a financial reconciliation incident.”
MariaDB (InnoDB): historically defaults to REPEATABLE READ. That reduces some anomalies but introduces others (like gap locks and more surprising lock behavior). It can make certain write patterns deadlock-prone under concurrency in ways teams don’t anticipate.
PostgreSQL: defaults to READ COMMITTED. It’s sane for many workloads and reduces some locking surprises, but it allows non-repeatable reads unless you use stronger isolation or explicit locking.
Operational implication: neither default is “safe” in the moral sense. They are safe for different kinds of developer assumptions. If your app assumes “I read it, so it stays true until I commit,” Postgres will happily violate that unless you ask it not to. If your app assumes “locking is simple,” InnoDB will punish you with a deadlock that only happens at peak traffic.
Character sets and collations: the silent corruption
Encoding and collation problems rarely page you at 2 a.m. They page you during a migration, or during an acquisition data merge, or when you start serving users in a new language. Which is worse, because now the issue is political as well as technical.
MariaDB: deployments historically varied: latin1 was common, utf8 (which in old MySQL/MariaDB contexts could mean “3-byte UTF-8”) caused surprises, and newer defaults trend toward utf8mb4 depending on version and configuration. Collations differ, and some collations have changed behavior across versions.
PostgreSQL: encoding is set at database creation and is typically UTF-8 in modern deployments, but collations depend on OS/ICU settings. That’s great until you restore to a different OS image and sorting changes subtly.
My strong advice: pick UTF-8 end-to-end, and in both systems, make collations explicit for user-visible sorting/searching. “Default collation” is not a strategy; it’s an untested dependency.
Constraints, foreign keys, and how mistakes surface
Constraints are how you turn database behavior from “best effort” to “contract.” If you don’t define the contract, you still have one—you just don’t know what it is.
PostgreSQL: constraints are first-class and commonly used. Deferred constraints exist and are useful when you know what you’re doing. Postgres will happily enforce your rules and loudly reject violations.
MariaDB: supports foreign keys in InnoDB, but many MariaDB/MySQL ecosystems historically underused them, often because of legacy MyISAM history, replication concerns, or fear of write overhead. The result: integrity enforced in application code—until it isn’t.
If you’re moving from MariaDB to Postgres, you’ll often discover that your data already violates the constraints you always assumed were true. Postgres isn’t “being difficult.” It’s doing the job you didn’t ask MariaDB to do.
Autovacuum vs purge: housekeeping is policy
PostgreSQL’s MVCC means rows don’t disappear when you delete them; they become dead tuples and must be vacuumed. Autovacuum runs by default, but it’s tuned for “average” and your workload is never average. If you ignore it, performance degrades slowly, then suddenly, and the first symptom is usually “disk is full” or “queries are inexplicably slow.”
In InnoDB, you have different housekeeping dynamics: undo logs, purge threads, and history list length. It’s not “no vacuum,” it’s “different plumbing.” In both systems, defaults aim for safety and generality, not your specific write pattern.
This is where the forgiving/punishing divide flips. Postgres will keep your reads consistent, but it will punish you later if you don’t vacuum. MariaDB/InnoDB may keep trucking, but it can punish you with long stalls, purge lag, or replication issues when internal cleanup can’t keep up.
Replication defaults: consistency vs availability tradeoffs
Replication is where defaults become politics: do you want to acknowledge writes before they are durable on a replica? Do you want the primary to wait? Do you want reads from replicas that might be stale?
MariaDB replication: often starts life as asynchronous. It’s easy to set up, easy to misunderstand, and extremely good at producing “we lost data during failover” postmortems if you don’t explicitly design for it.
PostgreSQL streaming replication: is also often asynchronous by default, but synchronous replication is a common and well-supported pattern. The database makes you be explicit about whether you want “no data loss” semantics, and it will make you pay for them in latency.
Defaults don’t save you from physics. They just choose who gets surprised first: you or your customers.
Fast diagnosis playbook
When latency spikes or errors rise, you need a sequence that works under pressure. Not a philosophy discussion. This is the order that tends to find the bottleneck fastest across MariaDB and Postgres.
First: confirm what kind of pain you’re in (CPU, IO, locks, or saturation)
- Are queries slow, or are they waiting? “Waiting” usually means locks, IO, or connection pool saturation.
- Is the database CPU pegged? Think bad plan, missing index, stats drift, or too many concurrent workers.
- Is IO maxed? Think bloat/vacuum, checkpointing, buffer pool too small, random reads, or noisy neighbors.
- Are connections maxed? Think pool misconfiguration, runaway clients, or slow queries holding connections.
Second: identify the top wait/lock offender
- PostgreSQL: look at
pg_stat_activityand wait events; find blockers and long transactions. - MariaDB: look at
SHOW PROCESSLIST, InnoDB status, and transaction/lock views.
Third: validate the storage engine’s housekeeping health
- PostgreSQL: autovacuum progress, dead tuples, bloat suspicion, checkpoint pressure.
- MariaDB/InnoDB: history list length, purge lag, buffer pool hit rate, IO capacity settings.
Fourth: check replication and backups (because incident response gets worse when you can’t fail over)
- Replication lag and errors.
- Disk space trends and WAL/binlog growth.
- Backup freshness and restore confidence.
Paraphrased idea (attributed): Gene Kim has often pushed the operational truth that fast feedback beats heroics; shorten the loop and you prevent the midnight surprises.
Practical tasks with commands (what output means, what decision you make)
These are the bread-and-butter checks you run during incidents, performance work, and migrations. Each task includes a command, sample output, what it means, and the decision it drives.
Task 1: Identify MariaDB SQL mode (strictness and coercion risk)
cr0x@server:~$ mysql -uroot -p -e "SELECT @@GLOBAL.sql_mode, @@SESSION.sql_mode\G"
Enter password:
*************************** 1. row ***************************
@@GLOBAL.sql_mode: STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
@@SESSION.sql_mode: STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
What it means: strict mode is enabled, so truncations/invalid values are more likely to error rather than warn.
Decision: if strict modes are missing in production, align them with staging, update app error handling, and schedule data-cleanup before enabling strictness.
Task 2: Check MariaDB warnings after a suspicious insert (silent data damage detector)
cr0x@server:~$ mysql -uroot -p -e "INSERT INTO t(amount) VALUES ('999999999999'); SHOW WARNINGS;"
Enter password:
Level Code Message
Warning 1264 Out of range value for column 'amount' at row 1
What it means: the insert succeeded but the value was coerced/truncated depending on column type and mode.
Decision: if you see warnings in app-critical paths, either tighten sql_mode or fix application validation and start collecting warnings in logs/metrics.
Task 3: Confirm PostgreSQL isolation level (debugging “it changed under me” reads)
cr0x@server:~$ psql -U postgres -d appdb -c "SHOW default_transaction_isolation;"
default_transaction_isolation
-----------------------------
read committed
(1 row)
What it means: each statement sees a fresh snapshot; within a transaction, later statements may see committed changes from other sessions.
Decision: if the app assumes repeatable reads, fix the app or explicitly use REPEATABLE READ for the critical transaction.
Task 4: Find Postgres sessions waiting on locks (fast path to “who is blocking?”)
cr0x@server:~$ psql -U postgres -d appdb -c "SELECT pid, usename, state, wait_event_type, wait_event, now()-query_start AS age, left(query,120) FROM pg_stat_activity WHERE wait_event_type IS NOT NULL ORDER BY age DESC LIMIT 10;"
pid | usename | state | wait_event_type | wait_event | age | left
------+--------+--------+-----------------+---------------+----------+-----------------------------------------------------------
8421 | app | active | Lock | transactionid | 00:03:12 | UPDATE orders SET status='paid' WHERE id=$1
8399 | app | active | Lock | relation | 00:01:48 | ALTER TABLE orders ADD COLUMN notes text
(2 rows)
What it means: the UPDATE is waiting on a transaction ID lock; the ALTER is waiting on a relation lock. DDL can block DML.
Decision: identify the blocker next (Task 5). If it’s an accidental DDL during peak, kill it or postpone with a proper migration plan.
Task 5: Identify the blocking query in Postgres (turn “mystery” into a name)
cr0x@server:~$ psql -U postgres -d appdb -c "SELECT blocked.pid AS blocked_pid, blocker.pid AS blocker_pid, now()-blocker.query_start AS blocker_age, left(blocker.query,120) AS blocker_query FROM pg_locks blocked JOIN pg_locks blocker ON blocker.locktype=blocked.locktype AND blocker.database IS NOT DISTINCT FROM blocked.database AND blocker.relation IS NOT DISTINCT FROM blocked.relation AND blocker.page IS NOT DISTINCT FROM blocked.page AND blocker.tuple IS NOT DISTINCT FROM blocked.tuple AND blocker.virtualxid IS NOT DISTINCT FROM blocked.virtualxid AND blocker.transactionid IS NOT DISTINCT FROM blocked.transactionid AND blocker.classid IS NOT DISTINCT FROM blocked.classid AND blocker.objid IS NOT DISTINCT FROM blocked.objid AND blocker.objsubid IS NOT DISTINCT FROM blocked.objsubid AND blocker.pid!=blocked.pid JOIN pg_stat_activity blocked_act ON blocked_act.pid=blocked.pid JOIN pg_stat_activity blocker ON blocker.pid=blocker.pid WHERE NOT blocked.granted AND blocker.granted LIMIT 5;"
blocked_pid | blocker_pid | blocker_age | blocker_query
------------+------------+-------------+---------------------------------------------------------
8421 | 8302 | 00:06:40 | BEGIN; SELECT * FROM orders WHERE id=$1 FOR UPDATE;
(1 row)
What it means: a long transaction holding a row lock is blocking updates.
Decision: fix the app path (missing commit/rollback), add timeouts, and consider killing the blocker if it’s safe and the incident demands it.
Task 6: MariaDB quick lock/transaction clue via processlist
cr0x@server:~$ mysql -uroot -p -e "SHOW FULL PROCESSLIST;"
Enter password:
Id User Host db Command Time State Info
21 app 10.0.2.15 appdb Query 187 Waiting for table metadata lock ALTER TABLE orders ADD COLUMN notes TEXT
34 app 10.0.2.16 appdb Query 183 Updating UPDATE orders SET status='paid' WHERE id=?
What it means: DDL is waiting for a metadata lock; it may also be blocking others depending on timing.
Decision: stop doing online DDL the naive way. Use a migration approach that avoids long metadata locks, and schedule it.
Task 7: Check InnoDB engine health quickly (purge lag and deadlock hints)
cr0x@server:~$ mysql -uroot -p -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,120p'
Enter password:
*************************** 1. row ***************************
Type: InnoDB
Name:
Status:
=====================================
2025-12-30 01:42:11 0x7f4a4c1f9700 INNODB MONITOR OUTPUT
=====================================
...
History list length 148732
...
LATEST DETECTED DEADLOCK
------------------------
*** (1) TRANSACTION:
TRANSACTION 923847, ACTIVE 3 sec starting index read
...
What it means: large history list length suggests purge is behind, often due to long-running transactions. Deadlock section shows patterns.
Decision: hunt long transactions, fix app transaction scope, and tune purge/IO capacity only after eliminating long transactions.
Task 8: PostgreSQL autovacuum health by table (bloat early warning)
cr0x@server:~$ psql -U postgres -d appdb -c "SELECT relname, n_live_tup, n_dead_tup, last_autovacuum, vacuum_count, autovacuum_count FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;"
relname | n_live_tup | n_dead_tup | last_autovacuum | vacuum_count | autovacuum_count
------------+------------+------------+------------------------+--------------+------------------
events | 81234921 | 23999110 | 2025-12-29 22:11:05+00 | 0 | 148
orders | 392110 | 88112 | 2025-12-29 20:03:12+00 | 2 | 67
(2 rows)
What it means: lots of dead tuples on events and frequent autovacuums. It may be keeping up—or barely.
Decision: if dead tuples stay high, tune autovacuum thresholds per table, consider partitioning, and check for long transactions preventing cleanup.
Task 9: Postgres “are we checkpoint thrashing?” sanity check
cr0x@server:~$ psql -U postgres -d appdb -c "SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time, checkpoint_sync_time, buffers_checkpoint FROM pg_stat_bgwriter;"
checkpoints_timed | checkpoints_req | checkpoint_write_time | checkpoint_sync_time | buffers_checkpoint
------------------+-----------------+-----------------------+----------------------+-------------------
214 | 987 | 8241132 | 119883 | 18933211
(1 row)
What it means: many requested checkpoints relative to timed ones: the system is forcing checkpoints, often due to WAL volume and settings.
Decision: adjust checkpoint/WAL settings and/or reduce write amplification (indexes, bloat). Also verify storage latency; slow disks turn checkpointing into a visible outage.
Task 10: MariaDB buffer pool hit rate and dirty pages (is the cache working?)
cr0x@server:~$ mysql -uroot -p -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_dirty';"
Enter password:
Variable_name Value
Innodb_buffer_pool_read_requests 9134829123
Innodb_buffer_pool_reads 18239481
Variable_name Value
Innodb_buffer_pool_pages_dirty 48211
What it means: if Innodb_buffer_pool_reads is high relative to requests, you’re missing cache and doing disk reads. Dirty pages show write pressure.
Decision: if cache misses dominate and you have RAM, increase buffer pool. If dirty pages stay high and IO is saturated, look at flush settings and write patterns.
Task 11: Identify top SQL in Postgres via pg_stat_statements (the “who’s burning CPU?” list)
cr0x@server:~$ psql -U postgres -d appdb -c "SELECT calls, total_exec_time::int AS total_ms, mean_exec_time::numeric(10,2) AS mean_ms, rows, left(query,100) FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 5;"
calls | total_ms | mean_ms | rows | left
-------+----------+---------+--------+----------------------------------------------------
89122 | 9123412 | 102.38 | 89122 | SELECT * FROM events WHERE tenant_id=$1 AND ts>=$2 ORDER BY ts DESC LIMIT 100
12211 | 3421901 | 280.21 | 12211 | UPDATE orders SET status=$1, updated_at=now() WHERE id=$2
(2 rows)
What it means: you have a “top talker” query consuming execution time.
Decision: run EXPLAIN (ANALYZE, BUFFERS) on the top query, add/adjust indexes, or change query shape. Also check whether the ordering forces a sort and memory spill.
Task 12: Identify top SQL in MariaDB using the slow query log (the boring, effective hammer)
cr0x@server:~$ sudo tail -n 20 /var/log/mysql/mysql-slow.log
# Time: 2025-12-30T01:38:44.112233Z
# User@Host: app[app] @ 10.0.2.15 []
# Query_time: 4.982 Lock_time: 0.001 Rows_sent: 100 Rows_examined: 81234921
SET timestamp=1735522724;
SELECT * FROM events WHERE tenant_id=42 ORDER BY ts DESC LIMIT 100;
What it means: full scan behavior (Rows_examined huge) and an ORDER BY likely forcing extra work.
Decision: add a composite index (e.g., (tenant_id, ts)), validate the plan with EXPLAIN, and keep the slow log on in production with sane thresholds.
Task 13: Postgres disk usage by table (find the real storage hogs)
cr0x@server:~$ psql -U postgres -d appdb -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) AS total_size FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;"
relname | total_size
---------+-----------
events | 412 GB
orders | 38 GB
(2 rows)
What it means: events dominates storage. It’s the table that will dominate vacuum, cache behavior, and backup time.
Decision: consider partitioning, retention policies, and index review. Don’t tune the whole database when one table is the entire weather system.
Task 14: MariaDB replication lag check (failover readiness)
cr0x@server:~$ mysql -uroot -p -e "SHOW SLAVE STATUS\G" | egrep 'Seconds_Behind_Master|Slave_IO_Running|Slave_SQL_Running|Last_SQL_Error'
Enter password:
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 87
Last_SQL_Error:
What it means: replication is working but 87 seconds behind. If you fail over now, you lose up to 87 seconds of acknowledged writes.
Decision: if your RPO is near-zero, do not pretend this is fine. Fix the lag cause (IO, long transactions, big writes) or adopt synchronous/semisync patterns where appropriate.
Task 15: Postgres replication status and lag (primary visibility)
cr0x@server:~$ psql -U postgres -d appdb -c "SELECT application_name, state, sync_state, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
application_name | state | sync_state | write_lag | flush_lag | replay_lag
------------------+-----------+------------+-----------+-----------+-----------
replica1 | streaming | async | 00:00:02 | 00:00:03 | 00:00:05
(1 row)
What it means: replica is slightly behind; it’s asynchronous, so some loss is possible on failover.
Decision: decide if async is acceptable. If not, implement synchronous replication for the critical data path and budget the latency cost.
Task 16: Confirm Postgres long transactions (vacuum and bloat killer)
cr0x@server:~$ psql -U postgres -d appdb -c "SELECT pid, usename, now()-xact_start AS xact_age, state, left(query,120) FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_age DESC LIMIT 10;"
pid | usename | xact_age | state | left
------+--------+----------+--------+----------------------------------------------------------
8302 | app | 01:12:44 | idle | BEGIN; SELECT * FROM orders WHERE id=$1 FOR UPDATE;
(1 row)
What it means: an idle-in-transaction session is holding locks and preventing vacuum from advancing.
Decision: kill it if needed, then fix the app (connection pool settings, transaction scope), and set idle_in_transaction_session_timeout.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption (forgiving defaults hid bad data)
A mid-sized marketplace ran MariaDB for years. The app team was proud of their “resilience.” If an upstream service sent slightly malformed data, the ingest pipeline kept going. No customer-visible errors. Green dashboards. Everybody slept.
Then finance asked a simple question: “Why do refunds not match chargebacks for the last quarter?” The numbers were close, but not close enough to ignore. The investigation was slow and humiliating because there was no single “bad deploy.” The badness was smeared across months.
The root cause was an assumption: “Invalid timestamps will fail inserts.” In their MariaDB environment, strict SQL mode had drifted between clusters. Some clusters accepted certain invalid date values with warnings; others rejected them. The pipeline didn’t check warnings. It stored coerced values that looked like legitimate dates but sorted wrong. A few downstream reports treated those dates as truth and produced subtle mismatches.
The fix was not heroic. They standardized sql_mode, added upstream validation, and wrote a data repair job. The cultural fix was harder: they stopped treating “no errors” as “correct.” They started treating warnings as a defect budget they could see and burn down.
Mini-story 2: The optimization that backfired (punishing defaults surfaced an app design flaw)
A SaaS company migrated a hot path from MariaDB to PostgreSQL. Their reason was sound: better indexing options, better query planning for analytics-like queries, and fewer footguns around type coercion. The migration went fine in tests. Then they rolled it to production and got hit with sporadic write failures.
The failures were not random. They correlated with peak traffic. The application used a pattern that “usually worked” in MariaDB: read a row, do some business logic, update it, repeat. Under MariaDB’s default behaviors and the way their code handled conflicts, they got “last write wins” more often than they realized. Not correct, but it kept the UX moving.
On Postgres, they tightened constraints and used transactions more rigorously. Now conflicts surfaced as errors. The database wasn’t punishing them for sport; it was refusing to lie. The app, however, was not built to retry transactions safely. It retried whole requests, duplicating side effects and producing user-visible weirdness.
They tried to “optimize” by increasing connection pool size and removing some locks in code. That made it worse: more concurrency increased conflict rate, and the pool saturated CPU with context switching and lock contention. The fastest query is still slow when it’s competing with 2,000 siblings.
They recovered by doing the boring fix: idempotency keys, proper transaction retries, and a clear policy on which operations require serialization. Performance improved after correctness improved. That ordering is not optional.
Mini-story 3: The boring but correct practice that saved the day (defaults don’t matter if you can see)
A payments-adjacent company ran both MariaDB and Postgres in different product lines. They had a simple rule: every database had a “defaults manifest” in version control—startup parameters, SQL modes, isolation expectations, encoding/collation choices, and the operational SLOs they implied.
This was not sexy work. It didn’t win architecture awards. It did, however, turn migrations into predictable engineering instead of archaeology. When a team wanted to clone an environment, they didn’t “copy what prod looks like.” They applied the manifest, then validated it with a script that ran sanity queries.
One day, an OS image update changed locale behavior in a Postgres replica. Sorting order differed subtly. Normally that would have become a slow-motion incident: users complain about search results, support files tickets, engineering argues about relevance ranking. Instead, their manifest validation caught a collation mismatch immediately after provisioning. No production impact. They fixed the image and moved on.
Joke 2: The most expensive database bug is the one that politely waits until you’re in the middle of a migration window.
Common mistakes: symptoms → root cause → fix
1) “Data looks wrong, but nothing errored”
Symptom: analytics discrepancies, weird dates, truncated strings, numeric values rounded unexpectedly.
Root cause: MariaDB running without strict SQL modes; warnings ignored; implicit casts and truncation accepted.
Fix: set strict sql_mode consistently, surface warnings in application logs, add validation at ingestion, and run a one-time data audit for coercion damage.
2) “Postgres is throwing errors in production after migration”
Symptom: insert/update failures for values that used to “work,” unexpected constraint violations, cast errors.
Root cause: the old system accepted ambiguous inputs; Postgres refuses. The app relied on permissive behavior.
Fix: correct the application types and validations, add explicit casts where appropriate, and make retries/idempotency part of the design.
3) “Random deadlocks in MariaDB at peak”
Symptom: deadlock errors under concurrent updates; the same code path mostly works off-peak.
Root cause: InnoDB lock behavior under REPEATABLE READ, gap locks, inconsistent update order, and long transactions.
Fix: reduce transaction scope, access rows in a consistent order, add appropriate indexes to avoid range scans, and consider isolation changes only after query/index fixes.
4) “Postgres gets slower over weeks; disk keeps growing”
Symptom: rising latency, rising disk usage, vacuum activity spikes, occasional IO storms.
Root cause: autovacuum not keeping up, long-running transactions preventing cleanup, table/index bloat.
Fix: kill/limit long transactions, tune autovacuum per hot tables, implement partitioning/retention, and schedule manual vacuum/reindex only with evidence.
5) “Failover happened and we lost recent writes”
Symptom: after failover, some recent transactions disappear; users report missing updates.
Root cause: async replication with non-zero lag; failover treated as “safe” without measuring RPO.
Fix: measure replication lag continuously, define acceptable RPO, adopt synchronous replication where required, and adjust app UX for eventual consistency where allowed.
6) “Connection storms: database is up but everything times out”
Symptom: DB CPU moderate, but app timeouts spike; connection count at max.
Root cause: pool misconfiguration, slow queries holding connections, or retry loops amplifying load (especially on Postgres where errors are more common under strictness).
Fix: cap concurrency, use a pooler appropriately (and configure it), fix slow queries, and add jitter/backoff to retries.
Checklists / step-by-step plan
Checklist A: If you run MariaDB and want “Postgres-like” safety
- Standardize
sql_modeacross all clusters and sessions. Make it part of provisioning, not folklore. - Turn warnings into signals: collect them in logs and alert when they spike.
- Use InnoDB everywhere (if you’re not already). If anything is still MyISAM, treat it as technical debt with interest.
- Add constraints and foreign keys where the domain is strict (money, entitlements, inventory). Let the database say “no.”
- Measure deadlocks and lock waits. Fix transaction scope and indexes before tuning magic knobs.
- Define encoding/collation explicitly. Prefer UTF-8 (
utf8mb4) and test migrations with real multilingual data. - Make replication lag visible and tie it to failover policy. Don’t “auto-failover” into data loss without a decision.
Checklist B: If you run Postgres and want “MariaDB-like” uptime without losing correctness
- Design application retries for serialization failures and transient errors. Test them under concurrency.
- Set sensible statement timeouts and
idle_in_transaction_session_timeout. - Install and use
pg_stat_statementsand track top queries as part of standard ops. - Tune autovacuum for the tables that matter; defaults are a starting point, not a promise.
- Watch checkpoints and WAL volume; make sure storage latency is known and acceptable.
- Use synchronous replication only where the business requires it; otherwise be honest about RPO.
- For large tables, plan partitioning and retention early. Bloat is easier to prevent than to cure.
Checklist C: Migration sanity plan (MariaDB ↔ Postgres)
- Audit type usage: dates, timestamps, booleans, and numeric precision. List every implicit cast your app currently relies on.
- Freeze defaults: SQL mode, isolation expectations, encoding/collation, time zone handling.
- Run a data quality scan: invalid dates, truncated values, orphan records (missing FKs), duplicate keys where uniqueness was assumed.
- Rehearse cutover with production-like concurrency and failure injection (killed connections, deadlocks, retry storms).
- Define rollback criteria and validate backups with an actual restore test, not just a “backup succeeded” log line.
- Post-cutover, watch the boring charts: replication lag, checkpoint rate, dead tuples/history list, and slow query top talkers.
FAQ
1) Is MariaDB “unsafe” by default?
No. But it can be permissive in ways that let bad data enter quietly if you don’t set strict SQL modes and watch warnings. You can make it strict; you just have to do it deliberately.
2) Is PostgreSQL “better” because it’s stricter?
Stricter means your mistakes surface earlier. That’s usually better for correctness-heavy systems. But it also means you need robust error handling and retry logic, or you’ll trade silent data issues for loud outages.
3) Which database has safer default transaction isolation?
Neither is universally safer. MariaDB’s default REPEATABLE READ can reduce some anomalies but increase locking surprises. Postgres’ READ COMMITTED is simpler but allows non-repeatable reads unless you request stronger isolation.
4) Why do MariaDB-to-Postgres migrations “find” data problems?
Because Postgres enforces types and constraints more strictly, and it doesn’t silently coerce as much. The migration is acting like a truth serum for assumptions.
5) What’s the biggest operational footgun in Postgres defaults?
Assuming autovacuum will always keep up. It usually does—until you have a hot table, long transactions, or a write-heavy workload. Then it becomes a performance and disk problem.
6) What’s the biggest operational footgun in MariaDB defaults?
Assuming “it inserted” means “it stored what I meant.” Without strict modes and warning visibility, you can accumulate incorrect data while everything looks healthy.
7) Can I get “no data loss” failover with either?
Yes, but you must pay for it. You need synchronous replication semantics (or equivalent guarantees) and an application model that tolerates the added latency. Async replication is an RPO decision, not a default convenience.
8) Should I rely on the database defaults at all?
Rely on defaults only when you’ve written them down, validated them, and monitored the behavior they imply. Otherwise you’re relying on whatever version you happened to install.
9) If I can only do one improvement this month, what should it be?
Make your defaults explicit and observable: SQL modes / isolation expectations, encoding/collation, and a “top queries + lock waits + replication lag” dashboard. It prevents both slow-burn corruption and surprise outages.
Next steps you can actually do this week
- Write down your current defaults (MariaDB
sql_mode, Postgres isolation, encoding/collation, time zone) and commit them to version control as a “runtime contract.” - Run the fast diagnosis queries during a calm period, not during an incident. Baselines are how you recognize “weird.”
- Pick one correctness guardrail and enforce it: strict modes on MariaDB, or timeouts and retry-safe transactions on Postgres.
- Make housekeeping visible: dead tuples/autovacuum stats on Postgres; InnoDB history list length and deadlocks on MariaDB.
- Decide your replication truth: measure lag, define acceptable RPO, and design failover accordingly. Don’t let the default decide in the middle of an outage.
If you take nothing else: the database defaults are part of your production API. Treat them like code, because they behave like code—just without the unit tests.