MariaDB vs TiDB: migration promises vs production reality

October 12, 2025 • February 3, 2026 • Read: 22 min • Views: 9

Was this helpful?

You don’t migrate databases because you’re bored. You migrate because you hit a wall: write scaling, cross-region latency, operational overhead, exploding shard counts, or a compliance deadline that doesn’t care about your weekend plans.

MariaDB and TiDB both show up in the “we need MySQL-ish behavior but better” conversation. The marketing story is neat: keep SQL, gain scale, reduce pain. Production reality is messier: different failure modes, different knobs, and different ways to ruin your on-call rotation. Let’s talk about those.

What you’re really choosing (not a database)

“MariaDB vs TiDB” sounds like a product comparison. In production, it’s really a choice between two operational models:

MariaDB is a familiar single-node InnoDB engine with mature replication patterns (async replication, semi-sync, GTID, and in some deployments Galera). You can run it boringly and well. You can also run it “creatively” and learn new synonyms for pain.
TiDB is distributed SQL: compute nodes (TiDB), storage nodes (TiKV), and a control plane (PD). You’re buying horizontal scaling and HA by design—plus the complexity tax of a distributed system. Every “simple” question becomes “which tier and which component is lying?”

Here’s the decision I push in real orgs:

If your bottleneck is one or a few hot primaries, but your workload is still mostly single-region OLTP, and you can scale vertically and by read replicas, MariaDB remains hard to beat for operational simplicity.
If you’re already living in shard hell, or you must scale writes across nodes with strong consistency and acceptable latency, TiDB can be the adult choice—if you invest in observability and SRE maturity.

One paraphrased idea from Werner Vogels (Amazon CTO) that still holds: everything fails, all the time; the job is designing so the system keeps working anyway (paraphrased idea).

That quote maps cleanly: MariaDB is about engineering careful failure boundaries around a mostly single-node truth. TiDB is about engineering around constant partial failure across many nodes.

Quick historical facts that actually matter

Some context points that change how you operate these systems. Short. Concrete. Useful.

MariaDB was forked from MySQL in 2009 after Oracle acquired Sun; that origin explains why compatibility is a first-class concern and why “mostly like MySQL” is part of the brand DNA.
InnoDB became the default MySQL storage engine years before many teams stopped thinking in MyISAM terms. If your legacy app still assumes table-level locking behavior, you’re already playing on hard mode.
Galera “multi-master” is really synchronous replication with certification. It’s not magic. It trades write concurrency for global ordering and can punish hot rows.
TiDB’s architecture was inspired by Google Spanner/F1-style separation (without TrueTime). The important part is the split between stateless SQL and stateful KV.
TiKV uses Raft replication. That means quorum writes, leader placement issues, and “one slow disk can slow a region” stories.
PD (Placement Driver) is a real control-plane dependency. If you treat it like “just another service,” you’ll meet it at 3 a.m.
MySQL compatibility is a moving target for TiDB. A query that works isn’t the same as a query that performs the same. SQL surface area is not operational equivalence.
Online schema changes in the MySQL world evolved as a coping mechanism. Tools and patterns exist because “ALTER TABLE” has been hurting people since forever.

Architecture in plain ops terms: where the dragons live

MariaDB: one engine, many ways to replicate it

MariaDB performance troubleshooting still feels like home to most SREs: CPU, buffer pool, IO latency, mutex contention, bad queries, missing indexes, and replication lag. When it’s on fire, it’s usually on fire in one place.

Even in clustered setups (Galera), your unit of pain is still “the database node.” You can measure it, isolate it, replace it. The hard parts are:

Replication semantics: async means lag and possible loss on failover; semi-sync reduces loss but increases latency; Galera means conflict handling and flow control.
Schema changes: online DDL exists, but not everything is online, and large tables will find the sharp edges.
Write scaling: you scale writes by scaling up, reducing write amplification, or changing the app. “Just add nodes” is mostly a read story.

TiDB: stateless SQL nodes + Raft storage + a control plane

TiDB spreads your database across tiers:

TiDB (SQL layer): parses SQL, plans queries, pushes work down, and does distributed transactions.
TiKV (storage): stores key/value data in regions, each replicated by Raft. Leaders matter. Network matters. Disk matters. Compaction matters.
PD: schedules regions, balances leaders, provides timestamps and placement decisions.

That separation is why TiDB can scale out reads and writes. It’s also why you can have a “database incident” where CPU is fine, queries are fine, and the real villain is a single TiKV node with miserable IO, causing leader elections and transaction retries.

Dry truth: a distributed database is a database plus a distributed systems curriculum you’ll be forced to take during incidents. The exam is always timed.

Joke #1: Distributed databases are great because they turn one slow disk into a team-building exercise across three services.

Migration promises vs what breaks first

Promise: “It’s MySQL-compatible, so the app just works”

Compatibility gets you through parsing. It doesn’t get you through production behavior:

Query plans diverge. A query that’s “fine” on MariaDB with a single-node optimizer can become a distributed table scan party in TiDB.
Transaction patterns matter more. Long transactions, large batch updates, and hot keys expose contention in Raft and MVCC behavior.
DDL behavior differs. TiDB supports online DDL patterns, but operationally you still need to plan for backfill load and PD scheduling.

Promise: “TiDB removes sharding complexity”

Yes, but you trade it for different complexity:

Hotspotting is the new shard. Instead of “user_id modulo N,” you get “this region leader is melting.”
Capacity planning is tiered. Compute nodes scale differently from storage nodes. Your bill and your bottleneck might not agree.
Operational dependencies expand. You now operate a control plane, a storage layer, and stateless SQL nodes.

Promise: “MariaDB cluster gives HA and scale”

HA? Absolutely, when done right. Write scale? Cautiously. In Galera, every node must certify writes; heavy write workloads can suffer from conflicts and flow control. Async replication can scale reads easily but writes still funnel through a primary.

Promise: “We can migrate with minimal downtime”

You can, but the critical work is not the data copy. It’s the behavioral compatibility work:

SQL modes and implicit type conversions
Isolation level assumptions
Use of non-deterministic functions
Stored procedures, triggers, and edge-case DDL patterns
Operational runbooks: backups, restores, failovers, DR

Production realities: performance, correctness, and sleep

Performance reality #1: latency has new floors

MariaDB can be very fast at single-row OLTP when data is in buffer pool and you’re not IO-bound. TiDB adds network hops and quorum commits. That doesn’t mean TiDB is “slow”; it means p99 latency has different physics.

If your product requires sub-5ms p99 for write transactions in one AZ, TiDB can still work, but you’ll sweat placement, leader locality, and transaction design. If you’re at 20–50ms p99 already because your app is chatty, TiDB’s overhead might be hidden under existing bad decisions.

Performance reality #2: hotspots are inevitable

In MariaDB, hotspots are mostly locking and index contention. In TiDB, hotspots often manifest as:

One region getting most writes due to sequential keys
A leader concentrated on one TiKV
Skewed access patterns causing PD to chase balance forever

Correctness reality: “multi-master” semantics are different

Galera’s synchronous replication means a transaction must be certified cluster-wide; conflicts happen at commit time, and retry logic matters. TiDB provides strong consistency via Raft replication and MVCC, but distributed transactions amplify the cost of contention and long-running transactions.

Operational reality: backups and restores are where confidence goes to die

Every database can back up. The question is: can you restore quickly, correctly, and under pressure? MariaDB logical dumps are portable but can be slow; physical backups are fast but require discipline. TiDB backups involve large distributed state; restores must consider placement, load, and version compatibility.

Joke #2: A backup is Schrödinger’s file: it’s both valid and useless until you try a restore.

Fast diagnosis playbook

This is the “stop debating architecture and find the bottleneck” sequence I use. Do it in order. Don’t freestyle until you have evidence.

First: is it the app, the network, or the database?

Check end-to-end latency breakdown. If you don’t have it, grab it from app tracing or at least compare DB time vs total request time.
Confirm saturation. CPU pegged? IO wait? network retransmits? If nothing is saturated, you’re chasing contention or query plans.
Confirm concurrency pattern. Spikes in connections, thread pools, queue depths.

Second: decide whether the bottleneck is “SQL layer” or “storage layer”

MariaDB: check InnoDB metrics, slow log, buffer pool, replication lag.
TiDB: check TiDB CPU and query duration, then TiKV IO/CPU and raftstore metrics, then PD scheduling and leader balance.

Third: classify the failure mode

Plan regression: sudden shift in query latency after deploy/schema change.
Contention: lock waits, deadlocks, raft write stalls, hot regions.
Capacity: disk latency, compaction debt, buffer pool misses, TiKV rocksdb stalls.
Control plane: PD unavailable, scheduling stuck, metadata issues.
Replication/failover: lag, elections, split brain prevention kicking in.

Fourth: pick the fastest “reduce load” lever while you diagnose

Rate limit the hottest endpoints.
Disable expensive background jobs.
Temporarily increase connection pool timeouts to avoid thundering herds.
In TiDB, consider moving leaders away from a sick TiKV and reducing hotspot pressure.

Practical tasks (commands, outputs, decisions)

Below are real tasks you can run during migration planning, performance debugging, and incident response. Each includes: command, sample output, what it means, and what decision you make.

Task 1: Confirm MariaDB version, server variables, and engine basics

cr0x@server:~$ mysql -h mariadb01 -uroot -p -e "SELECT VERSION(); SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; SHOW VARIABLES LIKE 'log_bin';"
Enter password:
VERSION()
10.11.6-MariaDB-1:10.11.6+maria~ubu2204
Variable_name	Value
innodb_buffer_pool_size	17179869184
Variable_name	Value
log_bin	ON

Meaning: You’re on MariaDB 10.11, buffer pool is 16GiB, binlog enabled.

Decision: If you rely on CDC or async replicas, binlog stays on. If you’re IO-bound and don’t need binlog, that’s a bigger conversation—don’t “optimize” away your recovery path.

Task 2: Spot top wait events and lock pain in MariaDB

cr0x@server:~$ mysql -h mariadb01 -uroot -p -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,120p'
Enter password:
*************************** 1. row ***************************
  Type: InnoDB
  Name:
Status:
=====================================
2025-12-30 10:41:23 0x7f5a2c1ff700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 28 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 714 srv_active, 0 srv_shutdown, 1082 srv_idle
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 24417
--Thread 140025024026368 has waited at row0lock.cc line 1134 for 12.00 seconds the semaphore:

Meaning: Threads are waiting on row locks; you likely have contention or a transaction holding locks too long.

Decision: Find the blocking transaction (next task) and decide: kill it, reduce batch size, add index, or rewrite the update pattern.

Task 3: Identify long-running transactions in MariaDB

cr0x@server:~$ mysql -h mariadb01 -uroot -p -e "SELECT trx_id, trx_started, trx_mysql_thread_id, trx_query FROM information_schema.innodb_trx ORDER BY trx_started LIMIT 5\G"
Enter password:
*************************** 1. row ***************************
trx_id: 9A3F2B11
trx_started: 2025-12-30 10:12:01
trx_mysql_thread_id: 18841
trx_query: UPDATE orders SET status='CLOSED' WHERE customer_id=412993;

Meaning: A transaction has been open ~30 minutes. That can block purge and lock other writers.

Decision: If it’s safe, kill the session and fix the application logic to avoid long transactions (commit more often, smaller batches).

Task 4: Confirm replication status and lag in MariaDB

cr0x@server:~$ mysql -h mariadb-replica01 -uroot -p -e "SHOW SLAVE STATUS\G" | egrep "Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Using_Gtid"
Enter password:
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 37
Using_Gtid: Slave_Pos

Meaning: Replica is healthy but ~37 seconds behind.

Decision: If your failover RTO/RPO can’t tolerate 37 seconds, you need semi-sync, better hardware, reduced write load, or different topology.

Task 5: Check binary log format for migration/CDC assumptions

cr0x@server:~$ mysql -h mariadb01 -uroot -p -e "SHOW VARIABLES LIKE 'binlog_format';"
Enter password:
Variable_name	Value
binlog_format	ROW

Meaning: Row-based binlog is enabled; good for many CDC pipelines.

Decision: Keep ROW for correctness unless you have a tested reason not to. Statement-based binlogs plus non-deterministic queries is how you get “replica creativity.”

Task 6: Surface slow queries in MariaDB (with real evidence)

cr0x@server:~$ sudo tail -n 20 /var/log/mysql/slow.log
# Time: 2025-12-30T10:38:13.512345Z
# User@Host: app[app] @ 10.20.3.14 []
# Query_time: 2.871  Lock_time: 0.004 Rows_sent: 25  Rows_examined: 1820031
SELECT * FROM orders WHERE created_at >= '2025-12-01' ORDER BY created_at DESC LIMIT 25;

Meaning: Large scan (1.8M rows examined) to return 25 rows. Classic missing index or wrong access pattern.

Decision: Add a covering index (e.g., on created_at plus any filter columns), or redesign query to avoid sorting large ranges.

Task 7: Validate index usage with EXPLAIN in MariaDB

cr0x@server:~$ mysql -h mariadb01 -uroot -p -e "EXPLAIN SELECT * FROM orders WHERE created_at >= '2025-12-01' ORDER BY created_at DESC LIMIT 25\G"
Enter password:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: orders
type: ALL
possible_keys: idx_created_at
key: NULL
rows: 1900000
Extra: Using where; Using filesort

Meaning: Full table scan and filesort; index not used.

Decision: Fix index or query. If the index exists but isn’t used, check collation/type mismatch, functions on the column, or poor cardinality stats.

Task 8: TiDB cluster health snapshot (control plane sanity)

cr0x@server:~$ tiup ctl:v8.5.0 pd -u http://pd01:2379 store
{
  "count": 3,
  "stores": [
    { "store": { "id": 1, "address": "tikv01:20160", "state_name": "Up" } },
    { "store": { "id": 2, "address": "tikv02:20160", "state_name": "Up" } },
    { "store": { "id": 3, "address": "tikv03:20160", "state_name": "Up" } }
  ]
}

Meaning: PD is reachable; TiKV stores are Up.

Decision: If PD can’t answer quickly, stop blaming SQL. Stabilize PD/quorum first; everything depends on it.

Task 9: Detect hotspot regions in TiDB/TiKV (where “shards” hide)

cr0x@server:~$ tiup ctl:v8.5.0 pd -u http://pd01:2379 hot read
{
  "as_leader": [
    { "region_id": 74219, "store_id": 2, "hot_degree": 97, "flow_bytes": 183274112 }
  ]
}

Meaning: Store 2 is leader for a very hot region. Expect uneven latency and CPU on that node.

Decision: Investigate access pattern (sequential keys, time-series inserts, or a single tenant). Consider splitting regions, changing key design, or moving leaders as a mitigation.

Task 10: Check leader balance and scheduling in TiDB (PD doing its job?)

cr0x@server:~$ tiup ctl:v8.5.0 pd -u http://pd01:2379 scheduler show
[
  "balance-leader-scheduler",
  "balance-region-scheduler",
  "balance-hot-region-scheduler"
]

Meaning: Expected schedulers are enabled.

Decision: If balance-hot-region-scheduler is missing/disabled, hotspots will linger. Enable it only if you understand the side effects on stability during peak load.

Task 11: Spot transaction retries and latch contention from TiDB SQL layer

cr0x@server:~$ mysql -h tidb01 -P4000 -uroot -e "SHOW STATUS LIKE 'tidb_txn_retry%'; SHOW STATUS LIKE 'tidb_server_execute%';"
Variable_name	Value
tidb_txn_retry_total	1842
Variable_name	Value
tidb_server_execute_duration_seconds_count	1298821

Meaning: Non-trivial transaction retries. That’s often contention, conflicts, or slow storage responses.

Decision: If retries climb during incidents, prioritize hotspot and storage health checks over SQL tuning trivia.

Task 12: Confirm TiKV disk health and IO wait on a suspected bad node

cr0x@server:~$ ssh tikv02 "iostat -x 1 3 | tail -n +4"
Device            r/s     w/s   rMB/s   wMB/s  avgrq-sz avgqu-sz   await  %util
nvme0n1         210.4   488.7    12.1    44.8    121.3     8.72   15.90  98.7

Meaning: Disk is near-saturated (%util ~99) with elevated await. TiKV on this node will respond slowly; Raft leaders here will hurt the whole cluster.

Decision: Move leaders/regions away from this store and fix storage (NVMe health, filesystem, noisy neighbor, compaction pressure). If it’s a cloud volume, check burst credits and throughput caps.

Task 13: Validate TiDB query plan, not just query text

cr0x@server:~$ mysql -h tidb01 -P4000 -uroot -e "EXPLAIN ANALYZE SELECT * FROM orders WHERE created_at >= '2025-12-01' ORDER BY created_at DESC LIMIT 25;"
+---------------------------+---------+---------+-----------+----------------------------+
| id                        | estRows | actRows | task      | operator info              |
+---------------------------+---------+---------+-----------+----------------------------+
| TopN_5                    | 25.00   | 25      | root      | order by:created_at, limit |
| └─TableFullScan_7         | 1.90e+6 | 1.82e+6 | cop[tikv] | keep order:false           |
+---------------------------+---------+---------+-----------+----------------------------+

Meaning: Full scan is happening in TiKV. The distributed system is faithfully doing the wrong thing.

Decision: Add the right index or rewrite. If you must scan, schedule it off-peak and isolate via resource controls; don’t let analytics cosplay as OLTP.

Task 14: Check MariaDB disk usage and growth before a migration cutover

cr0x@server:~$ sudo du -sh /var/lib/mysql; sudo df -h /var/lib/mysql
182G	/var/lib/mysql
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  500G  412G   63G  87% /var/lib/mysql

Meaning: You’re at 87% disk usage. That’s not “fine.” That’s “one index rebuild away from paging the team.”

Decision: Before migration tooling runs, expand disk or clean up. Migrations amplify temporary space needs (DDL, backfill, logs, snapshots).

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

The company was mid-migration from MariaDB to TiDB. The plan was conservative: dual-write for a while, read from MariaDB, then flip reads to TiDB. The team was proud of their SQL compatibility testing; the integration suite passed; the dashboards were green.

They flipped a small percentage of read traffic to TiDB. Within an hour, p99 latency doubled for a handful of endpoints. Nothing dramatic, just the kind of slow bleed that ruins conversion funnels and makes product managers discover the phrase “database regression.”

The assumption: “If a query is indexed in MariaDB, it will behave similarly in TiDB.” The reality: the query used a composite index in MariaDB due to a particular selectivity pattern, but TiDB’s plan chose a full scan under that distribution. It was technically correct SQL and functionally correct results—just operationally wrong.

During the incident review, the painful lesson wasn’t “TiDB is bad.” It was that they had not built a plan-validation gate. They were testing correctness, not performance. The fix was boring: capture top queries, run EXPLAIN ANALYZE on both systems, and block migrations when the plan changed in expensive ways without explicit sign-off.

What saved them from repeating it was a single rule: if a query touches more than X rows or performs a full scan in TiDB, it needs either an index, a rewrite, or a resource isolation plan. No exceptions for “but it’s always been fine.”

Mini-story 2: The optimization that backfired

Another team ran MariaDB with replicas and had a periodic reporting job that did large range scans. Someone decided to “optimize” by moving the reporting job to a replica and increasing parallelism. The job finished faster. Everyone clapped quietly in Slack.

A week later, a failover happened during a network flap. The promoted replica was lagging more than the team realized because the reporting job had been hammering IO and applying replication events slowly. The failover succeeded, but they served stale-ish data for long enough that finance noticed, which is the kind of noticing that gets calendar invites.

The root issue wasn’t “reporting on a replica.” It was the combination of: increased IO load, less headroom for replication apply, and a failover process that didn’t enforce an RPO gate (“don’t promote if lag > threshold”).

The fix was not heroic. They rate-limited the reporting job, added a dedicated analytics replica with different hardware, and made the failover automation check replication lag explicitly. The lesson: performance optimizations that ignore failover paths are just outages with better PR.

Mini-story 3: The boring but correct practice that saved the day

A company running TiDB had an uneventful practice that nobody loved: quarterly restore drills. Not “we ran a backup command.” Actual restores into an isolated environment, with validation queries and a timed runbook.

Then came the day: a human error in an automation script dropped the wrong table in production. The change was caught quickly, but the data was gone quickly too. Panic tried to move in.

They executed the runbook. Restore to a new cluster, validate row counts and checksums for critical tables, and then re-point the application for the affected dataset. It wasn’t instant, but it was controlled. The incident was measured in downtime minutes, not existential dread.

The postmortem wasn’t exciting. No villain. No miracle. Just evidence that boring disciplines—restore drills, clear RTO/RPO targets, and tested cutover procedures—beat cleverness. The team kept their weekends, which is the rarest KPI of all.

Common mistakes (symptom → root cause → fix)

1) “TiDB is slow” (p95/p99 climbs after migration)

Symptom: Reads and writes are both slower; CPU is not pegged; app sees retries/timeouts.

Root cause: Cross-AZ/region placement, leaders not local, or disk latency on a subset of TiKV nodes. Distributed commits pay for the slowest quorum member.

Fix: Enforce locality (labels/placement rules), confirm leader distribution, and fix the slow storage node(s). Measure raft commit latency and disk await.

2) MariaDB replica “randomly” lags during peak

Symptom: Seconds_Behind_Master spikes; reads get stale; failover risk increases.

Root cause: Replica IO saturation, long transactions on primary creating huge binlog bursts, or single-threaded apply constraints depending on config/workload.

Fix: Reduce write amplification (indexes, batch size), ensure replica IO headroom, consider parallel replication if applicable, and set failover gates on lag.

3) Galera cluster throughput collapses when write concurrency increases

Symptom: Adding nodes doesn’t add write throughput; conflicts and flow control appear; latency spikes.

Root cause: Certification conflicts on hot rows, too many concurrent writers, or large transactions causing cluster-wide coordination overhead.

Fix: Reduce write contention (schema changes, key design), break up transactions, pin writes to one node (if acceptable), or stop pretending Galera is a write-scaling silver bullet.

4) TiDB hotspot that never goes away

Symptom: One TiKV node stays hot; PD “balances” but the pain persists; transaction retries rise.

Root cause: Sequential keys or monotonic inserts creating region hotspots; insufficient region splitting; workload skew by tenant.

Fix: Change key design (avoid monotonic primary keys for hot tables), enable/pre-split for hot tables, isolate tenants, and use hotspot scheduling carefully.

5) Migration succeeded, but correctness bugs appear weeks later

Symptom: Occasional missing/duplicated rows, weird rounding, case-sensitivity differences, or time zone surprises.

Root cause: Differences in SQL mode, collation, implicit conversions, or application reliance on undefined behavior.

Fix: Lock down SQL modes and collations, add data validation checks, and eliminate “string-to-int magic” and time zone ambiguity in the application.

6) Backup exists, restore fails

Symptom: Restore takes far longer than planned, or fails due to permissions, missing binlogs, or incompatible versions.

Root cause: Untested restore path, inconsistent snapshots, or under-provisioned restore environment.

Fix: Regular restore drills, documented runbooks, and capacity planning for restore throughput (network + disk + CPU).

Checklists / step-by-step plan

Decision checklist: should you stay on MariaDB?

Your write workload fits a single primary with headroom after reasonable tuning.
You can tolerate replica lag or you can afford semi-sync tradeoffs.
Your main pain is query/index quality, not shard count.
You want the simplest failure model and the smallest operational surface area.

Decision checklist: should you move to TiDB?

You need write scaling beyond a single node without application sharding.
You can invest in observability: tracing, per-component metrics, and capacity management.
You can accept higher baseline latency in exchange for scale and HA.
You’re willing to redesign hot keys and transaction patterns.

Step-by-step migration plan that survives contact with production

Inventory workload: top queries by latency and frequency, top tables by writes, biggest tables by size.
Lock down semantics: SQL mode, timezone, collation, isolation expectations.
Plan for indexes first: validate query plans in target system for the top N queries.
Build data validation: counts, checksums per partition/range, and business invariants.
Design the cutover: dual-write vs CDC; define failure rollback.
Practice failover and restore: for both the old and new systems before production flip.
Do a dark launch: mirror reads or shadow traffic; compare results and latencies.
Ramp slowly: 1% → 5% → 25% → 50% with explicit acceptance criteria.
Freeze risky changes: no big schema refactors during the ramp unless you enjoy drama.
Document operational ownership: who pages for PD? Who pages for slow compactions? Who owns schema gates?

Runbook checklist: what “ready for prod” looks like

Backup job runs with monitored success and known restore time.
Restore drill completed in the last 90 days with recorded timings.
Dashboards exist per tier (SQL, storage, control plane) with p95/p99 focus.
Alert thresholds are tied to user impact (latency, error rate), not vanity metrics.
Capacity headroom exists: disk, IO, and CPU, plus space for maintenance/compaction.

FAQ

1) Is TiDB a “drop-in replacement” for MariaDB/MySQL?

For many apps, it’s close enough to start testing quickly. But “drop-in” ends at performance and operational behavior. Treat it as a rewrite of your assumptions.

2) Does TiDB replace the need for read replicas?

It changes the model. You scale out TiDB stateless SQL nodes for read throughput, but you still must plan for failure domains, placement, and noisy neighbors.

3) Is Galera a simpler way to get what TiDB offers?

No. Galera can deliver HA and some forms of multi-node writing, but it’s not the same as a distributed storage layer with Raft replication. Under heavy write contention, Galera’s certification model can hurt.

4) What workloads make TiDB shine?

High write scale needs without application sharding, large datasets with horizontal growth, and mixed workloads where scaling reads and writes independently helps. Also: teams that can operate distributed systems well.

5) What workloads should stay on MariaDB?

Single-region OLTP with predictable growth, low-latency requirements, and teams that value operational simplicity over horizontal write scaling. MariaDB is still a very competent workhorse.

6) What’s the first thing that surprises teams moving to TiDB?

Hotspots. Not because they didn’t exist before, but because they become visible as “one region leader is melting,” not just “the primary is busy.”

7) How do I think about cost differences?

MariaDB tends to be cheaper for moderate workloads because you buy fewer nodes and less network. TiDB often costs more in infrastructure but may cost less than sharding engineering. The correct model includes on-call burden and migration risk, not just node counts.

8) Can I migrate with near-zero downtime?

Yes, with CDC/dual-write patterns and careful cutover. The downtime risk is rarely the copy; it’s the last-mile: schema drift, plan regressions, and operational surprises during traffic ramp.

9) What’s the single most valuable pre-migration test?

Run the top 50–200 production queries on the target system with production-like data distribution, and compare EXPLAIN ANALYZE results. Correctness tests won’t catch plan disasters.

10) What’s the single most valuable operational habit for either system?

Restore drills. Not optional. Not “we verified the file exists.” Actual restores with validation and timed steps.

Conclusion: next steps you can execute Monday

If you’re choosing between MariaDB and TiDB, decide based on what you’re willing to operate. MariaDB rewards disciplined simplicity. TiDB rewards teams who treat distributed systems as a first-class job, not an accidental hobby.

Do this next:

Pull your top queries and run plan comparisons (MariaDB EXPLAIN, TiDB EXPLAIN ANALYZE).
Run the fast diagnosis playbook on your worst p99 endpoint and classify the bottleneck.
Pick one migration strategy (CDC/dual-write) and write the rollback plan before you write the cutover plan.
Schedule a restore drill. Put it on the calendar. Make it real.

You don’t need perfect certainty. You need controlled risk, measured performance, and a system that fails in ways you already rehearsed.