MySQL vs TiDB: MySQL Compatibility vs Operational Complexity—What You’re Signing Up For

Was this helpful?

You don’t choose a database in a vacuum. You choose it at 2:13 a.m., with an on-call phone buzzing, a queue building,
and someone asking why “the database” is slow like it’s a single appliance you can reboot.

If you’re deciding between MySQL and TiDB, you’re really deciding between one big, familiar, sharp tool and
a distributed system that promises to feel familiar. Both can work. Both can hurt you. The difference is
which pain you want: MySQL’s scaling ceiling and manual partitioning work, or TiDB’s operational surface area and
distributed failure modes.

The decision that actually matters

Most “MySQL vs TiDB” debates pretend the question is about features. It’s not. It’s about who does the
complexity work
.

  • MySQL pushes complexity upward into your application and your team: sharding strategy, cross-shard
    queries, dual-writes, id generation, schema changes with minimal downtime, replication topology management,
    read/write split logic, and the joy of finding out your “simple” join is now a distributed join across shards.
  • TiDB pulls complexity downward into the database platform: placement rules, Raft replication,
    region splits/merges, PD scheduling, distributed transaction coordination, hotspots, and multi-component upgrades.
    Your app code gets simpler. Your on-call runbooks get thicker.

Here’s the practical heuristic I use:

  • If you can scale MySQL with replicas, caching, careful indexing, and occasional vertical scaling—and your biggest
    pain is operational discipline—stick with MySQL. It’s boring, proven, and your team already knows how it
    fails.
  • If you’ve hit the point where sharding is inevitable (or already a mess), and you need strong consistency across
    a large dataset with elastic horizontal scaling, TiDB is a reasonable bet—but only if you can staff and run
    it like a platform
    , not like “a DB node with friends.”

And yes, TiDB speaks the MySQL protocol. That’s the start of compatibility, not the finish line.

Facts and historical context that change your intuition

A few anchors to keep your brain honest. These are not trivia; they explain why the systems behave the way they do.

  1. MySQL’s replication started as asynchronous statement-based replication. That legacy explains why
    “replication lag” and “read-your-writes” workarounds became common cultural knowledge for MySQL teams.
  2. InnoDB became the default MySQL storage engine in MySQL 5.5. Before that, MyISAM was common, and
    that era baked “MySQL is fast but not safe” into the internet’s collective memory.
  3. Percona and MariaDB rose because operators wanted better observability and operational knobs.
    MySQL’s ecosystem history is basically “SREs demanded tools and got an industry.”
  4. Google’s Spanner (2012) set expectations for “SQL + horizontal scale + strong consistency.”
    TiDB is in the family of systems that made that dream more accessible without requiring atomic clocks.
  5. TiDB’s storage layer (TiKV) is built on RocksDB and Raft replication. That means performance and
    failure modes look more like a distributed key-value store than a single-node B-tree engine.
  6. TiDB splits data into Regions (ranges) and schedules them via PD. That region rebalancing is both
    a superpower and a source of “why did my latency spike?” moments.
  7. MySQL’s GTID and semi-sync replication were responses to real pain. They improved operational
    reliability, but they don’t turn MySQL into a distributed SQL database; they just reduce the blast radius of
    human error.
  8. Online schema change tooling (gh-ost, pt-online-schema-change) became mainstream because MySQL DDL was
    historically disruptive.
    In TiDB, many schema changes are online, but you still pay in cluster work and
    downstream effects.
  9. “MySQL-compatible” is a moving target. MySQL itself has multiple major versions, vendor forks,
    and behavior differences. Compatibility isn’t binary; it’s a matrix.

MySQL in production: what breaks, what scales, what lies

What MySQL is genuinely good at

MySQL is a strong choice when the workload fits a single primary (or a small number of primaries) with replicas.
It’s straightforward to operate if you keep it simple: one writer, many readers; predictable queries; stable schema;
conservative configs; and a team that respects the database as a shared resource, not an infinite vending machine.

For many companies, MySQL’s “ceiling” is far higher than they think—because the real bottleneck was missing indexes,
huge transactions, chatty ORMs, or a job queue doing full table scans like it was paid per row.

Where MySQL makes you pay

MySQL’s scaling story for write-heavy workloads is basically: scale up, tune harder, then shard. Sharding works,
but it’s a commitment. You’ll rewrite assumptions:

  • Cross-shard transactions become application coordination or eventual consistency.
  • Global uniqueness needs IDs that won’t collide across shards.
  • Reporting queries become ETL jobs or analytics stores.
  • Operational work shifts to managing many smaller databases, backups, and schema migrations.

MySQL’s failure modes are well understood: replication lag, lock contention, buffer pool pressure, disk IO stalls,
and “one query ruined everyone’s day.” The upside is that most engineers have seen these before. The downside is
that teams often normalize them and stop questioning architecture choices.

Exactly one joke, as promised: MySQL is like a kitchen knife—excellent until someone decides it’s also a screwdriver.

TiDB in production: what you gain, what you inherit

What TiDB is genuinely good at

TiDB’s pitch is attractive because it addresses the classic MySQL pain: scaling without manual sharding while keeping
SQL and transactions. In practice, the wins are real when:

  • Your dataset and write throughput outgrow a single primary and you want horizontal scaling.
  • You need strong consistency semantics without building your own cross-shard logic.
  • You can tolerate the operational overhead of running a distributed system.
  • You need to grow capacity by adding nodes rather than planning “big box” upgrades.

What TiDB makes you pay

TiDB is not “MySQL, but faster.” It is a distributed database with a MySQL-compatible interface. You now run:

  • TiDB stateless SQL layer (scale-out).
  • TiKV stateful storage layer (Raft replicated, region-based).
  • PD placement driver (cluster brain for scheduling and metadata).
  • Often TiFlash for analytics acceleration and HTAP workloads.

Each component has its own saturation points, metrics, upgrade paths, and “it’s technically healthy but effectively
degraded” states. TiDB makes scaling easier; it does not make operations simpler. It shifts the complexity from
application-layer sharding to platform-layer scheduling and coordination.

You’re also opting into a different kind of performance tuning: less “buffer pool size and query cache nostalgia,”
more “region hotspots, coprocessor tasks, and raftstore latency.”

MySQL compatibility: the fine print that bites

Protocol compatibility is not behavior compatibility

Many migrations begin with “our app speaks MySQL, so TiDB should be drop-in.” That’s true right up until your app
depends on edge-case behavior. Most apps do, they just don’t know it yet.

Compatibility gaps show up in four practical places:

  • SQL grammar and functions: some functions differ, some are missing, and some behave subtly
    differently in corner cases.
  • Transactions and locking semantics: distributed transactions do not behave exactly like single-node
    locking under contention.
  • DDL and metadata: “online DDL” is great until it creates background load you didn’t budget for.
  • Operational assumptions: in MySQL, you can sometimes “just restart mysqld” to clear a wedged state.
    In TiDB, restarting a component may trigger rescheduling or leader elections—fine, but not free.

Consistency: MySQL’s simplicity vs TiDB’s coordination

MySQL with a single primary provides a clear mental model: the primary decides. Replicas follow. Strong consistency
is local to the primary; reads from replicas are “maybe stale,” and everyone learns to live with it.

TiDB aims for strong consistency across a distributed cluster. That means:

  • Writes must reach a quorum of replicas for Raft groups (Regions).
  • Transactions may involve multiple Regions, potentially across nodes.
  • Network jitter and slow storage nodes become query latency in new and creative ways.

Performance surprises: the “distributed tax”

Some queries get faster in TiDB because you can scale out compute and storage. Others get slower because a query that
was a local index lookup in MySQL becomes a distributed operation involving multiple Regions.

This is where you need to be brutally honest about workload shape:

  • Single-row point lookups by primary key can be great, but hotspots can form if many clients hammer
    the same key range.
  • Large range scans can be decent if parallelized, but they can also trigger heavy coprocessor load.
  • Joins are where you pay attention. Distributed join planning and statistics quality matter more.

Operational complexity: what TiDB adds to your pager

You’re operating a database platform

With MySQL, the unit of operation is a server (or a cluster of servers with one primary). With TiDB, the unit of
operation is a system. You will manage:

  • Rolling upgrades across multiple components
  • Capacity management across compute, storage, and network
  • Failure domains (racks/AZs), placement rules, and replica counts
  • Hotspot detection and mitigation
  • Backup/restore that respects distributed consistency
  • Schema changes that trigger background data reorganization

Observability is not optional

In MySQL, you can get surprisingly far with slow query logs, performance_schema, and basic host metrics. In TiDB,
you need cluster-level observability: PD health, TiKV store latency, region balance, raft apply duration, and SQL
layer saturation.

Here’s the operational truth: if you don’t have good metrics and alerting, TiDB will still work—until it doesn’t—and
then your incident timeline will read like abstract art.

One quote (paraphrased idea)

Paraphrased idea: If you can’t measure it, you can’t reliably improve it. — attributed to W. Edwards Deming

Exactly one quote, because that’s how reliability works: limited scope, clear ownership.

Practical tasks with commands, outputs, and decisions (hands-on)

The fastest way to understand operational complexity is to touch it. Below are concrete tasks I’d expect an SRE or
DBRE to run during migration planning or an incident. Each includes: a command, sample output, what it means, and
what decision you make from it.

Task 1: Check MySQL replication lag (are reads trustworthy?)

cr0x@server:~$ mysql -h mysql-replica-01 -e "SHOW REPLICA STATUS\G" | egrep "Seconds_Behind_Source|Replica_IO_Running|Replica_SQL_Running"
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 37

Meaning: replica is healthy but ~37s behind. Any “read after write” from this replica can be stale.

Decision: route critical reads to primary (or a low-lag replica), or fix lag before using replicas for correctness-sensitive paths.

Task 2: Find top MySQL wait events (is it locks, IO, or CPU?)

cr0x@server:~$ mysql -e "SELECT EVENT_NAME, SUM_TIMER_WAIT/1e12 AS seconds_waited FROM performance_schema.events_waits_summary_global_by_event_name ORDER BY SUM_TIMER_WAIT DESC LIMIT 5;"
EVENT_NAME	seconds_waited
wait/io/file/innodb/innodb_data_file	1843.22
wait/synch/mutex/innodb/buf_pool_mutex	912.45
wait/lock/table/sql/handler	211.10
wait/io/file/sql/binlog	199.08
wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done	133.57

Meaning: heavy InnoDB data file IO and buffer pool contention. You’re not “CPU bound”; you’re waiting.

Decision: check storage latency, buffer pool sizing, and hot tables. If this is chronic at your scale, TiDB might help—if your workload is shard-like.

Task 3: Identify MySQL queries with worst 95th percentile latency

cr0x@server:~$ pt-query-digest /var/log/mysql/mysql-slow.log --limit 3
#  1.2s user time, 60ms system time, 22.18M rss, 221.20M vsz
# Rank Query ID           Response time Calls R/Call V/M   Item
# ==== ================== ============= ===== ====== ===== ====
#    1 0x8D7A...          521.7340  38.4%  120 4.3478  0.02 SELECT orders
#    2 0x44B1...          301.0092  22.1%  560 0.5375  0.01 UPDATE inventory
#    3 0xA1E3...          198.5521  14.6%   45 4.4123  0.00 SELECT users

Meaning: two slow SELECTs dominate response time; an UPDATE is also expensive and frequent.

Decision: tune schema and queries first. Migrating a slow query to TiDB just makes it slow on more servers.

Task 4: Inspect MySQL table/index size (are you IO-bound due to bloat?)

cr0x@server:~$ mysql -e "SELECT table_schema, table_name, ROUND((data_length+index_length)/1024/1024/1024,2) AS gb FROM information_schema.tables WHERE table_schema='app' ORDER BY (data_length+index_length) DESC LIMIT 5;"
table_schema	table_name	gb
app	events	412.77
app	orders	188.34
app	order_items	141.02
app	users	62.11
app	inventory	51.89

Meaning: you have a few very large tables. Those dominate IO and buffer pool pressure.

Decision: consider partitioning, archival, or secondary storage for cold data. If access is broad and write-heavy, TiDB’s horizontal storage may help.

Task 5: Check TiDB cluster health (is the platform stable?)

cr0x@server:~$ tiup cluster display tidb-prod
Cluster type:       tidb
Cluster name:       tidb-prod
Cluster version:    v7.5.1
ID                   Role          Host            Ports        OS/Arch       Status  Data Dir
--                   ----          ----            -----        -------       ------  --------
tidb-01              tidb          10.0.2.11        4000/10080   linux/x86_64   Up      /tidb-data/tidb-4000
pd-01                pd            10.0.2.21        2379/2380    linux/x86_64   Up      /tidb-data/pd-2379
tikv-01              tikv          10.0.3.31        20160/20180  linux/x86_64   Up      /tidb-data/tikv-20160
tikv-02              tikv          10.0.3.32        20160/20180  linux/x86_64   Up      /tidb-data/tikv-20160
tikv-03              tikv          10.0.3.33        20160/20180  linux/x86_64   Up      /tidb-data/tikv-20160

Meaning: all components are up. This doesn’t guarantee performance, but it rules out “node down” as the first-order problem.

Decision: if any TiKV or PD is down, stop tuning SQL and fix cluster health first.

Task 6: Verify PD health and leader (is the brain confused?)

cr0x@server:~$ tiup ctl:v7.5.1 pd -u http://10.0.2.21:2379 member
{
  "members": [
    {
      "name": "pd-01",
      "member_id": 1234567890,
      "client_urls": ["http://10.0.2.21:2379"],
      "peer_urls": ["http://10.0.2.21:2380"]
    }
  ],
  "leader": {
    "name": "pd-01",
    "member_id": 1234567890
  }
}

Meaning: PD is single-member here; it’s the leader. In production you usually want multiple PD nodes for HA.

Decision: if PD is not HA, fix architecture before betting the company on the cluster.

Task 7: Check TiKV store capacity and balance (are you about to fall off a cliff?)

cr0x@server:~$ tiup ctl:v7.5.1 pd -u http://10.0.2.21:2379 store
{
  "count": 3,
  "stores": [
    {"store": {"id": 1, "address": "10.0.3.31:20160"}, "status": {"capacity": "3.6TiB", "available": "0.4TiB"}},
    {"store": {"id": 2, "address": "10.0.3.32:20160"}, "status": {"capacity": "3.6TiB", "available": "0.3TiB"}},
    {"store": {"id": 3, "address": "10.0.3.33:20160"}, "status": {"capacity": "3.6TiB", "available": "0.2TiB"}}
  ]
}

Meaning: all stores are running low on free space. Compaction, snapshots, and rebalancing get riskier near full disks.

Decision: add capacity before performance tuning. “Almost full” in RocksDB-land is how you schedule your own outage.

Task 8: Find TiDB slow queries quickly (are we dealing with plan regressions?)

cr0x@server:~$ mysql -h tidb-01 -P 4000 -e "SELECT time, query_time, digest, left(query,120) AS sample FROM information_schema.cluster_slow_query ORDER BY query_time DESC LIMIT 3;"
time	query_time	digest	sample
2025-12-30T09:40:11Z	8.214	9f2d...	SELECT * FROM orders WHERE user_id=... ORDER BY created_at DESC LIMIT 50
2025-12-30T09:41:02Z	6.882	2a11...	SELECT COUNT(*) FROM events WHERE created_at BETWEEN ... AND ...
2025-12-30T09:41:33Z	5.477	0bd3...	UPDATE inventory SET qty=qty-1 WHERE sku=...

Meaning: you have slow queries similar to MySQL, but now you need to examine distributed execution (cop tasks, regions touched, etc.).

Decision: pull execution plans and check stats freshness. If this coincided with a schema or version change, suspect plan regression.

Task 9: Explain a TiDB query and spot distributed pain (Index? coprocessor? scatter?)

cr0x@server:~$ mysql -h tidb-01 -P 4000 -e "EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id=42 ORDER BY created_at DESC LIMIT 50\G"
*************************** 1. row ***************************
id: TopN_10
actRows: 50
task: root
execution info: time:1.2s, loops:2
operator info: order by:created_at, offset:0, count:50
*************************** 2. row ***************************
id: IndexLookUp_21
actRows: 5000
task: root
execution info: time:1.1s, loops:5
operator info: index:idx_user_created(user_id,created_at), table:orders
*************************** 3. row ***************************
id: IndexRangeScan_19
actRows: 5000
task: cop[tikv]
execution info: time:980ms, loops:8
operator info: range:[42,42], keep order:true

Meaning: most time is spent in cop tasks on TiKV. That points to storage latency, hotspotting, or too many regions involved.

Decision: investigate TiKV latency, region distribution for that index range, and whether the workload creates a hotspot on user_id=42-like keys.

Task 10: Check region hotspots (is one key range melting a store?)

cr0x@server:~$ tiup ctl:v7.5.1 pd -u http://10.0.2.21:2379 hot read
{
  "as_peer": {
    "stats": [
      {"store_id": 3, "region_id": 918273, "hot_degree": 97, "flow_bytes": 125829120}
    ]
  }
}

Meaning: store 3 is serving a very hot region for reads. That can dominate tail latency.

Decision: mitigate hotspot: add TiDB nodes for SQL scale, consider splitting regions, adjust application access patterns, or use caching for that hot key range.

Task 11: Check TiKV latency via Prometheus text endpoint (is storage the bottleneck?)

cr0x@server:~$ curl -s http://10.0.3.33:20180/metrics | egrep "tikv_engine_write_stall|rocksdb_compaction_pending_bytes" | head
tikv_engine_write_stall 1
rocksdb_compaction_pending_bytes 2.184532e+10

Meaning: write stall is active and compaction backlog is huge. RocksDB is behind; writes and reads will suffer.

Decision: stop blaming SQL. Fix disk throughput, reduce write amplification, review TiKV options, and consider adding stores to spread compaction load.

Task 12: Validate time sync (distributed transactions hate sloppy clocks)

cr0x@server:~$ chronyc tracking
Reference ID    : C0A80101 (ntp-01)
Stratum         : 3
System time     : 0.000143219 seconds slow of NTP time
Last offset     : -0.000012034 seconds
RMS offset      : 0.000084321 seconds
Frequency       : 10.123 ppm slow
Residual freq   : -0.002 ppm
Skew            : 0.090 ppm
Root delay      : 0.001234 seconds
Root dispersion : 0.001901 seconds
Update interval : 64.0 seconds
Leap status     : Normal

Meaning: clock is well synced (sub-millisecond offsets). Good. Poor clock sync creates weird transaction and timeout behavior in distributed systems.

Decision: if offsets are large or leap status is bad, fix NTP/chrony before chasing phantom latency spikes.

Task 13: Validate TiDB transaction conflicts (are you fighting contention?)

cr0x@server:~$ mysql -h tidb-01 -P 4000 -e "SHOW STATUS LIKE 'tikv_txn_mvcc_%conflict%';"
Variable_name	Value
tikv_txn_mvcc_conflict_counter	18422
tikv_txn_mvcc_write_conflict_counter	12011

Meaning: conflicts are happening. Under hot-row workloads, TiDB will behave like any consistent system: it forces serialization somewhere.

Decision: redesign hot-row patterns (counters, “last seen” updates), batch writes, or move contention-heavy primitives to a different store.

Task 14: Check MySQL disk latency (is the host lying to you?)

cr0x@server:~$ iostat -x 1 3 | egrep "Device|nvme0n1"
Device            r/s     w/s   r_await   w_await  aqu-sz  %util
nvme0n1         210.4   980.2     3.12    18.44    9.87  99.10
nvme0n1         198.1  1050.6     2.88    21.77   11.22  99.54
nvme0n1         205.7  1012.3     3.05    19.90   10.45  99.33

Meaning: %util pegged; write latency ~20ms. That’s enough to make both MySQL and TiKV miserable.

Decision: before migrating to anything, fix storage: isolate workloads, upgrade disks, check RAID/controller settings, and reduce write amplification.

Task 15: Check network retransmits (distributed systems magnify network sins)

cr0x@server:~$ netstat -s | egrep "segments retransmited|packet receive errors" | head
    13245 segments retransmited
    0 packet receive errors

Meaning: retransmits exist; whether it’s “bad” depends on baseline and time window, but in TiDB it can show up as raft latency and p99 spikes.

Decision: correlate with incident window. If retransmits spike, investigate NIC drops, oversubscription, or noisy neighbors.

Task 16: Confirm TiDB node saturation (SQL layer scale-out is a lever)

cr0x@server:~$ top -b -n 1 | head -n 12
top - 09:45:11 up 34 days,  3:21,  1 user,  load average: 24.12, 21.88, 19.05
Tasks: 312 total,   2 running, 310 sleeping,   0 stopped,   0 zombie
%Cpu(s): 92.1 us,  3.4 sy,  0.0 ni,  3.9 id,  0.0 wa,  0.0 hi,  0.6 si,  0.0 st
MiB Mem :  64384.0 total,   1200.3 free,  45210.8 used,  17972.9 buff/cache
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
22411 tidb      20   0 14.1g  3.2g  112m S  860.2   5.1  12:11.43 tidb-server

Meaning: TiDB SQL node is CPU-hot. Since TiDB is stateless at the SQL layer, you can often scale this tier horizontally.

Decision: add TiDB nodes or reduce expensive queries; don’t immediately blame TiKV if TiDB is already melting.

Fast diagnosis playbook (first/second/third checks)

When latency spikes or throughput drops, you want to identify the bottleneck without reinventing distributed systems
theory on a whiteboard during an incident.

First: confirm the blast radius and which layer is hot

  • Is it all queries or specific endpoints? If it’s a single endpoint, suspect a query plan, stats, or
    a single table hotspot—not “the cluster.”
  • Check the SQL layer saturation: TiDB CPU, connection count, threadpool. In MySQL, check primary CPU
    and active threads.
  • Check error rates/timeouts: timeouts often precede visible saturation metrics.

Second: check storage latency and compaction/backpressure

  • MySQL: iostat, InnoDB history list length, buffer pool hit rate, fsync latency.
  • TiDB/TiKV: RocksDB compaction pending bytes, write stalls, raftstore apply duration, store disk
    utilization.

Third: check distribution pathologies

  • Hot regions: single region dominating reads/writes.
  • Leader imbalance: one store holds too many leaders.
  • Network: retransmits, packet loss, or cross-AZ latency surprises.
  • Stats and plan regressions: sudden query plan changes after deploy/DDL/analyze changes.

If you do these in order, you usually find the bottleneck fast. If you do them out of order, you can spend hours
tuning SQL while a storage node is write-stalling itself into the ground.

Three corporate mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company migrated a payments-adjacent service from MySQL to TiDB. The application used the MySQL
protocol and passed integration tests. Everyone celebrated. The team scheduled the cutover for a Tuesday because
Tuesday “isn’t a busy day,” which is something people say right before Tuesday becomes busy.

The wrong assumption was simple: they assumed their read path was safe to run at REPEATABLE READ semantics with
long-lived transactions. In MySQL, their pattern was “open transaction, read a few things, call a couple of internal
services, then write.” It was slow, but survivable on one primary and a couple replicas.

On TiDB, the same pattern created a pile-up of transaction locks and conflicts during a traffic burst. The
distributed transaction coordination amplified what used to be mild contention into p99 latency spikes and timeout
retries, which made contention worse. A retry storm is just a self-fulfilling prophecy with better logging.

The fix wasn’t exotic. They shortened transaction scopes, removed network calls from inside DB transactions, and
made the “read then write” logic idempotent with proper unique constraints. After that, TiDB behaved fine. The
compatibility was never the issue; their transactional habits were.

Mini-story 2: The optimization that backfired

Another company ran MySQL on high-end NVMe and had a performance review where someone suggested “let’s add more
indexes to make reads faster.” They added several composite indexes across a large event table. Reads improved on a
dashboard. Everyone nodded. Then the write path started to slip.

The backfire was classic: the ingestion pipeline was write-heavy, and each index turned every insert into additional
write amplification. InnoDB started spending more time maintaining secondary indexes than doing the actual work. The
binlog grew, replication lag increased, and the team started routing more reads to the primary “temporarily,” which
made the primary even busier. Temporarily is how outages become habits.

They tried to “solve” it by moving to TiDB, expecting horizontal scaling to erase the problem. It didn’t. TiDB
distributed the write amplification across TiKV nodes, but it also increased compaction pressure. The cluster stayed
up, but tail latency went sideways during peak ingest.

The eventual fix was boring: remove low-value indexes, introduce rollup tables for dashboard queries, and move some
analytic workloads to TiFlash. The lesson wasn’t “indexes are bad.” It was: optimization without understanding your
write path is just sabotage with good intentions.

Mini-story 3: The boring but correct practice that saved the day

A third org ran TiDB for a customer-facing workload and MySQL for several internal tools. Their TiDB cluster had a
strict practice: quarterly game days and monthly restore drills. Not “we restored once in staging,” but “we can
restore a consistent snapshot and bring the app up against it.”

One morning, a deploy introduced a subtle bug: a background job wrote incorrect data to a critical table. The bug
wasn’t immediately obvious. It passed unit tests and only manifested under certain data distributions. By the time
it was caught, bad writes had been happening for hours.

Because they had practiced it, the team did not panic-mutate production. They halted the job, took an incident
snapshot for forensics, and restored to a known-good point-in-time backup into a parallel cluster. Then they replayed
a scrubbed subset of correct writes. The service stayed partially available, and customer impact was limited.

Nobody got applause for “we ran restore drills.” But it worked because it was boring. Reliability is rarely clever.
It’s mostly rehearsal and refusing to skip the annoying steps.

Common mistakes: symptom → root cause → fix

1) Symptom: “TiDB is slow” only for one tenant or one user_id

Root cause: hotspotting on a narrow key range (single Region, single leader) due to monotonically increasing IDs or tenant-keyed access patterns.

Fix: redesign keys (hash prefix, scatter), enable region splitting strategy, add caching, or separate hot tenants. Confirm with PD hot region outputs.

2) Symptom: p99 latency spikes during peak writes; CPU looks fine

Root cause: TiKV RocksDB compaction backlog and write stalls; storage can’t keep up with write amplification.

Fix: increase disk throughput, add TiKV nodes, reduce index count, batch writes, review compaction settings and compression choices.

3) Symptom: MySQL primary looks healthy but replicas lag unpredictably

Root cause: large transactions, heavy write bursts, or replicas IO bound; sometimes single-threaded SQL apply becomes the limiter.

Fix: break up transactions, tune replication parallelism, reduce binlog churn (indexes, row format), and validate storage latency on replicas.

4) Symptom: Deadlocks or lock timeouts increase after moving to TiDB

Root cause: long transactions and high contention rows become more expensive in distributed commit paths; retries amplify load.

Fix: shorten transactions, avoid “select then update” patterns, use optimistic concurrency patterns, and make retries jittered and bounded.

5) Symptom: Schema change causes cluster-wide latency and elevated IO

Root cause: online DDL still triggers background work (backfill/reorg) and increases read/write load across TiKV.

Fix: schedule DDL during low traffic, throttle if available, ensure headroom, and watch TiKV compaction/raft apply metrics during DDL.

6) Symptom: After migration, some queries return different results

Root cause: relying on MySQL-specific behavior: implicit casts, collation differences, non-deterministic GROUP BY usage, or undefined ordering without ORDER BY.

Fix: make SQL explicit: proper ORDER BY, strict SQL modes where possible, explicit casts, verify collations, add tests for edge-case queries.

7) Symptom: TiDB cluster “healthy” but throughput drops after a node failure

Root cause: leader re-election and rebalancing changed region leader distribution; surviving stores are overloaded.

Fix: ensure capacity headroom, use placement rules aligned with failure domains, and rebalance leaders. Validate with PD store and hot region outputs.

8) Symptom: MySQL is “fine” until backups run, then everything crawls

Root cause: backup IO contention and snapshot overhead; backup tooling competing with production IO and cache.

Fix: run backups from replicas, throttle backup IO, schedule off-peak, isolate backup traffic/storage, and verify restore capability regularly.

Checklists / step-by-step plan

Checklist A: When you should stay on MySQL (and sleep better)

  • Your primary fits in a single strong box and write QPS isn’t exploding.
  • You can use replicas, caching, and query/index tuning to meet SLOs.
  • You don’t need strong consistency across multiple primaries.
  • Your team is not staffed to run a distributed database platform 24/7.
  • You can accept read replica staleness for non-critical reads.

Checklist B: When TiDB is a rational move (not a resume-driven one)

  • You are already sharding MySQL or about to, and it’s becoming product risk.
  • You need horizontal scaling with transactional semantics.
  • You can commit to observability, on-call training, and upgrade discipline.
  • You can provision enough nodes to keep headroom (compute and storage).
  • You have a plan for hotspots, not just a hope.

Step-by-step migration plan (pragmatic version)

  1. Inventory “MySQL-isms” in your app: SQL modes, implicit casts, reliance on undefined ordering, non-standard functions.
  2. Baseline performance: top queries by latency and by total time; transaction sizes; peak write throughput.
  3. Design for contention: identify hot rows, counters, and “last updated” patterns; redesign before migration.
  4. Provision TiDB with headroom: don’t start at “just enough.” Distributed systems punish tight margins.
  5. Decide placement and failure domains: replicas across AZs/racks; PD quorum; TiKV count.
  6. Load test with production-like data and skew: uniform synthetic load is how you miss hotspots.
  7. Run dual reads carefully: compare results for critical queries; watch for collation/cast differences.
  8. Plan cutover with rollback: either dual-write with verification or a controlled freeze window.
  9. Run restore drills before go-live: backup without restore is theater.
  10. Train on-call: “what is a region hotspot” should not be discovered during an incident.
  11. After cutover, freeze DDL and tuning changes: observe stable behavior before you “optimize.”
  12. Set SLO-driven alerts: latency and error rates first, then saturation signals (CPU, compaction, leader balance).

Second and final joke: A distributed database is just like a regular database, except it can fail in more places at
once.

FAQ

1) Is TiDB really “drop-in” for MySQL applications?

Drop-in for the protocol and many SQL patterns, yes. Drop-in for behavior, no. Assume you’ll find edge cases:
collations, implicit casts, transaction scope patterns, and performance differences under skewed workloads.

2) If I’m not sharding today, should I still move to TiDB “to be safe”?

Usually no. If MySQL meets your SLOs with sane operational practices, stay. TiDB is a strategic move when sharding or
multi-primary needs are becoming unavoidable product risk.

3) Will TiDB automatically fix my slow queries?

It can help if your bottleneck is single-node capacity and the query parallelizes well. It will not fix missing
indexes, bad query patterns, or huge transactions. In fact, distributed execution can make some bad queries more
expensive.

4) What’s the most common TiDB performance failure mode in real life?

Hotspots and storage backpressure. Hotspots concentrate load on a few regions/stores. Backpressure shows up as
RocksDB compaction backlog and write stalls, which then cascade into query latency.

5) What’s the most common MySQL performance failure mode in real life?

Lock contention and IO stalls, often triggered by one “innocent” query or a migration running at the wrong time.
Replication lag is the runner-up that quietly breaks correctness assumptions.

6) How do backups differ operationally?

MySQL backups are conceptually simpler: one primary dataset, plus replicas. TiDB backups must capture distributed
consistency across stores. In both cases, the only meaningful test is restoring and running the application against
it.

7) Can TiDB replace read replicas and read/write splitting?

TiDB can scale reads by adding TiDB nodes, and it can also use followers in some configurations. But you still need
to think about where load lands: SQL layer CPU, TiKV hotspots, and network. It’s not “no splitting,” it’s “different
splitting.”

8) What about schema changes—easier in TiDB?

Often easier from an application availability standpoint, because many DDL operations are online. But they still
create background work that can degrade performance if you don’t have headroom and good observability.

9) Do I need a dedicated platform team to run TiDB?

You don’t need a huge team, but you need clear ownership, on-call competence, and a culture of rehearsal. If your org
treats databases as pets, TiDB will become an expensive pet with opinions.

10) What’s a reasonable pilot approach?

Start with a service that has clear key access patterns, strong test coverage, and limited “clever SQL.” Mirror
traffic for reads, validate results, then progressively move write paths with a rollback plan.

Conclusion: practical next steps

If you remember one thing: MySQL and TiDB are both serious databases. The difference is where the complexity lives
and how it fails under stress.

  • If you’re on MySQL: do the unglamorous work first—index hygiene, transaction scope discipline,
    replica lag monitoring, and tested restores. You may buy years of runway without a migration.
  • If you’re evaluating TiDB: run a workload-shaped pilot. Validate not just functional correctness
    but tail latency under skew, hotspot behavior, and what happens during node loss and rolling upgrades.
  • If you’re already migrating: assume compatibility gaps exist and hunt them deliberately. Remove
    long transactions, make SQL explicit, and build an incident playbook before production makes one for you.

Pick the system whose failure modes you can diagnose quickly, whose operational work you can staff honestly, and
whose “simple” story still holds at peak load. That’s the grown-up choice.

← Previous
ZFS iSCSI: ZVOL Setup That Doesn’t Stutter Under Load
Next →
Ubuntu 24.04: IPv6 firewall forgotten — close the real hole (not just IPv4) (case #72)

Leave a comment