You bought “three small servers” because budgets love threes. Then you deployed a distributed SQL database because marketing loves the word “resilient.”
Now your dashboard’s p95 latency looks like it started taking the scenic route, and every incident review includes the phrase “but it worked fine locally.”
This is the latency tax nobody puts on the slide: distributed correctness has a very real per-query cost, and small servers are where that cost becomes loud.
Not catastrophic. Just loud enough to ruin your sleep.
The latency tax: what you pay for consensus
If you remember one thing, make it this: MySQL’s default OLTP path is “do the write locally, then replicate later.”
CockroachDB’s default OLTP path is “coordinate the write with a quorum, then commit.”
On a decent network and healthy disks, quorum writes are fine. On small servers with commodity storage and 1Gbps networking (or worse: virtualized “best effort”),
quorum writes become a latency amplifier. Not always in median latency. In tail latency. The kind users notice and SREs inherit.
The operational difference isn’t subtle:
- MySQL: A single primary can be extremely fast. It is also a single point where physics is kind and correctness is local. Replicas are for read scale and failover. Consistency is a policy choice.
- CockroachDB: Every write participates in consensus for the range it touches. Consistency is the default, not a bolt-on. Latency is the fee.
A practical framing: if your app is sensitive to p99 latency and you’re running small nodes, distributed SQL can be like hiring a committee to approve every email you send.
One quote that survives a lot of incident reviews: Everything fails, all the time.
— Werner Vogels.
He wasn’t talking about your three-node cluster specifically, but he might as well have been.
Where the tax shows up (and why it surprises people)
The surprise isn’t that a distributed system has overhead. The surprise is where it manifests.
You’ll look at CPU and see plenty of headroom. You’ll look at average query time and feel okay. Then you’ll open the histogram and see p95/p99 spiking.
Common places the tax appears:
- Quorum writes: a commit needs acknowledgements from a majority of replicas for the range. That’s at least one round trip, sometimes more.
- Read consistency: strongly consistent reads can need coordination (leaseholder placement matters). Stale reads can be faster, but you must opt in and understand the consequences.
- Range splits and rebalances: background work isn’t “free.” On small boxes, it competes directly with foreground latency.
- Compactions: LSM-based storage engines trade write amplification for read efficiency. That amplification becomes a disk and CPU tax at inconvenient times.
Two numbers that should haunt your sizing spreadsheet
You can’t “optimize” away physics, only route around it.
- Network RTT between nodes: every millisecond matters when your commit path includes cross-node coordination. A 1–2ms RTT sounds fine until you need multiple hops under load.
- Storage fsync latency: if your storage layer occasionally turns fsync into a 20–100ms adventure, distributed consensus will faithfully incorporate that adventure into user-visible latency.
Joke #1: A three-node cluster on tiny VMs is like a three-person relay race where everyone insists on tying their shoes between handoffs.
Facts and history you can use in arguments
These aren’t trivia for a pub quiz. They explain why the systems behave the way they do.
- MySQL shipped in the mid-1990s and became the default for web OLTP because it was easy to deploy, easy to replicate, and fast on a single box.
- InnoDB became the default MySQL storage engine (after years of MyISAM dominance), bringing crash recovery, row-level locking, and real transactions to mainstream deployments.
- Most MySQL HA patterns historically favored asynchronous replication because it’s simple and keeps primary latency low; the tradeoff is potential data loss on failover.
- Consensus algorithms like Raft were designed to make distributed consensus easier to understand and implement than Paxos, not free in latency.
- CockroachDB was built with Raft from the beginning, aiming for PostgreSQL-like SQL semantics with a shared-nothing, geo-distributed design.
- Google Spanner popularized “distributed SQL with strong consistency” using TrueTime; CockroachDB targets similar goals without specialized clock hardware, which affects how uncertainty and transaction retries show up.
- LSM-tree storage engines became common in distributed databases because they handle high write throughput well, but they require compaction—i.e., background I/O that can bite tail latency.
- The CAP theorem era made “consistency vs availability” a boardroom conversation; distributed SQL systems typically pick consistency and partition tolerance, forcing availability compromises during partitions.
- p99 latency became an SRE obsession because user experience is dominated by tail behavior, not averages; distributed coordination is a classic tail-latency generator.
Why small servers make distributed SQL look worse
“Small servers” usually means some combination of: fewer cores, less RAM, slower NVMe (or worse: network-attached), less predictable neighbors, and a network that was sized for “normal” traffic.
Distributed SQL does not hate small servers. It just refuses to pretend they are big servers.
Small nodes don’t just reduce capacity; they reduce slack
Slack is the invisible margin that keeps tail latency stable.
On a big machine, compactions, flushes, or minor network jitter might be absorbed.
On a small node, the same background work competes with foreground traffic and wins more often than you’d like.
Three-node clusters are operationally seductive (and mechanically harsh)
Three nodes is the minimum shape that looks like “a cluster.” It’s also the minimum shape where any disturbance can become user-visible.
Lose one node and you’re in “no more failures please” mode.
- MySQL on three nodes typically means 1 primary + 2 replicas. The primary stays hot; replicas are mostly passengers until failover.
- CockroachDB on three nodes means all three are in the write path (quorum). Every node’s health matters every minute.
The quiet killer: unevenness
Small fleets are rarely homogeneous. One node is on a slightly noisier hypervisor.
One disk is slightly slower.
One NIC is sharing interrupts with something rude.
Distributed systems tend to run at the pace of their worst participant—especially at p99.
MySQL: the low-latency baseline with sharp edges
MySQL’s superpower is boring: one machine, one commit log, one cache hierarchy.
If your workload fits on one primary, MySQL can deliver very low latency with predictable behavior.
You can absolutely run MySQL badly, but you have to work at it.
What MySQL is great at (on small servers)
- Low-latency single-row writes when your schema and indexes are sane.
- Stable p95/p99 when storage is decent and you avoid contention hotspots.
- Operational simplicity for teams that can tolerate a single primary and have a clear failover plan.
MySQL’s failure modes are not subtle
- Primary saturation: you don’t get graceful degradation; you get a queue.
- Replication lag: read scaling and failover safety depend on a lagging async stream unless you pay the sync cost.
- Failover correctness: split brain is avoidable, but not automatic; you need tooling and discipline.
A MySQL reality check
If you need strong consistency across multiple writers in multiple locations, MySQL won’t magically become a distributed database.
You can build that with semi-sync replication, group replication, or external coordination—but you’ll earn every pager.
CockroachDB: correctness at scale, even when you didn’t need scale
CockroachDB’s pitch is simple: keep SQL, lose the single-primary fragility.
Writes are replicated. Reads can be served from the “right” place. Failovers are automated and fast.
It is real engineering, not fairy dust.
What CockroachDB is great at (even on small servers)
- Surviving node failures without manual failover rituals.
- Scaling reads and writes by adding nodes (within reason and with attention to locality).
- Consistent semantics that keep application logic simpler than hand-rolled replication correctness.
What CockroachDB demands from you
- Respect for latency: you’re coordinating, so measure network and disk like you mean it.
- Understanding of contention: hotspots don’t disappear in a cluster; they become distributed hotspots with retries.
- Careful locality decisions: leaseholders and replicas must be placed intentionally if you care about latency.
Transaction retries: the “feature” that feels like a bug
CockroachDB will retry transactions under contention or uncertainty. That’s part of how it preserves correctness.
The symptom is application-level errors (or driver retries) and latency spikes that look random until you correlate them with contention metrics.
Joke #2: Distributed SQL is the only product that can sell you “no single point of failure” and still fail your query because two clocks disagreed politely.
Workload-by-workload: who wins and why
Simple OLTP (single-row reads/writes, tight p99 SLO)
On small servers, MySQL usually wins on raw latency. The commit is local. The cache is local. The code path is mature and short.
CockroachDB can meet reasonable SLOs here, but you’ll feel the quorum and storage engine costs more, especially if your nodes are underpowered or your network is twitchy.
If your product lives or dies on 5–20ms p99 for small transactions, don’t casually swap in a consensus-based database and hope it “tunes out.”
High availability with automatic failover
CockroachDB wins on operational behavior. If you’re not staffed to run crisp MySQL failover drills, you will eventually fail over badly.
CockroachDB makes the “boring” thing (node loss) boring.
MySQL can be highly available, but you must implement and rehearse it. The technology exists. The organizational consistency often doesn’t.
Write scaling (multiple writers, horizontal growth)
MySQL’s horizontal write scaling is not its native comfort zone.
You can shard, but then your app becomes a routing layer. That’s a legitimate architecture, but it’s a commitment.
CockroachDB is designed to scale writes by splitting ranges and distributing them. On small nodes, you may still be limited by per-node disk and CPU,
but at least the design goal matches the requirement.
Read scaling
MySQL read replicas are blunt but effective. They’re also operationally forgiving: if a replica lags, you route around it.
CockroachDB can distribute reads too, but you must understand leaseholders and locality; otherwise reads bounce around the cluster and pay extra hops.
Multi-region latency
If you’re truly multi-region, the speed of light becomes your product manager. Strong consistency across oceans is expensive.
CockroachDB gives you tools to place data and control read/write locality, but if you demand single-digit milliseconds globally, your demand is wrong.
MySQL in multi-region often becomes a primary-in-one-region truth with async replicas elsewhere. That’s fast locally and “eventually correct” remotely.
Sometimes that’s exactly what you want.
Schema evolution and operational friction
Both have schema change pitfalls. MySQL has decades of operational folk wisdom for online schema changes.
CockroachDB supports online schema changes but can still surprise you with backfills and their impact on small clusters.
Practical tasks: commands, outputs, and the decision they drive
You don’t fix distributed latency with vibes. You fix it with measurements that point to the slowest part of the commit path:
network, disk, contention, or topology.
Below are concrete tasks. Each one includes (1) a command, (2) representative output, (3) what it means, and (4) the decision you make.
Run them on your real nodes. Preferably during a load test that resembles production, not during a quiet Tuesday afternoon.
Task 1: Confirm network RTT between database nodes
cr0x@server:~$ ping -c 5 db-node-2
PING db-node-2 (10.0.0.12) 56(84) bytes of data.
64 bytes from 10.0.0.12: icmp_seq=1 ttl=64 time=0.78 ms
64 bytes from 10.0.0.12: icmp_seq=2 ttl=64 time=0.91 ms
64 bytes from 10.0.0.12: icmp_seq=3 ttl=64 time=1.02 ms
64 bytes from 10.0.0.12: icmp_seq=4 ttl=64 time=0.85 ms
64 bytes from 10.0.0.12: icmp_seq=5 ttl=64 time=0.80 ms
--- db-node-2 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4092ms
rtt min/avg/max/mdev = 0.78/0.87/1.02/0.09 ms
Meaning: Sub-millisecond to ~1ms RTT is decent for a small cluster. If you see 3–10ms inside one “datacenter,” expect p99 pain under quorum writes.
Decision: If RTT is high or jittery, fix network placement (same rack/zone), NIC offload settings, and noisy-neighbor issues before blaming the database.
Task 2: Measure TCP latency and retransmits under load
cr0x@server:~$ ss -ti dst 10.0.0.12:26257 | head -n 12
ESTAB 0 0 10.0.0.11:52144 10.0.0.12:26257
cubic wscale:7,7 rto:204 rtt:1.3/0.4 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:812345 segs_out:12034 segs_in:11877 send 89.2Mbps lastsnd:12 lastrcv:12 lastack:12 pacing_rate 178Mbps retrans:3/12034
Meaning: rtt shows observed RTT and variance. retrans indicates packet loss or congestion.
Decision: If retransmits climb during spikes, your tail latency isn’t “database tuning.” It’s congestion. Fix QoS, bandwidth, MTU mismatches, or noisy east-west traffic.
Task 3: Check disk latency distribution (the real villain)
cr0x@server:~$ iostat -x 1 5
Linux 6.5.0 (db-node-1) 12/30/2025 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
18.20 0.00 6.40 3.60 0.00 71.80
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
nvme0n1 210.0 8120.0 0.0 0.0 1.20 38.7 480.0 24576.0 0.0 0.0 14.80 51.2 7.40 92.0
Meaning: w_await at ~15ms with %util ~92% means writes are queuing. That’s where p99 goes to die.
Decision: If disk util is consistently high or await spikes, move to faster storage, separate WAL devices, reduce write amplification, or lower compaction pressure before changing SQL knobs.
Task 4: Validate filesystem and mount options
cr0x@server:~$ findmnt -no TARGET,SOURCE,FSTYPE,OPTIONS /var/lib
/var/lib /dev/nvme0n1p2 ext4 rw,relatime,discard,data=ordered
Meaning: Options like discard can add latency on some setups; ext4 defaults are usually fine, but the details matter.
Decision: If you see network filesystems, thin-provisioning weirdness, or unsafe options, fix storage first. Distributed consensus doesn’t forgive flaky disks.
Task 5: Check CPU steal time (virtualization tax)
cr0x@server:~$ mpstat 1 5 | tail -n 7
12:22:11 PM all 15.40 0.00 6.90 2.10 0.00 4.80 0.00 70.80
12:22:12 PM all 17.20 0.00 8.10 2.40 0.00 6.20 0.00 66.10
12:22:13 PM all 14.80 0.00 6.30 1.90 0.00 7.60 0.00 69.40
Meaning: If you see %steal non-trivial (not shown here; it’s 0.00), the hypervisor is borrowing your CPU time. That inflates coordination latency.
Decision: High steal → move to dedicated instances, pin vCPUs, or accept that your p99 is now a shared resource with your neighbors’ batch jobs.
Task 6: Measure fsync behavior quickly with fio
cr0x@server:~$ fio --name=fsync-test --directory=/var/lib/dbtest --size=1G --bs=4k --rw=write --ioengine=sync --fdatasync=1 --runtime=30 --time_based --direct=1
fsync-test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.33
Starting 1 process
fsync-test: (groupid=0, jobs=1): err= 0: pid=22811: Tue Dec 30 12:23:40 2025
write: IOPS=4200, BW=16.4MiB/s (17.2MB/s)(492MiB/30001msec); 0 zone resets
clat (usec): min=70, max=24500, avg=220, stdev=410
clat percentiles (usec):
| 95.00th=[ 310], 99.00th=[ 950], 99.90th=[ 8500]
Meaning: The 99.90th at 8.5ms is okay-ish; if you see 50–200ms at the 99.9th, your database will periodically “freeze” at p99.
Decision: Bad fsync tail → change storage class, disable write cache lies, ensure proper power-loss protection, or move WAL/pebble/redo logs to better media.
Task 7: MySQL—check InnoDB log flush policy
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'innodb_flush_log_at_trx_commit';"
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| innodb_flush_log_at_trx_commit | 1 |
+--------------------------------+-------+
Meaning: 1 is safest: flush on every commit. 2 or 0 can lower latency but risks losing recent commits on crash.
Decision: If you’re comparing MySQL to CockroachDB, keep this at 1 for honesty. If you change it, write the risk into your incident runbook.
Task 8: MySQL—verify semi-sync is actually on (or not)
cr0x@server:~$ mysql -e "SHOW STATUS LIKE 'Rpl_semi_sync_master_status';"
+-----------------------------+-------+
| Variable_name | Value |
+-----------------------------+-------+
| Rpl_semi_sync_master_status | OFF |
+-----------------------------+-------+
Meaning: If semi-sync is OFF, primary commits are not waiting for replica acknowledgement. That’s why MySQL is fast, and also why failover can lose data.
Decision: If the business requires “no data loss,” you either enable semi-sync (paying latency) or you stop pretending async replication meets that requirement.
Task 9: MySQL—spot replication lag and decide what “fresh” means
cr0x@server:~$ mysql -e "SHOW REPLICA STATUS\G" | egrep "Seconds_Behind_Source|Replica_IO_Running|Replica_SQL_Running"
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 42
Meaning: 42 seconds behind is not “a little.” It’s a different database.
Decision: Route reads that need freshness to primary, tune replication (parallel apply, schema/index fixes), or accept eventual consistency explicitly in application behavior.
Task 10: CockroachDB—check node health and liveness
cr0x@server:~$ cockroach node status --insecure
id | address | sql_address | build | started_at | updated_at | is_available | is_live
-----+-------------------------------+---------------+---------+----------------------------------+----------------------------------+--------------+----------
1 | db-node-1:26257 | db-node-1:26257 | v23.2.6 | 2025-12-30 11:02:17.123456+00:00 | 2025-12-30 12:25:01.123456+00:00 | true | true
2 | db-node-2:26257 | db-node-2:26257 | v23.2.6 | 2025-12-30 11:02:19.223456+00:00 | 2025-12-30 12:25:00.923456+00:00 | true | true
3 | db-node-3:26257 | db-node-3:26257 | v23.2.6 | 2025-12-30 11:02:21.323456+00:00 | 2025-12-30 12:25:01.003456+00:00 | true | true
Meaning: All nodes live and available. If one flips to unavailable intermittently, expect write latency spikes and transaction retries.
Decision: Fix node stability before tuning SQL. “It flaps sometimes” is not a minor detail in consensus systems.
Task 11: CockroachDB—inspect range distribution and under-replication
cr0x@server:~$ cockroach node ranges --insecure
node_id | ranges | leaseholders | replicas | avg_range_size
----------+--------+--------------+----------+----------------
1 | 412 | 190 | 824 | 64MiB
2 | 398 | 145 | 796 | 66MiB
3 | 421 | 205 | 842 | 63MiB
Meaning: Reasonably balanced. If one node holds most leaseholders, it becomes the latency hot spot for reads/writes on those ranges.
Decision: If skewed, adjust locality constraints, add capacity, or investigate why rebalancing is stuck (often disk or network).
Task 12: CockroachDB—look for contention and transaction retries in SQL stats
cr0x@server:~$ cockroach sql --insecure -e "SELECT app_name, sum(retry_count) AS retries, sum(cnt) AS stmts FROM crdb_internal.statement_statistics GROUP BY app_name ORDER BY retries DESC LIMIT 5;"
app_name | retries | stmts
------------+---------+--------
web-api | 1842 | 98412
worker | 210 | 12001
console | 0 | 1422
Meaning: Your API is getting retried a lot. That is time the user experiences as latency (even if the transaction ultimately succeeds).
Decision: Find hot keys, reduce contention (shard counters, batch writes differently), and ensure your client retries are configured correctly rather than failing loudly.
Task 13: CockroachDB—identify leaseholder locality for a hot table
cr0x@server:~$ cockroach sql --insecure -e "SHOW RANGES FROM TABLE app.orders;"
start_key | end_key | range_id | replicas | lease_holder
------------+---------+----------+----------+--------------
NULL | /1000 | 84 | {1,2,3} | 2
/1000 | /2000 | 91 | {1,2,3} | 2
/2000 | /3000 | 99 | {1,2,3} | 1
Meaning: Leaseholders concentrated on node 2 means node 2 handles the “coordination” for those ranges. If node 2 is slower, everyone is slower.
Decision: Pin leaseholders via zone configs/locality, or fix node 2’s disk/network/CPU so it stops being the cluster’s mood ring.
Task 14: Compare query latency distributions from the client side
cr0x@server:~$ pt-query-digest --type slowlog /var/log/mysql/mysql-slow.log | head -n 12
# 2.1s user time, 20ms system time, 28.11M rss, 100.00% CPU
# Query_time distribution
# 1us
# 10us
# 100us ################################################
# 1ms ###############################################
# 10ms ######
# 100ms ##
# 1s #
Meaning: The distribution tells you if you have a small number of pathological slow queries or broad tail degradation.
Decision: Pathological few → index/plan fixes. Broad tail → look at storage/network/lock contention and background work.
Task 15: Check Linux scheduler and IRQ imbalance (small-node classic)
cr0x@server:~$ cat /proc/interrupts | egrep "eth0|nvme" | head -n 6
48: 18823401 0 0 0 IR-PCI-MSI 524288-edge eth0-TxRx-0
49: 0 0 0 0 IR-PCI-MSI 524289-edge eth0-TxRx-1
50: 0 0 0 0 IR-PCI-MSI 524290-edge eth0-TxRx-2
51: 0 0 0 0 IR-PCI-MSI 524291-edge eth0-TxRx-3
Meaning: If all interrupts land on one CPU (like above), you can get weird latency under load even with “low” average CPU usage.
Decision: Enable irqbalance or pin queues properly; for small nodes, this can be the difference between stable p99 and chaos.
Fast diagnosis playbook
When latency spikes, you don’t have time for philosophy. You need a short sequence that finds the bottleneck with high probability.
Here’s the playbook I use when someone says, “Cockroach is slow,” or “MySQL is suddenly lagging,” on small servers.
First: is it network, disk, or contention?
-
Check node-to-node RTT and retransmits (ping + ss).
If RTT/jitter or retransmits correlate with latency spikes, stop and fix the network. -
Check disk await and utilization (iostat).
If writes are queuing (w_awaitelevated,%utilhigh), stop and fix storage or write amplification. -
Check transaction retries / lock waits (Cockroach SQL stats for retries; MySQL performance_schema or InnoDB status for lock waits).
If retries/waits spike, you have contention, not “slow hardware.”
Second: confirm topology and locality are sane
- CockroachDB: leaseholder placement for hot ranges; range distribution balance; under-replication.
- MySQL: primary/replica roles, replication lag, read routing, and whether semi-sync is unexpectedly enabled (or disabled).
Third: identify background work stealing your lunch
- CockroachDB: compactions, range rebalancing, schema backfills. On small nodes these can be loud.
- MySQL: purge lag, checkpoint pressure, long transactions, or a backup job doing “polite” I/O that isn’t actually polite.
Stop conditions (a.k.a. avoid random tuning)
- If
iostatshows high await, do not touch SQL settings yet. - If packet retransmits spike, do not redesign schema yet.
- If retries/lock waits are dominant, do not buy new hardware before fixing the hotspot.
Three corporate-world mini-stories (anonymized, plausible, technically accurate)
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company migrated a user profile service from MySQL to CockroachDB. The architecture doc said, “We need HA; a three-node cluster gives us that.”
The team was smart, the code was clean, and the migration rehearsals went well—on a staging environment with low load and friendly neighbors.
The assumption that failed: “If each node is under 30% CPU, we have plenty of headroom.”
Production latency spiked after launch, but CPU stayed calm. The on-call engineer stared at CPU graphs like they were going to confess.
Meanwhile, users were waiting an extra 200–600ms for profile updates during peak traffic.
The real bottleneck was storage tail latency on one node. Its SSD had occasional long fsync pauses.
MySQL had masked this before because the primary was on a different box, and replication lag wasn’t user-facing.
In CockroachDB, that one node was part of the quorum for a large share of ranges, so its pauses became quorum pauses.
The fix was embarrassingly non-database: replace the drive and rebalance leaseholders.
They also changed procurement rules: no “mystery SSDs,” and no mixed storage classes inside a consensus cluster.
The lesson that stuck: with distributed SQL, a single slow node isn’t just “a bit slower.” It becomes the cluster’s personality.
Mini-story 2: The optimization that backfired
An e-commerce platform ran MySQL with async replication. Checkout was fast, but leadership wanted “stronger consistency” during outages.
Someone proposed enabling semi-sync replication to ensure at least one replica acknowledged writes before commit.
It sounded like a free safety upgrade. It was not.
They enabled semi-sync during a maintenance window and saw only a modest increase in average latency.
A week later, the first real network disturbance hit: a top-of-rack switch started dropping packets intermittently.
The primary began waiting on replica acks that arrived late or not at all.
Latency didn’t just increase. It became spiky. Checkout traffic piled up, thread pools saturated, and then the app tier started timing out.
The incident looked like “database is slow.” The root cause was “network is unreliable, and we put it in the commit path.”
The backfired optimization wasn’t semi-sync itself; it was enabling it without validating east-west network quality and without timeouts tuned to reality.
After the incident, they kept semi-sync—but only after placing replicas on a better network segment and setting clear behavior for when semi-sync falls back.
The lesson: anything that adds an acknowledgement to the commit path turns network hiccups into user-visible pain. This is true in MySQL and it’s the default in CockroachDB.
Mini-story 3: The boring but correct practice that saved the day
A financial services team ran CockroachDB on small but decent nodes. They didn’t have infinite budget, so they leaned into discipline:
stable instance types, dedicated disks, and a strict change calendar for schema migrations.
Their least glamorous practice was also the most valuable: before every release, they ran a fixed “tail-latency acceptance test” that measured p95 and p99 across a set of critical transactions,
while simultaneously running a background compaction-like I/O job to simulate noisy conditions.
One day, the test failed. Median latency looked fine; p99 doubled. The change? A new index on a high-write table plus a backfill that increased write amplification.
The developers argued the code was correct, and they were right. But production correctness includes physics.
Because they had the test, they caught it before customers did. They rescheduled the backfill to off-peak, adjusted constraints to keep leaseholders near the API tier,
and added throttling for the migration job.
The day was saved by a boring routine and a refusal to treat p99 as “someone else’s metric.”
Common mistakes: symptoms → root cause → fix
These are the patterns that show up in incident channels. The symptoms are repetitive; the root causes are usually boring.
The fixes are specific, not inspirational.
CockroachDB: p99 write latency spikes every few minutes
- Symptoms: brief latency cliffs, transaction retries, nodes “healthy” at a glance.
- Root cause: storage compactions or flushes competing with foreground writes; or one node with long fsync tail.
- Fix: validate fsync tail with fio, reduce disk saturation, use better storage, and ensure nodes are homogeneous. Consider separating workloads and sizing for compaction headroom.
CockroachDB: reads are unexpectedly slow in a “single region” cluster
- Symptoms: read latency higher than expected, even for simple SELECTs.
- Root cause: leaseholders not located near the SQL gateway/app tier; queries bouncing between nodes.
- Fix: inspect leaseholder distribution, set locality/zone constraints, and route SQL connections appropriately.
CockroachDB: application sees sporadic serialization failures
- Symptoms: retryable errors, increased end-to-end request time, “random” failures under load.
- Root cause: contention on hot rows/ranges; long transactions; counter-style updates.
- Fix: redesign hotspots (sharded counters, batching), shorten transactions, add appropriate indexes, and ensure client retry logic is correct and bounded.
MySQL: replicas lag during peak and never catch up
- Symptoms: rising
Seconds_Behind_Source, delayed reads, risky failovers. - Root cause: slow queries on replica, insufficient apply parallelism, heavy DDL, or I/O saturation.
- Fix: optimize the hot queries, enable/adjust parallel replication, isolate replicas for reads, and schedule DDL carefully.
MySQL: sudden latency increase after “making it safer”
- Symptoms: commits slower, app timeouts, CPU normal.
- Root cause: enabling semi-sync or stricter durability settings without verifying network/disk tail latency.
- Fix: measure RTT and retransmits, set sane timeouts, place replicas better, and decide explicitly what failure semantics you want.
Both: average latency is fine, users still complain
- Symptoms: dashboards show “green,” support tickets say “slow sometimes.”
- Root cause: tail latency from contention, GC/compaction, background jobs, or network jitter.
- Fix: instrument and alert on p95/p99, correlate with disk/network metrics, and make tail latency part of release criteria.
Checklists / step-by-step plan
Checklist A: choosing MySQL vs CockroachDB for small servers
- If you need single-digit ms p99 for OLTP in one location: start with MySQL on a strong primary. Add replicas for reads and backups. Don’t apologize.
- If you need automated failover with minimal operational toil and can tolerate a latency premium: CockroachDB is a reasonable choice, but size disks and network like they matter (because they do).
- If you need multi-writer without sharding the app: CockroachDB fits the shape. MySQL will force sharding or careful coordination patterns.
- If your nodes are “tiny” and storage is mediocre: don’t pick distributed SQL for performance. Pick it for availability and operational behavior, and accept the tax.
- If the business demands “no data loss”: make sure you’re not relying on async semantics in MySQL, and make sure your CockroachDB cluster can actually keep quorum under failure.
Checklist B: making CockroachDB behave on small nodes
- Keep nodes as identical as possible (CPU, RAM, disk type, kernel settings).
- Measure and control fsync tail latency; don’t guess.
- Watch transaction retries and contention; fix hotspots in schema and access patterns.
- Check leaseholder locality for hot ranges; don’t assume the database “knows” your app topology.
- Schedule schema changes and backfills; throttle them; treat them as production events.
- Leave headroom for compactions and rebalancing. “80% disk busy” is not headroom.
Checklist C: making MySQL safe without destroying latency
- Keep
innodb_flush_log_at_trx_commit=1unless you can explain the risk in one sentence to a non-engineer. - Define what consistency you need for reads; route accordingly (primary vs replicas).
- Rehearse failover. A runbook you’ve never executed is fiction.
- Monitor replication lag and set thresholds that trigger operational action, not just dashboards.
- Keep backups and restores boring and tested; replication is not a backup.
- Do online schema changes with proven tooling and a rollback plan.
FAQ
1) Is CockroachDB “slower” than MySQL?
On small servers for low-latency OLTP, often yes—especially at p95/p99—because quorum replication and LSM compactions add unavoidable work.
On workloads that need scale-out writes or automated failover, “slower per transaction” can still be “faster for the business” because outages cost more than milliseconds.
2) Why does CockroachDB latency look fine on average but bad at p99?
Because tail latency is where coordination and background storage work show up: occasional slow fsync, compaction bursts, replica hiccups, network jitter, and retries.
Medians hide these; users don’t.
3) Can I make CockroachDB as fast as a single-node MySQL primary?
You can narrow the gap with good hardware, locality-aware configuration, and contention fixes. You can’t remove quorum and still claim the same failure semantics.
If you need single-node latency, run a single-node database for that path.
4) Is a three-node CockroachDB cluster enough?
It’s the minimum for quorum-based HA, and it works. It’s also fragile in the sense that any maintenance or node instability consumes your safety margin immediately.
If you can afford it, five nodes buys operational breathing room and often better tail behavior.
5) What’s the MySQL equivalent of CockroachDB’s strong consistency?
MySQL can approximate stronger durability/consistency using semi-sync replication or group replication, but you’re changing the commit path and taking a latency hit.
The key difference is that CockroachDB is designed around that model; MySQL can do it, but it’s easier to misconfigure and harder to reason about during failures.
6) Which one is easier to run with a small team?
If your workload fits on one primary, MySQL is usually simpler day-to-day.
If you need automated failover and can’t afford manual HA operations, CockroachDB can reduce toil—provided you invest in measuring disk/network and understanding contention.
7) Does CockroachDB require fast NVMe?
It doesn’t require it to function. It requires it to feel good under load, especially on small nodes.
Slow or jittery storage shows up as quorum commit latency and compaction stalls.
8) Can I use CockroachDB on 1Gbps networking?
Yes, but you must watch retransmits and east-west congestion, and you should keep nodes close (low RTT, low jitter).
If your network is shared and bursty, your p99 will reflect that reality.
9) Why do small clusters feel “fine” until traffic grows a bit?
Because the first thing you run out of is slack, not CPU. Background work and queueing effects rise nonlinearly.
At small scale, the system has enough idle time to hide sins; at moderate scale, it starts charging interest.
10) If I’m already on MySQL, when should I move to CockroachDB?
Move when your pain is primarily around failover, multi-writer growth, operational fragility, or the complexity of sharding and correctness.
Don’t move because you want a trendy database. If your main pain is query latency, a distributed system is an expensive way to buy more latency.
Conclusion: what to do next week
If you’re on small servers and debating MySQL vs CockroachDB, treat it like an engineering decision, not a branding decision.
MySQL is the latency baseline. CockroachDB is the availability baseline. You can tune both, but you can’t negotiate away their defaults.
Practical next steps that actually change outcomes:
- Measure RTT and disk fsync tail latency on your real nodes. If either is ugly, fix that before changing databases.
- Pick your truth model: async replication (fast, riskier) vs quorum/strong consistency (safer, slower). Write it down.
- Run a p99-focused load test with background I/O running. If you don’t test tails, you’ll discover them in production.
- If you choose CockroachDB on small nodes: invest in homogenous hardware, locality/leaseholder awareness, and contention diagnosis from day one.
- If you choose MySQL: invest in disciplined failover, replication lag monitoring, and tested backups. Don’t let “simple” become “untended.”
The latency tax isn’t a hidden fee. It’s on the receipt. You just have to look at p99 to see it.