Replication looks like a straight line on diagrams: primary writes, replicas follow, everyone goes home.
In production it’s a crime scene: a replica that’s “running” but wrong, a failover that “worked” but lost data,
or a lag graph that flatlines right when the CEO is demoing.
MariaDB and Percona Server both speak “MySQL-ish,” but their replication edge cases don’t line up neatly.
The worst failures come from assuming they do. This is a field guide to the sharp corners: GTIDs, binlogs,
crash safety, filtering, semi-sync, and the operational habits that keep you out of incident review hell.
What you’re actually choosing (hint: operational semantics)
“MariaDB vs Percona Server” sounds like a features checklist. Replication makes it a behavior choice.
In normal times, both will stream binlog events to replicas and apply them. Under stress—network partitions,
disk stalls, failovers, crash recovery—the differences show up as: what gets written to the binlog, what gets
acknowledged, what’s considered safe, and what your tooling assumes.
Percona Server is a MySQL variant that typically tracks Oracle MySQL behavior closely while adding instrumentation,
knobs, and operational patches (often performance- or observability-oriented). MariaDB is its own fork with
replication features that are sometimes ahead, sometimes simply different, and occasionally incompatible in subtle
ways—especially around GTID semantics.
Here’s the opinionated framing that holds up in outages:
- If you need “MySQL ecosystem compatibility first,” Percona Server is usually the safer bet. Many tools assume Oracle-style GTID and binlog details.
- If you want MariaDB-specific features (and accept MariaDB semantics), commit to them fully: use MariaDB GTID end-to-end, MariaDB-aware tooling, and test failover paths like you test backups—routinely.
- If you plan to mix MariaDB and Percona in the same replication topology, do it only with a written compatibility matrix and a lab that replays your real binlogs. “It worked once” is not a matrix.
Historical facts that explain today’s weirdness
These are short, concrete bits of history that matter because replication “weirdness” is usually “legacy decisions
made in public.” Keep these in your head when you debug.
- MySQL replication began as statement-based replication (SBR), which was fragile with non-deterministic SQL; row-based replication (RBR) came later and changed operational expectations.
- MariaDB forked from MySQL in 2009 and evolved replication features independently, including its own GTID implementation and metadata.
- Oracle MySQL GTID (5.6+) and MariaDB GTID are not the same protocol; they solve similar problems with different identifiers and failover rules.
- Percona Server historically shipped “operator-friendly” patches such as better instrumentation and performance knobs; this influenced how teams monitor replication (and what they expect to see).
- MySQL 5.7/8.0 made row-based replication the de facto default in many shops because it reduced “replica drift,” but it raised binlog volume and IO sensitivity.
- Multi-threaded replication arrived in stages: first coarse parallelism, then more granular “logical clock”/GTID-based parallel apply in MySQL variants; MariaDB has its own parallel replication approach and knobs.
- Semi-synchronous replication was bolted on as a plugin concept and evolved over time; it reduces but does not eliminate data loss risks under certain failover sequences.
- Crash-safe replication improved over the years via metadata repositories and relay log recovery, but configuration choices can still put you back in 2012.
- Checksums and binlog event metadata grew over time; cross-version and cross-flavor replication can break on things as petty as “the binlog says CRC32 and the replica disagrees.”
Replication models and where MariaDB and Percona diverge
Async replication: the default that lies politely
Classic async replication means the primary commits and tells the client “OK” without waiting for replicas.
It’s fast. It’s also honest about one thing: if the primary dies, you might lose the last few transactions.
The edge cases show up when people forget that the client ACK is not a durability guarantee beyond the primary’s
storage settings.
The MariaDB vs Percona difference here isn’t philosophical. It’s in the details:
how GTID state is tracked, how crash recovery repositions relay logs, what defaults are shipped, and what your
orchestration tools assume.
Row vs statement vs mixed: replication format is not a preference
If you’re still running statement-based replication on business-critical systems, you’re basically betting your
job on every query being deterministic under all execution plans, time zones, SQL modes, and function semantics.
That’s a bold lifestyle choice.
RBR is safer for correctness but heavier on disk, IO, and sometimes network. Mixed tries to be clever and often
surprises you at 3 a.m. The important edge case is: a single non-deterministic statement can cause replica drift,
and then every “successful” failover is just switching to a wrong database faster.
Percona Server vs MariaDB: what operators feel
Operators don’t care about slogans; they care about whether failover tools make correct choices.
In Percona Server (tracking Oracle MySQL semantics), GTID and replication internals tend to behave like the broader
MySQL ecosystem expects. MariaDB’s replication features can be excellent, but they’re not always drop-in compatible.
Translation: you can run either well, but you can’t wing it. Replication punishes improvisation.
GTIDs: the biggest foot-gun with the best marketing
GTIDs are supposed to make failover boring: each transaction has a global ID, replicas know what they executed,
and you can promote a replica without hand-counting binlog positions.
The edge cases come from two facts:
- MariaDB GTID and Oracle/MySQL GTID are different systems with different formats and state handling.
- GTID correctness depends on how you set up logging, crash safety, and your failover tool’s assumptions.
MariaDB GTID vs MySQL/Percona GTID: “same acronym, different contract”
In MySQL/Percona-style GTID, the server UUID (or server identifier set) is baked into GTID sets and state variables
like gtid_executed. MariaDB uses a domain-based GTID with different variables and conventions.
Both can support “auto-positioning” behavior, but the wire-level and metadata are not interchangeable.
Mixing flavors across GTID boundaries is where teams bleed. If you replicate from MariaDB to Percona or vice versa,
you must validate:
- Whether GTID is supported in that direction and version combination.
- Whether your replication channel understands the GTID format emitted by the primary.
- Whether your failover tooling reads the right status variables and interprets them correctly.
GTID failover: what “safe” actually means
A “safe” GTID failover requires two conditions:
- The candidate replica has received all transactions that were acknowledged to clients (or your app is OK losing some).
- The candidate replica has applied those transactions and will not reapply or skip them after promotion.
GTID helps with condition #2 (not reapplying transactions) if configured correctly. It does not guarantee #1. That’s
where semi-sync, durability settings (sync_binlog, innodb_flush_log_at_trx_commit), and network timing matter.
Joke #1
GTID is like labeling your leftovers: you still have to put them in the fridge if you don’t want regret later.
Binlog compatibility: checksums, formats, and silent mismatch
Replication is ultimately a contract: “I will write events like this; you will understand and apply them.”
Contracts break in three ways: the primary changes what it writes, the replica changes what it understands,
or a middlebox quietly edits the traffic (rare, but I’ve seen “helpful” proxies do dumb things).
Binlog checksums: small setting, big outage
Binlog checksums protect against corruption, but compatibility depends on both ends agreeing.
A classic failure mode is upgrading one side or changing a default, then watching replication stop with a parse error.
Practical advice: standardize binlog checksum settings fleet-wide and make them part of your baseline audits.
Don’t let one server be “special.”
Binlog format: RBR is safer, but watch for volume-induced lag
When you flip to RBR, you often fix correctness issues while creating new performance bottlenecks:
replicas become IO-bound applying large row images, and relay logs bloat.
Percona Server and MariaDB both handle RBR, but their parallel apply and tuning knobs differ. The edge case isn’t
“can it replicate” but “can it keep up during write bursts and schema changes.”
Event types and version skew
Cross-version replication (even within the same flavor) is where weird event types appear.
A newer primary may emit events a slightly older replica can’t parse. Sometimes you get a clean error. Sometimes
it looks like lag, then a stop, then a confused operator reruns START SLAVE as a coping strategy.
Semi-sync and “data loss anyway” scenarios
Semi-synchronous replication reduces the window of loss by waiting for at least one replica to acknowledge receipt
of the transaction before the primary returns OK. Receipt, not apply. If the acknowledged replica has the event in
a relay log but hasn’t committed it, a failover can still lose it—or worse, fork history depending on your procedure.
This is where a lot of executives hear “semi-sync” and think “no data loss.” That belief has caused more than one
awkward meeting with Finance.
What semi-sync actually buys you
- Fewer acknowledged transactions exist only on the primary.
- Better odds that a promoted replica contains the last committed writes.
- More latency and more sensitivity to replica/network jitter.
Failure sequence that still loses data
Primary commits, sends to replica, replica ACKs receipt, primary ACKs client. Then the replica crashes before apply,
or the relay log is corrupted, or the failover promotes a different replica that never received it. Congratulations:
semi-sync, async outcomes.
Operator stance
Use semi-sync if your latency budget allows it and you understand its guarantees. Pair it with promotion rules:
only promote replicas that are confirmed caught up and durable enough for your RPO.
Otherwise, you’re just paying latency rent for vibes.
Parallel replication: faster until it isn’t
Parallel replication is a performance feature that turns into a correctness debate the moment you have cross-schema
transactions, hot rows, or a workload that’s “mostly parallel except for the parts that matter.”
The edge case: you “optimize lag” by increasing worker threads, but the replica becomes CPU-bound, or mutex-bound,
or starts thrashing the buffer pool, and apply slows down. Or it keeps up—until a DDL arrives and everything piles up
behind it like cars behind a parade.
Percona Server (aligned with MySQL) typically uses MySQL-style parallel apply mechanics and instrumentation that
many tools expect. MariaDB has its own parallel replication system and variables, and the tuning patterns are not
always transferable. You must measure, not guess.
Filters and schema changes: the slow-burn disasters
Replication filters (replicate-do-db, replicate-ignore-db, wildcards, table filters)
look like a neat way to “keep only what you need.” In reality they are a long-term liability unless you treat them
like code: reviewed, tested, and re-validated after every schema change.
Filters + mixed binlog format + cross-database queries can cause replicas to miss changes silently.
Silent is the worst word in replication.
DDL replication: where “it worked in staging” goes to die
DDL is brutal on replicas. It’s often single-threaded, it blocks, and it can create enormous row change events.
Online schema change tools reduce impact but introduce their own edge cases: triggers, ghost tables, and lag spikes.
If you have to choose a philosophy: prioritize correctness and predictable recovery over peak throughput.
You can buy hardware for throughput. You cannot buy your way out of inconsistent data after a bad failover.
Practical tasks: commands, outputs, and the decisions you make
Below are real operational tasks. Each includes a command, sample output, what it means, and what decision you make.
Use these as muscle memory during incidents and during boring Tuesday audits. Boring Tuesday is where you win.
Task 1: Identify server flavor and version (stop guessing)
cr0x@server:~$ mysql -e "SELECT @@version, @@version_comment\G"
*************************** 1. row ***************************
@@version: 10.11.6-MariaDB-1:10.11.6+maria~ubu2204
@@version_comment: MariaDB Server
What it means: You’re on MariaDB 10.11.x. If your runbook assumes MySQL/Percona GTID variables, it’s already wrong.
Decision: Use MariaDB replication syntax/variables and ensure your failover tooling supports MariaDB GTIDs (if enabled).
Task 2: Check primary binlog and durability posture
cr0x@server:~$ mysql -e "SHOW VARIABLES WHERE Variable_name IN ('log_bin','binlog_format','sync_binlog','innodb_flush_log_at_trx_commit')"
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| binlog_format | ROW |
| innodb_flush_log_at_trx_commit | 1 |
| log_bin | ON |
| sync_binlog | 1 |
+--------------------------------+-------+
What it means: Binlogs are enabled, row-based replication is used, and durability is strict (higher IO cost, better crash safety).
Decision: If latency is acceptable, keep this. If you’re tempted to relax it, do it only with an explicit RPO/RTO sign-off and a tested recovery plan.
Task 3: Confirm replica status (the boring baseline)
cr0x@server:~$ mysql -e "SHOW SLAVE STATUS\G" | egrep "Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Last_SQL_Error|Last_IO_Error"
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 0
Last_IO_Error:
Last_SQL_Error:
What it means: Threads are running and the replica claims no lag. This is necessary, not sufficient, for correctness.
Decision: If you’re diagnosing “wrong results,” you must go beyond this and verify executed GTIDs / positions / checksums.
Task 4: Measure replication lag correctly (don’t worship Seconds_Behind_Master)
cr0x@server:~$ pt-heartbeat --check --database=percona --table=heartbeat --host=127.0.0.1 --user=monitor --password='***'
0.183
What it means: Actual observed lag is ~183ms. This is often more reliable than Seconds_Behind_Master under load or parallel apply.
Decision: Use heartbeat-based lag for alerting and failover gating. Treat Seconds_Behind_Master as a hint, not a truth oracle.
Task 5: Validate binlog checksum compatibility
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'binlog_checksum';"
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| binlog_checksum | CRC32 |
+-----------------+-------+
What it means: The primary emits CRC32 checksums. A replica that expects NONE (or lacks support) can fail reading events.
Decision: Ensure replicas are configured to understand the same checksum mode; standardize it fleet-wide before upgrades.
Task 6: Spot relay log pressure and disk risk
cr0x@server:~$ du -sh /var/lib/mysql/*relay* 2>/dev/null | tail -n 3
18G /var/lib/mysql/relay-bin.000112
18G /var/lib/mysql/relay-bin.000113
36G /var/lib/mysql/relay-log.info
What it means: Relay logs are huge. This usually correlates with lag or slow SQL thread apply.
Decision: Before you restart anything, find the apply bottleneck (IO, locks, DDL). If disk is near full, prioritize stopping the bleeding: throttle writes, add space, or provision a new replica.
Task 7: Find the current replication coordinates (binlog position still matters)
cr0x@server:~$ mysql -e "SHOW MASTER STATUS\G"
*************************** 1. row ***************************
File: mysql-bin.001874
Position: 918273645
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set:
What it means: You have a binlog file/position. Even with GTID, this is useful for point-in-time comparisons and debugging.
Decision: Record this before risky operations (failover, restart, config change). It’s your breadcrumb trail when you’re tired.
Task 8: Check whether the replica is applying in parallel (and if it’s helping)
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'slave_parallel_workers';"
+------------------------+-------+
| Variable_name | Value |
+------------------------+-------+
| slave_parallel_workers | 16 |
+------------------------+-------+
What it means: The replica is configured for 16 parallel workers (MySQL/Percona-style variable name; MariaDB uses different knobs depending on version).
Decision: If lag is worse after increasing workers, back off and profile: CPU saturation, mutex waits, IO stalls, and DDL serialization are common culprits.
Task 9: Inspect replication worker states to identify the blocker
cr0x@server:~$ mysql -e "SELECT THREAD_ID,PROCESSLIST_STATE,PROCESSLIST_INFO FROM performance_schema.threads t JOIN performance_schema.processlist p ON t.PROCESSLIST_ID=p.ID WHERE p.COMMAND IN ('Connect','Binlog Dump GTID','Binlog Dump','Query') LIMIT 5\G"
*************************** 1. row ***************************
THREAD_ID: 82
PROCESSLIST_STATE: Waiting for master to send event
PROCESSLIST_INFO: NULL
*************************** 2. row ***************************
THREAD_ID: 109
PROCESSLIST_STATE: Waiting for handler commit
PROCESSLIST_INFO: INSERT INTO orders ...
What it means: IO thread is idle (good), but a worker is stuck waiting on commit—often IO, fsync, or contention.
Decision: Look at disk latency and InnoDB log flush pressure. Don’t just crank workers; you’ll multiply contention.
Task 10: Check disk latency (replication lag is often storage lag)
cr0x@server:~$ iostat -x 1 3
Device r/s w/s r_await w_await aqu-sz %util
nvme0n1 120.0 980.0 1.10 18.70 7.42 92.0
What it means: Writes are waiting ~19ms on average and the device is ~92% utilized. Your replica SQL thread is probably IO-gated.
Decision: Reduce write amplification (buffer pool tuning, log file sizing), move replicas to faster storage, or reduce incoming write rate temporarily. CPU tuning won’t fix a saturated disk.
Task 11: Confirm crash-safe relay log recovery settings
cr0x@server:~$ mysql -e "SHOW VARIABLES WHERE Variable_name IN ('relay_log_recovery','master_info_repository','relay_log_info_repository');"
+-------------------------+-------+
| Variable_name | Value |
+-------------------------+-------+
| master_info_repository | TABLE |
| relay_log_info_repository | TABLE |
| relay_log_recovery | ON |
+-------------------------+-------+
What it means: Metadata is stored in tables and relay log recovery is enabled—good for crash safety.
Decision: If these are FILE/OFF in a fleet, schedule a controlled change. The day after an unclean reboot is not the time to discover you can’t recover relay logs cleanly.
Task 12: Detect replication break due to duplicate key (classic after wrong promotion)
cr0x@server:~$ mysql -e "SHOW SLAVE STATUS\G" | egrep "Last_SQL_Errno|Last_SQL_Error|Slave_SQL_Running"
Slave_SQL_Running: No
Last_SQL_Errno: 1062
Last_SQL_Error: Error 'Duplicate entry '1842' for key 'PRIMARY'' on query. Default database: 'app'. Query: 'INSERT INTO users(id,...) VALUES(...)'
What it means: Replica applied something out of order or diverged. Often caused by writing to a replica, or by failover promoting a replica missing transactions.
Decision: Stop. Don’t skip errors blindly. Identify divergence scope: compare GTID sets/positions, run consistency checks, and consider rebuilding the replica from a known-good snapshot.
Task 13: Verify no one is writing to replicas (it happens)
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'read_only'; SHOW VARIABLES LIKE 'super_read_only';"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only | OFF |
+---------------+-------+
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| super_read_only | OFF |
+-----------------+-------+
What it means: Replica is writable. This is fine only if you intentionally use it (rare) and understand the consequences.
Decision: Turn on read_only and super_read_only for replicas by policy. If an app needs writes, it needs a different architecture, not a sneaky exception.
Task 14: Check for replication filters that silently drop data
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'replicate%';" | egrep -v "replicate_(same_server_id|rewrite_db)"
replicate_do_db app
replicate_ignore_db mysql
replicate_wild_ignore_table app.tmp_%
What it means: You’re filtering replication. This can be valid, but it’s dangerous with cross-db queries and DDL.
Decision: Audit every filter against actual workload patterns. If you can’t prove it’s safe, remove it and isolate data another way (separate instances, separate clusters).
Fast diagnosis playbook
When replication is “slow” or “broken,” your job is not to stare at one metric. Your job is to find the limiting
resource and the failure mode quickly, then choose the least risky intervention.
First: Is replication stopped or lying?
- Check
SHOW SLAVE STATUS\G(or equivalent) for IO/SQL running and errors. - If both threads run, do not assume correctness. Validate lag via heartbeat and sample data checks.
Second: Is the bottleneck network, disk, or locks?
- Network/IO thread: IO thread not running, “connecting,” or “reconnecting.”
- Disk/commit: SQL thread waiting on commit, high
iostatawait, saturated %util. - Locks/DDL: SQL thread stuck on metadata locks; long-running DDL blocks apply.
Third: Is it a format/compatibility mismatch?
- Check binlog format and checksum variables on both ends.
- Check version skew, plugin differences, and GTID mode compatibility.
Fourth: Decide intervention level
- Low risk: throttle writes, add disk space, fix network, improve IO.
- Medium risk: adjust parallel workers or apply tuning with measurement.
- High risk: skip errors, reset slave, rebuild replicas, promote a different node. High risk requires a written plan and a rollback.
Common mistakes: symptoms → root cause → fix
1) “Seconds_Behind_Master is 0, but the data is wrong”
Symptom: App reads from replicas show missing rows or stale aggregates; replication status looks “fine.”
Root cause: Statement-based drift, unsafe functions, or filtering dropping events; sometimes writes leaked to replicas.
Fix: Switch to row-based binlog where possible; remove risky filters; enforce super_read_only; run consistency checks and rebuild divergent replicas.
2) “Replication broke after a minor upgrade”
Symptom: IO thread stops with event parsing errors; or SQL thread errors on unknown event types.
Root cause: Version skew emitting binlog features the replica can’t parse; checksum mismatch; GTID mode mismatch across flavors.
Fix: Align versions within supported ranges; standardize binlog_checksum; validate GTID compatibility in a lab and in a canary replica before rollout.
3) “Lag grows during write bursts, never recovers”
Symptom: Relay logs balloon; SQL thread shows “Waiting for handler commit”; IO is hot.
Root cause: Replica disk is the bottleneck (fsync pressure, saturated storage), not CPU. RBR increases write amplification.
Fix: Move replicas to faster storage; tune InnoDB log configuration; ensure buffer pool is sized; consider delaying heavy analytics queries on replicas.
4) “We did failover with GTID and still lost transactions”
Symptom: Users report missing recent writes; primary died; new primary promoted cleanly.
Root cause: Async replication window, semi-sync acknowledgment only on receipt, or promoted replica not fully caught up/applied.
Fix: Gate promotion on verified caught-up state; if using semi-sync, ensure your failover promotes the acknowledged replica when possible; tighten durability settings where required.
5) “Parallel replication made it worse”
Symptom: Increasing workers increases lag and CPU; replication throughput drops.
Root cause: Contention, hot rows, commit serialization, buffer pool churn, or DDL blocking apply.
Fix: Reduce workers; profile waits; treat DDL as a scheduled event; shard write hotspots if needed; validate tuning under realistic load.
6) “Replica crashes and comes back broken”
Symptom: After reboot, replica can’t resume; relay logs inconsistent; requires manual repositioning.
Root cause: Non-crash-safe metadata repositories; relay log recovery disabled; filesystem/disk issues.
Fix: Enable crash-safe repositories (TABLE) and relay log recovery; validate filesystem health; rebuild the replica if state is suspect.
Three corporate mini-stories from the replication trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-sized SaaS company ran a mixed estate: MariaDB for a legacy product line, Percona Server for newer services.
A new team inherited a cross-datacenter replication topology and standardized the failover runbook around “GTID auto-positioning.”
They assumed GTID was GTID. Same acronym, same safety, same outcome.
During a primary outage, they promoted a replica using the steps that worked flawlessly in Percona.
Replication resumed, the app recovered, dashboards went green. Everyone exhaled.
Then customer support escalations started: recent subscription changes were missing, but only for some tenants.
The postmortem was grimly educational. MariaDB GTID state wasn’t being interpreted the way the failover tooling expected.
The tooling chose a promotion candidate that looked caught up by one metric but wasn’t the correct “most advanced” node
in MariaDB’s GTID domain semantics.
They could have caught it with a pre-flight: validating executed transaction sets using the correct server variables,
and testing cross-flavor failover in a lab with real binlogs. Instead, they learned in production, which is the most expensive classroom.
Fix wasn’t heroic. They split the operational playbooks by flavor, enforced flavor-specific monitoring, and stopped pretending
that one universal automation script could safely drive both without explicit compatibility logic.
Mini-story 2: The optimization that backfired
An e-commerce company had a familiar pain: replicas lagged during promotions.
Someone proposed a simple tweak: relax durability to speed commits and let replicas catch up.
They lowered sync_binlog and changed log flush settings on replicas too, because “replicas don’t need to be as durable.”
Lag improved for a while. The team celebrated with the quiet satisfaction of shaving milliseconds off a graph.
Then a replica host rebooted during a kernel update, and the replica came back with relay logs that couldn’t be trusted.
Recovery required manual intervention, and then a rebuild.
During the rebuild window, the remaining replicas took on more read traffic and more replication load.
Lag returned, now with fewer safety nets. They had optimized for steady-state performance while increasing blast radius during routine events.
The lesson wasn’t “never tune.” It was “tune with a failure model.” If you lower durability, you must model what happens on crash,
power loss, and failover. If your model is “probably fine,” congratulations: you’ve adopted folk engineering.
They reverted the risky settings, moved replicas to faster disks, and stopped using durability knobs as a performance band-aid.
Storage was the bottleneck all along—like it usually is.
Mini-story 3: The boring but correct practice that saved the day
A payments platform ran Percona Server with strict operational discipline.
They had a weekly ritual: validate replica catch-up, run a checksum consistency pass on a rotating subset of tables,
rehearse a controlled promotion in a staging environment that mirrored production configuration, and record the exact commands.
Nobody bragged about it. It wasn’t “innovation.” It was housekeeping.
One afternoon, a primary node began throwing intermittent fsync stalls. Replication lag started climbing.
The on-call pulled the fast diagnosis playbook and confirmed the issue was storage latency, not SQL locks.
They throttled a batch writer, drained traffic, and prepared to fail over.
The difference-maker was the promotion gate: they only promoted a replica that was verified caught up via heartbeat and transaction state,
and they had a pre-approved checklist to ensure read-only flags and binlog settings were correct on the new primary.
The failover was unglamorous. It was also clean. No missing transactions, no drift, no “why is this table different” scavenger hunt.
Their biggest stress was waiting for DNS TTLs, which is the correct kind of stress.
Reliability is mostly a pile of boring habits. Boring is underrated.
One quote worth keeping on your wall
Paraphrased idea (attributed to Richard Cook): “Success comes from adapting to surprises; failure comes from the same system operating under strain.”
Checklists / step-by-step plan
Checklist A: Designing replication so edge cases hurt less
- Pick a flavor strategy: MariaDB-only GTID, MySQL/Percona-only GTID, or no-GTID with explicit positions. Avoid “we’ll figure it out later.”
- Standardize binlog format (prefer ROW) and binlog checksum across the fleet.
- Decide your durability posture and document the RPO. Align
sync_binlogandinnodb_flush_log_at_trx_commitwith that decision. - Enable crash-safe replication metadata (
master_info_repository=TABLE,relay_log_info_repository=TABLE,relay_log_recovery=ON) where supported and tested. - Enforce read-only on replicas (
read_onlyandsuper_read_only). - Choose a lag measurement method that reflects reality (heartbeat-based).
- Decide how DDL is rolled out (scheduled windows, online schema change tooling, or application-level migration patterns).
Checklist B: Promotion / failover procedure (human-safe version)
- Freeze writes if you can (application maintenance mode or writer throttling). If you can’t, at least know your expected loss window.
- Pick promotion candidate based on verified catch-up: heartbeat lag near zero, replication threads healthy, no SQL errors.
- Verify transaction state using flavor-correct GTID/position variables. Don’t “translate in your head.”
- Confirm candidate is not writable before promotion unless explicitly needed. Then flip to writable intentionally.
- Promote: stop replication cleanly, reset replication config as needed, ensure binlog is enabled on the new primary.
- Point other replicas at the new primary using the correct auto-position method for your flavor.
- Validate correctness with spot checks and a small consistency probe. Then unfreeze writes.
Checklist C: When you must rebuild a replica (and you will)
- Stop replication and record current status output for the incident timeline.
- Choose a clean snapshot source (prefer a replica that is verified correct, not “the least broken”).
- Restore data, then set replication coordinates/GTID state carefully.
- Start replication and watch for early SQL errors (duplicates, missing tables, checksum issues).
- Run a targeted consistency check on high-value tables before returning it to service.
Joke #2
Replication filters are like office politics: they seem helpful until you realize they’ve been silently dropping important information.
FAQ
1) Can MariaDB replicate from Percona Server (or vice versa) safely?
Sometimes, in specific version combinations and configurations, but “safely” requires testing your exact binlog format,
checksum mode, and GTID strategy. Cross-flavor GTID is where teams get hurt. If you must mix, do it with a lab replaying real binlogs.
2) Should I use GTID everywhere?
Yes if your tooling and team procedures are GTID-native for your chosen flavor. No if you’re mixing flavors without a compatibility plan.
GTID improves failover ergonomics, but it doesn’t eliminate data loss windows or bad promotion decisions.
3) Is semi-sync replication “no data loss”?
No. It reduces the window by ensuring at least one replica received the event before client ACK, but receipt is not apply.
You can still lose transactions depending on crash timing and failover choice.
4) Why does replication lag spike during DDL even with parallel replication?
DDL often serializes execution and can take metadata locks that block apply. Parallel workers can’t run past a blocking event.
Treat large schema changes as planned events and test them under production-like load.
5) What’s the most reliable lag metric for failover decisions?
Heartbeat-based measurements (e.g., a timestamp written on the primary and read on replicas) are usually more truthful than
Seconds_Behind_Master, especially under parallel apply or when the SQL thread is intermittently blocked.
6) If replication is broken, should I skip the offending transaction?
Only as a last resort, and only when you fully understand what you’re skipping and how you will repair consistency.
Skipping is an operational debt instrument with compounding interest.
7) Why do replicas diverge even when replication says it’s running?
Common causes: statement-based replication with non-determinism, writes leaking to replicas, replication filters, and
application behavior that depends on session settings not replicated as expected. Divergence is usually a policy failure, not a mystery.
8) What’s the fastest way to tell if the bottleneck is storage?
If SQL thread states show commit waits and iostat shows high await/%util, it’s storage. Fix storage or reduce write pressure.
Don’t “optimize” replication threads into a disk wall.
9) Do Percona-specific features change replication behavior?
Percona Server is typically close to Oracle MySQL replication semantics, but extra instrumentation and knobs can change performance behavior.
Treat it like a first-class flavor: keep consistent configs and test upgrades, especially around parallel replication and durability.
10) What’s the safest binlog format choice today?
Row-based replication is the usual default for correctness. The cost is higher binlog volume and more IO on replicas.
If you choose statement-based, you need tight controls on SQL patterns and a high tolerance for “surprise drift.”
Next steps you can do this week
If you want fewer replication incidents, don’t start by debating brands. Start by removing ambiguity.
- Inventory flavors and versions across your topology. Write it down. Stop assuming “it’s all MySQL.”
- Standardize binlog format and checksum and verify with commands on every node.
- Decide and document your RPO, then align durability settings with it. If the business won’t decide, pick conservative settings and make them sign the latency bill.
- Implement heartbeat-based lag measurement and use it in alerts and failover gates.
- Run a controlled failover drill in staging with production-like settings. Time it. Fix the steps that require tribal knowledge.
- Pick one “boring correctness practice” (read-only enforcement, crash-safe replication metadata, weekly consistency checks) and make it policy.
Replication edge cases don’t care which distribution you installed. They care what you assumed, what you tested,
and what you’re willing to keep consistent when nobody is watching.