Failover is where good replication designs go to get judged. Everything looks clean in diagrams: a primary, a replica, a VIP, maybe a proxy layer, and a runbook nobody reads until it’s 3 a.m. Then a disk stalls, a network flap happens, and suddenly the “new primary” refuses writes, replicas point to the wrong host, or worse: two primaries accept traffic and your business invents a new kind of accounting.
MySQL and MariaDB both “do replication.” They do not fail over the same way in real life. The differences are rarely about headline features and almost always about edge behavior: GTID semantics, binlog compatibility, metadata tables, proxy assumptions, and the sort of operational paper cuts that only show up under stress.
Ground truth: replication is not failover
Replication moves data changes from A to B. Failover moves authority from A to B, plus traffic, plus safety rails. Those rails are where the bodies are buried.
In production, failover has three jobs:
- Pick the right candidate (closest to current data, healthiest, not mid-crash-recovery).
- Make it writable (and ensure the old primary cannot accept writes, even if it wakes up angry).
- Move clients (proxies, DNS, VIPs, connection pools, app caches, and “helpful” retry logic).
MySQL and MariaDB both support async replication. Both can use GTIDs (with important differences). Both can do semi-sync flavors. Both can be managed by orchestrators or proxies. And both can betray you if you treat replication like a magic wand instead of a contract.
One operational truth worth printing on a sticker: failover is a consistency decision. You are always trading between availability now and data correctness later. Your tooling should force that decision to be explicit, not implicit.
Joke #1: Replication lag is just your database practicing mindfulness—living in the past so it doesn’t worry about the future.
Before we compare MySQL and MariaDB, let’s set the vocabulary the way SREs actually use it:
- Promotion: turning a replica into a writable primary.
- Repointing: changing replicas to follow the new primary.
- Fencing: preventing the old primary from accepting writes (shutdown, network isolation, STONITH, storage fencing).
- Split brain: two nodes accept writes for the same dataset. This is not “an incident,” it’s “a new era of your data.”
- Failover tool: Orchestrator, MHA, MaxScale, ProxySQL scripts, Pacemaker/Corosync, Kubernetes operators, or your own bash that you pretend is “temporary.”
Interesting facts and history that still matters
These aren’t trivia-night facts. They explain why MySQL and MariaDB replication failover differs in subtle, operational ways.
- MariaDB forked from MySQL in 2009 after MySQL’s acquisition path (Sun, then Oracle). Divergence began slowly, then accelerated, especially around GTID and replication features.
- MySQL 5.6 introduced GTIDs as Oracle MySQL’s first mainstream “global transaction ID” feature; MariaDB implemented a different GTID format and behavior.
- MySQL’s multi-threaded replication matured later (notably 5.7+ with parallel replication improvements, then 8.0 enhancements). MariaDB pursued different knobs and implementations; tuning recipes don’t transfer cleanly.
- MySQL 5.7 made InnoDB the default and pushed reliability improvements around crash recovery and replication metadata. MariaDB also improved InnoDB (and variants like XtraDB historically), but the operational symptoms differ.
- MySQL 8.0 changed defaults and removed old behavior (like query cache long gone, but more importantly authentication plugins and metadata behavior). Mixed-version failover can fail for reasons that feel unrelated to replication.
- MariaDB introduced “domain IDs” in GTIDs which can be useful (multi-source) and confusing (domain mismanagement during failover).
- MySQL’s Group Replication/InnoDB Cluster is a first-class HA track in Oracle MySQL; MariaDB has Galera-based clustering and its own HA story. People compare async replication failover while secretly wanting a consensus system.
- Both ecosystems grew operational tooling separately: Orchestrator started in MySQL land but can work with MariaDB; MaxScale is MariaDB’s proxy with MariaDB-first behaviors. Mixing tools and assuming feature parity is a common failure pattern.
What breaks during failover (MySQL vs MariaDB)
1) GTID failover: “It’s GTID, so it’s safe”… until it isn’t
GTID-based failover is supposed to simplify promotion: pick the most up-to-date replica, promote it, and repoint others using GTID auto-positioning. In practice, GTID does not mean “no data loss,” it means “better bookkeeping.”
MySQL GTID behavior typically centers around gtid_executed, gtid_purged, and GTID sets. Failover tooling often uses MySQL’s GTID set comparisons to find the most advanced replica.
MariaDB GTID behavior uses a different GTID format and includes concepts like domain IDs and sequence numbers. It’s powerful, but it also makes “just compare GTIDs” less uniform across mixed fleets.
What breaks:
- GTID strictness mismatches: a replica that can’t accept certain GTIDs after promotion due to gaps, purge history, or configuration differences.
- Tooling that assumes MySQL-style GTID variables and fails on MariaDB (or interprets them incorrectly).
- Cross-grade failover: the “best” replica is on a slightly different version with different defaults; promotion succeeds, clients fail.
2) Replication metadata gets weird under stress
During failover, you do a lot of state transitions quickly: stop replication, reset replication info, change primary, start replication, and sometimes rebuild relay logs. The details differ between MySQL and MariaDB, including command syntax and status fields.
In MySQL 8.0, replication commands changed (START REPLICA instead of START SLAVE), status names shifted (SHOW REPLICA STATUS), and metadata moved more firmly into transactional tables. MariaDB stayed with the traditional naming longer and has its own behavior around relay logs and GTIDs.
What breaks:
- Runbooks written for one syntax run against the other at 3 a.m. and quietly do nothing.
- Replication info tables become inconsistent after partial commands (especially when automation times out mid-change).
- Relay log corruption appears after a crash, and your “simple” promotion turns into a rebuild.
3) Semi-sync: latency and safety knobs don’t match your mental model
Semi-sync replication reduces the “ack gap” by requiring at least one replica to acknowledge receipt before commit returns. That sounds like durability. It’s not the same as durability across failures unless you understand the ack semantics and timeouts.
MySQL’s semi-sync plugin and MariaDB’s semi-sync implementation behave similarly in spirit, but operationally you’ll see different status counters and different failure signatures. The risk is the same: when semi-sync falls back to async during a turbulence event, your RPO can silently change right when you need it most.
What breaks:
- Failover assumes “no data loss” because semi-sync was enabled, but it had already degraded to async due to timeouts.
- Overly aggressive timeouts cause frequent fallbacks, increasing commit latency jitter and causing app retries that amplify load.
4) Proxies and VIPs: the database might be fine, but traffic isn’t
Most failovers fail at the client layer. Proxies cache backend state, apps cache DNS, connection pools pin sessions, and some drivers behave like toddlers when their favorite endpoint disappears.
MySQL vs MariaDB differences show up when you use ecosystem tools:
- MaxScale (MariaDB’s proxy) can do monitor-driven failover with MariaDB-specific assumptions. It can work with MySQL, but check support boundaries carefully.
- Orchestrator (common in MySQL fleets) is excellent but opinionated. MariaDB support exists, yet features like GTID interpretation can differ depending on versions.
- ProxySQL is database-agnostic-ish but relies on health checks and query rules; you must define “writer” and “reader” correctly and keep it in sync with promotion logic.
What breaks: you promote correctly, but the proxy still routes writes to the old primary for a minute. Congratulations, you just built split brain with extra steps.
5) Non-determinism and statement-based replication ghosts
If you still run statement-based replication (SBR), you are choosing pain. Row-based replication (RBR) is the modern default for a reason: it avoids non-deterministic behavior that turns failover into a forensic exercise.
Some MariaDB deployments keep mixed formats for legacy reasons. Some MySQL deployments do too, usually after a long migration where “temporary” became permanent.
What breaks:
- Replica applies a statement differently due to timezone, collation, SQL mode, or function behavior differences across versions.
- Failover reveals hidden drift: the promoted node has data that is subtly different, not obviously corrupt.
6) “read_only” is not fencing
Both MySQL and MariaDB support read_only and super_read_only (availability depends on version). But these are advisory controls, not a hard fence. Users with sufficient privileges can still write in many scenarios (especially without super_read_only), and background tasks may write as well.
Failover plans that rely on setting read_only=ON on the old primary are plans that assume the old primary will cooperate. Old primaries do not cooperate. They sulk and then accept writes the moment a connection sneaks through.
7) Backup/restore during re-seeding: GTID and binlog settings matter
After failover, you often need to rebuild a replica. If your backup tool doesn’t preserve GTID state correctly, or if you restore with the wrong binlog settings, you end up with a replica that cannot join cleanly.
The MySQL vs MariaDB divergence shows up in GTID state handling, plus differences in default configuration and replication variables. Backups that are “compatible with MySQL” are not always “compatible with MariaDB failover GTID positioning,” especially if you cross versions.
GTID reality: similar acronym, different contract
People say “we use GTID” like it’s a yes/no question. It’s not. It’s “we use GTID, with these constraints, and we know what happens when those constraints break.”
MySQL GTID: sets and auto-positioning
MySQL GTID is usually managed with:
gtid_mode=ONenforce_gtid_consistency=ONlog_slave_updates=ON(so replicas can become primaries)MASTER_AUTO_POSITION=1(auto-positioning)
Failover tooling likes MySQL GTID because it can compare GTID sets. But there’s a catch: GTID sets only tell you what was executed, not what your clients believe committed if the old primary died mid-flight. Semi-sync helps, but it can degrade.
MariaDB GTID: domain IDs and operational sharp edges
MariaDB GTID is different. Domain IDs can represent different replication domains. That can be useful for multi-source replication and for tracking streams. It can also introduce failure modes where a promoted node has a GTID state that looks “complete” but doesn’t match what other replicas expect.
MariaDB also has settings like GTID strict mode (depending on version) that change whether gaps are tolerated. That’s good for correctness, but it can cause promotions to fail hard in exactly the moment you want a soft landing.
What to do with this knowledge
If you’re homogeneous (all MySQL or all MariaDB), pick the native GTID path and commit to it. Don’t treat GTID as a migration checkbox. Treat it as a design: backup method, replica rebuild, promotion procedure, and monitoring all must match.
If you’re mixed (MySQL and MariaDB), stop pretending replication failover is symmetrical. Either decouple promotion tooling per engine, or standardize on one engine for the writer role. Mixed-engine replication can work for some use cases, but failover is where the incompatibilities show up first.
Topology choices that change your blast radius
Async primary-replica with manual failover: the honest baseline
This is the simplest system that can possibly work. It also forces humans to understand the tradeoffs because a human presses the button.
Pros: easy to reason about; fewer moving parts. Cons: slower recovery; human errors; inconsistent procedure unless you drill.
Async with orchestrated failover: automation that can hurt you faster
Orchestrated failover is great when it’s designed with fencing and careful candidate selection. It’s dangerous when it’s “monitor sees primary down → promote something.” The monitor is not your data correctness layer.
Candidate selection should consider:
- Replication lag and apply lag (not just IO thread running)
- Errant transactions (especially with GTID)
- Server health (disk full, crash recovery, filesystem errors)
- Network reachability from clients (a promoted node in a dead subnet is a very fast outage)
“Multi-primary” dreams: Group Replication, Galera, and reality checks
If you truly need automatic failover without data loss, you’re probably looking for a consensus-based system. In MySQL land, that’s typically Group Replication/InnoDB Cluster. In MariaDB land, it’s often Galera-based solutions.
But: those are not “replication with failover,” they are different beasts with different failure modes, performance characteristics, and operational requirements.
Async replication with failover can be perfectly fine if you accept RPO>0 and you have good fencing. It can also be a career-limiting move if you pretend it’s strong consistency.
Fast diagnosis playbook
When failover goes wrong, you need to find the bottleneck quickly. Not the philosophical bottleneck. The actual thing preventing you from restoring write service safely.
First: establish authority and prevent split brain
- Who is currently writable? Check
read_only/super_read_onlyand whether clients are writing successfully. - Fence the old primary (shutdown, isolate network, disable VIP, block at proxy). If you can’t fence, you don’t have failover; you have a wager.
- Stop automation loops that might re-promote, reparent, or flip VIPs back and forth.
Second: find the most correct candidate
- Compare GTID positions (or binlog file/pos if non-GTID). Identify the replica(s) closest to the old primary.
- Check SQL thread state: Is it applying? Stalled? Error? Waiting for metadata lock?
- Check InnoDB health: crash recovery ongoing, dirty page flush backlog, IO saturation.
Third: move traffic without lying to the app
- Update proxies/VIP/DNS and verify from multiple app subnets.
- Confirm writer/reader separation in proxies; make sure writes actually hit the new primary.
- Watch error rates and latency. A “successful” failover that causes a 10x latency spike is still an outage—just quieter.
Fourth: repair the replica set
- Repoint replicas to new primary and validate they replicate cleanly.
- Rebuild any replicas that have relay log corruption or errant transactions.
- Confirm backups and binlog retention are sufficient for future rebuilds.
Practical tasks: commands, outputs, and the decision you make
These are the tasks you actually run during a messy failover. Each includes: the command, what typical output means, and the decision you make from it. Commands are shown as if you have shell access and MySQL client connectivity.
Task 1: Identify engine/version quickly (because syntax and variables differ)
cr0x@server:~$ mysql -N -e "SELECT VERSION(), @@version_comment;"
10.6.18-MariaDB-log MariaDB Server
What it means: You’re on MariaDB 10.6. Don’t paste MySQL 8.0 replication commands without thinking.
Decision: Use MariaDB status fields (SHOW SLAVE STATUS) and MariaDB GTID variables. Adjust runbook branch.
Task 2: Check whether the server is currently writable
cr0x@server:~$ mysql -N -e "SELECT @@read_only, @@super_read_only;"
0 0
What it means: This node can accept writes.
Decision: If this is not the intended primary during an incident, you must fence or flip it to read-only immediately (and also fix traffic routing).
Task 3: Confirm replication threads and lag (MySQL 8 naming shown)
cr0x@server:~$ mysql -e "SHOW REPLICA STATUS\G" | egrep "Replica_IO_Running|Replica_SQL_Running|Seconds_Behind_Source|Last_SQL_Error"
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 2
Last_SQL_Error:
What it means: Replication is running with minimal lag.
Decision: This replica is a good promotion candidate from a lag perspective. Still verify GTID state and health.
Task 4: Same check on MariaDB (field names differ)
cr0x@server:~$ mysql -e "SHOW SLAVE STATUS\G" | egrep "Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Last_SQL_Error"
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 0
Last_SQL_Error:
What it means: MariaDB replica is caught up (as far as this metric can tell).
Decision: Still verify it isn’t lying: check GTID/binlog positions and error logs if the primary died uncleanly.
Task 5: Check GTID execution state (MySQL)
cr0x@server:~$ mysql -N -e "SELECT @@gtid_executed\G"
*************************** 1. row ***************************
@@gtid_executed: 3E11FA47-71CA-11E1-9E33-C80AA9429562:1-984321
What it means: This node has executed GTIDs through 984321 for that server UUID.
Decision: Compare across replicas; the one with the superset is usually the best candidate (subject to errant transactions and health).
Task 6: Check GTID state (MariaDB)
cr0x@server:~$ mysql -N -e "SELECT @@gtid_current_pos, @@gtid_slave_pos;"
0-1-542118 0-1-542118
What it means: Domain 0, server_id 1, sequence 542118; replica has applied up to that position.
Decision: Compare candidate replicas. If you see multiple domains unexpectedly, stop and map what they represent before promotion.
Task 7: Detect replication break due to missing row (common after unsafe settings)
cr0x@server:~$ mysql -e "SHOW SLAVE STATUS\G" | egrep "Last_SQL_Errno|Last_SQL_Error"
Last_SQL_Errno: 1032
Last_SQL_Error: Could not execute Delete_rows event on table app.orders; Can't find record in 'orders', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND
What it means: Data drift exists; replica cannot apply events cleanly.
Decision: Do not promote this replica unless you accept corruption. Rebuild from a clean backup or repair drift deliberately (pt-table-checksum style approaches—whatever your environment allows).
Task 8: Confirm binlog format and durability knobs (promotion readiness)
cr0x@server:~$ mysql -N -e "SELECT @@log_bin, @@binlog_format, @@sync_binlog, @@innodb_flush_log_at_trx_commit;"
1 ROW 1 1
What it means: Binary logging enabled, row-based, and durable-ish settings (sync_binlog=1, flush at commit).
Decision: Good. If you see sync_binlog=0 or innodb_flush_log_at_trx_commit=2, you must assume higher risk of losing committed transactions on crash—this changes your failover RPO expectations.
Task 9: Check if semi-sync is actually active (MySQL plugin counters)
cr0x@server:~$ mysql -e "SHOW STATUS LIKE 'Rpl_semi_sync_master_status';"
+-----------------------------+-------+
| Variable_name | Value |
+-----------------------------+-------+
| Rpl_semi_sync_master_status | OFF |
+-----------------------------+-------+
What it means: Semi-sync is not currently active, regardless of config. It likely fell back to async.
Decision: During failover, assume there may be missing transactions on replicas. Choose candidate based on actual applied state, not on hoped-for safety.
Task 10: Validate the “old primary” is truly fenced
cr0x@server:~$ ssh db-old-primary "systemctl is-active mysql || systemctl is-active mariadb"
inactive
What it means: MySQL/MariaDB service is stopped on the old primary.
Decision: Good, but not sufficient if the host can come back. If you have automated restarts, disable them or isolate the node at the network layer too.
Task 11: Confirm who clients are actually connected to (catch proxy lies)
cr0x@server:~$ mysql -N -e "SELECT SUBSTRING_INDEX(USER(),'@',-1) AS client_host, COUNT(*) FROM information_schema.processlist GROUP BY 1 ORDER BY 2 DESC LIMIT 5;"
10.12.4.31 87
10.12.4.18 54
10.12.4.22 41
What it means: Lots of connections from app subnets—this node is receiving traffic.
Decision: If you promoted this node, great. If you didn’t, you need to fix routing immediately and consider whether writes happened here unexpectedly.
Task 12: On candidate replica, stop replication cleanly before promotion
cr0x@server:~$ mysql -e "STOP REPLICA;"
Query OK, 0 rows affected (0.02 sec)
What it means: Replication stopped (MySQL 8 syntax).
Decision: Now you can safely evaluate state and promote without it continuing to apply events from a potentially dead source.
Task 13: Make the promoted node writable (and verify)
cr0x@server:~$ mysql -e "SET GLOBAL super_read_only=OFF; SET GLOBAL read_only=OFF;"
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
cr0x@server:~$ mysql -N -e "SELECT @@read_only, @@super_read_only;"
0 0
What it means: Node is writable.
Decision: Only do this after fencing the old primary. If you can’t fence, keep the candidate read-only and switch to a controlled outage while you resolve fencing.
Task 14: Repoint another replica using GTID auto-positioning (MySQL)
cr0x@server:~$ mysql -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='db-new-primary', SOURCE_USER='repl', SOURCE_PASSWORD='***', SOURCE_AUTO_POSITION=1; START REPLICA;"
Query OK, 0 rows affected (0.01 sec)
Query OK, 0 rows affected (0.04 sec)
Query OK, 0 rows affected (0.01 sec)
What it means: Replica now follows the new primary with GTID-based positioning.
Decision: Immediately check status for errors and lag; if it doesn’t catch up, you may have errant transactions or purged GTIDs.
Task 15: Repoint a MariaDB replica (classic syntax and GTID)
cr0x@server:~$ mysql -e "STOP SLAVE; CHANGE MASTER TO MASTER_HOST='db-new-primary', MASTER_USER='repl', MASTER_PASSWORD='***', MASTER_USE_GTID=slave_pos; START SLAVE;"
Query OK, 0 rows affected (0.01 sec)
Query OK, 0 rows affected (0.05 sec)
Query OK, 0 rows affected (0.01 sec)
What it means: MariaDB replica reconfigured to use GTID starting from its own applied position.
Decision: If replication errors mention GTID strictness or duplicates, stop and reconcile GTID positions; do not keep “skip counter” as a lifestyle.
Task 16: Verify new primary binlog is enabled (or replicas can’t follow)
cr0x@server:~$ mysql -N -e "SELECT @@log_bin;"
1
What it means: Binary logging is enabled.
Decision: If this returns 0, replicas cannot replicate from this node. You either misconfigured it, or you promoted the wrong host type (a “read replica” profile). Fix config and restart, or pick a different candidate.
Task 17: Check for long transactions blocking apply (often mistaken for “replication lag”)
cr0x@server:~$ mysql -e "SHOW ENGINE INNODB STATUS\G" | egrep -n "TRANSACTIONS|LOCK WAIT|history list length" | head
3:TRANSACTIONS
45:---TRANSACTION 18446744073709551615, not started
112:History list length 987654
What it means: A very large history list length can indicate purge lag, long-running transactions, or heavy write workload; it often correlates with replication apply struggling.
Decision: Before promoting a lagging replica, understand whether it’s behind due to IO saturation, lock waits, or purge. The “most up-to-date” replica that’s also overloaded may fall over immediately after promotion.
Task 18: Confirm disk saturation (the silent failover killer)
cr0x@server:~$ iostat -x 1 3
Linux 5.15.0 (db-replica-2) 12/29/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.10 0.00 4.20 38.50 0.00 45.20
Device r/s w/s rkB/s wkB/s await aqu-sz %util
nvme0n1 120.0 980.0 5200.0 42000.0 35.4 18.2 99.8
What it means: Disk is pegged (%util ~100), high await. Replication apply and client writes will suffer.
Decision: Don’t promote this host if you have another candidate with healthier IO. If it’s the only candidate, throttle workload, increase buffer pool safety, and brace for a slow recovery.
Three corporate mini-stories from the trenches
Incident caused by a wrong assumption: “read_only means no writes”
They had a standard two-node setup: one primary, one replica, plus a proxy. The failover plan was a single page: if primary dies, set the old primary to read_only=ON, promote the replica, and flip the proxy writer endpoint.
The day came. A kernel upgrade rebooted the primary unexpectedly. Monitoring paged, automation ran, and the replica was promoted. The proxy flipped. Everyone exhaled.
Then the primary came back. Systemd restarted the database service automatically. The proxy still had a stale connection pool to the old primary from before the failover, and some application nodes had hard-coded fallbacks. A privileged user existed for “maintenance,” with enough rights to bypass read_only behavior in ways the team didn’t understand. Writes trickled into the old primary for several minutes.
When they later repointed that old primary as a replica, it refused: errant transactions existed. That’s the polite failure mode. The impolite one is when it accepts replication anyway and you ship garbage silently.
The fix wasn’t “set read_only harder.” The fix was fencing: either shut the old primary down and keep it down, or isolate it at the network layer so it cannot accept client traffic. They also removed “helpful fallback hosts” from application configs and made the proxy the single source of truth.
An optimization that backfired: “Let’s make commits faster”
A team was chasing latency. They had a write-heavy service and occasional replication lag spikes. Someone proposed lowering durability settings: innodb_flush_log_at_trx_commit=2 and sync_binlog=0. The argument: “We can tolerate losing a second of data; we have replication anyway.”
It worked—until it didn’t. A primary crashed during a storage hiccup. The database restarted cleanly, but the last window of “committed” transactions wasn’t actually durable on disk. Some transactions were acknowledged to clients but missing from the binlog and InnoDB redo state.
Failover happened to the most advanced replica. The app came back. Then the incident turned into a compliance issue because downstream systems had received confirmations for events that now never existed in the database. The bug report read like a conspiracy theory, because the application had logs saying “success,” and the database had no record of it.
They eventually restored correctness by replaying events from an external queue and reconciling. Nobody enjoyed it. The optimization did not just increase RPO; it made failures non-intuitive. That’s the real cost.
The eventual compromise: keep durable settings on the primary, use better indexing and batching, and invest in semi-sync (with monitoring for when it falls back). Performance tuning that changes correctness should be treated like a product decision, not a quick win.
A boring but correct practice that saved the day: “We rehearsed”
A different company had what looked like overkill: a monthly failover drill, a written runbook with engine-specific branches, and a small script that performed read/write probes through the same path as the application.
During a real incident (network partition between AZs), the primary became unreachable from most apps but still reachable from a few. That is the split-brain prequel. The automation paused because the health checks disagreed across vantage points—by design.
The on-call followed the playbook: fence the primary at the network layer, then promote the best replica based on GTID state and disk health. The probe script verified that “writer traffic” really reached the new primary, not just that TCP was open.
The outage was not fun, but it was short and controlled. Later, in the postmortem, the team’s proudest achievement was not “five nines.” It was that nobody argued about what happened. The evidence was clean because they had rehearsed the evidence collection steps, not just the failover steps.
Rehearsal is boring. It’s also one of the few things that reliably converts a theoretical HA design into a practical one.
Common mistakes: symptoms → root cause → fix
1) Symptom: promotion “succeeds,” but application still writes to the old primary
Root cause: No fencing; proxy still routes or apps have fallback host lists; stale DNS/connection pools.
Fix: Fence the old primary (shutdown + network isolation), make proxy the single writer endpoint, and validate with a write probe through the app path.
2) Symptom: replicas won’t connect to new primary after promotion
Root cause: Promoted node has log_bin=OFF, missing replication user grants, or wrong server_id.
Fix: Ensure candidate replicas are built with “promotion-ready” config: unique server_id, binlog enabled, replication user present, log_slave_updates set where required.
3) Symptom: GTID-based reparent fails with duplicate or missing GTIDs
Root cause: Errant transactions on one replica, GTID purged inconsistently, or mixed GTID modes across fleet.
Fix: Standardize GTID config everywhere. During incident, pick a candidate that is a superset of executed transactions; rebuild inconsistent replicas rather than forcing them with skips.
4) Symptom: Seconds_Behind_* is small but data is missing after failover
Root cause: Metric lies under certain conditions (SQL thread stopped, multi-threaded apply, clock issues) or semi-sync had fallen back to async.
Fix: Verify with GTID/binlog position comparisons and application-level reconciliation signals. Monitor semi-sync status counters, not just configuration flags.
5) Symptom: replication lag explodes right after failover
Root cause: Promoted node was already IO-saturated, or buffer pool is cold, or a long transaction blocks purge and causes contention.
Fix: Choose promotion candidates based on health (IO, CPU, InnoDB metrics), not just “most up-to-date.” Warm replicas, use sensible buffer pool sizing, and avoid failover to a host that’s melting.
6) Symptom: failover script works on MySQL but fails on MariaDB (or vice versa)
Root cause: Syntax differences (START REPLICA vs START SLAVE), variable differences, and GTID semantics differences.
Fix: Maintain engine-specific automation paths. Detect engine/version before executing. Treat “compatible enough” as a risk, not a comfort.
7) Symptom: after failover, replicas show “connecting” forever
Root cause: Network ACL/firewall not updated, wrong source address, or proxy/VIP moved but replication uses direct hostnames.
Fix: Keep replication traffic paths explicit and tested. Use dedicated replication hostnames that are routable from replicas. Validate connectivity with TCP checks from replica subnets.
8) Symptom: replica SQL thread stops with “DDL mismatch” or metadata errors
Root cause: Mixed versions/SQL modes/collations; statement-based replication; or a DDL executed differently.
Fix: Use row-based replication. Standardize SQL mode and collation. Avoid cross-version failover when possible; if unavoidable, test DDL replication behavior before production.
Checklists / step-by-step plan
Failover readiness checklist (do this before you need it)
- Fencing exists and is tested: you can stop the old primary from accepting traffic even if it reboots.
- Promotion-ready config on replicas: binlog enabled, unique
server_id, replication users present, correct GTID settings. - Consistent durability stance: decide whether you accept crash-loss; don’t let it be an accidental performance tweak.
- Row-based replication:
binlog_format=ROWunless you have a proven reason not to. - Runbook with engine branches: MySQL vs MariaDB syntax differences captured and rehearsed.
- Health signals: replication apply, disk latency, InnoDB metrics, semi-sync active status, proxy routing state.
- Write probe tooling: a minimal script that confirms the new writer actually accepts writes through the real client path.
Incident failover plan (the controlled version)
- Stop the bleeding: pause automation, reduce write traffic if possible, and freeze schema changes.
- Fence the old primary: shutdown + network isolation. Verify it’s down and unreachable from clients.
- Pick a candidate: compare GTID/binlog positions; reject any replica with SQL errors or drift; check IO health.
- Stop replication on candidate: prevent it from applying further events from a compromised source.
- Promote: set read-only off; ensure binlog on; verify it can accept writes.
- Move traffic: update proxy/VIP/DNS; verify with a write probe; watch error rates.
- Reparent replicas: use GTID where possible; verify each replica catches up and stays stable.
- Post-failover cleanup: rebuild broken replicas, validate backups, and document exactly what you observed.
When to choose MySQL vs MariaDB for replication failover-heavy environments
Choose MySQL (especially 8.0+) if:
- You want a clearer “single vendor” HA story with Group Replication/InnoDB Cluster as an eventual option.
- Your tooling and staff muscle memory is already MySQL-oriented (Orchestrator, MySQL shell, 8.0 semantics).
- You value predictable GTID set behavior and common operational patterns across managed offerings.
Choose MariaDB if:
- You are invested in MariaDB ecosystem tooling like MaxScale and understand its failover behavior deeply.
- You benefit from MariaDB-specific features and are willing to treat GTID and failover as MariaDB-native, not “MySQL-like.”
- You plan around MariaDB’s replication capabilities and your fleet is homogeneous enough to avoid compatibility traps.
Avoid mixed-engine failover unless you’ve tested the exact versions, GTID mode, binlog settings, and tooling behavior under failure. “It replicates in staging” is not a proof; it’s a starting rumor.
FAQ
1) Can I fail over between MySQL and MariaDB replicas?
You can sometimes replicate between them in certain version combinations and configurations, but failover is risky. GTID semantics differ, and tooling often assumes one engine. If you must, keep the writer on one engine family and treat the other as a read-only consumer, not a promotion target.
2) Does GTID guarantee no data loss during failover?
No. GTID helps you identify what executed where and helps reparent replicas. Data loss depends on what transactions were acknowledged on the old primary before it died and whether replicas received/applied them.
3) Is semi-sync enough to guarantee zero data loss?
Not by itself. Semi-sync can fall back to async under load or network issues. You must monitor whether it’s active at the time of failure and understand the ack requirements you configured.
4) Why does “Seconds_Behind_Master” sometimes lie?
Because it’s an estimate based on timestamps and thread state. If SQL thread is stopped, if relay logs are delayed, if clocks drift, or if parallel apply is in play, the number can be misleading. Use GTID/binlog position comparisons and error status as primary signals.
5) What’s the single most important failover control?
Fencing. If you cannot guarantee the old primary won’t accept writes, everything else is theater.
6) Should I use statement-based or row-based replication for failover?
Row-based. Statement-based and mixed formats create non-deterministic drift and version-dependent behavior. Failover is when drift becomes visible—and expensive.
7) What tools should I use for failover?
Use a tool that understands your engine and GTID mode, and supports fencing or integrates with something that can fence (or at least pause promotions until a human confirms fencing). Orchestrator is common in MySQL fleets; MaxScale is common in MariaDB fleets; ProxySQL can work as a routing layer but needs correct health logic.
8) How do I pick the best replica to promote?
Pick the replica with (a) the most complete executed/applied transaction set, (b) no replication errors, and (c) good host health (disk latency, crash recovery not ongoing). “Most advanced but overloaded” is a trap.
9) Why do promotions sometimes fail due to “errant transactions”?
Because a replica executed transactions not present on the primary’s replication stream (manual writes, bad automation, or previous split brain). GTID makes this visible. The fix is to rebuild or surgically reconcile—usually rebuild.
10) How often should we rehearse failover?
Monthly if you can, quarterly if you must. Rehearsal is how you discover that your proxy caches DNS for 10 minutes, or that your “fencing” is actually a polite request.
Conclusion: next steps you can do this week
Failover doesn’t break because MySQL or MariaDB is “bad.” It breaks because you relied on assumptions you never verified under failure. The engines differ in GTID semantics, replication command surfaces, and ecosystem tooling expectations. Those differences are manageable—if you acknowledge them.
Do these next, in order:
- Write down your failover contract: acceptable RPO/RTO, who decides to lose data (if needed), and what “success” means (not “the database is up,” but “writes go to exactly one place”).
- Implement real fencing: not read_only. Something that stops old primary writes even if it reboots.
- Standardize replication mode: row-based, consistent SQL mode/collation, consistent GTID settings per engine.
- Build a promotion-ready replica profile: binlog enabled, server_id unique, replication users and privileges present, and monitored.
- Run a failover drill and collect the evidence the way you would during a real incident: status outputs, IO stats, proxy state. Fix the first surprise you find. There will be surprises.
Paraphrased idea from John Allspaw: reliability comes from learning continuously, not from pretending systems are predictable under pressure
.
Joke #2: The only thing more optimistic than “failover is automatic” is “the runbook is up to date.”