You upgraded in staging. Tests passed. Dashboards were green. Then production hit you with a 2 a.m. mystery: replication lag climbing, p95 latency doubling,
and a slow query log that reads like a thriller novel written by a nervous ORM.
The “works on staging” lie is rarely malicious. It’s usually just physics: different data shape, different concurrency, different storage behavior, different
edge-case SQL, different plugins, different TLS, and the one thing nobody wants to admit—different human urgency.
MariaDB vs Percona Server: what actually changes when you upgrade
“MariaDB vs Percona Server” is often framed as a philosophical debate. In production, it’s more like moving apartments: your furniture mostly fits, but the
doorframes are different, the water pressure changes, and the landlord has opinions about what counts as “standard.”
Percona Server is a MySQL fork with extra instrumentation, performance features, and operational knobs. MariaDB is its own fork with its own optimizer path,
replication choices, storage engines, and versioning trajectory. Both can run InnoDB (MariaDB’s InnoDB is XtraDB in older eras and diverged over time),
both speak “mostly MySQL,” and both will happily accept your application’s SQL right up until the day they don’t.
Compatibility isn’t a checkbox; it’s a matrix
When people say “compatible,” they usually mean “clients connect and basic CRUD works.” That’s not the bar. Your bar is:
same correctness, same performance envelope, same failure behavior under real load and real data.
The tricky parts when upgrading between MariaDB and Percona Server (or upgrading within either family) tend to cluster in five places:
- SQL semantics and optimizer plans: same query, different plan; same plan, different row estimates; different defaults for SQL modes.
- Replication behavior: GTID flavor, binlog format defaults, crash-safety, parallel replication tuning.
- InnoDB/redo/undo/log behavior: redo log sizing, flush policy defaults, doublewrite changes, checksum defaults.
- Authentication/TLS/plugins: auth plugins, default TLS versions/ciphers, plugin availability and naming.
- Operational tooling assumptions: Percona Toolkit expectations, system tables, performance_schema/information_schema differences.
If you’re migrating across families (MariaDB ↔ Percona Server), the biggest mistake is thinking the upgrade story is the same as
“MySQL minor version bump.” It isn’t. You’re crossing a divergence boundary, and divergence is where staging lies the loudest.
One paraphrased idea from Gene Kim (reliability/operations author): Improving outcomes comes from improving the system, not by asking people to try harder under pressure.
Treat your upgrade like system design, not hero work.
Historical context you can use in a postmortem
History doesn’t fix outages, but it explains why your “simple upgrade” has sharp edges. Here are facts worth keeping in your head when you plan:
- MariaDB started as a fork after Oracle acquired MySQL (2009–2010 era), which shaped its “open governance” positioning and divergence pace.
- Percona Server grew out of production tuning culture: the fork prioritized performance instrumentation and operational controls for busy systems.
- GTID is not one universal thing: MariaDB’s GTID implementation differs from Oracle MySQL/Percona’s GTID, which matters in cross-family replication and failover tooling.
- Performance Schema matured significantly over MySQL 5.6 → 5.7 → 8.0, and Percona typically tracks MySQL’s model; MariaDB’s instrumentation and sys schema story differs.
- MariaDB introduced features with no MySQL equivalent (e.g., some optimizer strategies, system-versioned tables), which can change execution behavior.
- Percona Toolkit became a de facto operational standard in many companies; it works best when the server behaves like MySQL/Percona expectations.
- InnoDB defaults have shifted across generations (redo log sizing, flushing heuristics, metadata handling). Upgrades often change defaults even if your config file doesn’t.
- Authentication plugins evolved (caching_sha2_password in MySQL 8.0 world; varied defaults elsewhere), making “client works in staging” dependent on the exact connector build.
- UTF8 got political: the difference between utf8 (3-byte) and utf8mb4 (4-byte) and how collations default has caused countless “it passed tests” surprises.
Why staging lies: the top failure modes
1) Your data distribution is fake
Staging usually has fewer rows, fewer hot keys, fewer pathological outliers, and less fragmentation. InnoDB behaves differently when the working set fits in
memory versus when it doesn’t. The optimizer behaves differently when histograms and index statistics represent real skew.
A classic lie: staging has a uniform user_id distribution; production has one “enterprise tenant” that owns half the rows. Your “safe” query becomes a
mutex hotspot under load.
2) Your concurrency is imaginary
Most staging load tests run with a polite concurrency level and a warm cache. Production is a stampede. Lock contention, purge lag, redo pressure, and
fsync cost are all nonlinear. You can’t extrapolate from “works at 50 QPS” to “works at 5,000 QPS” just because the query is the same.
3) Defaults changed, but you didn’t notice
An upgrade can change defaults for binlog format, SQL modes, InnoDB flushing heuristics, temp table handling, TLS versions, and more. If staging uses a
hand-tuned my.cnf but production has config drift (or vice versa), you don’t have an upgrade test—you have a comparison of two different universes.
4) The storage stack is not the same
“Same instance type” is not “same I/O.” Your staging volume might be quiet, your production volume might be contended. Your staging might be local NVMe,
production might be network-attached. Your filesystem mount options might differ. And yes, your RAID controller cache policy matters.
Joke #1: Storage is the place where “but it’s SSD” goes to die, usually during the one maintenance window you can’t extend.
5) Observability differs, so you notice problems later
Staging often has more debug logging, less traffic, and fewer compliance constraints. Production often has stricter log retention, fewer metrics, and more
things yelling at the CPU. When you upgrade, you also change what’s measurable and what’s expensive to measure.
6) Your application behavior changes under production feature flags
Staging rarely runs the exact same feature flags, timeouts, retry logic, circuit breakers, queue depths, and batch sizes. A database upgrade can expose a
minor change in query latency, which triggers a retry storm, which becomes a write amplification event, which becomes your weekend.
Fast diagnosis playbook (first/second/third)
When the upgrade goes sideways, you don’t have time to admire graphs. You need to find the bottleneck quickly and pick a safe mitigation. Here’s a fast
playbook that works across MariaDB and Percona Server.
First: is the database CPU-bound, I/O-bound, or lock-bound?
- CPU-bound: high mysqld CPU, high “executing” time, lots of handler reads, plans changed, missing indexes, or different optimizer choices.
- I/O-bound: high iowait, low buffer pool hit rate, redo/fsync pressure, temp tables spilling, checkpoint age issues.
- Lock-bound: threads waiting on row locks, metadata locks, or internal latches; replication SQL thread stuck; DDL blocking.
Second: is the pain concentrated in one query pattern?
Look for one or two dominant fingerprints: a query that changed plan, a background thread saturating I/O, or a replication apply bottleneck. If the top
offender explains more than ~30–40% of time, treat it as the incident cause until proven otherwise.
Third: is it correctness, performance, or stability?
- Correctness: data mismatches, silent truncation, collation changes, SQL mode differences, timezone drift.
- Performance: latency regression, throughput drop, replication lag, connection storms.
- Stability: crashes, OOM kills, disk full, corrupt logs, plugin failures, TLS negotiation failures.
Your response differs. Correctness issues often mean rollback or read-only mode until verified. Performance issues usually mean mitigating load and reverting
configs. Stability issues mean stop the bleeding: keep the data safe, then diagnose.
Practical verification tasks (commands, outputs, decisions)
These are the tasks that separate “we tested” from “we proved.” Each task has: a command, an example output, what it means, and what decision to make.
Run them on staging and production before and after the upgrade. Diff the results. If you don’t diff, you’re basically trusting vibes.
Task 1: Confirm server identity and version lineage
cr0x@server:~$ mysql -NBe "SELECT VERSION(), @@version_comment, @@version_compile_machine;"
8.0.36-28 Percona Server (GPL), Release 28, Revision 1234abcd x86_64
Meaning: VERSION() and version_comment tell you whether you’re on MariaDB, Percona, or something else wearing a costume.
Decision: If it’s not the exact intended distribution/version, stop. Your “upgrade” might be a mixed package repo problem.
Task 2: Dump and diff runtime variables that cause behavior changes
cr0x@server:~$ mysql -NBe "SHOW VARIABLES WHERE Variable_name IN ('sql_mode','binlog_format','transaction_isolation','innodb_flush_log_at_trx_commit','sync_binlog','character_set_server','collation_server','log_bin','gtid_mode','enforce_gtid_consistency');"
sql_mode ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
binlog_format ROW
transaction_isolation REPEATABLE-READ
innodb_flush_log_at_trx_commit 1
sync_binlog 1
character_set_server utf8mb4
collation_server utf8mb4_0900_ai_ci
log_bin ON
gtid_mode ON
enforce_gtid_consistency ON
Meaning: These variables shape correctness, replication safety, and performance. They also change across major versions and between forks.
Decision: If staging and production don’t match, fix drift before you trust test results. If an upgrade changes defaults, pin them explicitly.
Task 3: Verify InnoDB buffer pool sizing and pressure
cr0x@server:~$ mysql -NBe "SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests';"
innodb_buffer_pool_size 34359738368
Innodb_buffer_pool_reads 918273
Innodb_buffer_pool_read_requests 5544332211
Meaning: Reads vs read_requests approximates cache misses. A jump in buffer_pool_reads after upgrade often indicates new working set,
plan change, or background churn.
Decision: If miss ratio worsened significantly, you’re likely I/O-bound. Increase buffer pool (if safe), fix query plans, or reduce churn.
Task 4: Inspect redo log and checkpoint pressure (InnoDB log waits)
cr0x@server:~$ mysql -NBe "SHOW GLOBAL STATUS LIKE 'Innodb_log_waits'; SHOW VARIABLES LIKE 'innodb_redo_log_capacity';"
Innodb_log_waits 742
innodb_redo_log_capacity 4294967296
Meaning: Innodb_log_waits > 0 means foreground transactions waited for redo. That’s classic “your redo is too small or flushing too slow.”
Decision: If waits climb during load, increase redo capacity (where supported), review flush settings, and check storage fsync latency.
Task 5: Check disk latency and saturation at the OS layer
cr0x@server:~$ iostat -x 1 3
Linux 6.5.0 (db-prod-01) 12/30/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
22.11 0.00 6.08 18.47 0.00 53.34
Device r/s w/s rkB/s wkB/s aqu-sz await svctm %util
nvme0n1 320.0 980.0 20480.0 65536.0 9.40 8.90 0.55 72.00
Meaning: High await and high util indicates storage pressure. iowait > ~10–20% during incident is a loud hint.
Decision: If await spikes with the upgrade, suspect redo/fsync, temp table spills, or changed background flushing. Mitigate by reducing write rate,
increasing memory, or moving to faster volumes.
Task 6: Validate replication health and coordinates
cr0x@server:~$ mysql -e "SHOW REPLICA STATUS\G" | egrep 'Replica_IO_Running|Replica_SQL_Running|Seconds_Behind_Source|Last_SQL_Error|Retrieved_Gtid_Set|Executed_Gtid_Set'
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 84
Last_SQL_Error:
Retrieved_Gtid_Set: 3b2c1d10-aaaa-bbbb-cccc-111111111111:1-982733
Executed_Gtid_Set: 3b2c1d10-aaaa-bbbb-cccc-111111111111:1-982649
Meaning: Seconds_Behind_Source rising + a widening GTID gap indicates apply can’t keep up.
Decision: If lag appears only after upgrade, check parallel replication settings, binlog format, and fsync pressure on the replica.
Task 7: Catch plan regressions with EXPLAIN and optimizer trace (targeted)
cr0x@server:~$ mysql -e "EXPLAIN SELECT o.id FROM orders o JOIN customers c ON c.id=o.customer_id WHERE c.email='x@example.com' AND o.status='OPEN'\G"
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ref
possible_keys: idx_email
key: idx_email
rows: 1
filtered: 100.00
Extra: Using index
*************************** 2. row ***************************
table: o
type: ref
possible_keys: idx_customer_status
key: idx_customer_status
rows: 12
filtered: 10.00
Extra: Using where
Meaning: You’re looking for changes in join order, access type, and row estimates. “type: ALL” where you used to have “ref” is a red flag.
Decision: If plan changed, consider updating statistics, adding indexes, or pinning with hints (last resort). Also verify collations and data types match.
Task 8: Compare table statistics freshness
cr0x@server:~$ mysql -NBe "SELECT table_schema, table_name, update_time FROM information_schema.tables WHERE table_schema='app' ORDER BY update_time DESC LIMIT 5;"
app orders 2025-12-30 01:12:44
app customers 2025-12-30 01:10:02
app order_items 2025-12-29 23:45:11
app invoices 2025-12-29 21:03:19
app payments 2025-12-29 19:58:07
Meaning: update_time can be misleading depending on engine and settings, but it’s a quick clue whether maintenance ran and whether tables are “fresh.”
Decision: If tables look stale after migration/import, run ANALYZE TABLE for critical tables and re-check EXPLAIN for key queries.
Task 9: Check for temp table spills and sort pressure
cr0x@server:~$ mysql -NBe "SHOW GLOBAL STATUS LIKE 'Created_tmp%'; SHOW VARIABLES LIKE 'tmp_table_size'; SHOW VARIABLES LIKE 'max_heap_table_size';"
Created_tmp_tables 182773
Created_tmp_disk_tables 44291
Created_tmp_files 9312
tmp_table_size 134217728
max_heap_table_size 134217728
Meaning: A high ratio of Created_tmp_disk_tables means queries are spilling to disk. Upgrades can change internal temp table behavior and defaults.
Decision: If disk spills increased post-upgrade, increase tmp_table_size/max_heap_table_size cautiously, and fix the offending queries and indexes.
Task 10: Identify top query patterns by digest (performance schema)
cr0x@server:~$ mysql -NBe "SELECT DIGEST_TEXT, COUNT_STAR, ROUND(SUM_TIMER_WAIT/1e12,2) AS total_s, ROUND(AVG_TIMER_WAIT/1e9,2) AS avg_ms FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 3;"
SELECT * FROM orders WHERE customer_id = ? AND status = ? ORDER BY created_at DESC LIMIT ? 882771 912.22 1.03
UPDATE inventory SET qty = qty - ? WHERE sku = ? AND qty >= ? 221009 644.11 2.91
SELECT id FROM customers WHERE email = ? 775002 312.88 0.40
Meaning: If one digest dominates total time, you have a surgical target. If everything got slower evenly, suspect I/O or global contention.
Decision: Fix top digests first. If the top digest is new after upgrade, suspect plan changes or changed SQL from the app due to connector behavior.
Task 11: Check for metadata lock pileups (the quiet killer)
cr0x@server:~$ mysql -e "SELECT OBJECT_SCHEMA, OBJECT_NAME, LOCK_TYPE, LOCK_STATUS, COUNT(*) cnt FROM performance_schema.metadata_locks GROUP BY 1,2,3,4 ORDER BY cnt DESC LIMIT 5;"
+--------------+-------------+-----------+-------------+-----+
| OBJECT_SCHEMA | OBJECT_NAME | LOCK_TYPE | LOCK_STATUS | cnt |
+--------------+-------------+-----------+-------------+-----+
| app | orders | SHARED_READ | GRANTED | 132 |
| app | orders | EXCLUSIVE | PENDING | 1 |
+--------------+-------------+-----------+-------------+-----+
Meaning: One pending EXCLUSIVE lock plus a pile of granted shared locks is the classic “someone started DDL and now everyone waits.”
Decision: If you see this during upgrade, stop the DDL, or schedule it properly with online schema change tooling and off-peak windows.
Task 12: Validate error log for plugin/auth/TLS regressions
cr0x@server:~$ sudo tail -n 30 /var/log/mysql/error.log
2025-12-30T01:20:11.123456Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2025-12-30T01:20:15.765432Z 12 [ERROR] [MY-010054] [Server] Slave SQL: Error 'Unknown collation: 'utf8mb4_0900_ai_ci'' on query. Default database: 'app'. Query: 'CREATE TABLE ...', Error_code: 1273
Meaning: Collation mismatch and TLS warnings are common cross-version/cross-fork tripwires. “Unknown collation” is a migration boundary problem.
Decision: If collation isn’t supported on the target, you must convert schema/columns or choose a compatible collation before replicating/importing.
Task 13: Validate schema drift between staging and production (DDL diff via mysqldump)
cr0x@server:~$ mysqldump --no-data --routines --triggers --databases app | sha256sum
c0a5f4c2f6d7f6bb38a39b9c1d9e2c11caa8d9b7a0f1b2c3d4e5f6a7b8c9d0e1 -
Meaning: Hashing the schema dump is crude, but effective. If hashes differ, your environments are not equivalent.
Decision: If the schema differs, stop claiming staging validated production. Rebuild staging from production schema (and preferably masked data).
Task 14: Verify backup/restore correctness before touching production
cr0x@server:~$ xtrabackup --prepare --target-dir=/backups/xb_2025-12-30_0100
xtrabackup: This target seems to be prepared.
Meaning: A backup you haven’t prepared/restored is a theory, not a backup. Prepare validates the backup’s ability to become consistent.
Decision: If prepare fails or takes wildly longer post-upgrade, treat it as a release blocker: your recovery path is compromised.
Task 15: Run a production-shape micro-benchmark (not a vanity benchmark)
cr0x@server:~$ sysbench oltp_read_write --mysql-host=127.0.0.1 --mysql-user=sb --mysql-password=sb --mysql-db=sbtest --tables=16 --table-size=500000 --threads=64 --time=120 --report-interval=10 run
SQL statistics:
queries performed:
read: 1456720
write: 416206
other: 208103
transactions: 104051 (867.03 per sec.)
Latency (ms):
min: 2.31
avg: 73.81
max: 512.24
Meaning: This isn’t about absolute TPS; it’s about comparing before/after on the same hardware and config. Watch latency and max spikes.
Decision: If max latency explodes post-upgrade, you likely have fsync/redo issues, mutex contention, or temp spills. Don’t ship it.
Task 16: Validate connection errors and handshake behavior
cr0x@server:~$ mysqladmin -uroot -p extended-status | egrep 'Aborted_connects|Connection_errors_internal|Connection_errors_max_connections|Threads_connected'
Aborted_connects 17
Connection_errors_internal 0
Connection_errors_max_connections 4
Threads_connected 412
Meaning: Upgrades can change auth plugin defaults, TLS requirements, and timeouts. Spikes in aborted_connects can be client incompatibility or load.
Decision: If aborted connects climb only after upgrade, validate connector versions, TLS settings, and max_connections/thread_cache.
Joke #2: The only thing more confident than an ORM is an ORM after you upgrade the database under it.
Three corporate mini-stories (painful, real, useful)
Mini-story 1: The incident caused by a wrong assumption (GTID “is GTID”)
A mid-size SaaS company planned a “simple” migration: replace an aging MariaDB cluster with Percona Server because the new observability stack spoke
MySQL/Percona more fluently. Staging went fine. Replication worked. Failover drills looked clean enough to sign off.
The wrong assumption was subtle: they assumed GTID behavior was portable the way “port 3306 is portable.” In staging, they used logical dumps and short-lived
replicas. In production, they depended on GTID-based automation in their failover tooling and expected it to keep working after cutover.
During the production cutover, the promoted replica came up, but the automation refused to reattach a lagging node. Then a second replica applied transactions
in an order that didn’t match their expectations. Nobody lost data, but they lost time, and time is an availability bug.
The postmortem found the real culprit: the migration plan treated MariaDB GTID and MySQL/Percona GTID as interchangeable enough to “just work,” without
validating the failover tooling’s assumptions. Staging didn’t include the automation layer or the long-lived replication topology.
The fix wasn’t heroic. They rebuilt the plan: treat GTID flavor as a hard boundary, validate failover procedures end-to-end, and include the automation tool
in staging drills with long-lived replicas and realistic lag injection. The next cutover was boring, which is the highest compliment in operations.
Mini-story 2: The optimization that backfired (bigger redo logs, worse tail latency)
A fintech team upgraded Percona Server within the same major family and saw intermittent write stalls in production. Somebody spotted Innodb_log_waits rising
and made a reasonable call: increase redo capacity. Bigger redo means fewer checkpoints, fewer stalls. That’s the folklore.
They rolled out the change, and average latency improved. Then tail latency got weird. p99 write latency spiked during traffic bursts, and replicas lagged
further behind. The graphs looked like a calm sea with occasional sea monsters.
The real issue wasn’t redo size; it was I/O behavior under bursty load combined with storage characteristics. Bigger redo changed the timing of flush and
checkpoint work. Instead of smaller, frequent background work, they created larger, less predictable writeback phases that aligned badly with their storage
volume’s throughput variability.
The fix was to stop treating redo as a single knob. They instrumented fsync latency, adjusted flushing, verified doublewrite settings, and ensured replicas
had enough I/O headroom. Redo sizing stayed larger than before—but not “as large as possible,” and not without verifying tail behavior.
The lesson: performance tweaks are not free candy. Any knob that shifts when work happens can turn into a tail-latency generator. If you only watch averages,
you will ship regressions with a smile.
Mini-story 3: The boring but correct practice that saved the day (schema hash + config freeze)
An enterprise team had to move from MariaDB to Percona Server due to vendor constraints in their platform. They weren’t excited about it. They were also
disciplined, which is the adult version of being excited.
They started with two non-negotiables: a config freeze and a schema hash gate. For four weeks pre-cutover, production my.cnf changes were banned unless tied
to an incident. Any change required a diff review and a staging replay. Meanwhile, staging was rebuilt nightly from production schema, and the schema dump was
hashed and compared in CI.
During final dress rehearsal, the schema hash gate caught a tiny difference: a collation on one table in staging didn’t match production. It wasn’t intentional.
It was the residue of a half-finished migration months earlier that never reached production.
They fixed staging to match production, re-ran the upgrade rehearsal, and found a query plan regression that only appeared with the production collation and
index cardinality. This is the part where people say “lucky catch.” It wasn’t luck. It was boring process doing its job.
Cutover night was uneventful. The best upgrade stories don’t make it into company lore because nobody enjoys reminiscing about things that went according to plan.
Common mistakes: symptom → root cause → fix
1) Symptom: replication breaks with “Unknown collation” or “Unknown character set”
Root cause: Source schema uses a collation/charset not supported on target (common across MariaDB ↔ MySQL/Percona boundaries, and across MySQL 5.7 → 8.0).
Fix: Convert schema before migration. Standardize on utf8mb4 and choose collations supported by both sides. Verify by restoring a schema-only dump on the target.
2) Symptom: queries got slower, but CPU is the same and I/O is higher
Root cause: buffer pool misses increased due to plan changes or different temp table behavior; staging data fit in RAM, production does not.
Fix: Compare EXPLAIN plans, update statistics (ANALYZE), add/adjust indexes, and verify tmp_table_size and internal temp engine differences.
3) Symptom: p99 write latency spikes after upgrade
Root cause: redo/fsync pressure, changed flushing heuristics, or storage stack variability. Sometimes caused by “helpful” config changes during upgrade.
Fix: Measure fsync latency (OS + database), review innodb_flush_log_at_trx_commit and sync_binlog, tune redo capacity responsibly, and ensure volume headroom.
4) Symptom: “too many connections” or connection storms after cutover
Root cause: client handshake/auth plugin mismatch, TLS negotiation failures, or application retries amplifying mild latency regressions.
Fix: Pin auth plugin and TLS settings, validate connector versions in production-like tests, adjust timeouts, and deploy connection pooling sanity limits.
5) Symptom: DDL causes system-wide stalls during upgrade window
Root cause: metadata locks; online DDL assumptions not valid for that operation/version; long transactions holding locks.
Fix: Use online schema change tooling (or native online DDL where safe), kill/avoid long transactions, set lock wait timeouts, and stage DDL earlier.
6) Symptom: replicas fall behind only after the upgrade
Root cause: parallel replication not tuned or changed defaults; binlog format changed; commit order differences; replica I/O weaker than primary.
Fix: Verify binlog_format=ROW, tune replica parallelism, increase replica I/O capacity, and ensure replica configuration matches the post-upgrade workload.
7) Symptom: “Deadlocks increased” and app errors spike
Root cause: changed execution plans or index usage leads to different lock acquisition order; increased concurrency; different isolation/SQL mode defaults.
Fix: Compare plans, add supporting indexes, reduce transaction scope, and ensure transaction_isolation and sql_mode are pinned and reviewed.
8) Symptom: disk fills unexpectedly during migration or soon after
Root cause: bigger binary logs, temp table spills, slow log enabled at verbose settings, backup artifacts not rotated, or redo/undo growth.
Fix: Pre-allocate space budget, enforce log rotation, cap retention, monitor /var/lib/mysql and tmpdir, and validate migration scripts clean up after themselves.
Checklists / step-by-step plan (boring on purpose)
Phase 0: Decide what kind of “upgrade” you’re actually doing
- In-family upgrade (Percona → Percona, MariaDB → MariaDB): still risky, but tooling and semantics are closer.
- Cross-family migration (MariaDB ↔ Percona): treat as a migration, not an upgrade. Different GTID, different collations, different optimizer, different system tables.
- Major version jump: assume behavior changes unless proven otherwise.
Phase 1: Make staging stop lying
- Rebuild staging schema nightly from production. Hash the schema dump. Gate on mismatch.
- Load staging with production-like data shape: masked production snapshot or a generated dataset with similar skew and cardinality.
- Match config: same my.cnf, same kernel/sysctl, same filesystem mount options, same storage class.
- Replay production traffic patterns (including bursts). Don’t just run unit tests and call it a day.
- Include the automation layer: failover tooling, backups, monitoring, schema migration system.
Phase 2: Preflight compatibility gates (blockers)
- Collation/charset compatibility check across the entire schema.
- SQL mode and strictness: ensure expected behavior for inserts, group by, and implicit conversions.
- Auth plugins and TLS: ensure every client library can connect under the new defaults.
- Replication plan: decide on GTID strategy; avoid “we’ll figure it out during cutover.”
- Backup/restore rehearsal: prove you can restore and promote within your RTO.
Phase 3: Performance gates (prove you didn’t move the bottleneck into darkness)
- Benchmark with production concurrency and burst patterns; record p95/p99 and max latencies.
- Compare top digests before/after; flag new top offenders and plan regressions.
- Validate I/O headroom and redo behavior under write bursts.
- Verify temp table spills and sort pressure.
Phase 4: Cutover plan that respects rollback reality
- Define rollback triggers: correctness error, replication instability, sustained latency regression, rising error rate.
- Freeze schema changes during cutover window.
- Reduce blast radius: canary a subset of traffic or tenants, if your architecture allows it.
- Run the same verification tasks during cutover as you did in rehearsal. Same commands, same diffs, same gates.
- Do not improvise “quick optimizations” during incident unless you can measure them and undo them safely.
FAQ
1) Is Percona Server “just MySQL” and MariaDB “just MySQL”?
Not in the way your upgrade plan cares about. Percona tracks MySQL closely but adds features and builds. MariaDB diverged more over time. Treat cross-family
moves as migrations with compatibility validation, not casual upgrades.
2) Can I replicate from MariaDB to Percona Server (or the reverse) during migration?
Sometimes, but the edge cases are where you bleed: GTID flavor mismatch, collation differences, and statement/row format assumptions. If you do it, prove it
with a long-lived replication rehearsal, not a 30-minute staging test.
3) What’s the single most common “works on staging” cause?
Data shape. Staging datasets rarely reproduce production skew, fragmentation, and hot keys. Optimizers and caches behave differently when reality is messy.
4) Should I pin all configuration variables before upgrading?
Pin the ones that change semantics or durability: sql_mode, binlog_format, transaction isolation, flush and binlog sync settings, charset/collation, and
authentication/TLS policy. Don’t pin random knobs you don’t understand; you’ll fossilize old mistakes.
5) How do I catch query plan regressions early?
Capture top query digests, run EXPLAIN on representative queries, compare before/after, and validate with production-like stats (ANALYZE). Also watch temp
spills and handler read patterns, not just query time.
6) Why did replication lag increase after upgrade when writes didn’t?
Replica apply can be limited by single-threaded apply, commit order constraints, fsync pressure, or different parallel replication defaults. The primary can
“handle it,” but the replica can’t apply it fast enough under its I/O budget.
7) Is it safe to change redo/flush settings during an incident?
Only if you can measure the result quickly and you understand the durability tradeoff. Some settings can reduce fsync cost at the expense of crash safety.
If your incident is correctness-adjacent, don’t trade integrity for speed without executive-level buy-in.
8) How do I know staging is “production-like enough”?
When your verification tasks diff cleanly (schema/config), your benchmark reproduces production bottlenecks, and your failure drills (backup restore, replica
rebuild, failover) behave the same way. If you can’t reproduce production pain in staging, staging isn’t realistic.
9) Should we use logical dumps or physical backups for migration?
Logical dumps are portable but slow and can change subtle things (definers, collations, character sets). Physical backups are fast but require engine/version
compatibility. For cross-family moves, many teams end up combining methods: physical within family, logical across boundaries, always rehearsed.
10) What’s a reasonable success criterion for an upgrade?
Correctness validated, replication stable, and performance within a defined regression budget (for example, no sustained p95 regression and no p99 explosion).
Plus: backup/restore still works and monitoring still tells the truth.
Next steps you can do this week
If you’re planning a MariaDB ↔ Percona Server move (or even an “easy” in-family upgrade), don’t start by debating features. Start by removing the staging lies.
- Build a diff report: run Tasks 1–3 and 13 on both environments and fix drift until they match.
- Pick 10 critical queries: run EXPLAIN before/after, and record the plans as artifacts.
- Do one realistic load replay: sysbench is fine as a baseline, but also replay a slice of production traffic if you can.
- Rehearse recovery: prepare and restore backups, prove promotion, and verify you can rebuild a replica inside your RTO.
- Write rollback triggers: make them measurable (lag, error rate, p99), and get agreement before the maintenance window.
Upgrades don’t fail because engineers are careless. They fail because reality is different at scale and the plan assumed it wouldn’t be. Make reality part of
your test harness, and “works on staging” becomes less of a lie and more of a useful prediction.