MySQL vs MariaDB upgrades: how to update without nuking production

Was this helpful?

If you’ve ever upgraded a database “real quick” and then watched your application fall into a silent, polite panic—latency up, errors sideways, replicas drifting—you already know the punchline: upgrades don’t fail loudly; they fail creatively.

This is a production guide. Not theory. Not vendor marketing. The goal is simple: move MySQL or MariaDB forward (or sideways between them) while keeping your data safe, your latency sane, and your rollback credible.

Choose your path: in-place, rolling, or migration

Most database upgrades go wrong for one of three reasons: you picked the wrong upgrade method, you skipped rehearsal, or you lied to yourself about rollback.

Option A: In-place major upgrade (fastest, riskiest)

In-place means you stop MySQL/MariaDB on a host and restart it with the new version pointing at the same datadir. This can be fine for small, non-critical systems. In production, it’s usually a last resort because:

  • Rollback often isn’t rollback. Once data files are upgraded, going back can be “restore from backup and pray.”
  • Downtime is mandatory.
  • If you discover a performance regression, you’re debugging under time pressure.

If you must do in-place, make it on a replica first. If you don’t have replicas, your first “upgrade step” is: build replication.

Option B: Rolling upgrade using replication (recommended for most)

This is the adult way. You upgrade replicas first, verify, then fail over. Your downtime becomes a controlled failover window, not an existential event.

  • Best for: MySQL async replication, semi-sync, MariaDB replication, many Galera scenarios (with version constraints).
  • Rollback is credible: fail back to the old primary (assuming you kept it intact and replication remains compatible).
  • Risk shifts from “data file format” to “replication compatibility and query behavior.” That’s a better kind of risk.

Option C: Migration via logical dump/restore or CDC (slowest, cleanest)

If you are changing engines (MySQL → MariaDB or MariaDB → MySQL), or you need to jump across incompatible major versions, you might do:

  • Logical: mysqldump / mydumper, restore to new cluster, cut over.
  • CDC: binlog-based replication (native), or a change data capture pipeline that streams changes to the new cluster.

Logical migration is slow but deterministic. CDC is elegant but operationally heavier. Either way, you get a fresh start and fewer file-format surprises.

Opinionated rule: if the business cares, do rolling upgrades or CDC migration. In-place upgrades in production are how you learn the true meaning of “unplanned maintenance.”

A few facts and history (the parts that still bite)

Upgrades are easier when you know why the ecosystem looks the way it does. Here are concrete facts that still matter operationally:

  1. MariaDB forked from MySQL in 2009 after Oracle acquired Sun Microsystems. That split created two diverging roadmaps with “mostly compatible” as a moving target.
  2. MySQL 5.7 → 8.0 is not a “normal” major upgrade. The optimizer, data dictionary, defaults, and authentication changed in ways that surface as application bugs, not just DBA chores.
  3. MySQL 8.0 moved metadata into a transactional data dictionary (InnoDB-based). Operationally: fewer .frm-era oddities, but upgrades rewrite internal structures and can take real time.
  4. MariaDB kept Aria as a system table engine in places and developed features independently (e.g., different JSON behavior and optimizer features). That independence is power—and compatibility risk.
  5. “JSON” is a great example of divergence: MySQL’s JSON is a binary format with functions and indexes tuned around it; MariaDB historically treated JSON as a TEXT alias with JSON functions evolving differently.
  6. Authentication plugins changed expectations in MySQL 8.0 (caching_sha2_password default). Clients and proxies that were “fine for years” suddenly aren’t.
  7. GTID implementations differ between MySQL and MariaDB. You can’t assume you can just “turn on GTID” and keep moving between them without planning.
  8. MySQL removed the query cache years ago (8.0), while MariaDB kept it longer. If you relied on query cache behavior (even accidentally), upgrades will feel like a performance “mystery.”
  9. InnoDB page size, redo log settings, and crash recovery behavior can change defaults and heuristics across versions. Your recovery time objective (RTO) might move without asking permission.

Joke #1: The only “one weird trick” in database upgrades is making a backup you’ve actually restored at least once.

Compatibility map: what breaks between MySQL and MariaDB

“Compatible” is not a binary property. It’s a list of things that will eventually hurt you in production, with a timestamp attached.

MySQL → MySQL (same family, still dangerous)

  • SQL mode and defaults: Behavior changes can surface as data truncation errors or “suddenly strict” inserts.
  • Optimizer regressions: The same query can choose a new plan and become slow under load.
  • Authentication and TLS: Client libraries, HAProxy/ProxySQL, and old JDBC versions can fail to connect.
  • Replication filtering and metadata: Minor configuration differences can break replication at cutover.

MariaDB → MariaDB

  • Galera version constraints: You can’t just roll any major version across a cluster. Check supported rolling upgrade paths; otherwise you’re doing a cluster rebuild.
  • System tables changes: mysql.* tables and privilege schemas evolve; upgrades can require explicit post-steps.
  • InnoDB vs XtraDB history: Older MariaDB versions had distinct storage engine code paths; modern versions are closer, but legacy environments still exist.

MySQL ↔ MariaDB (engine switch)

This is where “it starts up fine” becomes a trap.

  • Data file format and system tables differ: in-place swapping binaries is not a strategy; it’s a gamble.
  • GTID differences: migrating replication setups between them requires careful design (and often re-seeding).
  • SQL features diverge: window function support, JSON functions, optimizer hints, and reserved words drift apart.
  • Client expectations: connector behavior, authentication defaults, and TLS versions vary across distributions.

Opinionated rule: treat MySQL ↔ MariaDB as a migration, not an “upgrade.” Plan for dual-running, validation, and cutover with a rollback window.

Fast diagnosis playbook: what to check first/second/third

When an upgrade goes sideways, you don’t need a 40-tab browser session. You need a triage sequence that tells you where the time and pain live.

First: is it CPU, IO, or locks?

  • CPU-bound: query plan regression, missing indexes, increased sort/temporary tables, or TLS/auth overhead under high churn connections.
  • IO-bound: changed flushing behavior, doublewrite pressure, redo logging changes, buffer pool too small, storage latency jump.
  • Lock-bound: metadata locks, new DDL behavior, long transactions, or replication apply thread choking.

Second: is replication healthy and comparable?

  • Replica lag and apply errors after upgrade are early warnings. They’re also often ignored until cutover. Don’t.
  • Confirm binlog format, GTID mode (if used), and replica SQL thread status.

Third: did the app change its behavior?

  • Connection pool thrash due to auth plugin mismatch and reconnect loops.
  • New strictness causing retries.
  • Time zone, collation, or character set changes producing subtle differences and cache misses.

Fourth: did observability survive the upgrade?

  • Performance Schema settings, slow log format, and metrics names can change. If dashboards went dark, you’re debugging blind.

Single best move under pressure: compare one known expensive query on old vs new using the same data and the same parameters. If the plan differs, you have a direction.

Joke #2: “We didn’t change anything” is the database equivalent of “the dog ate my homework,” except the dog is an optimizer.

Production tasks: commands, outputs, and decisions (12+)

These are tasks you can run on real servers. Each includes: command, a plausible output snippet, what it means, and the decision you take.

Task 1: Identify exact server version and distribution

cr0x@server:~$ mysql -e "SELECT VERSION() AS version, @@version_comment AS distro, @@version_compile_machine AS arch\G"
*************************** 1. row ***************************
version: 8.0.36
distro: MySQL Community Server - GPL
arch: x86_64

Meaning: You’re not guessing. “MySQL 8” is not a version; 8.0.36 is a version. MariaDB will show different comments.

Decision: Pin upgrade targets explicitly. If you can’t name your current and target patch versions, you’re not planning an upgrade; you’re planning a surprise.

Task 2: Check critical configuration diffs (effective runtime)

cr0x@server:~$ mysql -e "SHOW VARIABLES WHERE Variable_name IN ('datadir','log_bin','binlog_format','gtid_mode','enforce_gtid_consistency','innodb_buffer_pool_size','innodb_flush_log_at_trx_commit','sync_binlog','sql_mode','character_set_server','collation_server')"
+------------------------------+---------------------------+
| Variable_name                | Value                     |
+------------------------------+---------------------------+
| binlog_format                | ROW                       |
| character_set_server         | utf8mb4                   |
| collation_server             | utf8mb4_0900_ai_ci        |
| datadir                      | /var/lib/mysql/           |
| enforce_gtid_consistency     | ON                        |
| gtid_mode                    | ON                        |
| innodb_buffer_pool_size      | 34359738368               |
| innodb_flush_log_at_trx_commit| 1                        |
| log_bin                      | ON                        |
| sql_mode                     | ONLY_FULL_GROUP_BY,...    |
| sync_binlog                  | 1                         |
+------------------------------+---------------------------+

Meaning: This is your operational personality: durability, replication format, strictness, and character handling.

Decision: Any upgrade/migration must preserve intent. If the new version changes defaults, override them explicitly in config, not in your hopes.

Task 3: Confirm disk space headroom before upgrade

cr0x@server:~$ df -h /var/lib/mysql
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  900G  720G  135G  85% /var

Meaning: Major upgrades can create temporary files, rebuild internal structures, and expand redo/undo during catch-up.

Decision: Under ~20% free is a red flag. Either expand storage or plan a migration to a new volume. “We’ll be fine” is not a filesystem strategy.

Task 4: Look for corrupted tables or lurking InnoDB warnings

cr0x@server:~$ sudo tail -n 60 /var/log/mysql/error.log
2025-12-29T02:14:11.102334Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2025-12-29T02:15:03.981245Z 0 [Warning] [MY-012639] [InnoDB] Retry attempts for reading partial data failed.
2025-12-29T02:15:04.001103Z 0 [Note] [MY-010431] [Server] Detected data dictionary initialization

Meaning: Upgrades amplify existing damage. If InnoDB is already complaining, don’t stack risk on top.

Decision: Investigate and remediate storage/log issues first (SMART/NVMe errors, filesystem errors, kernel messages). If needed, take a fresh logical backup.

Task 5: Measure replication health on replicas (before touching anything)

cr0x@server:~$ mysql -e "SHOW REPLICA STATUS\G" | egrep "Replica_IO_Running|Replica_SQL_Running|Seconds_Behind_Source|Last_SQL_Error|Retrieved_Gtid_Set|Executed_Gtid_Set"
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 0
Last_SQL_Error:
Retrieved_Gtid_Set: 2f9c3b3a-...:1-98233411
Executed_Gtid_Set: 2f9c3b3a-...:1-98233411

Meaning: A replica that can’t replicate today will not magically replicate after you upgrade it.

Decision: Fix replication drift first. Don’t upgrade on top of broken replication; you’ll lose your easiest rollback path.

Task 6: Check for long transactions (upgrade and failover killers)

cr0x@server:~$ mysql -e "SELECT trx_id, trx_started, TIME_TO_SEC(TIMEDIFF(NOW(), trx_started)) AS age_s, trx_mysql_thread_id, trx_query FROM information_schema.innodb_trx ORDER BY age_s DESC LIMIT 5\G"
*************************** 1. row ***************************
trx_id: 145922993
trx_started: 2025-12-29 01:47:12
age_s: 2081
trx_mysql_thread_id: 17322
trx_query: UPDATE orders SET status='PAID' WHERE id=...

Meaning: Long transactions block purge, can block DDL, and can make failovers take forever (and replicas lag).

Decision: Before cutover, enforce a “no long transactions” window: pause batch jobs, fix stuck writers, and consider setting max execution time where appropriate.

Task 7: Find top waits (locks vs IO vs CPU) using Performance Schema

cr0x@server:~$ mysql -e "SELECT event_name, COUNT_STAR, ROUND(SUM_TIMER_WAIT/1000000000000,2) AS total_s FROM performance_schema.events_waits_summary_global_by_event_name WHERE SUM_TIMER_WAIT > 0 ORDER BY SUM_TIMER_WAIT DESC LIMIT 8"
+------------------------------------------+------------+---------+
| event_name                               | COUNT_STAR | total_s |
+------------------------------------------+------------+---------+
| wait/io/table/sql/handler                | 192233331  | 8421.33 |
| wait/synch/mutex/innodb/buf_pool_mutex   |  98223311  | 3120.10 |
| wait/io/file/innodb/innodb_data_file     |  12233411  | 1777.54 |
| wait/lock/table/sql/handler              |   2233111  |  601.22 |
+------------------------------------------+------------+---------+

Meaning: You’re looking at what the server spends time waiting on. Not feelings. Not Slack.

Decision: If IO waits dominate, plan storage and flush tuning validation. If lock waits dominate, audit queries and transaction scope before you upgrade.

Task 8: Compare query plans pre-upgrade with EXPLAIN ANALYZE

cr0x@server:~$ mysql -e "EXPLAIN ANALYZE SELECT * FROM sessions WHERE user_id=123 AND created_at > NOW() - INTERVAL 7 DAY ORDER BY created_at DESC LIMIT 50\G"
*************************** 1. row ***************************
EXPLAIN: -> Limit: 50 row(s)  (actual time=0.321..0.338 rows=50 loops=1)
    -> Index range scan on sessions using idx_user_created (user_id=123)  (actual time=0.320..0.333 rows=50 loops=1)

Meaning: This is your truth serum. If the “actual time” or access method changes after upgrade, you found your regression.

Decision: Capture a baseline set of critical queries and their plans. After upgrading a replica, compare. If plans diverge, fix with indexes, hints (sparingly), or query rewrites before cutover.

Task 9: Validate character set/collation drift that breaks indexes and comparisons

cr0x@server:~$ mysql -e "SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME FROM information_schema.COLUMNS WHERE TABLE_SCHEMA='app' AND DATA_TYPE IN ('varchar','text','char') AND COLLATION_NAME NOT LIKE 'utf8mb4%' LIMIT 10"
+--------------+------------+-------------+--------------------+-------------------+
| TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | CHARACTER_SET_NAME | COLLATION_NAME    |
+--------------+------------+-------------+--------------------+-------------------+
| app          | users      | email       | latin1             | latin1_swedish_ci |
+--------------+------------+-------------+--------------------+-------------------+

Meaning: Mixed collations cause weird ordering, broken uniqueness expectations, and slow comparisons.

Decision: Don’t change collations during an engine upgrade unless you’re ready for a schema migration project. If you must, do it as a separate change with its own blast radius and rollback plan.

Task 10: Check for deprecated/removed features used by your workload

cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'query_cache%';"
Empty set (0.00 sec)

Meaning: On MySQL 8.0, query cache variables are gone. On MariaDB they may exist depending on version.

Decision: If your app was accidentally benefiting from query cache, you’ll see extra load after upgrade. Prepare capacity, and fix caching at the application tier where it belongs.

Task 11: Verify client compatibility with authentication plugins

cr0x@server:~$ mysql -e "SELECT user, host, plugin FROM mysql.user WHERE user IN ('app','replicator')\G"
*************************** 1. row ***************************
user: app
host: %
plugin: caching_sha2_password
*************************** 2. row ***************************
user: replicator
host: 10.0.0.%
plugin: mysql_native_password

Meaning: If your app client library can’t do caching_sha2_password, it will fail to connect, or reconnect in a tight loop and DoS you politely.

Decision: Either upgrade clients/connectors/proxies before server upgrade, or set the user plugin explicitly. Prefer upgrading clients; downgrading auth is a security trade you should make consciously.

Task 12: Confirm binary logging and binlog retention won’t explode during migration

cr0x@server:~$ mysql -e "SHOW BINARY LOGS;"
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| binlog.000331    |  10485776 |
| binlog.000332    | 1073741824|
| binlog.000333    | 1073741824|
+------------------+-----------+

Meaning: Your binlogs are large and frequent. During cutover rehearsals and replica rebuilds, binlog churn can fill disks.

Decision: Ensure disk headroom and set sane binlog expiration. For MySQL 8.0, prefer binlog_expire_logs_seconds. Avoid setting it “low” during upgrades unless you like rebuilding replicas.

Task 13: Take a logical backup you can actually restore (spot-check restore)

cr0x@server:~$ mysqldump --single-transaction --routines --events --triggers --hex-blob --set-gtid-purged=OFF --databases app > /backups/app.sql
cr0x@server:~$ mysql -e "CREATE DATABASE restore_test; USE restore_test; SOURCE /backups/app.sql;"
Query OK, 1 row affected (0.01 sec)

Meaning: The backup exists and is usable. That’s rarer than it should be.

Decision: If restore fails, stop. Fix backup/restore before you touch production binaries. Backups are not a checkbox; they’re your rollback when everything else is on fire.

Task 14: Dry-run application compatibility with a shadow instance

cr0x@server:~$ mysql -e "SET GLOBAL log_output='TABLE'; SET GLOBAL slow_query_log=ON; SET GLOBAL long_query_time=0.2;"
Query OK, 0 rows affected (0.00 sec)

Meaning: You can capture slow queries on the upgraded replica while running mirrored read traffic (or synthetic tests).

Decision: If upgraded replica shows new slow queries for the same workload, do not cut over until you understand why. “It’s probably fine” is how you upgrade into a week-long incident.

Three corporate mini-stories from the trenches

Incident caused by a wrong assumption: “replication is replication”

They ran a busy SaaS platform with a tidy async replication setup: one primary, two replicas, nightly backups. The plan was to “modernize” by moving from MariaDB to MySQL because a vendor toolchain “preferred MySQL 8.” The migration doc was two pages long. That should have been a clue.

The wrong assumption was subtle: they assumed their MariaDB GTID-based replication approach would map cleanly to MySQL GTID. The team built a new MySQL replica, tried to attach it to the MariaDB primary, and discovered that “GTID” is not a single standard across forks. They pivoted to file/position replication, which sort of worked until a failover test introduced a binlog position mismatch and the replica started rejecting events.

In the cutover window, the application switched writes to the new MySQL primary, and the old MariaDB cluster was meant to remain as rollback. But rollback required reverse replication to keep it close. That reverse path wasn’t stable. Within an hour, they had two diverged realities and no clean line back.

The fix wasn’t glamorous. They rebuilt the migration as a CDC pipeline with explicit consistency checks, and they stopped trying to make GTID mean the same thing across engines. They also implemented a hard rule: rollback must be tested with real write traffic in a rehearsal, not “seems plausible.” The platform survived. The two-page doc did not.

An optimization that backfired: “let’s tune flushing during the upgrade”

A different company had a MySQL 5.7 fleet on fast NVMe. During a planned 8.0 upgrade, someone suggested a performance tweak: lower durability settings temporarily to speed catch-up and reduce failover time. Specifically, they reduced innodb_flush_log_at_trx_commit and sync_binlog on the upgraded replicas.

In staging, it looked great. Replication applied faster; queries were snappy. Production, however, had one extra ingredient: occasional power events in one rack and a storage controller firmware quirk that had been harmless under strict flushing. Two days into the rollout, one upgraded replica crashed. Crash recovery completed, but it replayed logs differently than expected and came up with an inconsistency in an edge-case table that used a mix of autocommit patterns and a legacy trigger.

No data was “lost” in the existential sense—most tables were fine—but the inconsistency was enough to block promotion of that replica and forced a rebuild during the rollout. The overall upgrade slowed down dramatically. The “optimization” added more downtime risk than it removed.

The lesson: don’t introduce tuning changes inside an upgrade unless you’re ready to debug them as if they were production features. If you want to change durability knobs, do it as its own change, with its own test window, and accept the risk trade explicitly.

A paraphrased idea often attributed to Werner Vogels: “Everything fails; design and operate assuming it will.” That includes your clever upgrade-time settings.

A boring but correct practice that saved the day: rehearsed cutover with a real rollback

One enterprise team had a habit that everyone teased: they ran full dress rehearsals for database upgrades, including the rollback. Not “we have backups,” but “we practiced switching back under load.” It felt excessive—until it didn’t.

They were upgrading MariaDB in a three-node setup used by an internal finance system. During rehearsal #1, they found that their schema migration tool opened long-running transactions, which caused metadata locks during failover. That would have turned a 2-minute cutover into a 45-minute outage. They fixed the tool configuration and separated schema changes from the engine upgrade.

During rehearsal #2, they discovered their monitoring agent used a deprecated status variable and stopped reporting replication lag on the upgraded nodes. They updated dashboards and alerts before production.

On the real night, the upgrade went fine—until a network ACL change elsewhere caused intermittent packet loss to one replica. The team saw replication jitter immediately (because monitoring still worked), paused the promotion, and failed over to the other upgraded replica instead. No incident. The boring rehearsal work did all the dramatic work for them.

Checklists / step-by-step plan

There are two upgrade projects hiding in one: the database engine change and the operational change. Treat both seriously.

Phase 0: Decide what you’re actually doing

  • Same engine, major version (MySQL 5.7→8.0, MariaDB 10.x→11.x): use rolling upgrade where possible.
  • Engine switch (MySQL ↔ MariaDB): plan migration with validation. Assume incompatibilities until proven otherwise.
  • Galera: verify supported rolling upgrade paths. If not supported, plan a new cluster build and cutover.

Phase 1: Preflight checklist (do this a week before)

  • Inventory: versions, configs, plugins, storage engine usage.
  • Baseline: p95/p99 query latency, top queries, replication lag, buffer pool hit rate, redo/undo pressure.
  • Backups: full logical backup restore test; physical backup restore test if you use it.
  • Client compatibility: connectors, proxies, TLS, auth plugins.
  • Disk headroom and IOPS headroom: upgrades cause bursts.
  • Define success: acceptable error rate, latency, and cutover duration.

Phase 2: Build an upgrade rehearsal environment (do not skip)

Use production-like data. If you can’t, at least use production schema and a workload replay.

  • Restore last night’s backup into staging.
  • Attach a read-only workload replay or synthetic benchmark that approximates query mix.
  • Run application integration tests against the new version.
  • Compare query plans for critical queries.

Phase 3: Rolling upgrade plan (replication-based)

Step 1: Upgrade one replica

  • Stop replication (or keep it running depending on method), upgrade binary, run required post-upgrade steps.
  • Let it catch up, verify apply speed, verify query latency.

Step 2: Validate deeply before touching the primary

  • Run read-only traffic against upgraded replica (or mirror reads) and compare errors/latency.
  • Confirm replication is stable over hours, not minutes.
  • Confirm backups work on the new version (tools sometimes change behavior).

Step 3: Upgrade remaining replicas

Never upgrade all replicas at once if you can avoid it. Keep at least one known-good replica on the old version until cutover is successful and stable.

Step 4: Cutover

  • Quiesce writes briefly if your failover process requires it.
  • Promote an upgraded replica to primary.
  • Repoint application, verify quickly, then gradually re-enable background jobs.

Step 5: Rollback plan (you practice this)

  • Keep old primary intact for a defined rollback window.
  • Decide rollback trigger conditions (latency, error rate, replication instability).
  • Have a scripted, tested procedure to repoint traffic back.

Phase 4: Migration plan (MySQL ↔ MariaDB)

If you are switching engines, prefer a new cluster build and a controlled cutover:

  • Provision new cluster with target engine/version.
  • Load baseline data via physical restore (if compatible) or logical restore.
  • Stream changes using replication/CDC (engine-dependent).
  • Validate with checksums and application shadow reads.
  • Cut over writes; keep old cluster read-only for rollback window.

Common mistakes: symptom → root cause → fix

1) Symptom: upgrade succeeds, but app can’t connect

Root cause: authentication plugin mismatch (e.g., MySQL 8 caching_sha2_password) or TLS/cipher mismatch in proxies/clients.

Fix: upgrade connectors/proxies first; verify with a canary host. If emergency, set user plugin to mysql_native_password for that user and schedule proper remediation.

2) Symptom: sudden query latency spike, same CPU usage

Root cause: optimizer plan regression or changed statistics behavior.

Fix: capture EXPLAIN ANALYZE on old vs new; add/adjust indexes; update histograms where used; consider plan-stability techniques (but treat hints as debt).

3) Symptom: replicas lag heavily only after upgrade

Root cause: upgraded replica applies row events slower due to different fsync behavior, different parallel replication settings, or IO limits.

Fix: validate IO latency, increase replica parallelism where supported, ensure binlog_format and row image settings are consistent, and fix storage saturation before cutover.

4) Symptom: failover takes forever, connections pile up

Root cause: long transactions / metadata locks / slow crash recovery on candidate primary.

Fix: identify and stop long transactions pre-cutover; enforce maintenance window rules for batch jobs; verify crash recovery time by controlled restart in rehearsal.

5) Symptom: data looks “different” (ordering, uniqueness, comparisons)

Root cause: collation/character set differences, or behavior changes in JSON/text comparison functions across engines.

Fix: pin character_set_server/collation_server explicitly; audit columns with mixed collations; add explicit COLLATE in critical queries if needed; avoid changing collations during upgrade.

6) Symptom: monitoring dashboards broke on upgrade night

Root cause: changed status variables, Performance Schema settings, or privilege changes for monitoring user.

Fix: test monitoring agent against upgraded replica during rehearsal; update queries and grants; keep a minimal “DB heartbeat” check independent of fancy dashboards.

7) Symptom: disk fills up mid-upgrade

Root cause: binlog growth, temp table spill, redo log expansion during catch-up, or leftover upgrade artifacts.

Fix: ensure headroom; monitor /var/lib/mysql and tmpdir; set tmpdir on a volume sized for worst-case sorts; do not aggressively shorten binlog retention during migration.

8) Symptom: queries that used to “work” now error

Root cause: stricter SQL mode, reserved words, or changed default settings.

Fix: inventory sql_mode and pin it; run application test suite; search logs for “deprecated” and “error near” patterns; fix queries rather than loosening correctness globally unless you have to.

FAQ

1) Should I upgrade MySQL 5.7 to 8.0 in-place?

Only if downtime is acceptable and you have a tested restore path. For anything important, upgrade replicas first, then fail over.

2) Can I switch from MariaDB to MySQL by just replacing the binaries?

No. Treat it as a migration. Data dictionaries, system tables, GTIDs, and feature behavior diverge. Build a new cluster and move data with validation.

3) What’s the safest rollback strategy?

Keep the old primary intact and writable only if you’re doing a controlled failback with known replication direction. Otherwise keep it read-only and rely on restore/CDC. Rollback must be rehearsed.

4) How do I detect optimizer regressions before cutover?

Pick 20–50 critical queries, capture EXPLAIN ANALYZE and runtime stats on old version, then compare on an upgraded replica with production-like data and load.

5) My app uses an old connector. What breaks first on MySQL 8?

Authentication and TLS are common failure points: caching_sha2_password, stricter cipher expectations, and proxy support. Upgrade clients/proxies before the server if you can.

6) Do I need to run mysql_upgrade?

It depends on version and distribution. Modern MySQL versions integrate much of the upgrade logic, but you still need to follow the vendor’s required post-steps. The operational rule: verify system tables and run the recommended upgrade procedure on a replica first.

7) What about Galera clusters—can I do rolling upgrades?

Sometimes. It depends on the exact MariaDB/Galera versions. Validate the supported path and test in rehearsal. If rolling isn’t supported, build a new cluster and cut over.

8) Should I change config defaults during the upgrade?

Not unless you have a specific reason and a test proving it. Changing engine version is already a big variable. Split tuning changes into a separate project unless you like ambiguous root causes.

9) How long should I keep the old cluster around after cutover?

Long enough to cover your realistic “bug discovery” window—often days, not hours—balanced against cost and operational risk. Keep it in a state that supports your rollback plan (read-only or ready-to-failback).

10) What’s the fastest way to gain confidence right after cutover?

Watch three things: error rate from the app, p95/p99 latency on a few key endpoints, and replication health (if you have downstream replicas). If any drift, pause and decide quickly.

Conclusion: practical next steps

If you take nothing else: upgrades aren’t a weekend hobby. They’re a controlled change to the most stateful system you run.

  1. Pick the method: rolling upgrade via replication for same-engine upgrades; migration for MySQL ↔ MariaDB switches.
  2. Rehearse: restore real data into staging, run workload, compare plans, test monitoring.
  3. Make rollback real: define triggers, keep an old primary/cluster in a rollback-ready state, and practice the procedure.
  4. Upgrade replicas first: validate for hours under load, not minutes.
  5. Cut over with discipline: quiesce risky batch jobs, watch error/latency/replication, and be willing to abort.

The goal isn’t a successful upgrade command. The goal is a production system that behaves predictably the morning after.

← Previous
Docker Multi-Network Containers: Stop Accidental Exposure to the Wrong Network
Next →
Ceph performance on Proxmox is slow: 10 checks that actually find the bottleneck

Leave a comment