MySQL vs RDS MySQL: the hidden limits that bite during incidents

Was this helpful?

There’s a special kind of outage where the database isn’t “down.” It’s up, accepting connections, returning some queries, and radiating smug health checks. Meanwhile your API is timing out, replicas are drifting into the future, and every engineer is staring at a dashboard that says everything is “green.”

This is where the MySQL you think you’re running and the MySQL you’re actually running—RDS MySQL—stop being synonyms. The differences are subtle until they are catastrophic. Most incidents aren’t caused by one big mistake; they’re caused by a wrong assumption meeting a hidden limit at 2:13 a.m.

The core mismatch: “MySQL” is not a deployment model

Self-managed MySQL is a pile of choices. File system, block device, RAID level (or not), CPU pinning, OS page cache behavior, kernel scheduler quirks, TCP settings, backup tooling, and the kind of weird cron job a former coworker wrote in 2019 and nobody wants to touch. It’s flexible. It’s also your responsibility, which means it’s your fault during incidents.

RDS MySQL is a product. It behaves like MySQL at the SQL layer, but it lives inside guardrails: managed storage, managed backups, managed patching, managed failover, managed observability. That management comes with constraints you don’t control and sometimes can’t even see. During a calm week, that’s a feature. During an incident, it’s a negotiation.

The typical failure mode is not “RDS is worse.” It’s “you planned like it was self-managed,” or vice versa. In production, the system you assume is the system you debug. If your assumptions are wrong, you’ll debug confidently and fix nothing.

One quote you should keep taped to your monitor comes from Werner Vogels: You build it, you run it. It’s short, and it hurts because it’s true.

Facts and history that explain today’s sharp edges

  • MySQL’s default storage engine changed the world. InnoDB became the default in MySQL 5.5, and suddenly “ACID” wasn’t a premium feature. It also made I/O and redo log sizing a first-class operational concern.
  • Amazon introduced RDS in 2009. The promise was simple: stop babysitting servers. The trade-off was simpler: you don’t get root, and you accept opinionated infrastructure.
  • Performance Schema wasn’t always standard practice. It matured over years; many “MySQL experts” learned in an era where you tailed slow logs and guessed. RDS makes deep OS-level tooling harder, so you should learn the modern MySQL instrumentation.
  • InnoDB’s “doublewrite buffer” exists because storage lies. Torn pages happen. Managed storage reduces some risk but doesn’t remove the need for crash-consistent design choices.
  • Replication has multiple personalities. Classic asynchronous replication is different from semi-sync, and both are different from group replication. RDS supports some modes, restricts others, and adds its own operational behaviors around failover.
  • “General purpose SSD” wasn’t always good enough. gp2/gp3 and io1/io2 exist because IOPS is a product decision now, not just an engineering detail. On-prem you fight physics; in cloud you fight your last purchase order.
  • Online DDL changed incident response. MySQL’s DDL has improved, but not all ALTERs are equal, and RDS adds constraints around long-running operations and maintenance windows.
  • Backups moved from “files” to “snapshots.” Logical dumps are portable but slow; snapshots are fast but coupled to the platform’s storage model. RDS encourages snapshots; your recovery strategy has to account for that coupling.

Hidden limits that bite during incidents (and why)

1) Storage isn’t just “size”: it’s latency, burst, and write amplification

On self-managed MySQL, you can attach faster disks, tune the RAID controller, or throw NVMe at the problem. On RDS, you pick an EBS-backed storage class and live inside its IOPS and throughput constraints. If you picked gp2 years ago and never revisited it, you may be living on “burst credits” without realizing it.

Incident pattern: latency rises, CPU looks fine, queries slow down, and everyone blames “a bad deploy.” Meanwhile, the real villain is disk queue depth. InnoDB is I/O-hungry when its working set exceeds buffer pool or when it’s flushing dirty pages under pressure.

What’s hidden: the storage subsystem is managed. You can’t run iostat on the host. You must rely on RDS metrics and MySQL status variables. This isn’t worse, it’s different: you need different reflexes.

2) “Free storage” is a lie during temp table spills and online DDL

RDS can autoscale storage in some configurations, but it doesn’t autoscale the kind of space you need right now for a massive temp table spill, a big ALTER that builds a copy, or a long transaction that bloats undo. You can end up with plenty of allocated storage and still hit a “temporary space” wall that feels random if you’ve only run on bare metal.

Self-managed environments often put tmpdir on a separate volume and size it intentionally. On RDS, tmp usage interacts with the instance’s local ephemeral space and configuration. You need to know what your engine does under sort pressure and how much headroom you have before the system starts failing in creative ways.

3) Max connections is not a number; it’s a blast radius

RDS sets defaults and sometimes ties allowed values to instance class via parameter groups. On a box you own, you can crank max_connections until you run out of RAM and then wonder why the kernel OOM killer is writing your postmortem. On RDS, you can still kill the instance with connections—just more politely.

The hidden limit is usually not the configured max_connections but what happens before you hit it: per-connection memory (sort buffers, join buffers), thread scheduling overhead, and mutex contention inside MySQL. RDS adds another twist: connection storms during failover, app retries, and pooling misconfiguration can stack up and turn a small latency event into a complete pileup.

4) Failover is a product feature, not a free lunch

Self-managed failover can be instant or terrible, depending on how much you invested in automation. RDS failover is generally good, but it isn’t magical. There is detection time, promotion time, DNS propagation time, and application reconnection behavior. During that window, your app might hammer the endpoint with retries like a toddler hitting an elevator button.

Hidden limit: the “writer endpoint” abstraction can mask topology changes, but it doesn’t remove the need for your application to handle transient errors, stale connections, and idempotency. If your app doesn’t tolerate a 30–120 second brownout, you don’t have high availability. You have a hope-and-pray architecture.

5) Replication lag is often an I/O story wearing a SQL costume

People treat replication lag as a replication setting problem. Sometimes it is. Often it’s just the replica being slower at applying writes because it’s saturated on storage, CPU, or single-threaded apply constraints. RDS gives you metrics and some knobs; it does not give you permission to ssh in and “just check a thing.”

Hidden limit: the replica instance class might be smaller, the storage class might differ, or the binlog format and workload might make parallel replication ineffective. In incidents, the wrong fix is usually “restart replication.” The right fix is usually “reduce write pressure” or “fix the slow apply path.”

6) Parameter groups make configuration safer—and slower to change

On a self-managed box, you edit my.cnf, reload, and move on. On RDS, parameter changes can require a reboot, be dynamic-only, or be outright forbidden. You also have to track which parameter group is attached to which instance, which sounds boring until you discover prod and staging are “almost” the same.

Hidden limit: operational latency. During an incident, “we can tune X” isn’t a plan unless you know whether X is modifiable without downtime and whether you can do it quickly and safely.

7) Observability: you don’t get the host, so you must get good at the engine

On self-managed MySQL, you can use OS-level tools: perf, strace, tcpdump, cgroup stats, filesystem latency histograms, and the comfort of knowing you can always dig deeper. On RDS, you use CloudWatch metrics, Enhanced Monitoring (if enabled), Performance Insights (if enabled), and MySQL’s own tables and status variables.

If you don’t enable those features ahead of time, your future incident self will be staring at a blurry photo of a crime scene. Joke #1: Observability you didn’t enable is like a fire extinguisher still in the Amazon cart—very affordable, very useless.

8) Backups and restores: snapshots are fast; recovery is still a process

RDS snapshots are convenient, and point-in-time recovery is powerful. But the incident trap is assuming “we can just restore quickly.” Restores take time, and the new instance needs warm-up, parameter verification, security group checks, and application cutover. If you’re used to restoring a local backup to a spare VM, you may be shocked by the orchestration overhead.

Hidden limit: time to operational readiness, not time to create the instance.

9) You cannot fix everything with “bigger instance”

Scaling up helps CPU-bound workloads and gives you more RAM for buffer pool. It does not automatically fix I/O limits if storage throughput is the choke point. It does not fix contention in hot rows or metadata locks. It does not fix pathological queries. And during an incident, scaling can be slow or disruptive.

Self-managed environments often have more “unsafe” options (drop caches, restart services, detach volumes, run emergency scripts). RDS reduces the number of sharp tools you can hurt yourself with. It also reduces the number of sharp tools you can save yourself with. Plan accordingly.

Fast diagnosis playbook: what to check first/second/third

First: decide whether you’re CPU-bound, I/O-bound, or lock-bound

  • CPU-bound signals: high CPU, high “active sessions” in Performance Insights, queries slow even when working set fits in memory, many functions/sorts, poor indexes.
  • I/O-bound signals: low/moderate CPU, rising query latency, increased InnoDB reads/writes, elevated disk queue (CloudWatch), buffer pool misses, dirty page flushing spikes.
  • Lock-bound signals: CPU may be low, but threads are “waiting for” locks; many connections stuck; a few transactions block everything; replication lag spikes due to apply blocking.

Second: confirm the bottleneck with two independent views

Don’t trust one metric. Pair engine-internal data (status, processlist, performance_schema) with platform metrics (CloudWatch/Enhanced Monitoring) or query-level evidence (top digests, slow log).

Third: choose the safest intervention

  • If lock-bound: identify blocker, kill the right session, reduce transaction scope, fix app behavior. Avoid “restart MySQL” unless you’re choosing downtime on purpose.
  • If I/O-bound: stop doing the thing that writes/reads too much (batch job, big report), reduce concurrency, temporarily scale IOPS/storage class, and then fix schema/query patterns.
  • If CPU-bound: find top SQL by time, add or fix indexes, reduce expensive queries, consider scaling instance, and then fix the query plan for real.

Fourth: prevent the rebound

Incidents love rebounds: you kill the blocking query, latency drops, autoscalers or retry storms reintroduce load, and you’re back where you started. Rate-limit retries, pause batch workers, and control connection pool behavior.

Practical tasks: commands, outputs, and decisions (12+)

These are designed for incident response. Each task includes a command, what the output means, and the decision you make from it. Run them against self-managed MySQL or RDS MySQL (from a bastion or app host). Replace host/user names as needed.

Task 1: Confirm basic connectivity and server identity

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "SELECT @@hostname, @@version, @@version_comment, @@read_only\G"
Enter password:
*************************** 1. row ***************************
@@hostname: ip-10-11-12-13
@@version: 8.0.36
@@version_comment: MySQL Community Server - GPL
@@read_only: 0

Meaning: You’re on the writer (read_only=0) and you know the major version. Version matters because behavior and instrumentation differ.

Decision: If you expected a replica and got the writer (or vice versa), stop and fix your target before making “helpful” changes.

Task 2: See what threads are doing right now

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "SHOW FULL PROCESSLIST;"
...output...
10231 appuser 10.0.8.21:51244 appdb Query  38  Sending data SELECT ... 
10244 appuser 10.0.7.19:49812 appdb Query  38  Waiting for table metadata lock ALTER TABLE orders ...
10261 appuser 10.0.8.23:50111 appdb Sleep  120  NULL
...

Meaning: You can spot obvious stalls: “Waiting for table metadata lock” is a giant red arrow pointing to DDL or long transactions.

Decision: If you see lock waits dominating, pivot to lock diagnosis instead of chasing CPU or I/O.

Task 3: Identify the blocking transaction (InnoDB lock waits)

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "SELECT * FROM sys.innodb_lock_waits\G"
*************************** 1. row ***************************
wait_started: 2025-12-30 01:12:18
wait_age: 00:01:42
waiting_trx_id: 321889203
waiting_pid: 10244
waiting_query: ALTER TABLE orders ADD COLUMN ...
blocking_trx_id: 321889199
blocking_pid: 10198
blocking_query: UPDATE orders SET ...
blocking_lock_mode: X

Meaning: This shows who’s blocked and who’s doing the blocking, with PIDs you can act on.

Decision: If the blocker is an app transaction stuck in a loop, kill the blocker (not the victim) and mitigate at the app layer.

Task 4: Kill the right session (surgical, not emotional)

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "KILL 10198;"
Query OK, 0 rows affected (0.01 sec)

Meaning: The blocking thread is terminated. The lock should clear; blocked queries should proceed or fail fast.

Decision: Immediately watch for reconnection storms and retry spikes. Killing one query can invite a hundred replacements.

Task 5: Check InnoDB engine status for deadlocks, flushing, and history length

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "SHOW ENGINE INNODB STATUS\G"
...output...
History list length 51234
Log sequence number 89433222111
Log flushed up to 89433111822
Pending writes: LRU 0, flush list 128, single page 0
...

Meaning: Large history list length suggests long-running transactions preventing purge. Pending flush list suggests write pressure.

Decision: If history list length is exploding, find and end long transactions; if pending flush is high, reduce write load and consider storage/IOPS tuning.

Task 6: Confirm buffer pool health (memory vs I/O pressure)

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';"
+---------------------------------------+------------+
| Variable_name                         | Value      |
+---------------------------------------+------------+
| Innodb_buffer_pool_read_requests      | 9812234432 |
| Innodb_buffer_pool_reads              | 22334455   |
+---------------------------------------+------------+

Meaning: Innodb_buffer_pool_reads are physical reads. If they spike relative to requests, you’re missing cache and hitting storage.

Decision: If physical reads climb, consider increasing buffer pool (bigger instance class) and/or reduce workload working set (indexes, query fixes).

Task 7: Detect temp table spills (the quiet disk killer)

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "SHOW GLOBAL STATUS LIKE 'Created_tmp%tables';"
+-------------------------+----------+
| Variable_name           | Value    |
+-------------------------+----------+
| Created_tmp_disk_tables | 18443321 |
| Created_tmp_tables      | 22300911 |
+-------------------------+----------+

Meaning: High disk temp tables usually means sorts/group by/joins spilling. On RDS, this can collide with temp space constraints and I/O throughput.

Decision: Identify the offending queries (Performance Insights / slow log / statement digests) and fix them; don’t “just raise tmp_table_size” and call it done.

Task 8: Find top queries by total time using Performance Schema digests

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "SELECT DIGEST_TEXT, COUNT_STAR, ROUND(SUM_TIMER_WAIT/1e12,1) AS total_s, ROUND(AVG_TIMER_WAIT/1e9,1) AS avg_ms FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 5\G"
*************************** 1. row ***************************
DIGEST_TEXT: SELECT * FROM orders WHERE customer_id = ? ORDER BY created_at DESC LIMIT ?
COUNT_STAR: 188233
total_s: 6221.4
avg_ms: 33.1
...

Meaning: You get a ranked list of the SQL shapes consuming time. This is usually the fastest path to reality.

Decision: Take the top 1–2 digests and analyze their plans. Do not optimize the 19th query because it’s “ugly.” Optimize what’s burning time.

Task 9: Explain the plan (and spot missing indexes)

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "EXPLAIN FORMAT=TRADITIONAL SELECT * FROM orders WHERE customer_id = 123 ORDER BY created_at DESC LIMIT 50;"
+----+-------------+--------+------+---------------+------+---------+------+--------+-----------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows   | Extra                       |
+----+-------------+--------+------+---------------+------+---------+------+--------+-----------------------------+
|  1 | SIMPLE      | orders | ALL  | idx_customer  | NULL | NULL    | NULL | 932112 | Using where; Using filesort |
+----+-------------+--------+------+---------------+------+---------+------+--------+-----------------------------+

Meaning: type=ALL and “Using filesort” indicates a full scan plus sort. On big tables, that’s an incident waiting to happen.

Decision: Add a composite index (e.g., (customer_id, created_at)) and verify it matches query patterns. Plan the change safely (online DDL behavior matters).

Task 10: Check replication lag and apply status (replica-side)

cr0x@server:~$ mysql -h prod-mysql-replica.aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "SHOW REPLICA STATUS\G"
...output...
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 187
Retrieved_Gtid_Set: ...
Executed_Gtid_Set: ...

Meaning: IO and SQL threads are running, but lag is 187s. That’s performance, not a broken link.

Decision: Check replica resource saturation and query apply bottlenecks; consider scaling replica, improving storage throughput, or reducing write load temporarily.

Task 11: Validate whether you are hitting connection saturation

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "SHOW GLOBAL STATUS WHERE Variable_name IN ('Threads_connected','Threads_running','Max_used_connections');"
+---------------------+-------+
| Variable_name       | Value |
+---------------------+-------+
| Max_used_connections| 1980  |
| Threads_connected   | 1750  |
| Threads_running     | 220   |
+---------------------+-------+

Meaning: Many connections exist, but only 220 running. That often indicates lock waits, I/O waits, or connection pool misbehavior.

Decision: If connected is near max, protect the instance: throttle app, enable pooling, and consider lowering connection limits per service to stop the loudest client from winning.

Task 12: Check for long transactions that bloat undo and block purge

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "SELECT trx_id, trx_started, trx_rows_locked, trx_rows_modified, trx_query FROM information_schema.innodb_trx ORDER BY trx_started LIMIT 5\G"
*************************** 1. row ***************************
trx_id: 321889101
trx_started: 2025-12-30 00:02:11
trx_rows_locked: 0
trx_rows_modified: 812331
trx_query: UPDATE orders SET ...

Meaning: A transaction running since 00:02 with 800k rows modified is not “normal background noise.” It’s a durability and latency tax.

Decision: Work with app owners to chunk work, commit frequently, or move heavy updates off-peak. During incident, consider killing it if it’s blocking or destabilizing.

Task 13: Confirm binlog pressure and retention behavior

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "SHOW BINARY LOGS;"
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.012331 | 1073741824|
| mysql-bin.012332 | 1073741824|
| mysql-bin.012333 | 1073741824|
...

Meaning: Lots of large binlogs suggest heavy write activity. On RDS, retention settings and storage growth can become a surprise bill and a surprise incident.

Decision: If binlogs are ballooning, verify replica health, retention configuration, and whether a stuck replica is preventing log purging.

Task 14: Check table/index bloat and cardinality drift (quick sanity)

cr0x@server:~$ mysql -h prod-mysql.cluster-aaaa.us-east-1.rds.amazonaws.com -u ops -p -e "SELECT table_name, engine, table_rows, data_length, index_length FROM information_schema.tables WHERE table_schema='appdb' ORDER BY (data_length+index_length) DESC LIMIT 5;"
+------------+--------+-----------+------------+-------------+
| table_name | engine | table_rows| data_length| index_length|
+------------+--------+-----------+------------+-------------+
| orders     | InnoDB | 93211234  | 90194313216| 32112201728 |
...

Meaning: Big tables dominate your fate. If the largest table’s indexes are enormous or mismatched to queries, you will pay in I/O and cache misses.

Decision: Prioritize indexing and data lifecycle policies (partitioning, archiving) on the top tables, not the ones people complain about loudly.

Three corporate-world mini-stories from the incident floor

Mini-story 1: The incident caused by a wrong assumption

A mid-sized SaaS company moved from self-managed MySQL on a tuned NVMe RAID box to RDS MySQL. The migration was clean. App latency improved. Everyone declared victory and went back to shipping features.

Three months later, a sales campaign worked too well. Write traffic doubled for a few hours. Nothing “broke” immediately; instead, p95 climbed, then p99 exploded. The primary instance CPU hovered at 35%. Engineers stared at it like it was a liar.

The wrong assumption was simple: “If CPU is fine, the database is fine.” On their old servers, that assumption was often valid because storage had enough headroom and was monitored directly. On RDS, the team hadn’t enabled Performance Insights and barely watched storage metrics. The gp2 volume was burning through burst credits, I/O latency climbed, and InnoDB started flushing dirty pages more aggressively. The system was I/O-bound while the CPU looked relaxed.

They tried scaling the instance class. It helped a bit but didn’t fix the root problem: the storage configuration and write amplification of their workload. The fix was to move to a storage option with predictable IOPS, tune write-heavy patterns (smaller transactions, fewer secondary indexes on hot tables), and add dashboards that made “I/O debt” visible before it turned into an incident.

Afterward, they adopted a rule: every migration includes a “new bottleneck rehearsal.” If you can’t explain how it fails, you haven’t finished migrating.

Mini-story 2: The optimization that backfired

A finance-adjacent company had a nightly job that recalculated aggregates. It was slow, so an engineer “optimized” it by increasing concurrency: more workers, bigger batches, and a larger connection pool. In staging it looked great. In production it turned into a brownout generator.

The job mostly did updates on a hot table with several secondary indexes. More workers meant more simultaneous index maintenance and more row-level locking conflicts. The redo log and flushing behavior became the bottleneck, and temp tables spilled to disk because the job also did group-bys for intermediate steps.

On self-managed MySQL, they used to watch iostat and tune the host. On RDS, they watched CPU and assumed it was the limiter. It wasn’t. The “optimization” increased write amplification and lock contention, pushing the storage subsystem into sustained high latency. Replication lag climbed, and read traffic started hitting the writer because replicas were too far behind to be trusted.

The boring fix was to reduce concurrency, chunk updates deterministically by primary key ranges, and add an index that turned one expensive join into a cheap lookup. The job finished slightly slower than the “optimized” version in isolation, but it stopped wrecking the rest of the platform.

Joke #2: In databases, “more parallelism” sometimes means “more people trying to fit through the same door at once.”

Mini-story 3: The boring but correct practice that saved the day

An e-commerce platform ran RDS MySQL with Multi-AZ and a read replica used for analytics. Their engineers were not famous for thrilling architecture. What they were famous for was runbooks, routine game days, and an almost annoying obsession with parameter groups.

One afternoon, a schema migration introduced a query plan regression. The writer’s latency spiked. Then the replica lag spiked. Then the app began timing out and retrying. Classic spiral.

The on-call didn’t start by “tuning MySQL.” They started by executing a practiced playbook: confirm whether it’s lock/I/O/CPU, identify top digests, verify whether retries are amplifying, and then take the least risky load-shedding action. They temporarily disabled the analytics consumer, reduced worker concurrency, and applied a query-level feature flag to stop the worst offender. Replication lag stabilized.

Here’s where the boring practice paid: they had tested restoring from snapshot and promoting replicas, and they had a documented procedure to detach read traffic from the writer. They didn’t end up needing a failover, but they could have done it calmly if they had to. The incident lasted under an hour, and the postmortem was mostly about query review discipline, not heroics.

Common mistakes: symptoms → root cause → fix

1) Symptom: CPU is low, latency is high

Root cause: I/O saturation (storage throughput/IOPS) or lock waits. Common on RDS when gp volumes lose burst or when temp tables spill.

Fix: Verify buffer pool misses and temp disk tables; check CloudWatch storage latency/queue; reduce write/read pressure; move to predictable IOPS; fix queries and indexes.

2) Symptom: “Too many connections” appears after failover

Root cause: Connection storm due to app retry behavior and lack of pooling; old connections not being recycled; max_connections set without understanding memory per thread.

Fix: Enforce pooling (ProxySQL/RDS Proxy/app pool), cap per-service connections, add jittered retries, and set sane timeouts. Prefer fewer, healthier connections over thousands of idle ones.

3) Symptom: Replicas lag steadily but never catch up

Root cause: Replica apply throughput lower than write workload, often due to storage or single-threaded apply constraints, or a weaker instance class.

Fix: Scale replica resources and storage throughput, reduce write volume (batch throttling), and check for long-running transactions or lock waits on replica.

4) Symptom: ALTER TABLE “hangs” and everything else slows

Root cause: Metadata locks and long transactions; or online DDL creating heavy temp/redo pressure; sometimes a blocked DDL waiting for a transaction to finish.

Fix: Identify blockers with sys.innodb_lock_waits and processlist; schedule DDL off-peak; use safe online schema change methods; shorten transactions.

5) Symptom: Disk space alarms but data size didn’t grow much

Root cause: Binlogs, undo growth from long transactions, temp table spills, or snapshots/retention side effects.

Fix: Inspect binlog volume and retention; kill or refactor long transactions; fix queries that spill; confirm backup retention policies.

6) Symptom: After scaling instance, performance barely improves

Root cause: Bottleneck is storage throughput/IOPS or lock contention, not CPU/RAM.

Fix: Move to higher-throughput storage, reduce write amplification, and fix contention hot spots (indexes, application transaction patterns).

7) Symptom: “Free memory” looks high but system is slow

Root cause: MySQL memory is not the whole story; InnoDB may be under-allocated or the workload may be I/O-bound. RDS memory metrics can mislead if you don’t inspect buffer pool and working set.

Fix: Check buffer pool hit ratio indicators, examine top queries and indexes, and tune instance class and innodb_buffer_pool_size appropriately (where permitted).

8) Symptom: A read replica is “healthy” but returns stale data

Root cause: Replication lag or delayed apply; application assumes read-after-write consistency from replicas.

Fix: Route read-after-write to writer (session stickiness), enforce consistency for critical paths, monitor lag, and set thresholds for replica usage.

Checklists / step-by-step plan

Before you migrate (or before the next incident, if you already did)

  1. Enable the right telemetry now: Performance Insights, slow query log (with sane sampling), and Enhanced Monitoring where appropriate.
  2. Define SLOs and the “stop the bleeding” actions: which batch jobs can be paused, which endpoints can be degraded, and who can flip those switches.
  3. Map limits explicitly: max connections, storage type/IOPS, temp behavior, failover expectations, and parameter group change semantics (dynamic vs reboot).
  4. Run a load rehearsal: simulate the top two traffic spikes you’ve seen historically and confirm how the system degrades.
  5. Write the runbook like you’ll read it half-asleep: short steps, exact queries, decision points, and rollback criteria.

During an incident (triage sequence that works in the real world)

  1. Stop amplification: rate-limit retries, pause batch workers, and cap connection pools. If you do nothing else, do this.
  2. Classify the bottleneck: CPU vs I/O vs locks using processlist + key status variables + platform metrics.
  3. Identify top SQL shapes: digests / PI top waits / slow log. Pick the top offender, not the loudest team.
  4. Apply least-risk mitigation: feature-flag off a query, reduce concurrency, or reroute reads/writes carefully.
  5. Only then tune or scale: instance scaling and parameter changes are valid tools, but they’re not first-aid.
  6. Record timestamps: every action, every metric shift, every change. Your postmortem depends on it.

After the incident (prevent recurrence, not just embarrassment)

  1. Convert the root cause into a guardrail: query linting, migration review, capacity alerts on the real bottleneck (often I/O), and load-shedding automation.
  2. Fix the data lifecycle: archiving, partitioning, and index hygiene on the top tables.
  3. Practice restores and failovers: time them end-to-end, including application cutover and verification.
  4. Make performance regressions harder to ship: capture query plans for critical queries and compare them across releases.

FAQ

1) Is RDS MySQL “real MySQL”?

At the SQL layer, yes. Operationally, it’s MySQL inside a managed environment with constraints: no root access, managed storage behaviors, and product-defined failover and backup mechanics.

2) What’s the single biggest “hidden limit” difference?

Storage performance predictability. On self-managed hosts you can often see and tune the full stack; on RDS you must choose the right storage class and monitor I/O with the tools RDS provides.

3) Why do incidents on RDS feel harder to debug?

Because you can’t drop to the host and run OS-level tooling. You must rely on MySQL instrumentation and RDS telemetry. If you didn’t enable them, you’re debugging with half the lights off.

4) Should we just crank max_connections to avoid connection errors?

No. That’s how you turn a small traffic spike into a memory and contention disaster. Use connection pooling, cap per-service connections, and treat connection errors as a backpressure signal.

5) Why does scaling the instance class sometimes not fix performance?

Because you might be I/O-bound or lock-bound. More CPU doesn’t fix disk latency, and more RAM doesn’t fix metadata lock waits. Diagnose first, then scale for the bottleneck you actually have.

6) Are read replicas a safe way to scale reads during incidents?

Only if you monitor lag and your application is designed for it. Replicas are great until they’re behind, then they’re a correctness problem, not a capacity solution.

7) What’s the fastest safe intervention when latency spikes?

Stop amplification: pause batch jobs, reduce worker concurrency, and rate-limit retries. Then identify whether you’re locked, I/O-bound, or CPU-bound before making deeper changes.

8) Do parameter group changes always require downtime?

No, but many do require a reboot. During an incident, “we’ll change a parameter” is only helpful if you already know whether it’s dynamic and what side effects it has.

9) How do we avoid temp table incidents on RDS?

Fix the query shapes that spill (indexes, reduce sort/group by cost), limit concurrency of heavy reports, and watch Created_tmp_disk_tables as an early warning, not a trivia stat.

10) Is Multi-AZ enough for high availability?

It’s necessary, not sufficient. Your app must handle transient errors, reconnect correctly, avoid retry storms, and tolerate brief brownouts. HA is a system property, not a checkbox.

Conclusion: what to change on Monday

MySQL and RDS MySQL run the same SQL, but they fail differently. Self-managed failures often involve a host you can poke until it confesses. RDS failures often involve a limit you agreed to months ago and forgot to monitor.

Next steps that actually reduce incident time:

  1. Enable and verify engine-level visibility (Performance Insights and performance_schema usage), and practice using it under load.
  2. Build dashboards that answer one question fast: is the bottleneck CPU, I/O, or locks?
  3. Write a load-shedding runbook: which jobs get paused, which endpoints degrade, and how you stop retry storms.
  4. Revisit storage choices and replication topology based on real workload, not defaults you inherited.
  5. Turn your top 5 expensive query shapes into owned items with indexes, plan stability checks, and safe rollout paths.

If you do those five, the next incident will still be stressful. But it won’t be mysterious. And mysterious incidents are the ones that age you.

← Previous
PostgreSQL vs Redis for sessions, rate limits, and queues
Next →
WordPress REST API Blocked by a Security Plugin: Allow Safely, Not Blindly

Leave a comment