There are two kinds of engineers: those who have taken production down with an ALTER TABLE, and those who haven’t… yet. Schema changes look innocent in a ticket. Then the pager rings because your “quick” change is waiting on a lock held by a long-running transaction started by a BI tool that forgot how to exit politely.
MySQL and PostgreSQL both let you mutate reality in-place. They just charge you in different currencies: MySQL loves metadata locks and online DDL edge cases; PostgreSQL loves transactional purity and the occasional table rewrite that turns your disk into a space heater. If you operate either at scale, the question isn’t “can we run ALTER TABLE?” It’s “what exactly will it lock, rewrite, replicate, and surprise us with at 2 a.m.?”
What “ALTER TABLE nightmare” actually means in production
A schema change becomes a nightmare when it violates one of three production truths:
- Latency budgets are real. Anything that blocks writes for seconds will surface as timeouts, retries, queue buildup, and sometimes cascading failure.
- Replication is part of your serving path. Even if you “don’t read from replicas,” lag can break failovers, backups, analytics, and any safety mechanism that depends on fresh secondaries.
- Storage isn’t infinite and I/O isn’t free. Rewriting a 2 TB table isn’t just slow—it can fill disks, churn buffer caches, and trigger autovacuum or purge storms.
In practice, schema change pain clusters around:
- Locks that block traffic (explicit locks, implicit locks, and the particularly spicy kind: metadata locks in MySQL).
- Table rewrites (common in PostgreSQL; possible in MySQL depending on engine and DDL mode).
- Replication behavior (statement-based vs row-based implications, DDL serialization, replica apply lag).
- Long transactions (both databases hate them, they just hate them differently).
- Tooling expectations (someone assumes “online” means “zero impact,” and production teaches humility).
One quote worth keeping on a sticky note:
“paraphrased idea” — Werner Vogels: you build reliability by assuming things will fail and designing to recover fast.
Schema changes are reliability work. Treat them like deploys, not like editing a spreadsheet column.
A few historical facts that explain today’s behavior
These aren’t trivia; they’re why your DDL behaves the way it does.
- PostgreSQL chose transactional DDL early: DDL participates in transactions, so it can be rolled back. Great for correctness, but it also means locks can be held until commit.
- MySQL’s “online DDL” is mostly an InnoDB story: the behavior depends on storage engine capabilities, and historically MyISAM would happily lock the world.
- InnoDB introduced “Fast Index Creation” (FIC) to reduce rebuild cost, but the fine print matters: some alters still rebuild the table.
- PostgreSQL 9.2+ made CREATE INDEX CONCURRENTLY practical as an operational tool, with tradeoffs: it takes longer and can fail if you don’t babysit it.
- PostgreSQL 11 improved “add column with default” by avoiding a full table rewrite for constant defaults in many cases—one of the biggest operational quality-of-life upgrades.
- MySQL 5.6/5.7 broadened online DDL options with
ALGORITHMandLOCKclauses, but they’re more like negotiation than a guarantee. - MySQL 8.0 introduced “instant ADD COLUMN” for some cases, but the moment you touch certain column attributes or ordering, you’re back to heavier operations.
- PostgreSQL’s MVCC means old row versions linger until vacuum cleans up; large updates during schema changes can create a vacuum/I/O aftershock.
- MySQL replication historically applied DDL in a single thread (and even with parallel replication, DDL has serialization constraints), making “fast on primary” sometimes “slow everywhere else.”
Joke #1: Calling an ALTER “online” because it doesn’t take an exclusive lock is like calling a bear “friendly” because it hasn’t bitten you yet.
MySQL: online DDL, metadata locks, and the illusion of “instant”
MySQL’s core problem: metadata locks don’t care about your intent
In MySQL, a lot of ALTER pain isn’t the work itself—it’s MDL (metadata locks). Any DDL needs metadata locks, and those locks interact with ongoing queries in ways that surprise teams who equate “online DDL” with “no blocking.”
The classic failure mode looks like this:
- You start
ALTER TABLE. - It waits on an MDL because some session is reading the table.
- New queries pile up behind the waiting ALTER because they want conflicting locks.
- Your app melts down even though “the ALTER hasn’t started.”
ALGORITHM and LOCK clauses: useful, but not a force field
MySQL lets you ask for a DDL algorithm and locking behavior. Example:
ALGORITHM=INSTANTfor metadata-only changes when supported.ALGORITHM=INPLACEto avoid table copy when possible.LOCK=NONEorLOCK=SHAREDto reduce blocking.
The operational trick: always specify them when you have strict availability requirements, because they turn surprises into immediate errors. An error is a gift. It’s telling you: “Not today, not like this.”
What MySQL does well
- Some changes can be genuinely instant (e.g., some ADD COLUMN cases on 8.0), which is lovely when it works.
- Tooling ecosystem for online schema change is mature:
pt-online-schema-change,gh-ost, and homegrown variants are common. - InnoDB can do many index operations online-ish, with manageable impact if you understand the constraints.
Where MySQL hurts
- MDL pileups are catastrophic: they look like sudden “database is down” symptoms even when the server is healthy.
- Replication lag can become your hidden outage: DDL can block apply or conflict with parallelism.
- “Online” still consumes I/O and CPU: building a secondary index on a large table will compete with production workload. The absence of a big lock doesn’t mean the absence of pain.
PostgreSQL: transactional DDL, rewrites, and vacuum-shaped consequences
PostgreSQL’s core problem: rewrites are honest, brutal, and sometimes unavoidable
PostgreSQL is consistent: if a schema change requires touching every row, it will do exactly that. And it will do it under transactional semantics. That means:
- Table rewrites can be huge (time + I/O + extra disk).
- Locks can be strong (some operations require
ACCESS EXCLUSIVE). - Long running transactions make everything worse, because they prevent cleanup of old tuples and can keep locks open longer than you think.
PostgreSQL’s operational superpower: predictable tooling primitives
Postgres gives you building blocks that behave predictably:
CREATE INDEX CONCURRENTLYavoids heavy locking (at the cost of longer runtime and more sensitivity to failures).ALTER TABLE ... ADD COLUMNis often cheap, and since v11 adding a constant default often avoids rewriting the entire table.- Rich introspection via system catalogs and
pg_stat_activity, making “what is blocking what?” an answerable question.
Where Postgres hurts
- Some DDL still takes an ACCESS EXCLUSIVE lock, which blocks reads and writes. The list is not short: column type changes, some constraint validations, some table rewrites.
- MVCC side effects: changes that touch many rows can generate bloat; vacuum then competes for I/O. Your migration can “finish” and your pain begins later.
- Replication on physical standbys: heavy write churn during a rewrite means WAL volume, which means standby lag and storage pressure.
Joke #2: PostgreSQL will let you do almost anything in a transaction; it’s just occasionally the transaction is “rebuild the universe.”
Who makes it worse? A practical verdict by change type
1) Add a nullable column
Usually: both are fine.
- MySQL 8.0: often
ALGORITHM=INSTANTif you don’t force column order changes and you stay within supported patterns. - PostgreSQL: metadata-only, fast.
Nightmare risk: low, unless you trigger MDL pileups (MySQL) or take strong locks by bundling multiple operations (Postgres).
2) Add a column with a default
PostgreSQL improved dramatically in v11. For constant defaults, it can store the default in metadata and apply it at read time, avoiding a full rewrite in many cases.
MySQL: may still require table rebuild depending on exact change and version/engine. “Default” is rarely the only change; people sneak in NOT NULL and a reorder and then wonder why the table copies itself.
Verdict: modern Postgres often wins for constant defaults. MySQL can win if it’s instant; it can also lose spectacularly if it rebuilds.
3) Add an index
Neither is free. You’re reading a lot of data and writing a lot of structure.
- MySQL: index creation can be “online” but still consumes I/O/CPU; metadata locks still matter, and replication apply can lag.
- PostgreSQL:
CREATE INDEX CONCURRENTLYis the go-to for uptime, with the operational caveat that it may take longer and cannot run inside a transaction block.
Verdict: Postgres wins on lock avoidance with CONCURRENTLY. MySQL wins when the workload is tolerant and you can schedule the operation—plus it has mature external online schema change tools.
4) Change a column type
This is where you earn your paycheck.
- PostgreSQL: may rewrite the table, and frequently requires
ACCESS EXCLUSIVE. There are tricks (add new column, backfill, swap) that avoid long exclusive locks. - MySQL: often rebuilds the table; can be blocked by MDL; “inplace” support varies.
Verdict: both can be nightmares. Postgres is more transparent about the rewrite; MySQL is more likely to surprise you with MDL and “why is everything stuck?” symptoms.
5) Add a foreign key
Operationally risky in both. The database must validate existing data, which can mean scanning tables and taking locks.
- PostgreSQL: you can add the constraint
NOT VALIDand validate later to manage lock time. That’s a real production feature, not a party trick. - MySQL: validation and locking behavior depends on engine/version; you may end up with long blocking or heavy I/O.
Verdict: Postgres usually gives you a safer staged approach. MySQL can be fine but has fewer “defer validation” primitives.
6) Drop a column
Both are deceptively dangerous. Not because dropping is hard, but because applications and ORMs are excellent at being wrong in new ways.
Verdict: the database is rarely the nightmare; your deploy choreography is. Do it in phases.
Fast diagnosis playbook
This is the “we started an ALTER and now things are on fire” sequence. The goal is to identify whether you’re blocked on locks, saturated on I/O/CPU, or drowning in replication lag—fast.
First: is it blocked on locks or waiting on metadata?
- MySQL: check processlist and metadata lock waits; look for sessions in “Waiting for table metadata lock.”
- PostgreSQL: check
pg_stat_activityandpg_locksfor blocking chains and lock modes likeAccessExclusiveLock.
If you find blocking: decide whether to kill the blocker, cancel the DDL, or wait. Killing the wrong session can be worse than the migration.
Second: is the server resource-saturated?
- CPU pegged: index build, constraint validation, or heavy triggers.
- I/O pegged: table rewrite, index build, vacuum, InnoDB background activity.
- Memory pressure: buffer pool churn, sorts spilling to disk.
If saturated: throttle, reschedule, or move to online tooling that copies in smaller chunks.
Third: is replication lag turning into a second outage?
- MySQL: monitor replica apply delay and whether SQL thread is blocked behind DDL.
- PostgreSQL: monitor WAL generation and replay lag; DDL that rewrites generates a lot of WAL.
If lag is critical: pause the change, reduce write rate, or accept you can’t fail over safely until lag recovers.
Hands-on tasks: commands, outputs, and decisions (14 of them)
These are practical probes you can run during planning or incident response. Each task includes the command, what the output means, and the decision it drives.
Task 1 (MySQL): see who is waiting on metadata locks
cr0x@server:~$ mysql -uroot -p -e "SHOW FULL PROCESSLIST\G"
*************************** 1. row ***************************
Id: 12345
User: app
Host: 10.0.2.15:53412
db: prod
Command: Query
Time: 120
State: Waiting for table metadata lock
Info: ALTER TABLE orders ADD COLUMN promo_code varchar(32)
*************************** 2. row ***************************
Id: 12201
User: bi
Host: 10.0.9.20:41233
db: prod
Command: Query
Time: 980
State: Sending data
Info: SELECT * FROM orders WHERE created_at > NOW() - INTERVAL 30 DAY
Meaning: The ALTER is blocked by an ongoing query holding a conflicting metadata lock.
Decision: If uptime matters more than that BI query, kill the blocker (or cancel the ALTER). Otherwise, schedule DDL in a window or enforce query timeouts.
Task 2 (MySQL): identify InnoDB lock waits (row locks, not MDL)
cr0x@server:~$ mysql -uroot -p -e "SHOW ENGINE INNODB STATUS\G" | sed -n '1,160p'
=====================================
2025-12-29 10:18:07 0x7f9a2c1a0700 INNODB MONITOR OUTPUT
=====================================
------------
TRANSACTIONS
------------
Trx id counter 92839110
Purge done for trx's n:o < 92839000 undo n:o < 0 state: running
History list length 2341
...
LATEST DETECTED DEADLOCK
...
Meaning: Shows transactional pressure, history list length (purge lag), deadlocks, and lock waits.
Decision: If history list length is growing fast during DDL + backfill, slow the backfill and shorten transactions.
Task 3 (MySQL): check whether your ALTER can be INSTANT/INPLACE
cr0x@server:~$ mysql -uroot -p -e "ALTER TABLE orders ADD COLUMN promo_code varchar(32), ALGORITHM=INSTANT, LOCK=NONE;"
ERROR 1845 (0A000) at line 1: ALGORITHM=INSTANT is not supported for this operation. Try ALGORITHM=INPLACE.
Meaning: MySQL refused INSTANT; it might still be INPLACE, or it might need COPY (table rebuild).
Decision: Do not “let it pick.” Retry with ALGORITHM=INPLACE, LOCK=NONE and accept failure if it can’t comply—then move to online schema change tooling.
Task 4 (MySQL): verify replica lag and whether SQL thread is stuck on DDL
cr0x@server:~$ mysql -uroot -p -e "SHOW SLAVE STATUS\G" | egrep "Seconds_Behind_Master|Slave_SQL_Running_State|Slave_IO_Running|Slave_SQL_Running"
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Seconds_Behind_Master: 187
Meaning: Replica is lagging; SQL thread apply is waiting, often behind a large transaction or DDL serialization.
Decision: If failover is part of your safety plan, stop the migration or throttle the writer until lag recovers.
Task 5 (MySQL): check table size before a potentially rebuilding ALTER
cr0x@server:~$ mysql -uroot -p -e "SELECT table_name, ROUND((data_length+index_length)/1024/1024/1024,2) AS size_gb FROM information_schema.tables WHERE table_schema='prod' AND table_name='orders';"
+------------+---------+
| table_name | size_gb |
+------------+---------+
| orders | 612.84 |
+------------+---------+
Meaning: You’re about to touch a ~613 GB object. Rebuilds will take time and space, and will hammer I/O.
Decision: Treat this as a project: consider gh-ost/pt-osc, ensure disk headroom, and coordinate a throttle plan.
Task 6 (PostgreSQL): see active queries and whether your DDL is waiting
cr0x@server:~$ psql -d prod -c "SELECT pid, state, wait_event_type, wait_event, now()-query_start AS age, left(query,80) AS query FROM pg_stat_activity WHERE datname='prod' ORDER BY age DESC LIMIT 10;"
pid | state | wait_event_type | wait_event | age | query
-------+--------+-----------------+--------------------+---------+--------------------------------------------------------------------------------
81231 | active | Lock | relation | 00:02:01| ALTER TABLE orders ALTER COLUMN total TYPE numeric(12,2)
80910 | active | | | 00:18:22| SELECT * FROM orders WHERE created_at > now() - interval '30 days'
Meaning: The ALTER is waiting on a lock; another query is active long enough to be suspicious.
Decision: Find blockers, then either cancel blockers, or cancel the migration and reschedule with a safer strategy (dual-write, backfill, swap).
Task 7 (PostgreSQL): find the blocking chain
cr0x@server:~$ psql -d prod -c "SELECT blocked.pid AS blocked_pid, blocking.pid AS blocking_pid, blocked.query AS blocked_query, blocking.query AS blocking_query FROM pg_locks blocked JOIN pg_stat_activity blocked_activity ON blocked.pid = blocked_activity.pid JOIN pg_locks blocking ON blocking.locktype = blocked.locktype AND blocking.database IS NOT DISTINCT FROM blocked.database AND blocking.relation IS NOT DISTINCT FROM blocked.relation AND blocking.page IS NOT DISTINCT FROM blocked.page AND blocking.tuple IS NOT DISTINCT FROM blocked.tuple AND blocking.virtualxid IS NOT DISTINCT FROM blocked.virtualxid AND blocking.transactionid IS NOT DISTINCT FROM blocked.transactionid AND blocking.classid IS NOT DISTINCT FROM blocked.classid AND blocking.objid IS NOT DISTINCT FROM blocked.objid AND blocking.objsubid IS NOT DISTINCT FROM blocked.objsubid AND blocking.pid != blocked.pid JOIN pg_stat_activity blocking_activity ON blocking.pid = blocking_activity.pid WHERE NOT blocked.granted AND blocking.granted LIMIT 5;"
blocked_pid | blocking_pid | blocked_query | blocking_query
------------+--------------+--------------------------------------------------+-----------------------------------------------
81231 | 80910 | ALTER TABLE orders ALTER COLUMN total TYPE ... | SELECT * FROM orders WHERE created_at > ...
Meaning: You have a direct blocker. Often it’s a read query that you assumed “doesn’t block schema changes.” It does.
Decision: Cancel the blocking query if it’s safe (pg_cancel_backend), or terminate it if it’s stuck (pg_terminate_backend)—but understand the application impact.
Task 8 (PostgreSQL): estimate table size and whether rewrite is scary
cr0x@server:~$ psql -d prod -c "SELECT pg_size_pretty(pg_total_relation_size('public.orders')) AS total_size, pg_size_pretty(pg_relation_size('public.orders')) AS heap_size;"
total_size | heap_size
------------+----------
835 GB | 612 GB
Meaning: Rewriting the heap could mean hundreds of GB of new writes, plus WAL, plus free space requirements.
Decision: Don’t attempt a rewrite during peak. Use an expand/backfill/contract approach unless you can tolerate long exclusive locks and heavy I/O.
Task 9 (PostgreSQL): check if adding a default will rewrite in your version
cr0x@server:~$ psql -d prod -c "SHOW server_version;"
server_version
----------------
14.10
Meaning: On v11+, constant default optimization exists; on older versions, ADD COLUMN DEFAULT could rewrite the table.
Decision: On older Postgres, avoid adding a default directly; add nullable column, backfill in batches, then set default for future rows.
Task 10 (PostgreSQL): create an index without blocking writes
cr0x@server:~$ psql -d prod -c "CREATE INDEX CONCURRENTLY idx_orders_created_at ON public.orders(created_at);"
CREATE INDEX
Meaning: The index was built without taking the heaviest lock. It likely took longer than a normal create.
Decision: Prefer this in production. Monitor progress and be ready to retry if it fails due to conflicts or cancellations.
Task 11 (PostgreSQL): watch progress of index builds
cr0x@server:~$ psql -d prod -c "SELECT pid, relid::regclass AS table_name, index_relid::regclass AS index_name, phase, blocks_done, blocks_total FROM pg_stat_progress_create_index;"
pid | table_name | index_name | phase | blocks_done | blocks_total
-------+------------+---------------------------+----------+-------------+-------------
82110 | orders | idx_orders_created_at | building | 120331 | 612840
Meaning: Progress view shows where you are; blocks_done/blocks_total gives rough completion.
Decision: If it’s too slow or starving the system, consider throttling at the infrastructure layer (I/O limits) or rescheduling.
Task 12 (PostgreSQL): measure replication replay lag during heavy DDL/backfill
cr0x@server:~$ psql -d prod -c "SELECT application_name, state, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS byte_lag FROM pg_stat_replication;"
application_name | state | byte_lag
------------------+-----------+----------
standby-a | streaming | 3.1 GB
Meaning: Your standby is multiple GB behind in WAL replay terms; that may be minutes, it may be hours, depending on throughput.
Decision: If you need a hot failover, stop or slow the operation until lag is acceptable.
Task 13 (MySQL): use pt-online-schema-change as a safer alternative
cr0x@server:~$ pt-online-schema-change --alter "ADD COLUMN promo_code varchar(32)" D=prod,t=orders --execute
No slaves found. See --recursion-method if host has slaves.
Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
Creating new table...
Created new table prod._orders_new OK.
Altering new table...
Altered `prod`.`_orders_new` OK.
Creating triggers...
Created triggers OK.
Copying rows...
Copying `prod`.`orders`: 12% 00:03 remain
Meaning: Tool is creating a shadow table, copying rows, and using triggers to keep it in sync. This avoids long blocking on the original table at the cost of extra load and complexity.
Decision: Use when direct ALTER risks blocking or rebuilds. Make sure you understand triggers, replication, and cutover behavior.
Task 14 (MySQL): detect MDL contention from performance_schema
cr0x@server:~$ mysql -uroot -p -e "SELECT OBJECT_SCHEMA, OBJECT_NAME, LOCK_TYPE, LOCK_STATUS, THREAD_ID FROM performance_schema.metadata_locks WHERE OBJECT_SCHEMA='prod' AND OBJECT_NAME='orders' LIMIT 5;"
+--------------+-------------+-----------+------------+-----------+
| OBJECT_SCHEMA| OBJECT_NAME | LOCK_TYPE | LOCK_STATUS| THREAD_ID |
+--------------+-------------+-----------+------------+-----------+
| prod | orders | SHARED_READ | GRANTED | 88121 |
| prod | orders | EXCLUSIVE | PENDING | 88210 |
+--------------+-------------+-----------+------------+-----------+
Meaning: A pending exclusive lock (your DDL) is stuck behind granted shared locks (reads).
Decision: Kill or drain long readers, enforce max execution time, or move schema changes to a dedicated maintenance window.
Three corporate mini-stories from the trenches
Mini-story 1: the incident caused by a wrong assumption
At a mid-sized marketplace company, the team scheduled an “online” index creation on MySQL. The change was blessed by a senior dev who remembered that InnoDB can build indexes without blocking writes. Everyone went home.
Ten minutes later, the API error rate spiked. Latency went vertical. The MySQL CPU was fine. Disk wasn’t even that busy. The graphs looked like a networking incident, which is how you know it’s a database incident.
The culprit was metadata lock contention. A long-lived reporting query held a shared metadata lock on the table. The ALTER TABLE ... ADD INDEX waited for an exclusive metadata lock. Once it was waiting, incoming requests started queueing behind it, because their metadata lock requests could not be granted in a way that would starve the pending exclusive lock. The table didn’t just slow down; it effectively stopped accepting new work.
The wrong assumption was “reads don’t block schema changes.” In MySQL, they can. The fix wasn’t heroic: they killed the reporting session, the DDL ran, traffic recovered. The follow-up was the real win: timeouts for reporting queries, offloading BI to replicas with guardrails, and a preflight step that checks for long-running queries on target tables.
Mini-story 2: the optimization that backfired
A fintech team on PostgreSQL wanted to reduce query latency by changing a money column from numeric to bigint cents. It’s a sound optimization: integer arithmetic is cheaper, indexes are smaller, and you avoid weird decimal edge cases.
They tried the direct path: ALTER TABLE ... ALTER COLUMN ... TYPE bigint USING .... On staging, it completed quickly. In production, it took an ACCESS EXCLUSIVE lock and started rewriting a large hot table.
The app didn’t go down immediately. It stalled. Connections piled up. Because the change was in a transaction (as Postgres encourages), the lock persisted until commit. Meanwhile, WAL volume jumped, replicas lagged, and the team lost their safety net: they could no longer fail over without losing a chunk of acknowledged writes.
The “optimization” was correct in principle and disastrous in execution. The eventual fix was a phased migration: add a new column, dual-write at the application layer, backfill in small batches, validate, switch reads, then drop the old column later. The performance benefit still arrived—just not as a single heroic ALTER.
Mini-story 3: the boring but correct practice that saved the day
An enterprise SaaS team had a strict migration runbook: every schema change required a preflight, a canary on a replica, explicit lock/algorithm settings (for MySQL) or staged constraints (for Postgres), and a rollback plan that didn’t involve restoring from backup.
They needed to add a foreign key and make a column NOT NULL on PostgreSQL. That’s exactly the kind of change that can trigger a lock party. Instead of doing it directly, they added the foreign key as NOT VALID, fixed orphaned rows with a controlled cleanup job, then validated during a low-traffic period.
For the NOT NULL change, they first added a CHECK (col IS NOT NULL) NOT VALID, backfilled nulls in batches, validated the check, and only then flipped to SET NOT NULL in a quick operation.
Nothing exciting happened. No incident. No applause. But the best migrations are the ones nobody hears about, including you.
Common mistakes: symptom → root cause → fix
1) Symptom: MySQL traffic suddenly stalls; CPU is low; many connections “waiting”
Root cause: MDL pileup. An ALTER is waiting for metadata lock; new queries queue behind it.
Fix: Identify blockers, kill/limit long-running readers, and run DDL with ALGORITHM/LOCK clauses. Use online schema change tools for hot tables.
2) Symptom: PostgreSQL ALTER TABLE appears “stuck” and blocks everything on the table
Root cause: ACCESS EXCLUSIVE lock waiting on a long-running transaction or query.
Fix: Find blocker with pg_locks/pg_stat_activity, cancel it, or reschedule. For type changes, avoid direct ALTER; use expand/backfill/contract.
3) Symptom: Disk usage spikes during Postgres migration, then replicas fall behind
Root cause: Table rewrite generates heavy heap writes and WAL; standby replay can’t keep up.
Fix: Staged migration, throttle backfill, monitor WAL lag, ensure disk headroom, and consider temporarily relaxing failover expectations.
4) Symptom: MySQL replica lag grows after “online” index creation
Root cause: DDL and/or huge transactions serialize replication apply; index build competes with apply thread.
Fix: Monitor Seconds_Behind_Master and SQL state. Use throttled online schema change tooling; reduce write rate; schedule during low traffic.
5) Symptom: Postgres CONCURRENTLY index build fails mid-way
Root cause: CREATE INDEX CONCURRENTLY is more fragile: it can fail due to cancellations, dead tuples, or conflicts; it also leaves invalid indexes behind.
Fix: Check for invalid indexes and drop/recreate. Run with operational supervision and timeouts; avoid doing it during unstable periods.
6) Symptom: “ALTER succeeded” but application errors keep happening
Root cause: Deploy choreography mismatch: app expects column not yet present / already removed; ORMs caching schema; prepared statements.
Fix: Two-phase deploys: expand schema first, deploy code that uses it optionally, then contract later. Keep backward compatibility between app versions.
7) Symptom: Migration causes severe query slowdown weeks later
Root cause: Postgres bloat/vacuum debt or InnoDB change buffer/purge debt from large backfills.
Fix: Batch backfills, tune vacuum, monitor bloat and history list length, and schedule post-migration maintenance.
Checklists / step-by-step plan
Plan A: expand → backfill → switch → contract (works on both)
- Expand: add new columns/tables/indexes in a way that is safe online (nullable columns, concurrent indexes, shadow-table tooling).
- Dual-write: if changing semantics (type change, denormalization), have the app write both old and new fields.
- Backfill: fill historical rows in batches with small transactions; sleep between batches; monitor lag and locks.
- Validate: constraints in Postgres using
NOT VALIDthenVALIDATE; in MySQL, validate via queries and/or application checks. - Switch reads: flip application reads to new field behind a feature flag; watch error rates.
- Contract: drop old columns/indexes after a full deploy cycle and after confirming no readers remain.
MySQL-specific checklist
- Always specify
ALGORITHMandLOCK. If MySQL can’t comply, let it error. - Before running DDL, check for long-running queries on the table and kill or drain them.
- Ensure disk headroom for rebuilds even if you “expect inplace.” Expect to be wrong at least once.
- Decide your replication posture: can you tolerate lag? If not, throttle or use tooling with replica-lag checks.
- Prefer gh-ost/pt-osc for large hot tables where MDL risk is unacceptable.
PostgreSQL-specific checklist
- Assume type changes and some constraint operations will require strong locks or rewrites.
- Use
CREATE INDEX CONCURRENTLYfor production writes; monitor progress and plan for failure recovery. - Use
NOT VALIDfor foreign keys/check constraints, then validate later. - Watch WAL volume and standby replay lag; treat it as part of availability.
- Keep transactions short during backfills; long transactions sabotage vacuum and lock management.
“Do we run it now?” decision gate
- Green: operation is metadata-only or concurrent/instant, table size manageable, no long transactions, replica lag low.
- Yellow: operation is heavy but throttleable; you have a backout plan; you can accept temporary lag.
- Red: requires table rewrite + strong locks during peak; replication is already laggy; disk headroom is tight; you don’t have a safe rollback path.
FAQ
1) Is PostgreSQL better than MySQL for schema changes?
Postgres is often more predictable: you can usually reason from lock modes and whether an operation rewrites. MySQL can be smoother for some “instant” operations, but MDL contention can create sudden outages.
2) What’s the single most dangerous MySQL DDL behavior?
Metadata lock pileups. A waiting DDL can cause new queries to queue behind it, turning a mild issue into a total stall.
3) What’s the single most dangerous PostgreSQL DDL behavior?
Accidentally triggering a table rewrite under an ACCESS EXCLUSIVE lock on a hot table. It blocks and it writes a lot—worst of both worlds.
4) Can I safely add a NOT NULL constraint in Postgres without downtime?
Often yes, with a staged approach: add a CHECK (...) NOT VALID, backfill, validate, then SET NOT NULL. The final step can be quick if the data is already compliant.
5) Why did my Postgres migration “finish” but performance got worse later?
Backfills generate dead tuples and bloat; vacuum then has more work. Also, caches get churned by large sequential scans and writes. Plan for the aftershock.
6) Do online schema change tools in MySQL guarantee no impact?
No. They trade lock time for extra write load (triggers), longer runtimes, and more moving parts. They are safer for uptime, not magically free.
7) Should I run DDL inside a transaction in Postgres?
It depends. Transactional DDL is powerful, but it can also hold locks longer than expected. Operations like CREATE INDEX CONCURRENTLY can’t run inside a transaction block anyway. Use transactions when rollback safety matters and lock time is controlled.
8) What’s the safest way to change a column type on either database?
Expand/backfill/contract. Add new column with desired type, dual-write, backfill in batches, switch reads, then drop the old column later.
9) How do I stop an ALTER that’s causing trouble?
MySQL: find the session and KILL it, but be aware the server may still be cleaning up. Postgres: pg_cancel_backend first; pg_terminate_backend if needed. Always confirm what you’re killing.
10) Which database is “less nightmare” overall?
If you value introspection and staged correctness primitives, Postgres tends to be calmer. If you rely on “instant” metadata changes and external tooling, MySQL can be excellent—until MDL turns your day into interpretive dance.
Next steps you can do this week
- Inventory your top 10 largest tables and label which ones are “hot” (high QPS) vs “cold.” This alone changes how you plan DDL.
- Add preflight checks to migrations: block if long-running queries exist on the target table (MySQL MDL risk) or if blockers exist (Postgres lock chain).
- Standardize a migration pattern: expand/backfill/contract for risky changes, and require explicit “online” settings (
ALGORITHM/LOCKorCONCURRENTLY). - Define replication SLOs during maintenance: how much lag is acceptable, and what action triggers a pause.
- Practice the abort path: rehearse how to cancel/kill safely, how to detect cleanup, and how to restore service if your change stalls.
When someone asks “MySQL vs PostgreSQL: who makes ALTER TABLE a nightmare,” the honest answer is: the one you treat casually. The second-most honest answer is: MySQL will surprise you with locks you didn’t see; Postgres will surprise you with work you didn’t budget. Choose your surprises—then design them out.