Replication lag is the silent tax on “scale.” You add a read replica, dashboards look calmer, and then the CEO’s report shows yesterday’s numbers. Or worse: your app reads from a replica, sees stale state, and does something enthusiastic and irreversible.
Lag is rarely “a database problem.” It’s a queueing problem with sharp edges: CPU, I/O, lock contention, network, configuration, schema, and workload shape all take turns being the villain. The trick is finding which one is currently holding the knife—fast—and applying a fix that doesn’t boomerang next week.
What replication lag really is (and what it isn’t)
Replication lag is a gap between commit on the primary and visibility on the replica. That sounds obvious until you realize there are at least three “gaps” hiding under the same word:
- Transport lag: primary generated changes, but the replica hasn’t received them yet (network, throttling, burst).
- Replay/apply lag: replica received changes but hasn’t applied them (CPU, I/O, contention, single-threaded apply, conflict).
- Visibility lag: changes are applied but not visible to a particular query/session due to snapshot rules (common with long-running queries, especially in Postgres).
If you treat these as one metric, you’ll “fix” the wrong thing. Lag is always a queue. Queues have two cures: increase service rate, or reduce arrival rate. Everything else is decoration.
Two definitions you should enforce internally
Operational lag: “How long until a committed write on the primary becomes queryable on the replica?” This is what apps care about.
Replication pipeline health: “How close is the replica to ingesting and replaying the primary’s change stream?” This is what SREs watch.
Interesting facts and historical context (because history is how we stop repeating it)
- MySQL replication started as statement-based, which made “deterministic” a prayer, not a guarantee. Row-based logging became the sane default later.
- PostgreSQL’s built-in streaming replication (physical WAL) landed in the 9.x era and changed “standby” from “restore from archive” to “near-real-time.”
- MySQL’s semi-synchronous replication was introduced to reduce data loss risk, but it can trade latency for durability—your SLOs will feel it.
- GTID in MySQL made failover and topology changes less error-prone, but it also made it easier to build dangerously confident automation.
- Postgres hot standby enabled read-only queries on replicas, but it also introduced query conflict behavior that looks like “random cancellations” if you don’t plan for it.
- Logical replication in Postgres arrived later than physical streaming replication and is not “physical but better”; it’s a different tool with different failure modes.
- MySQL parallel replication evolved over versions: early implementations were limited, later ones got better at parallelizing apply. Your version matters.
- Replica lag metrics lie by omission: “seconds behind master” in MySQL is not a universal truth; it’s a timestamp-based estimate with blind spots.
One quote worth keeping on your wall, because it prevents heroic nonsense during incidents:
“Hope is not a strategy.” — Vince Lombardi
Replication lag is where hope goes to die. Good. Now you can engineer.
How MySQL replication creates lag (and how it clears it)
Classic MySQL replication is deceptively simple: the primary writes binlogs, replicas fetch them and apply them. The devil is in how they apply them.
The MySQL replication pipeline (practical view)
- Primary generates binlog events (statement or row based; in production you usually want row-based).
- I/O thread on replica reads binlog events from the primary and writes them to relay logs.
- SQL thread(s) apply relay log events to the replica’s data.
Lag happens when relay logs grow faster than the SQL thread(s) can apply. That’s it. Everything else is a reason why that happened.
Common MySQL lag shapes
- Single big transaction: replica looks “stuck” then suddenly catches up. Your binlog is fine; apply is busy chewing.
- Many small transactions: replica slowly drifts behind during peak and never recovers. You need more apply throughput or fewer writes.
- Lock contention on replica: read queries on replica block apply (or vice versa), producing a sawtooth lag graph.
One dry truth: in MySQL, the replica is not your passive mirror. It’s a second database server doing real work, competing for CPU, I/O, and buffer pool.
How PostgreSQL replication creates lag (and how it clears it)
PostgreSQL streaming replication is WAL shipping over a persistent connection. Replicas (standbys) receive WAL records and replay them. With hot standby, they also serve read-only queries.
The Postgres replication pipeline (physical streaming)
- Primary writes WAL records for changes.
- WAL sender streams WAL to replicas.
- WAL receiver on replica writes received WAL to disk.
- Startup/replay process replays WAL to make data files consistent with the primary’s timeline.
Lag happens when the replay process can’t keep up, or when it’s forced to pause/slow due to conflicts or resource constraints.
Postgres-specific lag realities
- Replay can be blocked by recovery conflicts (hot standby): long-running queries can prevent vacuum cleanup on primary, and on replica they can get canceled or cause replay delay depending on settings.
- WAL volume spikes can be caused by bulk updates, index rebuilds, or full-table rewrites; this is more “physics” than “tuning.”
- Replication slots prevent WAL removal; if a replica is down or slow, WAL accumulates on primary until your disk panics.
Second joke (and last): replication lag is like a corporate meeting—if one person insists on talking forever, everyone else gets behind schedule.
Fast diagnosis playbook
This is the order that finds the bottleneck fastest in real systems. Don’t freestyle it. Your gut is not observable.
First: classify the lag type
- Transport lag? Primary is generating changes, replica isn’t receiving fast enough.
- Apply/replay lag? Replica has the data (relay/WAL received), but replay is behind.
- Visibility lag? It’s applied, but queries still see old snapshots (long transactions, repeatable read semantics, etc.).
Second: decide where the queue is growing
- MySQL: relay log size, SQL thread state, replication worker states.
- Postgres: received LSN vs replay LSN, replay delay, conflicts, WAL receiver status.
Third: find the resource constraint
- I/O bound: high await, low IOPS headroom, dirty page flushing, WAL/fsync pressure.
- CPU bound: high CPU, single core saturated (often apply thread), compression/decompression overhead.
- Lock/contended: apply waiting on locks, DDL, or hot standby conflicts.
- Network bound: packet loss, low throughput, mis-sized TCP buffers, cross-region reality.
Fourth: choose the least dangerous lever
- Prefer workload changes (batch size, transaction size, indexes, query patterns) over “magic knobs.”
- Prefer adding parallelism where safe (MySQL parallel apply, more IO capacity) over “turn off durability.”
- Prefer moving reads away from the replica during catch-up (feature flags, circuit breakers) over letting replicas drown.
Practical tasks: commands, outputs, decisions
These are real, runnable checks you can do during an incident. Each includes what the output means and what decision to make. Do them in this order unless you have hard evidence otherwise.
Task 1 (MySQL): Check basic replica status and identify the bottleneck thread
cr0x@server:~$ mysql -e "SHOW SLAVE STATUS\G" | egrep "Seconds_Behind_Master|Slave_IO_Running|Slave_SQL_Running|Slave_SQL_Running_State|Last_SQL_Error|Last_IO_Error|Relay_Log_Space"
Seconds_Behind_Master: 187
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Relay_Log_Space: 4294967296
Last_SQL_Error:
Last_IO_Error:
Meaning: I/O thread is healthy (transport likely fine). SQL thread is running but blocked on commit dependency; relay logs are huge. Apply is the queue.
Decision: Inspect parallel replication settings and worker status. If dependency waits persist, look for big transactions or commit order bottlenecks. Consider scaling IO/CPU on replica and reducing long-running reads.
Task 2 (MySQL): Inspect replication workers (parallel apply health)
cr0x@server:~$ mysql -e "SELECT WORKER_ID, THREAD_ID, SERVICE_STATE, LAST_APPLIED_TRANSACTION, LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP, APPLYING_TRANSACTION FROM performance_schema.replication_applier_status_by_worker\G"
*************************** 1. row ***************************
WORKER_ID: 1
THREAD_ID: 52
SERVICE_STATE: ON
LAST_APPLIED_TRANSACTION: 0f3a2b1c-11aa-11ee-9d5b-0800272b3f7a:912334
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP: 2025-12-31 11:58:01.123456
APPLYING_TRANSACTION: 0f3a2b1c-11aa-11ee-9d5b-0800272b3f7a:912390
*************************** 2. row ***************************
WORKER_ID: 2
THREAD_ID: 53
SERVICE_STATE: ON
LAST_APPLIED_TRANSACTION: 0f3a2b1c-11aa-11ee-9d5b-0800272b3f7a:912333
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP: 2025-12-31 11:58:01.120000
APPLYING_TRANSACTION:
Meaning: Some workers are idle while one is applying. Parallelism exists but isn’t effectively used (workload not parallelizable or blocked by commit order).
Decision: If workload is single-hot-row or single-schema-object, parallel apply won’t save you. Focus on schema/workload changes (shard hot spots, reduce contention) or accept lag and route reads differently.
Task 3 (MySQL): Identify the SQL thread’s current wait (locks vs I/O vs commit)
cr0x@server:~$ mysql -e "SHOW PROCESSLIST" | awk 'NR==1 || $4 ~ /Connect|Binlog Dump|SQL/ {print}'
Id User Host db Command Time State Info
14 system user replication Connect 187 Waiting for relay log NULL
15 system user replication Connect 187 Waiting for dependent transaction to commit NULL
102 app_ro 10.0.2.55:61234 prod Query 32 Sending data SELECT ...
Meaning: Replica SQL apply is waiting on commit dependencies; one read query is running for 32s and may be causing buffer churn or lock pressure.
Decision: If read queries are heavy, throttle/kill them during catch-up, or move them to a dedicated analytics replica. If commit dependency waits dominate, look for group commit limits and parallel apply tuning.
Task 4 (MySQL): Confirm binlog format and row image (WAL/binlog volume control)
cr0x@server:~$ mysql -e "SHOW VARIABLES WHERE Variable_name IN ('binlog_format','binlog_row_image','sync_binlog','innodb_flush_log_at_trx_commit')"
+------------------------------+-----------+
| Variable_name | Value |
+------------------------------+-----------+
| binlog_format | ROW |
| binlog_row_image | FULL |
| innodb_flush_log_at_trx_commit | 1 |
| sync_binlog | 1 |
+------------------------------+-----------+
Meaning: Durable settings (good) but potentially heavy binlog volume if FULL row image on wide tables.
Decision: Consider binlog_row_image=MINIMAL only if you’ve validated tooling and replication safety for your workload; otherwise optimize schema and transaction size first.
Task 5 (MySQL): Check InnoDB flush pressure on the replica
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_dirty'; SHOW GLOBAL STATUS LIKE 'Innodb_data_pending_fsyncs'; SHOW GLOBAL STATUS LIKE 'Innodb_os_log_pending_fsyncs';"
+--------------------------------+--------+
| Variable_name | Value |
+--------------------------------+--------+
| Innodb_buffer_pool_pages_dirty | 184223 |
+--------------------------------+--------+
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| Innodb_data_pending_fsyncs| 67 |
+---------------------------+-------+
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| Innodb_os_log_pending_fsyncs | 29 |
+---------------------------+-------+
Meaning: Dirty pages and pending fsyncs imply storage is the limiter; apply can’t safely flush fast enough.
Decision: Add IOPS (faster disks, better provisioned storage), tune flushing, or reduce write amplification (indexes, transaction batching). Don’t “fix” this by turning off fsync unless you like data-loss tabletop exercises.
Task 6 (Postgres): Check received vs replayed LSN (transport vs replay)
cr0x@server:~$ psql -x -c "SELECT now() AS ts, pg_last_wal_receive_lsn() AS receive_lsn, pg_last_wal_replay_lsn() AS replay_lsn, pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS bytes_pending;"
-[ RECORD 1 ]--+------------------------------
ts | 2025-12-31 12:00:12.1122+00
receive_lsn | 3A/9F1200A0
replay_lsn | 3A/9E8803D8
bytes_pending | 58982400
Meaning: Replica is receiving WAL but is ~56MB behind in replay. Transport is fine; replay is the queue.
Decision: Check CPU/I/O on the replica and hot standby conflicts. If bytes_pending grows during peak, you need more replay throughput or fewer WAL-heavy writes.
Task 7 (Postgres): Measure time-based lag as seen by the system
cr0x@server:~$ psql -x -c "SELECT now() AS ts, pg_last_xact_replay_timestamp() AS last_replay_ts, now() - pg_last_xact_replay_timestamp() AS replay_delay;"
-[ RECORD 1 ]---+------------------------------
ts | 2025-12-31 12:00:20.004+00
last_replay_ts | 2025-12-31 11:56:59.771+00
replay_delay | 00:03:20.233
Meaning: Approx 3m20s of replay delay. This is closer to “what applications feel” than bytes.
Decision: If your business can’t tolerate this, route critical reads to primary, implement read-your-writes, or invest in synchronous/quorum approaches (with eyes open about latency).
Task 8 (Postgres): Check WAL receiver status and any network symptoms
cr0x@server:~$ psql -x -c "SELECT status, receive_start_lsn, received_lsn, latest_end_lsn, latest_end_time, conninfo FROM pg_stat_wal_receiver;"
-[ RECORD 1 ]---+---------------------------------------------
status | streaming
receive_start_lsn | 3A/9B000000
received_lsn | 3A/9F1200A0
latest_end_lsn | 3A/9F1200A0
latest_end_time | 2025-12-31 12:00:18.882+00
conninfo | host=10.0.1.10 port=5432 user=replicator ...
Meaning: Streaming is healthy; latest_end_time is current. Transport is not the bottleneck.
Decision: Stop blaming “the network.” Look at replay constraints: storage, CPU, conflicts, and checkpoints.
Task 9 (Postgres): Detect hot standby conflicts and cancellations
cr0x@server:~$ psql -c "SELECT datname, confl_snapshot, confl_lock, confl_bufferpin, confl_deadlock FROM pg_stat_database_conflicts;"
datname | confl_snapshot | confl_lock | confl_bufferpin | confl_deadlock
-----------+----------------+------------+-----------------+----------------
postgres | 0 | 0 | 0 | 0
prod | 144 | 3 | 12 | 0
Meaning: Conflicts are real on prod; snapshot conflicts are common when read queries run long while replay needs to remove old row versions.
Decision: If you value fresh data more than long reads, tune to cancel queries sooner. If you value long reads, expect lag and build app logic accordingly. Trying to have both usually ends with pager noise.
Task 10 (Postgres): Identify long-running queries on the replica causing conflicts or bloat
cr0x@server:~$ psql -x -c "SELECT pid, usename, state, now() - xact_start AS xact_age, wait_event_type, wait_event, query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_start ASC LIMIT 5;"
-[ RECORD 1 ]----+---------------------------------------------
pid | 21844
usename | app_ro
state | active
xact_age | 00:18:41.772
wait_event_type | Client
wait_event | ClientRead
query | SELECT ... FROM big_table JOIN ...
Meaning: An 18-minute transaction on a hot standby is a conflict magnet and a replay slow-down risk.
Decision: Kill it during incident response, then fix the pattern: enforce statement timeouts, move analytics queries off hot standby, or use a warehouse.
Task 11 (System): Check disk latency that makes apply/replay crawl
cr0x@server:~$ iostat -x 1 3
Linux 6.1.0 (db-replica-1) 12/31/2025 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.11 0.00 7.88 28.42 0.00 51.59
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
nvme0n1 120.0 980.0 6400.0 51200.0 97.4 22.8 21.8 0.9 99.2
Meaning: 99% utilization and 20ms await is not “fine.” Your replica is storage-bound; replay/apply will lag.
Decision: Scale the storage (better disks, more IOPS), reduce write amplification (indexes, batching), or move replica to separate volumes for WAL/relay logs vs data where applicable.
Task 12 (System): Validate network health (because sometimes it actually is the network)
cr0x@server:~$ mtr -r -c 20 10.0.1.10
Start: 2025-12-31T12:02:00+0000
HOST: db-replica-1 Loss% Snt Last Avg Best Wrst StDev
1.|-- 10.0.2.1 0.0% 20 0.4 0.5 0.3 1.2 0.2
2.|-- 10.0.10.5 0.0% 20 1.1 1.3 1.0 2.4 0.3
3.|-- 10.0.1.10 2.0% 20 2.2 2.4 2.0 6.8 1.1
Meaning: 2% loss and jitter spikes can absolutely create transport lag and retransmits, especially under sustained streams.
Decision: If Postgres/MySQL shows receive gaps, fix the path (NIC errors, oversubscribed link, noisy neighbor). Don’t tune the database to compensate for packet loss. That’s like lowering your standards in hiring because your chairs squeak.
Task 13 (Postgres): Check replication slots and WAL retention risk
cr0x@server:~$ psql -x -c "SELECT slot_name, slot_type, active, restart_lsn, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes FROM pg_replication_slots;"
-[ RECORD 1 ]---+------------------------------
slot_name | replica_1
slot_type | physical
active | f
restart_lsn | 3A/12000000
retained_bytes | 107374182400
Meaning: Slot is inactive and retaining ~100GB of WAL. Your primary disk is now a countdown timer.
Decision: Decide whether that replica is coming back soon. If not, drop the slot (after confirming it won’t be needed) or bring the replica back and let it catch up.
Task 14 (MySQL): Check relay log growth rate and disk usage (avoid “replica died” as a side quest)
cr0x@server:~$ du -sh /var/lib/mysql/*relay* 2>/dev/null; df -h /var/lib/mysql
4.2G /var/lib/mysql/mysql-relay-bin.000123
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 400G 362G 18G 96% /var/lib/mysql
Meaning: Relay logs are big and disk is near full. If it hits 100%, MySQL will have a very bad day and take your incident with it.
Decision: Free space immediately (add volume, purge safely if possible, or temporarily stop heavy replication consumers). Then fix the root cause: apply throughput and disk sizing.
Task 15 (Postgres): See if checkpoints are too frequent (WAL pressure and I/O spikes)
cr0x@server:~$ psql -c "SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time, checkpoint_sync_time, buffers_checkpoint FROM pg_stat_bgwriter;"
checkpoints_timed | checkpoints_req | checkpoint_write_time | checkpoint_sync_time | buffers_checkpoint
------------------+-----------------+-----------------------+----------------------+--------------------
112 | 389 | 9823412 | 4412231 | 18442201
Meaning: Many requested checkpoints suggests WAL volume or settings forcing frequent checkpoints; this can starve replay with I/O storms.
Decision: Tune checkpoint settings and storage; reduce WAL burst sources (bulk updates, index rebuild cadence). Validate with measured I/O and replay behavior, not vibes.
That’s more than a dozen tasks. Use them like a checklist, not like a buffet.
Lag causes that actually matter (by subsystem)
1) Transaction shape: big commits, many commits, and the tyranny of “one hot row”
Replication is fundamentally serial in key places. Even when you have parallel apply, commit order and dependency tracking can force serialization. Your workload decides whether parallelism is real or cosmetic.
- Big transactions create bursty lag: the replica must apply a large chunk before “lag” improves.
- Many small transactions create constant overhead: fsync, lock acquisition, btree updates, and commit bookkeeping dominate.
- Hot rows (counters, last_seen, single account) create dependency chains that parallel apply can’t break.
What to do: batch writes intentionally, avoid per-row “chatty” updates, and redesign hotspots (append-only events, sharded counters, periodic rollups).
2) Write amplification: indexes, foreign keys, and “helpful” constraints
Every write is not one write. It’s a small parade: table heap/page updates, index updates, WAL/binlog, page flushes, metadata, and sometimes extra reads to check constraints.
If replicas lag during heavy write periods, your first suspect should be write amplification, not replication knobs.
3) Storage physics: latency beats bandwidth
Replication apply/replay is a write-heavy workload with nasty sync points. Low latency matters more than impressive sequential throughput numbers.
- High
awaitiniostatoften correlates directly with lag growth. - Cloud volumes with burst credits can “look fine” until they don’t, then lag climbs and never recovers.
- Mixed workloads (replication + analytics reads) can trash caches and cause random I/O amplification.
4) Read traffic on replicas: the fake “free lunch”
Serving reads from replicas sounds like free capacity. In practice it’s a resource competition problem:
- Heavy reads cause cache churn, forcing more disk reads for apply/replay.
- Long reads in Postgres hot standby can cause conflicts with replay.
- In MySQL, reads can compete for buffer pool and I/O; also metadata locks and DDL interactions can get spicy.
Rule: a replica intended for freshness should not also be your ad-hoc analytics playground.
5) Network: latency, loss, and cross-region fantasies
Cross-region replication is not a tuning exercise; it’s physics and probability. Loss causes retransmits; latency stretches feedback loops; jitter creates microbursts.
If you’re replicating over the public internet (or an overloaded VPN), you’re not running a database; you’re running a networking experiment.
MySQL-specific fixes and tradeoffs
Parallel replication: helpful, not magical
Modern MySQL supports parallel apply, but the degree depends on version and settings. The workload also must allow parallelism (multiple schemas, independent transactions).
Do: enable and validate parallel workers; monitor worker utilization.
Avoid: “Just crank workers to 64” without measuring lock waits and commit dependency chains. You’ll get overhead and the same lag.
Binlog format and row image: control volume carefully
Row-based logging is operationally safer for most production workloads, but it can generate a lot of binlog data—especially with wide rows and FULL row images.
Do: reduce unnecessary updates, avoid updating unchanged columns, and keep rows narrow when it matters.
Consider: binlog_row_image=MINIMAL only with careful compatibility checks (downstream tooling, auditing, CDC, and your own debugging needs).
Replica-friendly schema and query patterns
- Keep secondary indexes intentional. Each extra index is extra apply work on every replica.
- Prefer idempotent writes and avoid read-modify-write loops that create hotspots.
- Batch background jobs and avoid “tiny commits forever.”
Durability knobs are not a performance plan
You can reduce fsync pressure with durability settings, but you are purchasing performance with data loss risk. Sometimes it’s acceptable in a cache-like store. For core data? No.
First joke (and only other one): turning off durability to fix lag is like removing the smoke detector because it keeps waking you up.
PostgreSQL-specific fixes and tradeoffs
Hot standby conflicts: pick a side
On a hot standby, replay needs to apply changes that may remove row versions or lock resources. Long-running queries can conflict with that replay.
Postgres gives you choices, not miracles:
- Prioritize replay freshness: cancel conflicting queries sooner. Users see query failures, but data stays fresher.
- Prioritize read stability: allow long queries, accept replay delays. Users get results, but data goes stale.
What to do: set clear expectations for each replica role. “This replica is for dashboards, up to 5 minutes stale” is a valid product decision. Pretending it’s real-time is how incidents become meetings.
WAL volume management: the unglamorous win
WAL is the bill for your write behavior. Reduce the bill:
- Avoid full-table rewrites during peak hours.
- Be careful with mass updates that touch many rows; often an INSERT-into-new-table + swap is operationally better, but it changes WAL patterns too.
- Consider autovacuum tuning and table maintenance to reduce bloat (which otherwise increases I/O during replay).
Replication slots: guardrails with teeth
Slots prevent WAL from being removed while a consumer still needs it. Great for logical replication and robust standbys. Also great at filling disks when a consumer disappears.
Operational rule: if you use slots, you must have alerts on retained WAL size and slot activity, and a runbook for “is it safe to drop this slot?”
Synchronous replication: when you must, and what you pay
Synchronous replication can bound lag by making commits wait for replica acknowledgment. That’s not “free correctness”; it is a latency tax. Use it for a small set of critical data paths, or for regional HA where consistency is paramount.
Common mistakes: symptom → root cause → fix
1) “Seconds behind master” is flat, but users still see stale reads
Symptom: MySQL reports low lag, but the app reads old data.
Root cause: You’re measuring the wrong thing. Timestamp-based lag can be misleading during idle periods, or your app uses read-after-write without stickiness.
Fix: Implement read-your-writes: route to primary after a write, or use session stickiness, or track GTID/LSN and wait until replica catches up.
2) Lag grows steadily every day, then “randomly” resets after maintenance
Symptom: Replicas get slower over time; restarts or vacations “fix” it.
Root cause: Cache churn, bloat, growing indexes, or increasing write amplification. Sometimes it’s also storage burst credits running out.
Fix: Measure I/O latency and buffer behavior, track table/index growth, and plan storage with headroom. Treat bloat and indexing as capacity work, not a personal failing.
3) Postgres replica cancels queries, and dashboards scream
Symptom: “canceling statement due to conflict with recovery” appears; users complain.
Root cause: Hot standby conflict: long-running snapshot reads conflict with replay cleanup/locks.
Fix: Move long queries elsewhere; add statement_timeout on replica; tune hot standby conflict settings according to role. Don’t put BI workloads on a freshness-focused standby.
4) Postgres primary disk fills with WAL unexpectedly
Symptom: WAL directory grows fast; disk pressure; replication slot shows huge retained bytes.
Root cause: Inactive/slow consumer with a replication slot; WAL cannot be recycled.
Fix: Restore consumer or drop the slot after verification. Add monitoring and a policy for slot lifecycle.
5) MySQL replica is “running” but apply never catches up after peak
Symptom: I/O thread ok, SQL thread ok, lag rises and stays high.
Root cause: Apply throughput below sustained write rate: storage latency, insufficient CPU, too many indexes, single-thread apply limitations, or contention.
Fix: Increase replica resources (IOPS, CPU), tune parallel replication where meaningful, reduce write amplification, and split workloads across replicas with clear roles.
6) You scale replicas and lag gets worse
Symptom: Adding more replicas increases primary load and replica lag.
Root cause: Each replica adds replication sender overhead (network, WAL sender, binlog dump threads) and can increase fsync/log pressure depending on configuration.
Fix: Use cascading replication (where appropriate), ensure network is provisioned, and validate primary CPU and I/O headroom before adding replicas.
Three corporate mini-stories from the lag trenches
Mini-story #1: The incident caused by a wrong assumption (read replicas are “safe”)
A mid-sized SaaS company had a clean architecture slide: writes to the primary, reads to the replicas. They also had a feature called “instant export” that users could run right after changing settings.
Someone assumed the settings read could come from a replica because “it’s just configuration.” The export job wrote a “started” record on the primary, then read settings from a replica, then produced output based on what it found.
During a marketing campaign, writes spiked. Replication lag climbed from seconds to minutes. Exports started using old settings—wrong filters, wrong formats, wrong destinations. Nothing crashed. The system simply produced confident, incorrect results, which is the most expensive type of correct-looking failure.
The fix was not a tuning knob. They implemented read-your-writes: after updating settings, the user’s session was pinned to the primary for a short window. They also tagged replicas by role: “freshness” vs “analytics.” The export feature stopped pretending it could live on “cheap” reads.
Mini-story #2: The optimization that backfired (bigger batches, bigger pain)
A payments platform ran MySQL with replicas. A nightly job updated statuses in batches. Lag was annoying but manageable. Someone noticed the job spent too long committing and decided to “optimize” by increasing batch size dramatically.
The primary loved it at first: fewer commits, less overhead. Then replicas started lagging in ugly bursts. The apply threads would hit a massive transaction, spend a long time applying, and keep relay logs growing. During that window, replicas served increasingly stale reads.
Worse: when the big transaction finally committed on the replica, it caused a heavy flush wave. Storage latency spiked. Apply slowed further. The team’s graphs turned into a modern art exhibit.
The boring truth: they optimized the primary’s commit overhead and ignored the replica’s apply mechanics. The fix was to tune batch size to a measured sweet spot: small enough to keep apply smooth, big enough to avoid death by commit. They also added guardrails: maximum transaction size, plus a circuit breaker that paused the job when replica lag exceeded a threshold.
Mini-story #3: The boring but correct practice that saved the day (role-based replicas and runbooks)
A retail company ran Postgres with two standbys. One was designated “fresh standby” with aggressive cancellation of conflicting queries; the other was “reporting standby” where long reads were allowed and lag was expected.
During peak season, a developer ran an expensive report. It landed on the reporting standby as intended. That standby fell minutes behind, but nobody panicked because its SLO allowed it. The fresh standby stayed close to real time because it wasn’t hosting the report workload.
Then a deployment introduced a write-heavy migration. Replay on the fresh standby started to drift. On-call followed a runbook: verify receive vs replay LSN, check iostat, confirm no query conflicts, then temporarily route “stale-sensitive” reads to primary. They did not debate it in a channel for 45 minutes. They acted.
The business impact was minimal. Not because they had perfect replication, but because they had a defined replica purpose, measured expectations, and a practiced plan. Boring saved the day. Again.
Checklists / step-by-step plan
Checklist A: When lag spikes right now (incident response)
- Confirm scope: one replica or all? MySQL or Postgres? single region or cross-region?
- Classify lag: transport vs apply vs visibility (use the tasks above).
- Protect correctness: route critical reads to primary; enable read-your-writes paths.
- Reduce load on replica: stop analytics queries, pause background jobs, throttle batch writers.
- Check storage latency: if
awaitis high and %util pinned, fix infra or reduce writes. - Look for one giant offender: big transaction, index rebuild, migration, or vacuum conflicts.
- Make the smallest safe change: temporary routing change, job pause, query kill, capacity add.
- Document the timeline: when lag started, what correlated (deploy, job, traffic, storage events).
Checklist B: Prevent lag from becoming your personality (engineering plan)
- Define replica roles: freshness replica vs reporting replica vs DR replica.
- Set lag budgets: acceptable seconds/minutes per role; communicate it to product.
- Instrument the pipeline:
- MySQL: relay log growth, worker states, apply rate.
- Postgres: LSN receive vs replay, replay timestamp delay, conflict counters, slot retention.
- Build app safety: read-your-writes, consistency tokens (GTID/LSN), fallback to primary.
- Control write amplification: index hygiene, avoid wide rows, avoid useless updates.
- Plan maintenance windows: bulk operations off peak; rate-limit migrations and backfills.
- Capacity with headroom: IOPS and CPU sized for peak + catch-up, not average.
- Practice failover: lag interacts with failover in unpleasant ways; rehearse it.
Checklist C: Schema and workload patterns that reduce lag by design
- Replace “update same row constantly” with append-only events + periodic compaction.
- Batch writes to a stable size; avoid both extremes (tiny commits, monster commits).
- Keep secondary indexes intentional; remove ones that are only “maybe useful someday.”
- On Postgres, avoid long transactions on hot standbys; they cost freshness or reliability.
FAQ
1) Is MySQL replication lag measurement reliable?
It’s useful, not gospel. Seconds_Behind_Master is an estimate based on timestamps and can be misleading during idle periods or with certain workloads. Measure relay log growth and apply states too.
2) Is Postgres replication lag easier to reason about?
Usually, yes. LSN-based metrics let you separate receive vs replay lag. Time-based lag via pg_last_xact_replay_timestamp() is also practical, though it’s NULL if no transactions replayed yet.
3) Should I run analytics on replicas?
Yes, but not on the same replica you rely on for freshness. Give analytics its own replica (or system). Otherwise you’re trading “fewer queries on primary” for “replication falls behind and correctness gets weird.”
4) Does adding more replicas reduce lag?
No. Replicas reduce read load on the primary (sometimes). Lag is about the replica’s ability to receive and apply. More replicas can increase primary overhead and network use.
5) What’s the fastest safe way to reduce user impact during lag?
Route stale-sensitive reads to primary and keep replicas for non-critical reads. Add read-your-writes behavior after writes. This is an application-level safety net that works regardless of database brand.
6) Can parallel replication “solve” lag in MySQL?
It can help a lot when transactions are independent. It won’t help much with dependency chains (hot rows, single-table contention, strict commit order). Treat it as a multiplier on a workload that already parallelizes.
7) Why does Postgres sometimes cancel queries on a replica?
Hot standby conflicts: replay needs to apply changes that conflict with a query’s snapshot or required locks. Depending on settings, Postgres either delays replay (more lag) or cancels queries (more query failures). Choose intentionally.
8) Are replication slots in Postgres required?
They’re required for logical replication and useful for preventing WAL removal before consumers have received it. They are also a disk-filling machine if consumers go away. Use them with monitoring and a lifecycle policy.
9) Should I use synchronous replication to eliminate lag?
Synchronous replication can bound lag by making commits wait. You pay in write latency and reduced availability if the synchronous standby is unhealthy. Use it for workloads where bounded staleness is worth the cost.
10) What’s the most common root cause you see in production?
Storage latency under write load, amplified by schema choices (too many indexes) and mixed workloads on replicas (analytics + freshness). The databases get blamed; the I/O graphs quietly tell the truth.
Conclusion: next steps that won’t embarrass you
If you remember one thing: replication lag is a queue. You don’t “tune it away” with wishful thinking. You either increase apply/replay capacity, reduce write volume, or change what you expect replicas to be.
Practical next steps:
- Implement read-your-writes for user-facing correctness. Stop pretending replicas are instantly consistent.
- Separate replica roles: one for freshness, one for reporting, one for DR. Give each a lag budget and enforce it.
- Instrument the replication pipeline with receive vs apply/replay metrics, not just a single lag number.
- Fix storage headroom: if your replica disks are pegged, nothing else matters.
- Audit write amplification: indexes, batch jobs, and hot rows. Most “replication problems” start in the schema and workload, not in replication.
- Write the runbook and rehearse it. The worst time to learn which metric is lying is 3 a.m. on a holiday.
MySQL and PostgreSQL both replicate well—when you treat replication as a first-class production system, not as a checkbox. The lag you tolerate is a product decision. The lag you don’t measure is a career decision.