You added read replicas to stop the “CPU 95%” pager. The graphs improved. Then support shows up with screenshots: “I just paid, but the order says unpaid.”
Replicas didn’t just fail to help. They helped in the worst possible way: by being confidently wrong.
Read replicas are a legitimate scaling tool. They’re also an excellent way to outsource your correctness to physics and hope. This is a practical guide to
how MySQL and PostgreSQL replicas behave in production, when they buy you headroom, and when they lie to your application.
When replicas help vs when they lie
Replicas help when reads are real reads
Read replicas shine when your workload contains heavy, repeatable reads that don’t need the newest data:
reporting, analytics, search indexing, “list of products” browsing, timelines with acceptable staleness, and
long-running queries you’d rather not run on the primary. You can carve off those reads and buy room for writes.
They also help when you need operational flexibility: online schema changes, vacuum pressure relief (Postgres),
secondary environments for verification, and safer maintenance windows. A replica can be the place you run “weird”
queries without gambling the primary’s cache and latency.
Replicas lie when you use them like a load balancer for correctness-sensitive traffic
Replication is usually asynchronous. That means the replica is behind by definition. The only question is “how far”
and “for how long.” The lie is subtle: the replica answers quickly and confidently, but with history.
If you send “read my own write” traffic to a replica without safeguards, you’re creating user-visible inconsistency.
The user does something; you accept it; they read back; you show old state. It looks like data loss, even when it’s
“just lag.”
First rule: if a read changes what the user will do next (money, auth, inventory, permissions),
don’t let it come from an async replica unless you have explicit session consistency logic.
Second rule: replicas do not fix slow queries. They just let you run more of them. If your read
path is unindexed garbage, you’ve built a larger garbage-disposal system—not a cleaner kitchen.
Joke #1: Replication lag is like a meeting you weren’t invited to—you only find out it happened after the decisions are already made.
What “help” looks like in metrics
- Primary CPU drops and stays down during peak read bursts.
- Primary buffer pool / shared buffers hit rate improves (less churn from read-heavy queries).
- p95 and p99 latency on write transactions improves because lock pressure and cache churn reduce.
- Replica lag stays bounded under expected peak (you know the budget and you enforce it).
What “lie” looks like in incidents
- “I just changed my password but it still accepts the old one.”
- “Inventory shows 3 left, checkout fails.”
- “Support can see it; customer can’t.” (support hits primary; customers hit replicas)
- Failover happens and your “replica” becomes primary with missing writes you thought were committed.
MySQL vs PostgreSQL replication: what’s actually different
Both systems can do read replicas. Both can do asynchronous and synchronous-ish options. Both can burn you if you
pretend replication is a magic mirror. But the failure modes differ because the mechanics differ.
MySQL replication: binlogs, GTIDs, and choices you can’t ignore
Classic MySQL replication is based on the primary writing changes to the binary log, and replicas fetching and
applying those changes. The core knobs that matter in production:
- Row-based vs statement-based binlogging: row-based is usually safer and more deterministic; statement-based is smaller but can be “creative.”
- GTID: global transaction IDs make failover and position tracking dramatically less awful.
- Single-thread vs parallel apply: replication SQL thread bottlenecks create lag even when network is fine.
- Semi-sync: reduces data loss risk at commit time by requiring at least one replica to acknowledge receipt (not always apply) before commit returns.
In MySQL, “replica caught up” is often expressed in terms of executed GTID set, or seconds behind source, but you
must understand what that “seconds” metric actually means: it’s derived from timestamps in events, not a precise,
monotonic measure of truth.
PostgreSQL replication: WAL, streaming, and the reality of “hot standby”
Postgres replicas stream WAL (write-ahead log). Replicas can be hot standbys: readable while applying WAL. This is
operationally lovely, until your read queries collide with vacuum and row version cleanup expectations.
- Replication slots: protect WAL needed by replicas, but can fill disks if a replica falls behind.
- WAL receiver/replay: lag can be at receive, flush, or replay stages—each points to a different bottleneck.
- Hot standby conflicts: long reads can block cleanup; cleanup can cancel reads, depending on settings.
- Synchronous replication: can enforce commit waits on standbys; you choose how many must acknowledge.
Postgres provides rich, stage-specific lag metrics, which is great because you can finally stop arguing with your
app team about whether “seconds behind” is a vibe or a number.
The big operational difference: what “durable” means under replication
If you acknowledge a write to a client, the client will treat it as truth. If the primary crashes immediately after,
and your failover promotes a replica that never got that write, your system has performed an accidental time travel.
MySQL semi-sync and Postgres synchronous replication both mitigate that, but they have different semantics. MySQL
semi-sync usually means “at least one replica received the event.” Postgres synchronous replication can mean “one or
more standbys confirmed flush” (or in newer configurations, more nuanced choices), which is closer to “it’s on disk elsewhere.”
The practical advice: if you are using replicas for failover safety, define your required commit durability explicitly.
If you are using replicas for read scaling, define your acceptable staleness explicitly. These are different goals.
Mixing them without acknowledging tradeoffs is how you get the worst of both.
Consistency models you can implement (and what they cost)
1) “Replica reads are best effort” (cheap, dangerous)
You spray GET requests across replicas and accept that sometimes users see old data. This is fine for non-critical
content. It is not fine for state transitions. If you can’t easily classify reads, you will accidentally use this model for everything.
2) Read-your-writes per session (the grown-up default)
After a user writes something, subsequent reads in that session should see it. You can implement this in a few ways:
- Primary pinning: after a write, route that user to primary for N seconds.
- GTID/WAL position fencing: record the commit position (GTID or LSN) returned by the primary; only read from replicas that have replayed past it.
- Application-level cache with versioning: keep recent write results in cache and merge with replica reads (harder than it sounds).
Cost: complexity in routing and instrumentation. Benefit: your product stops gaslighting your customers.
3) Synchronous replication for correctness-sensitive writes (expensive, sometimes necessary)
If you must not lose committed transactions under a single-node failure, you need to wait for replica acknowledgments
before returning success. That increases commit latency and decreases throughput—especially across availability zones.
Use it for money movement, irreversible workflows, and credentials. Don’t blanket-enable it because someone heard
the phrase “strong consistency” and got excited.
4) Logical replication / change streams for read models (often the best scaling story)
Sometimes the right answer is: stop making your primary database serve every read shape. Use replication (logical
decoding in Postgres; binlog-based CDC in MySQL) to build read-optimized stores: search indexes, caches, denormalized
tables, OLAP engines. That’s scaling. Replicas are a halfway house.
Interesting facts and historical context
- Fact 1: MySQL replication predates InnoDB becoming the default; early replication assumptions were shaped by simpler storage engines and looser durability expectations.
- Fact 2: Statement-based replication was popular partly because it reduced binlog size, but it exposed users to nondeterminism (time functions, random order without ORDER BY, etc.).
- Fact 3: MySQL GTIDs were a big operational shift: they turned “find the right binlog file and position” into “check transaction identity,” which is far less error-prone under failover.
- Fact 4: PostgreSQL’s WAL-based replication evolved from file shipping to streaming; streaming replication made “replica as a service” realistic instead of “restore jobs and prayer.”
- Fact 5: Postgres hot standby introduced a new class of production drama: read queries on replicas can conflict with vacuum cleanup, leading to query cancellations or bloat.
- Fact 6: Replication slots in Postgres are both a safety belt and a foot-gun: they prevent WAL loss for lagging replicas, but they can fill disks fast if a replica goes missing.
- Fact 7: “Seconds behind master/source” metrics have misled engineers for decades because they aren’t a universal measure of freshness; they’re a heuristic tied to event timestamps.
- Fact 8: Semi-synchronous replication in MySQL emerged because async replication plus automatic failover created unacceptable data loss windows for many businesses.
- Fact 9: Postgres synchronous replication allows choosing how many standbys must acknowledge, enabling quorum-style durability at the cost of tail latency.
Fast diagnosis playbook
You’re on call. Read traffic is spiking. Someone says “just add another replica.” Before you do, find the bottleneck.
This playbook assumes you want an answer in under 15 minutes, not a philosophical debate.
First: decide if the primary is actually read-bound
- Check primary CPU and I/O wait: If CPU is pegged but I/O wait is low, you might be compute-bound (bad queries, poor indexes, bad plan).
- Check buffer/cache hit rate: If your working set doesn’t fit memory, replicas may just multiply I/O pain.
- Check top queries: One query can eat a replica farm for breakfast.
Second: measure replication lag correctly
- MySQL: compare GTID sets, not vibes. Seconds behind source is a hint, not a promise.
- Postgres: separate receive lag from replay lag. Network vs disk vs CPU show up differently.
Third: check for replica-specific throttles
- MySQL: replication SQL thread not keeping up, parallel workers misconfigured, long-running transactions on primary creating giant events.
- Postgres: WAL replay stuck due to I/O, replication slot backlog, hot standby conflicts cancelling queries or delaying cleanup.
Fourth: validate routing and read-after-write behavior
- Confirm what percentage of traffic hits replicas vs primary.
- Confirm whether correctness-sensitive reads are pinned or fenced.
- Confirm whether your load balancer is silently failing open to replicas during primary overload.
Practical tasks: commands, outputs, decisions
Below are production-grade tasks. Each includes a command, sample output, what it means, and what decision you make.
Run them from a safe admin host. Adjust usernames, sockets, and hostnames to your environment.
Task 1 (MySQL): Check replication state and lag
cr0x@server:~$ mysql -e "SHOW REPLICA STATUS\G"
*************************** 1. row ***************************
Replica_IO_State: Waiting for source to send event
Source_Host: mysql-primary-1
Source_Log_File: binlog.003921
Read_Source_Log_Pos: 812993441
Relay_Log_File: relay.000778
Relay_Log_Pos: 813002118
Replica_SQL_Running: Yes
Replica_IO_Running: Yes
Seconds_Behind_Source: 7
Retrieved_Gtid_Set: 2f1c9b2a-...:1-91833122
Executed_Gtid_Set: 2f1c9b2a-...:1-91833098
What it means: I/O and SQL threads are running; replica reports 7s behind, but GTID shows it’s missing 24 transactions.
Decision: If your SLA for staleness is < 1s (common for user state), this replica cannot serve post-write reads. Either pin to primary after writes or fence on GTID.
Task 2 (MySQL): Compare GTID execution between primary and replica
cr0x@server:~$ mysql -h mysql-primary-1 -e "SELECT @@global.gtid_executed\G"
*************************** 1. row ***************************
@@global.gtid_executed: 2f1c9b2a-...:1-91833122
cr0x@server:~$ mysql -h mysql-replica-1 -e "SELECT @@global.gtid_executed\G"
*************************** 1. row ***************************
@@global.gtid_executed: 2f1c9b2a-...:1-91833098
What it means: Replica is behind by a specific GTID range; you can quantify freshness without guessing.
Decision: Use this for automation: route session to replicas only after they reach a target GTID (or route to primary until then).
Task 3 (MySQL): Find whether replication is bottlenecked by single-threaded apply
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'replica_parallel_workers'; SHOW VARIABLES LIKE 'replica_parallel_type';"
+--------------------------+-------+
| Variable_name | Value |
+--------------------------+-------+
| replica_parallel_workers | 0 |
+--------------------------+-------+
+------------------------+----------+
| Variable_name | Value |
+------------------------+----------+
| replica_parallel_type | DATABASE |
+------------------------+----------+
What it means: Parallel workers are disabled. One SQL thread applies everything.
Decision: If the primary write rate is high and lag grows under load, enable parallel replication (carefully), then re-measure lag and commit order constraints.
Task 4 (MySQL): Check for large transactions causing bursty lag
cr0x@server:~$ mysql -e "SHOW PROCESSLIST;"
+----+-----------+-----------------+------+---------+------+------------------------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-----------+-----------------+------+---------+------+------------------------+------------------+
| 12 | repl | 10.0.2.15:45512 | NULL | Binlog Dump GTID | 0 | Source has sent all binlog to replica | NULL |
| 33 | app | 10.0.3.21:60433 | prod | Query | 218 | updating | UPDATE orders ... |
+----+-----------+-----------------+------+---------+------+------------------------+------------------+
What it means: A long-running UPDATE on primary can create huge binlog events; replicas apply later, then “catch up” in ugly jumps.
Decision: Break massive updates into chunks or run them off-peak. If you can’t, stop serving consistency-sensitive reads from replicas during the operation.
Task 5 (Postgres): Measure receive, flush, and replay lag
cr0x@server:~$ psql -h pg-replica-1 -d postgres -c "SELECT now() AS ts, pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();"
ts | pg_last_wal_receive_lsn | pg_last_wal_replay_lsn | pg_last_xact_replay_timestamp
-----------------------------+--------------------------+-------------------------+-------------------------------
2025-12-29 14:03:11.120+00 | 3A/9F12A7B0 | 3A/9F0C1028 | 2025-12-29 14:03:05.004+00
(1 row)
What it means: Replica is receiving WAL ahead of what it has replayed. Replay is behind by a chunk; last replayed transaction is ~6 seconds old.
Decision: If receive is ahead but replay lags, focus on replica CPU/disk and replay settings. If receive also lags, focus on network or primary WAL sender.
Task 6 (Postgres primary): Identify lagging standbys and their states
cr0x@server:~$ psql -h pg-primary-1 -d postgres -c "SELECT application_name, state, sync_state, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
application_name | state | sync_state | write_lag | flush_lag | replay_lag
------------------+-----------+------------+-----------+-----------+------------
pg-replica-1 | streaming | async | 00:00:01 | 00:00:01 | 00:00:06
pg-replica-2 | streaming | async | 00:00:00 | 00:00:00 | 00:00:01
(2 rows)
What it means: Replica-1 is replaying 6s behind; replica-2 is close to real-time.
Decision: If your router can choose, prefer replica-2 for freshness-sensitive reads. Also investigate why replica-1 replays slower (I/O, CPU, conflicts).
Task 7 (Postgres): Detect hot standby query cancellations
cr0x@server:~$ psql -h pg-replica-1 -d postgres -c "SELECT datname, confl_lock, confl_snapshot, confl_bufferpin, confl_deadlock FROM pg_stat_database_conflicts;"
datname | confl_lock | confl_snapshot | confl_bufferpin | confl_deadlock
---------+------------+----------------+-----------------+----------------
prod | 0 | 17 | 0 | 0
(1 row)
What it means: Snapshot conflicts occurred 17 times—queries were canceled because replay needed to remove row versions that a long query depended on.
Decision: Either move those long queries elsewhere, tune max_standby_streaming_delay, or accept cancellations and retry logic. Don’t pretend it’s “random.”
Task 8 (Postgres): Check replication slots for runaway WAL retention
cr0x@server:~$ psql -h pg-primary-1 -d postgres -c "SELECT slot_name, active, restart_lsn, wal_status FROM pg_replication_slots;"
slot_name | active | restart_lsn | wal_status
------------------+--------+-------------+------------
replica1_slot | f | 39/FF0012A0 | extended
replica2_slot | t | 3A/9E1000A0 | reserved
(2 rows)
What it means: replica1_slot is inactive; Postgres may retain WAL from restart_lsn onward, potentially filling disk.
Decision: If replica-1 is gone for good, drop the slot. If it’s temporarily down, budget disk and fix the replica fast. This is not “later” work.
Task 9 (System): Verify disk latency on the replica (common replay bottleneck)
cr0x@server:~$ iostat -x 1 3
Linux 6.5.0 (pg-replica-1) 12/29/2025 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.30 0.00 3.90 24.10 0.00 59.70
Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
nvme0n1 120.0 380.0 8120.0 48000.0 279.0 9.80 21.40 1.20 60.00
What it means: iowait is high; average wait is ~21ms. WAL replay is likely I/O constrained.
Decision: Stop blaming “Postgres replication.” Fix storage: faster disk, less contention, separate WAL, tune checkpointing, or reduce competing workloads on the replica.
Task 10 (MySQL): Check if semi-sync is enabled (and what it implies)
cr0x@server:~$ mysql -h mysql-primary-1 -e "SHOW VARIABLES LIKE 'rpl_semi_sync_master_enabled'; SHOW STATUS LIKE 'Rpl_semi_sync_master_status';"
+-------------------------------+-------+
| Variable_name | Value |
+-------------------------------+-------+
| rpl_semi_sync_master_enabled | ON |
+-------------------------------+-------+
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Rpl_semi_sync_master_status| ON |
+----------------------------+-------+
What it means: Semi-sync is enabled and active. Commits should wait for at least one replica to acknowledge receiving events.
Decision: Good for reducing data loss on failover, but it doesn’t guarantee replica is query-fresh. Do not treat this as “safe to read from replicas immediately.”
Task 11 (Postgres): Confirm synchronous replication configuration
cr0x@server:~$ psql -h pg-primary-1 -d postgres -c "SHOW synchronous_commit; SHOW synchronous_standby_names;"
synchronous_commit
-------------------
on
(1 row)
synchronous_standby_names
--------------------------
FIRST 1 (pg-replica-2)
(1 row)
What it means: The primary waits for one named standby (replica-2) for synchronous commit semantics.
Decision: If you’re counting on this for durability, monitor that the named standby is healthy. If it falls out, your system may stall or drop back to async depending on settings and orchestration.
Task 12 (Postgres): Identify slow queries on a replica (because replicas can be overloaded too)
cr0x@server:~$ psql -h pg-replica-1 -d prod -c "SELECT pid, now()-query_start AS age, state, wait_event_type, wait_event, left(query,120) AS q FROM pg_stat_activity WHERE state <> 'idle' ORDER BY age DESC LIMIT 5;"
pid | age | state | wait_event_type | wait_event | q
------+----------+--------+-----------------+--------------+----------------------------------------------------------
8412 | 00:03:41 | active | IO | DataFileRead | SELECT ... FROM events WHERE ... ORDER BY created_at DESC
9120 | 00:01:12 | active | CPU | | SELECT ... FROM orders JOIN order_items ...
(2 rows)
What it means: Replica is busy. One query is waiting on I/O reads; another is CPU-heavy.
Decision: If replicas are overloaded, adding more replicas might help—but only if you fix the query pattern or indexing too. Otherwise you’re scaling pain horizontally.
Task 13 (MySQL): Observe read distribution (proxy/load balancer sanity)
cr0x@server:~$ mysql -h mysql-primary-1 -e "SHOW GLOBAL STATUS LIKE 'Questions';"
+---------------+-----------+
| Variable_name | Value |
+---------------+-----------+
| Questions | 819332199 |
+---------------+-----------+
cr0x@server:~$ mysql -h mysql-replica-1 -e "SHOW GLOBAL STATUS LIKE 'Questions';"
+---------------+-----------+
| Variable_name | Value |
+---------------+-----------+
| Questions | 83322188 |
+---------------+-----------+
What it means: Primary is still taking the majority of queries. Either routing isn’t working, or your app pins too aggressively.
Decision: Before adding replicas, fix routing. If you can’t prove read traffic moved, you’re just buying hardware to decorate dashboards.
Task 14 (System): Check network retransmits (streaming replication hates packet loss)
cr0x@server:~$ ss -ti dst 10.0.1.10:5432 | head -n 20
ESTAB 0 0 10.0.2.20:48932 10.0.1.10:5432
cubic wscale:7,7 rto:204 rtt:2.1/0.9 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:9231123 bytes_acked:9231000 bytes_received:1840012 segs_out:6123 segs_in:5981 data_segs_out:5123 data_segs_in:4981 send 55.2Mbps lastsnd:8 lastrcv:7 lastack:7 pacing_rate 110Mbps retrans:14/231
What it means: There are retransmits. A few is normal; a rising number with lag suggests network issues causing receive lag.
Decision: If receive lag grows with retransmits, don’t “tune Postgres.” Fix the network path or move replication traffic to a less congested link.
Three mini-stories from corporate life
Mini-story 1: An incident caused by a wrong assumption
A mid-sized subscription business had a clean architecture on paper: primary database for writes, two read replicas for “scale,” and a proxy that routed SELECTs.
The feature team added a “cancel subscription” flow. It wrote to the primary, then immediately read the subscription state to decide what UI to show next.
Under light load it looked fine. Under peak load, it became a customer support generator.
The wrong assumption was simple: “the replica is basically real-time.” The proxy had no concept of session consistency. It didn’t know a user just wrote.
So sometimes the post-cancel read hit a replica still showing “active.” The UI would offer the wrong actions. Customers would retry. Retries created more writes.
Lag worsened. You can guess the rest.
The fun part: the dashboards looked “healthy.” Replica CPU was low. Primary CPU was moderate. Latency was fine. The only screaming metric was support tickets.
Engineers initially chased front-end caching, then “eventual consistency” narratives, then they tried adding a third replica. That improved nothing because the issue wasn’t capacity. It was semantics.
The fix was boring and immediate: after a write in that workflow, they pinned reads to the primary for 30 seconds for that user/session token.
They also added a “replica freshness” metric to the proxy so it could avoid a lagging replica entirely.
The incident ended, and everyone pretended they always knew it was a routing problem.
Mini-story 2: An optimization that backfired
An enterprise team tried to reduce primary load by moving “all reporting queries” to a Postgres hot standby. Reasonable. Except the reporting queries were not reports;
they were ad-hoc dashboards with multi-minute scans and a habit of exporting CSVs at quarter end.
On paper, this was safe because the standby wasn’t serving writes. In reality, those long snapshots held back cleanup expectations and triggered standby conflicts.
The ops team adjusted settings to avoid canceling queries—because executives hate partial dashboards. That reduced cancellations but increased WAL replay delay.
Then a failover test happened. The promoted standby took longer than expected to become consistent, and a chunk of time-sensitive reads were now served from a node that was minutes behind.
Nothing was “corrupted.” It just wasn’t current. Meanwhile, primary disk usage climbed because WAL was being retained longer than anticipated.
The backfiring optimization was mixing “human-time reporting” workloads with “operational standby” responsibilities. The correct fix was to separate concerns:
a dedicated analytics replica with aggressive cancellation allowed, and a separate standby configured for fast replay and predictable failover behavior.
Mini-story 3: A boring but correct practice that saved the day
A payments-adjacent platform had both MySQL and Postgres in different domains. They had a rule that sounded like bureaucracy:
any replica used for reads must publish a freshness signal, and the router must refuse it if it exceeded the lag budget.
People complained about it in planning meetings. “We’ll just add more replicas.” “We don’t need the complexity.”
The SRE answer was consistent: “If we can’t measure freshness, we can’t sell correctness.”
So they implemented fencing: MySQL workflows stored the last GTID from the write transaction; Postgres workflows stored the last LSN.
A year later, a storage controller on one replica started intermittently stalling writes. The replica looked “up,” accepted connections, and served reads.
But WAL/relay apply slowed down and lag drifted beyond the budget.
The router automatically drained that replica from the pool. No customer impact. The on-call got a clean alert: “replica freshness violated,” not “customers are angry.”
They replaced the failing hardware during business hours, because boring safeguards buy you boring incidents.
Common mistakes: symptom → root cause → fix
1) Users don’t see their updates
Symptom: After an update, the UI shows the old value for a few seconds; refresh sometimes fixes it.
Root cause: Read-after-write routed to async replicas; no session consistency.
Fix: Pin session to primary after writes, or fence using GTID/LSN. Add replica lag budget enforcement in the router.
2) “Seconds behind source” is low but data is still missing
Symptom: MySQL shows 0–1s behind, but a just-committed row isn’t on the replica.
Root cause: The metric is timestamp-based and can be misleading; also SQL apply could be momentarily stalled or the event timestamp differs.
Fix: Compare GTID sets; fence based on executed GTID. Treat Seconds_Behind_Source as a smoke alarm, not a ruler.
3) Replica lag grows linearly during peak, then “catches up” in bursts
Symptom: Lag increases steadily and then drops suddenly; reads are stale intermittently.
Root cause: Large transactions or single-threaded apply; replica can’t parallelize work.
Fix: Chunk writes; enable/verify parallel apply; avoid giant multi-table updates in peak windows.
4) Postgres standby cancels queries
Symptom: Reports fail with “canceling statement due to conflict with recovery.”
Root cause: Hot standby conflicts between replay (cleanup) and long-running queries.
Fix: Make queries faster, move reporting off standby, or accept cancellations with retries. Tune standby delay thresholds intentionally.
5) Primary disk fills unexpectedly (Postgres)
Symptom: WAL directory grows rapidly; disk alerts; replication otherwise “fine.”
Root cause: Inactive replication slot retains WAL for a missing or stalled replica.
Fix: Drop unused slots; monitor slot lag; treat inactive slots as a production risk, not a curiosity.
6) Adding replicas doesn’t reduce primary load
Symptom: You add two replicas; primary CPU barely changes.
Root cause: Routing isn’t sending meaningful traffic, or the traffic that matters is write-bound, or app pins too much to primary.
Fix: Measure query counts per node; validate router rules; classify reads by correctness needs; fix the slow queries first.
7) Replicas are fast but the application gets slower
Symptom: DB nodes look fine; app p95 rises; timeouts increase.
Root cause: Connection storms, pool mis-sizing, or load balancer retries causing thundering herds across replicas.
Fix: Cap connection pools; use transaction pooling where possible; implement circuit breakers when replicas exceed lag or error rate.
8) Failover causes “missing” data (really: lost acknowledged writes)
Symptom: After primary crash and promotion, last few seconds of transactions are gone.
Root cause: Async replication; failover promoted a replica that hadn’t received/applied those writes.
Fix: Use synchronous replication for those transactions, or accept the RPO explicitly. Align failover automation with your durability model.
Joke #2: A read replica is an honest database with a poor memory. Unfortunately, your users don’t care about its backstory.
Checklists / step-by-step plan
Step-by-step: deciding whether replicas will help your workload
- Classify reads: mark endpoints/queries as “fresh required” vs “stale acceptable.” If you can’t, assume fresh required.
- Measure primary bottleneck: CPU, iowait, lock time, buffer hit rate, top queries.
- Establish a lag budget: pick a number (e.g., 250ms, 1s, 5s) per read class. Write it down.
- Choose routing policy: primary pinning after writes or GTID/LSN fencing. Don’t ship replicas without one.
- Instrument replica freshness: export lag metrics that reflect your chosen fencing mechanism.
- Load test with replication: include realistic write rates; measure replica apply throughput, not just query throughput.
- Plan for replica failure: what happens if a replica is slow, down, or stale? Your router should degrade gracefully to primary.
Step-by-step: building a “replicas don’t lie” read path
- On write: capture GTID (MySQL) or LSN (Postgres) from the transaction context.
- Store it per session/request: in a cookie, token, or server-side session store.
- Before replica read: pick a replica that has executed/replayed at least that position.
- If none qualify: read from primary, or wait up to a small timeout (dangerous; can amplify load).
- Enforce a global lag ceiling: if a replica exceeds budget, drain it automatically.
- Audit endpoints quarterly: new features tend to sneak correctness-sensitive reads into “read-scaled” pools.
Operational checklist: replicas as an SRE feature, not an architecture slide
- Replica lag alerts that page at the point of user-visible inconsistency, not at “infinite patience.”
- Capacity plan for replicas: CPU and disk I/O sized for replay and query load simultaneously.
- Runbooks for: re-seeding replicas, recovering from slot/WAL backlog, handling broken relay logs, and controlled failover.
- Chaos test: kill a replica, introduce 1% packet loss, throttle disk—verify routing drains and fencing still works.
- Backups still exist and are tested. Replication is not backup; it is duplication of your mistakes at network speed.
FAQ
1) Can I use read replicas to scale writes?
No. Replicas scale reads (and some operational tasks). Writes still bottleneck on the primary unless you shard or redesign.
If your primary is write-bound, replicas won’t fix it; they might even worsen it by adding replication overhead.
2) Is PostgreSQL replication “more consistent” than MySQL replication?
Both can be configured for strong durability guarantees with synchronous replication, and both are commonly run async for performance.
The difference is not moral; it’s mechanical and observable. Postgres gives more granular lag visibility; MySQL’s GTID tooling can be excellent if used properly.
3) Does MySQL semi-sync mean my reads are fresh on replicas?
No. Semi-sync typically means a replica acknowledged receipt of the transaction’s events before commit returns. It does not guarantee the replica applied them
and made them visible for reads.
4) What’s the simplest safe routing strategy?
Primary pinning after writes for a short TTL is the simplest. It’s not perfect, but it’s usually good enough and easy to reason about.
If you need tighter correctness with fewer primary reads, move to GTID/LSN fencing.
5) How many replicas should I have?
“Enough to handle peak read load with one replica down” is a practical starting point. Then add requirements:
a separate standby for failover, a separate replica for reporting, and maybe another for maintenance/testing.
Mixing all roles on one replica is how you get role conflict during incidents.
6) Why does replica lag spike during backups or maintenance?
Backups can saturate disk and evict cache. Maintenance jobs can create heavy I/O and CPU load. Replication apply competes for the same resources.
The fix is isolation: dedicate replicas for backup/reporting, throttle maintenance, and monitor replay/SQL thread utilization.
7) Are replicas a substitute for caching?
Not really. Replicas still execute queries and still hit storage. Caches absorb repeated reads cheaply; replicas spread query execution.
Use caches for hot key/value or computed responses; use replicas for offloading complex reads and isolating workloads.
8) What’s the fastest way to tell whether I should add a replica or fix queries?
If the top few queries dominate time and are slow because of missing indexes or bad plans, fix queries first.
If you have many moderately expensive reads and your primary is CPU-bound, replicas can help.
If you’re I/O-bound due to working set not fitting memory, replicas often just multiply disk contention.
9) Can I do multi-region read replicas for low latency?
Yes, but you are buying latency for staleness. Cross-region async replication will lag under any meaningful write rate.
For correctness-sensitive reads, you’ll still need primary pinning or fencing, and those tend to pull traffic back to the write region.
10) What quote should I remember when I’m tempted to “just add replicas”?
“Hope is not a strategy.” — paraphrased idea attributed to engineering and operations culture.
The operational version is: measure lag, enforce budgets, and design for the consistency you actually need.
Next steps you can do this week
If you want replicas that help instead of lying, do these in order:
- Write down your staleness budget: per endpoint or per workflow. If you can’t, assume 0 for anything involving money, auth, or permissions.
- Measure lag with the right primitive: GTID sets in MySQL; receive/replay lag and LSN in Postgres.
- Add a routing rule: pin-to-primary after writes, or implement GTID/LSN fencing for sessions.
- Drain stale replicas automatically: a replica outside budget should be treated like a partial outage, not a minor inconvenience.
- Separate roles: don’t run quarter-end reporting on the same standby you intend to promote during a primary failure.
- Fix the worst queries: replicas are not an excuse to keep a 90-second query in production. Make it fast or isolate it.
Replicas are a tool. Good tools still cut people when used carelessly. If you design your read path with explicit freshness rules and enforce them,
replicas become boring—and boring is what you want at 3 a.m.