Read replicas are supposed to be the easy button. Add a node, point analytics at it, reduce load on the primary, go home on time. In production, replicas often do the opposite: they amplify lock contention, introduce stale reads that break business logic, and quietly turn “scale reads” into “debug replication lag at 2 a.m.”
This piece is for people who operate databases, not people who collect vendor PDFs. We’ll compare PostgreSQL and Percona Server (MySQL-compatible) through the lens that matters: how replicas behave under real load, how they fail, and which knobs are actually worth touching.
A working mental model: what “scaling reads” really means
Scaling reads with replicas is not one problem. It’s three:
- Capacity: Can replicas execute your read workload fast enough without starving the primary or each other?
- Freshness: How stale can results be before your product becomes “creatively wrong”?
- Correctness under failover: Can you promote a replica without losing writes or corrupting application assumptions?
PostgreSQL and Percona Server approach these tradeoffs differently.
PostgreSQL physical replication (streaming WAL) is conservative: correctness and simplicity first, and it’s willing to cancel read queries on replicas to keep recovery moving. Percona Server inherits MySQL replication DNA: versatile, widely understood, and very capable—plus a history of “it worked in staging” incidents caused by subtle consistency details.
Read scaling also lives above the database. If you don’t have routing discipline—read/write split, session stickiness, “read your writes” handling—you’re just distributing confusion to more hosts.
One operational truth that keeps showing up: if your replicas aren’t measurably faster for the target queries than the primary, replicas are a tax, not a feature. They cost hardware, operational attention, and failure modes. Make them earn it.
Interesting facts and historical context (that explains today’s sharp edges)
- PostgreSQL “hot standby” arrived in 9.0 (2010): before that, physical replicas were mostly for failover, not serving live reads.
- MySQL replication started as asynchronous binlog shipping: the cultural default became “replicas can lag,” and many application patterns silently assume that.
- MySQL semisynchronous replication was introduced to narrow the gap: it improves durability guarantees but can cap throughput under latency.
- PostgreSQL replication is WAL-based and page-oriented: it’s efficient and deterministic, but it couples replica progress to replay and vacuum interactions.
- MySQL/Percona GTID made failover saner: it reduces the “which binlog position are we at?” chaos that used to ruin weekends.
- Percona Server added operational instrumentation early: features like extra metrics and performance schema friendliness made it popular with SRE teams that like proof.
- PostgreSQL logical replication matured later (10+): it opened up selective replication and zero-downtime upgrades, but it’s not a silver bullet for read scaling.
- MySQL has long supported multi-source and advanced topologies: you can do clever things, and clever things come with clever failures.
PostgreSQL replicas: why they usually behave, and when they bite
How PostgreSQL read replicas work in practice
In classic PostgreSQL streaming replication, the primary generates WAL (write-ahead log). Replicas receive WAL and replay it. Reads on a replica happen against a consistent snapshot of the database as of the replayed WAL position.
That replay step matters. If replay is slow—IO, CPU, locks on the standby, or conflict handling—replication lag grows. The replica is alive and answering queries, but it’s answering from the past.
The “replicas that hurt” failure mode: recovery conflicts
PostgreSQL has an opinion: recovery must proceed. If a long-running query on the standby conflicts with WAL replay (commonly due to vacuum cleanup on the primary), PostgreSQL may cancel the standby query to keep replay moving. You’ll see errors like canceling statement due to conflict with recovery.
This is not PostgreSQL being mean; it’s PostgreSQL being honest. You’re asking a replica to be both a time machine and a data warehouse. Pick one, or engineer around it:
- Use
hot_standby_feedbackcarefully to reduce cancellations (but accept bloat risk on the primary). - Set
max_standby_streaming_delayto trade replay lag for query completion. - Route long analytics to a logical replica or a dedicated system, not your HA standby.
Freshness controls that matter
PostgreSQL gives you levers to bound staleness. The best one operationally is synchronous_commit plus synchronous replication settings (synchronous_standby_names). It can ensure at least one standby confirms receipt (or apply) before commit returns.
The hidden bill: network latency becomes part of your write path. If your product is latency-sensitive and your standby is cross-region, you’re effectively paying an international postage fee for every commit.
Replica performance in Postgres: don’t ignore the basics
Standby nodes need the same love as primaries: shared buffers sized reasonably, effective_cache_size set realistically, and storage that can keep up with random reads plus WAL replay writes. A standby with slow disks can lag even if it’s barely serving queries.
Also, Postgres replicas aren’t “free reads.” Every query needs CPU, memory, and IO. If you point a dashboard fleet at a single standby without connection pooling, you can DDoS your own database politely.
Joke #1: A read replica is like an intern taking notes—helpful until you ask them to predict the future, then they panic and delete your spreadsheet.
Percona Server replicas: fast, familiar, and full of footguns
What you’re really running when you run Percona Server
Percona Server is MySQL-compatible, typically used as a drop-in replacement with extra instrumentation and performance features. For read scaling, the core behaviors come from MySQL replication: the primary writes binlogs, replicas fetch and apply them. Historically, apply has been single-threaded; newer versions offer parallel replication, but only if you configure it and your workload cooperates.
The classic pain: replication lag isn’t a bug, it’s a default
Many MySQL/Percona setups treat replicas as “eventually consistent caches.” That’s fine until someone routes business-critical reads to replicas without a freshness contract. If your application does “write, then read, then charge money,” lag turns into support tickets and angry finance people.
Semisync, GTID, and the illusion of safety
Semisynchronous replication helps by making the primary wait for at least one replica to acknowledge receipt of the transaction before committing (depending on configuration). It reduces the window of loss on primary failure, but it doesn’t magically ensure replicas are caught up for reads. Receipt is not apply.
GTID-based replication makes failover and topology management less error-prone, but it doesn’t protect you from unsafe routing. GTID solves “what transaction are we on,” not “what does the app assume.”
Replica performance is often an apply problem, not a query problem
When a Percona replica lags, teams often add CPU or increase buffer pool and hope. Sometimes that works. Often the replica is bottlenecked on SQL thread apply: a big transaction, a schema change, a single hot table, or insufficient parallelism settings.
Parallel replication can be a win, but it has prerequisites. If your workload is mostly single-table contention or large serialized transactions, parallel workers will sit around waiting their turn—like a meeting where everyone is “aligned” but nothing ships.
Replica correctness landmines: statement vs row, and DDL behavior
Modern best practice is row-based replication for safety. Statement-based replication can drift with non-deterministic functions or timezone differences. Mixed mode is a compromise, and compromises are where incidents go to breed.
DDL can also block replication apply or create long stalls. Online schema changes help, but they introduce their own operational complexity and can backfire if your tooling isn’t tuned for your workload.
Joke #2: Parallel replication is like hiring more movers—great until you realize the couch only fits through the door one way.
Read/write split: routing patterns that don’t set you on fire
Pattern 1: “Read your writes” with session stickiness
If a user just wrote data and expects to read it immediately, you have two sane options:
- Route that session to primary for a window (sticky primary after write), then allow replica reads later.
- Use a freshness fence: only read from replicas that have confirmed applying at least the write’s LSN/GTID (more complex, but precise).
Everything else is hope-based engineering.
Pattern 2: Split by workload, not by endpoint
Don’t route “all SELECTs” to replicas. Route specific query classes. Analytics, reporting, and dashboard queries are good candidates—if they tolerate staleness and you protect the replicas from runaway concurrency.
Conversely, “SELECT … FOR UPDATE” belongs on the primary. So do reads that must see immediately-consistent state (inventory checks, idempotency keys, fraud logic).
Pattern 3: Designate replicas by purpose
In PostgreSQL, a hot standby for HA is not the same as a read pool node for BI. In Percona, a replica used for backups and ETL is not the same as the one you’d promote during an incident. Mixing purposes creates priority inversions: your backups throttle your failover candidate, or your dashboards cause query cancellations that inflate primary bloat.
Pattern 4: Put a governor in front of replicas
Connection poolers, query timeouts, and per-role resource limits pay for themselves. The replica is a shared resource. Treat it like one. “Unlimited dashboard connections” is how you learn what file descriptor exhaustion feels like.
Paraphrased idea (attributed): John Allspaw argues that reliability comes from learning and improving systems, not from blaming individuals after failure.
Practical tasks: commands, outputs, and decisions (do these in production)
These are the tasks I actually run when replicas misbehave. Each one includes: the command, what the output means, and what decision you make next.
PostgreSQL: replica health and lag
Task 1: Check replication state from the primary
cr0x@server:~$ psql -X -c "SELECT application_name, client_addr, state, sync_state, write_lag, flush_lag, replay_lag FROM pg_stat_replication;"
application_name | client_addr | state | sync_state | write_lag | flush_lag | replay_lag
-----------------+--------------+-----------+------------+-----------+-----------+-----------
standby-a | 10.0.2.11 | streaming | async | 00:00:00 | 00:00:00 | 00:00:02
standby-b | 10.0.2.12 | streaming | async | 00:00:00 | 00:00:01 | 00:00:15
(2 rows)
Meaning: Lag is broken down by stage. Replay lag is what your reads feel. Standby-b is 15s behind on replay.
Decision: If replay lag correlates with read staleness incidents, route sensitive reads away from standby-b or fix replay bottlenecks.
Task 2: Measure byte lag precisely (LSN diff)
cr0x@server:~$ psql -X -c "SELECT application_name, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS bytes_behind FROM pg_stat_replication;"
application_name | bytes_behind
-----------------+--------------
standby-a | 524288
standby-b | 73400320
(2 rows)
Meaning: Standby-b is ~70MB behind. Time lag can look small until a burst hits; bytes show backlog.
Decision: If bytes behind grows steadily, you have sustained under-capacity (network, disk, CPU, or apply conflicts).
Task 3: On the standby, check replay timestamp lag
cr0x@server:~$ psql -X -c "SELECT now() - pg_last_xact_replay_timestamp() AS replay_delay;"
replay_delay
--------------
00:00:14.832
(1 row)
Meaning: “How far in the past is the latest applied transaction?” Very close to user-experienced staleness.
Decision: If replay_delay exceeds your product tolerance, stop using that standby for freshness-critical reads.
Task 4: See if standby queries are being canceled
cr0x@server:~$ sudo journalctl -u postgresql -n 50 --no-pager
Dec 30 10:12:01 standby-a postgresql[1832]: ERROR: canceling statement due to conflict with recovery
Dec 30 10:12:01 standby-a postgresql[1832]: DETAIL: User query might have needed to see row versions that must be removed.
Dec 30 10:12:01 standby-a postgresql[1832]: STATEMENT: SELECT ... FROM events WHERE ...
Meaning: Hot standby conflicts. WAL replay is cleaning up row versions; your query wants them.
Decision: Either move long queries off the standby, or accept the tradeoff of hot_standby_feedback=on (with a bloat plan), or adjust standby delay settings.
Task 5: Check if hot_standby_feedback is enabled
cr0x@server:~$ psql -X -c "SHOW hot_standby_feedback;"
hot_standby_feedback
---------------------
off
(1 row)
Meaning: Standby is not telling the primary to retain old row versions longer.
Decision: If you’re canceling important reads, consider turning it on—but only after checking primary bloat and vacuum headroom.
Task 6: Verify synchronous replication settings (freshness contract)
cr0x@server:~$ psql -X -c "SHOW synchronous_commit; SHOW synchronous_standby_names;"
synchronous_commit
-------------------
on
(1 row)
synchronous_standby_names
--------------------------
standby-a
(1 row)
Meaning: Commits wait for standby-a (depending on sync mode). This can cap write latency but tightens RPO.
Decision: If your write latency budget can’t afford it, don’t pretend you have synchronous durability. Switch to async and design for it, or keep sync but ensure standby-a is low-latency and well-provisioned.
Task 7: Find top read queries on the standby
cr0x@server:~$ psql -X -c "SELECT query, calls, mean_exec_time, rows FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 5;"
query | calls | mean_exec_time | rows
------------------------------------------------+-------+----------------+------
SELECT * FROM report_rollups WHERE ... | 120 | 842.123 | 1200
SELECT ... JOIN ... WHERE ... | 5400 | 120.512 | 40
(2 rows)
Meaning: Your replica is slow because it’s being asked to do expensive things, repeatedly.
Decision: Add indexes, reduce result sets, cache, or move that workload to a dedicated analytics store. “Replica” is not a magic word that turns bad queries into good ones.
Percona Server: replica health and apply lag
Task 8: On a replica, check replication threads and lag
cr0x@server:~$ mysql -e "SHOW REPLICA STATUS\G"
*************************** 1. row ***************************
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 27
Retrieved_Gtid_Set: 1-100-984433
Executed_Gtid_Set: 1-100-984120
Replica_SQL_Running_State: Waiting for dependent transaction to commit
Meaning: IO is fine, SQL apply is behind. GTID sets show how far. The SQL state hints at dependency/serialization.
Decision: Focus on apply throughput: parallel settings, transaction size, hot rows/tables, and disk performance for redo/undo.
Task 9: Check parallel replication configuration
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'replica_parallel%'; SHOW VARIABLES LIKE 'slave_parallel%';"
+-----------------------------------+-------+
| Variable_name | Value |
+-----------------------------------+-------+
| replica_parallel_workers | 8 |
| replica_parallel_type | LOGICAL_CLOCK |
+-----------------------------------+-------+
+-----------------------------------+-------+
| Variable_name | Value |
+-----------------------------------+-------+
| slave_parallel_workers | 8 |
| slave_parallel_type | LOGICAL_CLOCK |
+-----------------------------------+-------+
Meaning: Parallel workers enabled. If lag persists, your workload may not parallelize well or you’re bottlenecked elsewhere.
Decision: If workers are 0, enable them carefully; if enabled already, investigate big transactions or disk stalls rather than “add more workers.”
Task 10: Find the largest recent transactions (binlog apply pain)
cr0x@server:~$ mysql -e "SELECT THREAD_ID, EVENT_NAME, TIMER_WAIT/1000000000000 AS seconds, SQL_TEXT FROM performance_schema.events_statements_history_long ORDER BY TIMER_WAIT DESC LIMIT 3;"
+-----------+--------------------------+---------+--------------------------------------+
| THREAD_ID | EVENT_NAME | seconds | SQL_TEXT |
+-----------+--------------------------+---------+--------------------------------------+
| 2213 | statement/sql/insert | 12.421 | INSERT INTO audit_log SELECT ... |
| 1987 | statement/sql/update | 9.812 | UPDATE orders SET status='...' WHERE...|
+-----------+--------------------------+---------+--------------------------------------+
Meaning: Big statements can serialize apply and create lag spikes.
Decision: Break up monster transactions, avoid “INSERT INTO … SELECT” in peak hours, and consider batching patterns that respect replication.
Task 11: Check InnoDB pressure on the replica (buffer pool and IO)
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests';"
+--------------------------+----------+
| Variable_name | Value |
+--------------------------+----------+
| Innodb_buffer_pool_reads | 18399231 |
+--------------------------+----------+
+----------------------------------+------------+
| Variable_name | Value |
+----------------------------------+------------+
| Innodb_buffer_pool_read_requests | 9912837712 |
+----------------------------------+------------+
Meaning: If buffer pool reads climb fast relative to requests, you’re missing cache and hitting disk. Replicas doing analytics tend to do that.
Decision: Increase buffer pool (if memory allows), reduce working set, add indexes, or move analytics off the replica.
Task 12: Check if the replica is throttled by fsync/IO
cr0x@server:~$ iostat -xm 1 3
Linux 6.5.0 (replica-1) 12/30/2025 _x86_64_ (8 CPU)
Device r/s w/s rMB/s wMB/s await %util
nvme0n1 120.0 410.0 8.1 52.3 18.4 98.7
Meaning: Disk is saturated (%util ~99%). Await is high. Apply and reads are fighting for IO.
Decision: Move redo/binlog to faster storage, reduce read load, tune flush settings carefully, or add replicas to split the read workload.
Task 13: Verify binlog/GTID mode from the source (failover readiness)
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'gtid_mode'; SHOW VARIABLES LIKE 'enforce_gtid_consistency'; SHOW MASTER STATUS\G"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| gtid_mode | ON |
+---------------+-------+
+--------------------------+-------+
| Variable_name | Value |
+--------------------------+-------+
| enforce_gtid_consistency | ON |
+--------------------------+-------+
*************************** 1. row ***************************
File: binlog.000812
Position: 94822133
Executed_Gtid_Set: 1-100-984433
Meaning: GTID is on and consistent. That’s your baseline for sane failover tooling.
Decision: If GTID is off in 2025, budget time to fix it. You can still run without it, but you’re choosing manual error-prone operations.
Task 14: Spot read routing mistakes via processlist
cr0x@server:~$ mysql -e "SHOW PROCESSLIST;"
+-----+------+-----------------+------+---------+------+------------------------+------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-----+------+-----------------+------+---------+------+------------------------+------------------------------+
| 101 | app | 10.9.1.25:42110 | prod | Query | 12 | Sending data | SELECT ... FOR UPDATE ... |
| 142 | app | 10.9.1.25:42144 | prod | Query | 8 | Waiting for table lock | UPDATE users SET ... |
+-----+------+-----------------+------+---------+------+------------------------+------------------------------+
Meaning: Writes and locking reads on a replica indicate broken routing. Some proxies misclassify queries; some apps ignore read/write intent.
Decision: Fix routing rules and enforce read-only users on replicas. If the replica is writable, someone will write to it—accidentally or “temporarily.”
Cross-cutting tasks: OS, network, and capacity reality
Task 15: Check network retransmits (replication stream quality)
cr0x@server:~$ ss -ti dst :5432
ESTAB 0 0 10.0.2.11:50824 10.0.1.10:5432
cubic wscale:7,7 rto:214 rtt:3.1/1.2 ato:40 mss:1448 pmtu:1500 rcvmss:1448 advmss:1448 cwnd:10 bytes_sent:248120122 bytes_retrans:81234 bytes_acked:248038888
Meaning: Retransmits exist; if they’re high and growing, your “replication lag” is partly a network problem.
Decision: Fix packet loss before tuning database knobs. Databases are great at many things; packet loss therapy is not one of them.
Task 16: Confirm a replica is actually read-only at the database level
cr0x@server:~$ psql -X -c "SHOW transaction_read_only;"
transaction_read_only
-----------------------
off
(1 row)
Meaning: This session can write. On a standby, writes will fail, but on a promoted or misconfigured node, this can be dangerous.
Decision: Enforce read-only roles and proxy-side restrictions. Relying on “it’s a standby” is not a policy.
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'read_only'; SHOW VARIABLES LIKE 'super_read_only';"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only | ON |
+---------------+-------+
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| super_read_only | ON |
+-----------------+-------+
Meaning: The replica is protected even from privileged users (with caveats).
Decision: Turn on super_read_only on replicas that should never accept writes.
Fast diagnosis playbook
When someone pings you “replicas are slow” or “reads are inconsistent,” you don’t have time for a philosophical debate about CAP theorem. Do this.
First: decide whether it’s freshness or performance
- PostgreSQL: check
now() - pg_last_xact_replay_timestamp()on the replica andpg_stat_replicationon the primary. - Percona: check
Seconds_Behind_Sourceand the SQL thread state, plus GTID executed vs retrieved.
If lag is high, the “slow reads” complaint may be stale-read bugs, not slow queries.
Second: find the bottleneck category
- Apply/replay bottleneck: IO saturated, SQL thread serialized, WAL replay stuck, conflicts/cancellations.
- Query bottleneck: expensive queries, missing indexes, too much concurrency, bad plans.
- System bottleneck: CPU steal, memory pressure, network loss, noisy neighbor.
Third: make a routing decision immediately
- If lag > tolerance: route sensitive reads to primary or to a healthier replica.
- If replica is overloaded: cap concurrency (pooling, timeouts) and shed non-critical read traffic.
- If apply is behind: stop sending heavy analytics to the HA candidate. Let it catch up.
Fourth: fix with the smallest lever that changes the outcome
Replica problems are often solved by boring actions: one index, one query rewrite, one reduction in connection counts, one IO upgrade. Resist the urge to redesign replication topology during an incident.
Common mistakes: symptoms → root cause → fix
Mistake 1: “We can send all SELECTs to replicas”
Symptoms: Random stale reads, “missing” rows, inconsistent UI state after updates.
Root cause: No freshness contract; async replication lag is normal under load.
Fix: Implement “read your writes” (sticky primary window or LSN/GTID fences). Route only stale-tolerant query classes to replicas.
Mistake 2: Postgres standby cancels BI queries
Symptoms: Errors about conflict with recovery, dashboards time out during vacuum-heavy periods.
Root cause: Hot standby conflict between WAL replay and long-running snapshots.
Fix: Move BI to dedicated replica/logical replication; or enable hot_standby_feedback with a bloat plan; or bound query runtimes aggressively.
Mistake 3: Percona replica “is healthy” because IO thread is running
Symptoms: IO thread Yes, SQL thread Yes, but lag keeps growing; reads become increasingly stale.
Root cause: Apply throughput is insufficient (SQL thread serialization, large transactions, disk saturation).
Fix: Tune parallel replication, reduce transaction size, fix hot-spot tables, and provision IO for apply as well as reads.
Mistake 4: Replicas used for backups and reporting and failover
Symptoms: Failover candidate is always behind; promotion causes data loss window to widen; backups slow down the cluster.
Root cause: Multi-purpose replicas create competing priorities and unpredictable lag.
Fix: Assign roles: at least one “clean” HA replica, separate reporting replicas, separate backup/ETL node if needed.
Mistake 5: Over-optimizing durability on the wrong tier
Symptoms: Writes get slower after enabling sync/semisync; p99 latency spikes; incident load increases.
Root cause: Latency is now bound to replica RTT and fsync. The replica wasn’t provisioned for that responsibility.
Fix: If you need stronger durability, keep synchronous targets close and fast. Otherwise accept async and engineer around RPO/RTO explicitly.
Mistake 6: Read pool melts down from connection storms
Symptoms: Replicas show high CPU context switching, many idle connections, memory pressure, sudden latency spikes.
Root cause: No pooling or concurrency limits; dashboards and batch jobs open many connections.
Fix: Use PgBouncer or ProxySQL (or equivalent), set pool sizes, enforce timeouts, and isolate batch jobs.
Three mini-stories from corporate life (anonymized, painfully familiar)
Incident caused by a wrong assumption: “Reads can’t break money flows”
A mid-sized subscription platform moved “most reads” to MySQL-compatible replicas to protect the primary. It was a reasonable goal: reduce primary CPU, keep write latency stable, and scale horizontally. A proxy did the split based on whether the query started with SELECT.
Two weeks later, support noticed a pattern: cancellations that should have been immediate sometimes looked “pending” for minutes. Finance noticed something worse: occasional double-charges when users retried actions. Nobody had changed the billing logic. The database graphs looked fine—until they looked at replica lag during peak traffic.
The wrong assumption was simple: “billing reads are just reads.” But the billing flow did write, then read to verify state before calling an external payment provider. When that verification hit a lagging replica, it saw the old state and retried. The system behaved consistently, just not correctly.
The fix wasn’t exotic. They added a “primary after write” stickiness window per user session, and they tagged certain query paths as primary-only. Lag still happened sometimes, but it stopped being a correctness bug. They also started alerting on “lag above business tolerance,” not “lag above some random number.”
Optimization that backfired: “Let’s make the standby stop canceling queries”
A different company ran PostgreSQL with a hot standby serving dashboards. The BI team complained about canceled queries during heavy write periods. The database team flipped hot_standby_feedback=on so the standby would stop losing long-running queries.
Dashboards became stable. Everyone celebrated. Then disk usage on the primary began creeping upward in a way that didn’t match data growth. Autovacuum activity increased, but table bloat still grew. Index scans slowed. The primary was working harder to do the same work.
What happened: hot_standby_feedback prevented vacuum cleanup of dead tuples that the standby might still need. Long queries on the standby effectively told the primary, “keep old versions around,” and the primary obliged until it became a storage and performance problem.
They backed out the change and split workloads: an HA standby optimized for replay and failover, and a separate reporting replica with controlled query timeouts and a schedule for heavy reports. The BI team didn’t love the constraints, but the primary stopped slowly suffocating.
Boring but correct practice that saved the day: “One replica is sacred”
An enterprise SaaS company had a rule that sounded bureaucratic: one replica in each cluster is the promotion candidate, and it is not used for ad-hoc reads, ETL, backups, or experiments. People complained it was “wasted hardware.” It wasn’t.
During a storage incident on a primary, they had to fail over under pressure. The promotion candidate replica was consistently within a tight lag bound, because it was protected from noisy workloads and its IO budget was reserved for apply. Promotion was quick and predictable.
Meanwhile, another replica that served reporting was minutes behind due to a batch job doing large scans. If that node had been promoted, the write loss window would have been materially worse, and the recovery path would have involved business-level data reconciliation. Instead, it was a standard failover and a standard postmortem.
The practice wasn’t glamorous. It didn’t require new tools. It saved the day by reducing surprise. In reliability work, reducing surprise is basically the entire job.
Checklists / step-by-step plan: replicas that work
Step 1: Define the freshness contract in business terms
- List flows that must “read your writes” (payments, auth, inventory, idempotency, admin actions).
- Define maximum acceptable staleness for everything else (seconds, not vibes).
- Turn those into routing rules and alerts.
Step 2: Separate replica roles
- HA replica: minimal read traffic, tuned for replay/apply, monitored for promotion readiness.
- Read pool replicas: serve application reads that tolerate lag; protected by pooling and timeouts.
- Reporting/ETL: separate node if the workload is heavy or long-running.
Step 3: Make routing enforceable
- Use distinct users/roles for primary vs replica.
- On Percona replicas: enable
super_read_only. - In Postgres: use
default_transaction_read_onlyfor replica roles (and enforce at the proxy/app layer too).
Step 4: Cap concurrency and runtime
- Install connection pooling; set hard limits per service.
- Set statement timeouts for dashboards and ad-hoc tools.
- Keep an eye on long transactions; they wreck both systems differently, but they wreck them.
Step 5: Ensure replicas have enough IO for two jobs
- Replicas must apply changes and serve reads. That’s write + read IO.
- Measure disk saturation during peak traffic, not at noon on a Tuesday.
- If storage is the bottleneck, tuning SQL is a hobby, not a fix.
Step 6: Test failover like you mean it
- Measure: promotion time, client reconnection time, and data loss window (or confirmation of none, if synchronous).
- Verify the promoted node’s settings: read-only flags off, correct server_id, correct replication config, correct backups.
- Practice restoring a new replica from backup + catch-up. If you can’t do that, you don’t have a replication system; you have pets.
FAQ
1) Are PostgreSQL replicas “more consistent” than Percona replicas?
By default, both are typically asynchronous, which means both can be stale. PostgreSQL tends to be more explicit about conflicts (canceling queries) rather than silently lagging forever, but staleness is still real.
2) What’s the simplest safe read-scaling approach?
Route only stale-tolerant reads to replicas, keep “read your writes” on primary via session stickiness, and cap replica concurrency with pooling.
3) Why is my Postgres standby canceling queries?
Hot standby conflicts happen when WAL replay needs to remove old row versions (often due to vacuum) but a standby query still needs them. Fix by moving long queries elsewhere, tuning standby delay, or using hot_standby_feedback with care.
4) Why does Seconds_Behind_Source lie sometimes?
In MySQL/Percona, it can be NULL or misleading during certain states, and it’s time-based rather than backlog-based. Compare GTID sets and watch apply thread state for a clearer picture.
5) Should I use synchronous replication to make replicas safe for reads?
Synchronous settings help durability and reduce data loss on failover. They do not automatically guarantee “safe reads” unless you also ensure reads go to nodes that have applied the needed transactions. It’s still a routing problem.
6) Can I run analytics on the HA replica?
You can, until the day you need to fail over and discover your HA node is behind or resource-starved. If you care about predictable failover, keep one replica boring and protected.
7) Is logical replication better for read scaling?
Sometimes. Logical replication can reduce some replay conflict issues and let you replicate subsets, but it adds operational complexity and doesn’t remove the need for routing discipline and performance tuning.
8) How many replicas do I need?
Enough to handle peak read concurrency with headroom, plus at least one node reserved for HA (depending on your failover design). If one replica is always pegged, you don’t have enough, or you’re routing the wrong workload.
9) What’s the best alert for replica health?
Alert on lag relative to business tolerance, plus disk saturation on replicas, plus query cancellations (Postgres) or apply stalls (Percona). Alerts that don’t map to user impact create alert fatigue.
10) Can proxies fully automate read/write split safely?
They help, but they can’t read your product requirements. A proxy can classify queries; it can’t know which reads must be fresh. You must encode that logic in application routing or session policy.
Practical next steps
- Write down the freshness contract for your top 10 user flows. If you can’t state it, you can’t enforce it.
- Pick replica roles: one HA candidate kept clean, one or more read pool replicas, and (if needed) a reporting/ETL node.
- Implement routing guardrails: sticky primary after write, replica-only users, and read-only enforcement on replicas.
- Run the commands above weekly and during peak: lag in time and bytes, apply state, IO saturation, top queries.
- Fix the biggest bottleneck first: saturated disks, runaway concurrency, or one pathological query. Don’t tune knobs as a substitute for capacity.
- Practice failover and measure it. If failover is an untested theory, it will be tested for you later, at an inconvenient time.
Replicas can absolutely scale reads. They can also scale confusion. The difference is whether you treat replication as an engineering system with contracts, budgets, and observability—or as a magic trick you hope nobody asks you to explain.