You don’t buy “high availability” on a slide deck. You run it. At 2 a.m. When a rack trips, a router reboots,
a deploy goes sideways, and your on-call phone starts doing cardio.
PostgreSQL can be boringly reliable in production—if you accept that HA is mostly a system you build around it.
CockroachDB offers HA as a first-class feature—if you accept that distributed systems don’t remove pain, they relocate it.
The real question: what kind of failure do you want?
The most common mistake in HA database selection is thinking the question is “Which database is more available?”
Availability is not a brand attribute. It’s an end-to-end property that includes your client drivers, DNS, load balancers,
schema migration habits, backup discipline, observability, and the kind of humans you hire.
The actual question is: what failure are you comfortable diagnosing under stress?
-
With PostgreSQL, the HA story typically involves streaming replication, a failover manager, and clear separation
between “the one writable node” and “replicas that might lag.” Your pain is usually around failover orchestration,
read/write routing, replication lag, and operational sharp edges. -
With CockroachDB, the HA story is built in: consensus replication, automatic rebalancing, and SQL that looks familiar.
Your pain shifts to distributed debugging: contention, range splits, hotspots, cross-region latency amplification,
and learning the cluster’s internal state model.
Pick the pain you can pay for. Not the pain that looks better in a diagram.
Quick positioning: when each database is the right hammer
Choose PostgreSQL HA when you can centralize writes and control complexity
PostgreSQL is a workhorse. You get mature SQL semantics, a massive ecosystem, predictable performance characteristics,
and a failure model that most SRE teams can reason about. In practice, HA PostgreSQL is usually:
one primary, one or more replicas, plus an orchestrator (Patroni, repmgr, Pacemaker/Corosync),
and a routing layer (HAProxy, PgBouncer, cloud load balancer, or application-level logic).
PostgreSQL HA is the right choice when:
- Your write workload is heavy and latency-sensitive, and you can keep the primary in one region/zone.
- You want full Postgres feature depth (extensions, advanced indexing, rich tooling) without “mostly compatible” caveats.
- You can tolerate that cross-region HA is usually disaster recovery (DR), not “active-active writes.”
- You have a team that can treat failover as an engineered subsystem with drills.
Choose CockroachDB when you need writes to survive node loss without a bespoke HA stack
CockroachDB is distributed SQL: a consensus-replicated key-value store with SQL on top. The sales pitch is compelling:
“It keeps running when machines die.” Often true. But you pay with new failure modes and more moving parts per query.
Your operational posture shifts from “protect the primary” to “keep the cluster balanced and the hotspots cooled.”
CockroachDB is the right choice when:
- You need automatic failover of writes without promoting a replica.
- You have a multi-zone footprint and you want the database to handle replica placement and quorum automatically.
- Your org is ready to learn distributed performance profiling and accept higher baseline latency for safety.
- You can design data locality (partitioning/geo) intentionally, not by wishful thinking.
Dry-funny joke #1: Building Postgres HA is like assembling IKEA furniture—stable and useful, unless you “freestyle” a missing screw.
History and interesting facts you can weaponize
Context matters because database design is mostly the fossil record of old outages.
Here are facts that actually help you make decisions:
-
PostgreSQL descends from POSTGRES (UC Berkeley, mid-1980s), designed when correctness mattered more than cloud buzzwords.
That DNA shows in its conservative, predictable behavior. -
Write-ahead logging (WAL) is older than your CI system. Postgres relies on WAL to guarantee durability and enable replication.
If you don’t respect WAL and checkpoints, you will eventually have a slow-motion incident. -
Streaming replication in Postgres arrived in 9.0 (2010), which made “real” physical replication mainstream and fast.
Before that, HA was clunkier and more manual. - Postgres hot standby (readable replicas) landed in 9.0. That enabled a standard pattern: scale reads, protect the primary, and plan failovers.
-
CockroachDB was inspired by Google Spanner (publicly described in 2012). Spanner’s thesis: global-scale SQL with strong consistency,
at the cost of coordination. -
CockroachDB uses Raft consensus for replication. Raft is designed to be understandable (relative to Paxos), but “understandable”
still involves terms like quorums, leases, and leadership transfers. -
“Distributed SQL” became a category because NoSQL couldn’t pretend forever. Many teams wanted SQL transactions back
after learning that eventual consistency is a lifestyle, not a feature toggle. -
CAP is not a product comparison tool. In real systems, partition tolerance is not optional, and “consistency” has multiple meanings.
The useful question is: what tradeoff does the system make during partitions, and how visible is it to your app? -
Postgres extensions are a superpower and a trap. They let you build specialized systems fast (PostGIS, pgcrypto, Timescale-style tooling),
but they anchor you to Postgres semantics in ways “compatible” systems may not match.
How HA actually works: Postgres replication vs Cockroach consensus
PostgreSQL HA: single-writer with replicas, plus orchestration glue
PostgreSQL’s core architecture is single-writer. You can scale reads with replicas and you can fail over by promoting a replica.
But “HA Postgres” is not a single feature; it’s a choreography:
- Streaming replication copies WAL from primary to replicas.
- Synchronous replication can ensure commits wait for replica acknowledgement (lower RPO, higher latency).
- Failover management decides which node becomes primary and rewires clients.
- Split-brain prevention is your job. You need fencing, quorum, or a reliable distributed lock.
The Postgres win is clarity: there is a primary. If writes fail, you look for the primary or promote one.
The Postgres risk is that your HA stack is “batteries not included.” That stack can be excellent, but it must be engineered.
CockroachDB HA: consensus replication everywhere, and SQL rides shotgun
CockroachDB distributes data into ranges (shards). Each range is replicated (often 3 replicas by default),
and one replica is the Raft leader handling writes for that range. Transactions coordinate across ranges.
Failover is internal: if a node dies, another replica becomes leader for affected ranges, assuming quorum remains.
The Cockroach win is operational: you don’t manually “promote” a new primary. The system self-heals within quorum constraints.
The Cockroach risk is performance predictability: a simple transaction can involve multiple ranges and multiple consensus groups,
especially as your dataset and schema evolve.
RPO/RTO in real terms
If you’re buying HA, you’re buying RPO and RTO, not vibes.
-
Postgres async replication: RPO can be non-zero (you can lose the last few seconds of commits on failover),
RTO depends on orchestration and client routing. -
Postgres sync replication: RPO can approach zero if configured correctly, but commit latency rises and you can stall
if synchronous standbys are unavailable. -
CockroachDB: RPO is typically near zero within quorum; RTO is often fast for node failures, but partitions and
overloaded clusters can lead to “available but sad” behavior: timeouts, retries, and degraded throughput.
One quote to keep you honest, from Werner Vogels: “Everything fails, all the time.” If you want the exact wording, treat this as a paraphrased idea.
Latency, tail latency, and why “multi-region” is a trap word
HA conversations get romantic about geography. “Active-active across regions.” “Global writes.”
Then physics shows up, uninvited, like the auditor you forgot to CC.
With Postgres, the write path is local to the primary. If you put the primary in one region and run app servers in another,
your p99 latency will make a mockery of your roadmap. That’s not Postgres’ fault; it’s yours.
HA in Postgres usually means “keep the primary close to the writers; keep replicas close to readers; fail over when needed.”
With CockroachDB, multi-region is part of the design, but it’s not magic. A strongly consistent write needs a quorum.
If your replicas are spread across distant regions, the quorum round-trip time becomes your baseline write latency.
The system can place leaders and replicas to optimize locality, but you still have to choose your constraints:
latency vs survivability vs consistency model.
Dry-funny joke #2: Multi-region strong consistency is the only way to turn “speed of light” into a recurring monthly bill.
Tail latency is where databases go to die
Median latency is nice for dashboards. Tail latency is what users experience.
Distributed systems tend to widen the tail because more components participate in each operation.
CockroachDB can be excellent, but if your transaction touches multiple ranges across nodes,
you have more opportunities for queueing, contention, and slow replicas.
Postgres has its own tail issues—autovacuum stalls, lock queues, checkpoint spikes—but the scope is often easier to bound:
one primary’s resource constraints and one WAL stream.
Failure modes: the stuff that wakes you up
PostgreSQL failure modes (and what they feel like)
-
Split brain: two nodes believe they are primary. Symptoms: divergent timelines, conflicting writes, “cannot connect” flapping.
Root causes: bad fencing, misconfigured failover manager, network partitions, overly clever scripts. -
Replication lag: replicas fall behind. Symptoms: stale reads, read-after-write surprises, failover would lose data.
Root causes: slow I/O, network throttling, long transactions, WAL bursts during batch jobs, undersized replicas. -
Checkpoint / WAL pressure: latency spikes. Symptoms: periodic write stalls, I/O saturation, rising fsync times.
Root causes: tiny shared_buffers with high churn, aggressive checkpoint settings, slow storage, too many dirty buffers. -
Autovacuum debt: everything gets slower, then falls off a cliff. Symptoms: bloat, rising query times, wraparound warnings.
Root causes: vacuum thresholds not tuned for high churn, long-running transactions, insufficient autovacuum workers. -
Failover with incompatible state: promoted replica is missing extensions, configs, or has different parameters.
Symptoms: app errors post-failover, unexpected query plans, missing roles.
Root cause: snowflake servers and config drift.
CockroachDB failure modes (and what they feel like)
-
Hot ranges / hotspots: a keyspace becomes a traffic magnet. Symptoms: high latency for a subset of operations, CPU spikes on few nodes.
Root causes: monotonically increasing keys, single-row contention, poor schema design for distribution. -
Transaction retries: app sees serialization errors or retryable failures. Symptoms: increased latency, bursty timeouts, rising retries.
Root causes: high contention, long transactions, conflicting writes, insufficient backoff or retry logic in clients. -
Replica under-replication / quorum loss: a range cannot reach quorum. Symptoms: unavailability for some data, stalled writes.
Root causes: too many node failures, mis-sized cluster, maintenance done like a demolition derby. -
Cross-region quorum latency: writes slow globally. Symptoms: p95/p99 jumps correlated with region RTT.
Root cause: replica placement policies that force quorums across distant regions. -
Background rebalancing interference: the cluster is “healing” while you’re trying to serve traffic.
Symptoms: sustained disk and network, higher latency, more variance.
Root causes: decommissioning nodes under load, sudden topology changes, insufficient headroom.
Notice the theme: Postgres failures are often about orchestration and single-node resource ceilings.
Cockroach failures are often about coordination and distribution side effects.
Practical tasks: commands, outputs, and decisions (12+)
Below are real operational tasks with commands you can run, what the output means, and the decision you make.
They’re split across Postgres and CockroachDB, with a few OS-level sanity checks. No fluff; this is what on-call actually does.
PostgreSQL: identify the primary and replication health
Task 1: Is this node a primary or a standby?
cr0x@server:~$ psql -XAtc "select pg_is_in_recovery();"
f
Meaning: f means this node is not in recovery: it’s acting as primary. t would mean standby.
Decision: If you expected a standby but got f, stop and check failover status; you might be in split-brain territory.
Task 2: Check replication lag in bytes on the primary
cr0x@server:~$ psql -Xc "select application_name, client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn,replay_lsn) as byte_lag from pg_stat_replication;"
application_name | client_addr | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | byte_lag
------------------+-------------+---------+-----------+-----------+-----------+------------+-----------
replica1 | 10.0.1.21 | streaming | 3/9A4F2C8 | 3/9A4F2C8 | 3/9A4F2C8 | 3/9A4F2C8 | 0
replica2 | 10.0.2.37 | streaming | 3/9A4F2C8 | 3/9A4F2C8 | 3/9A4F2C8 | 3/9A4E100 | 54024
Meaning: byte_lag is how far behind each replica is relative to what the primary has sent.
Small lag is normal; persistent growth means the replica can’t keep up.
Decision: If lag is growing, avoid failing over to the lagging replica; investigate replica I/O, CPU, and network.
Task 3: On a standby, measure replay delay in time terms
cr0x@server:~$ psql -Xc "select now() - pg_last_xact_replay_timestamp() as replay_delay;"
replay_delay
--------------
00:00:02.184
Meaning: If this grows into minutes under normal load, your “read scaling” is producing stale reads.
Decision: Either stop sending latency-sensitive reads here, or fix the bottleneck (often disk).
Task 4: Check for long-running transactions blocking vacuum and bloat control
cr0x@server:~$ psql -Xc "select pid, usename, state, now()-xact_start as xact_age, left(query,80) as query from pg_stat_activity where xact_start is not null order by xact_age desc limit 5;"
pid | usename | state | xact_age | query
------+--------+--------+------------+--------------------------------------------------------------------------------
8421 | app | active | 00:42:11 | update orders set status='paid' where id=$1 returning id
9173 | app | idle in transaction | 01:13:02 | select * from customers where id=$1
Meaning: “idle in transaction” for an hour is a red flag; it can prevent cleanup and cause bloat.
Decision: Fix the app connection handling; consider idle_in_transaction_session_timeout and kill offenders during incident.
Task 5: Find lock contention quickly
cr0x@server:~$ psql -Xc "select a.pid, now()-a.query_start as age, a.state, left(a.query,70) as query, l.mode, l.granted from pg_locks l join pg_stat_activity a on a.pid=l.pid where a.datname=current_database() order by l.granted, age desc limit 12;"
pid | age | state | query | mode | granted
------+----------+--------+-------------------------------------------------------------------+---------------------+---------
5211 | 00:01:12 | active | alter table invoices add column tax_region text | AccessExclusiveLock | f
4182 | 00:01:08 | active | select * from invoices where account_id=$1 order by created_at | AccessShareLock | t
Meaning: A waiting AccessExclusiveLock blocks basically everything on that table.
Decision: Abort the DDL if it’s unsafe; reschedule with CONCURRENTLY options or online migration patterns.
Task 6: Check checkpoint pressure and buffer behavior
cr0x@server:~$ psql -Xc "select checkpoints_timed, checkpoints_req, buffers_checkpoint, buffers_backend, maxwritten_clean, checkpoint_write_time, checkpoint_sync_time from pg_stat_bgwriter;"
checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_backend | maxwritten_clean | checkpoint_write_time | checkpoint_sync_time
-------------------+-----------------+--------------------+-----------------+------------------+-----------------------+----------------------
102 | 47 | 9231142 | 312991 | 25013 | 1882210 | 402113
Meaning: High checkpoints_req relative to timed checkpoints suggests WAL pressure forcing extra checkpoints.
Decision: Tune checkpoint settings and evaluate storage throughput; consider spreading write bursts and increasing max_wal_size.
Task 7: Validate WAL archiving is actually working (for PITR)
cr0x@server:~$ psql -Xc "select archived_count, failed_count, last_archived_wal, last_archived_time, last_failed_wal, last_failed_time from pg_stat_archiver;"
archived_count | failed_count | last_archived_wal | last_archived_time | last_failed_wal | last_failed_time
---------------+--------------+-------------------+----------------------------+-----------------+-----------------
91822 | 0 | 00000001000000030000009A | 2025-12-30 02:18:12.104+00 | |
Meaning: failed_count=0 is what you want. Failed archiving means your recovery story is fiction.
Decision: If failures exist, stop pretending you have PITR; fix permissions, storage, or archiver command before the next incident.
PostgreSQL: failover posture and client routing
Task 8: Check synchronous replication configuration
cr0x@server:~$ psql -Xc "show synchronous_commit; show synchronous_standby_names;"
synchronous_commit
--------------------
on
synchronous_standby_names
--------------------------
FIRST 1 (replica1,replica2)
Meaning: Commits wait for one standby. If both standbys are unhealthy, writes can stall.
Decision: If you’re in an outage and need writes to proceed, you may temporarily relax sync settings—but document it and revert.
Task 9: Confirm your app is pointing at the right endpoint (PgBouncer example)
cr0x@server:~$ psql -h 127.0.0.1 -p 6432 -U pgbouncer -d pgbouncer -Xc "show clients;"
type | user | database | state | addr | port | local_addr | local_port | connect_time
------+------|----------|--------|------------|-------|------------|------------|----------------------------
C | app | prod | active | 10.4.7.19 | 49212 | 10.4.2.10 | 6432 | 2025-12-30 02:18:55.911+00
Meaning: You see active clients and their source addresses. Useful during failover: are clients even reaching the proxy?
Decision: If clients aren’t connecting, the DB may be fine and the routing layer is broken (or blocked by firewall/DNS).
CockroachDB: cluster health and distribution sanity
Task 10: Check node status and liveness
cr0x@server:~$ cockroach node status --host localhost:26257
id | address | build | started_at | is_live | replicas | cpu | mem | ssd | version
-----+----------------+---------------+-----------------------+---------+----------+------+-----+------+---------
1 | 10.0.1.10:26257| v24.1.3 | 2025-12-30 00:11:02 | true | 2210 | 0.42 | 64G | 30% | 24.1
2 | 10.0.2.10:26257| v24.1.3 | 2025-12-30 00:11:08 | true | 2198 | 0.55 | 64G | 29% | 24.1
3 | 10.0.3.10:26257| v24.1.3 | 2025-12-30 00:10:59 | false | 2175 | 0.00 | 64G | 31% | 24.1
Meaning: Node 3 is not live. With a 3-replica setup, losing one node is usually survivable, but you’re now one failure from trouble.
Decision: Pause any maintenance; restore node health or add capacity before doing anything “clever.”
Task 11: Verify replication health quickly
cr0x@server:~$ cockroach node status --ranges --host localhost:26257
id | ranges | underreplicated | unavailable | leader_ranges
-----+--------+-----------------+-------------+--------------
1 | 5600 | 12 | 0 | 1850
2 | 5578 | 15 | 0 | 1902
3 | 5489 | 301 | 2 | 0
Meaning: Underreplicated and unavailable ranges indicate data placement issues and potential unavailability.
Decision: If unavailable is non-zero, treat it as a user-facing incident. Restore quorum (bring node back, or reconfigure).
Task 12: Identify expensive statements and retries (SQL introspection)
cr0x@server:~$ cockroach sql --host localhost:26257 -e "select app_name, query, count(*) as execs, sum(retries) as retries from crdb_internal.statement_statistics where aggregated_ts > now() - interval '10 minutes' group by app_name, query order by retries desc limit 5;"
app_name | query | execs | retries
----------+---------------------------------------------+-------+---------
api | update accounts set balance = balance + $1 | 1200 | 340
Meaning: High retries for an update suggests contention or hot keys.
Decision: Reduce contention: redesign the write pattern (shard counters, avoid single-row hot spots), shorten transactions, add retry backoff.
Task 13: Check range distribution and suspect hotspots
cr0x@server:~$ cockroach sql --host localhost:26257 -e "select range_id, replicas, lease_holder, queries_per_second from crdb_internal.ranges_no_leases order by queries_per_second desc limit 5;"
range_id | replicas | lease_holder | queries_per_second
----------+----------+--------------+-------------------
90211 | {1,2} | 1 | 892.2
90197 | {1,2,3} | 2 | 521.7
Meaning: A range with very high QPS can become a bottleneck; also note ranges with only two replicas.
Decision: If hotspots persist, revisit schema and key distribution; if replica count is low, address node loss and re-replication.
OS and storage checks that affect both systems
Task 14: Check disk latency and saturation (iostat)
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (db01) 12/30/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.33 0.00 5.41 9.88 0.00 72.38
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 120.0 980.0 6400.0 51200.0 18.40 0.78 92.10
Meaning: %util near 100% plus high await means the disk is a bottleneck.
Decision: Stop blaming “the database.” Reduce write pressure, add IOPS, move to faster storage, or add nodes (Cockroach) / tune checkpoints (Postgres).
Task 15: Confirm filesystem space and inode health
cr0x@server:~$ df -h /var/lib/postgresql /var/lib/cockroach
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 900G 812G 43G 95% /var/lib/postgresql
/dev/nvme1n1p1 900G 701G 153G 83% /var/lib/cockroach
Meaning: 95% usage is flirting with disaster; Postgres especially hates running out of disk mid-WAL.
Decision: If above ~85–90% sustained, treat as urgent: expand volume, prune, or move data before it becomes downtime.
Fast diagnosis playbook
When the database “is slow,” you don’t have time for philosophy. You need a fast, repeatable triage that narrows the bottleneck.
Here’s a pragmatic first/second/third sequence for both systems.
First: Is it the database, or the path to it?
- Check client connection errors/timeouts in app logs. If it’s DNS/LB/firewall, the DB can be innocent.
- Verify the endpoint resolves to the expected node(s). Misrouted traffic is a classic after failover.
- Check basic OS health: CPU steal, disk full, disk latency.
Second: Is the system blocked (locks/contended) or saturated (I/O/CPU)?
- Postgres: look at locks, long transactions, and I/O wait; check replication lag if reads are stale.
- CockroachDB: look at retries, hotspots (ranges), and under-replication; check node liveness and range availability.
Third: Is this a topology/HA event?
- Postgres: confirm exactly one primary, confirm replicas are connected, confirm failover manager state.
- CockroachDB: confirm quorum health, under-replication, and whether rebalancing is fighting your workload.
The discipline is to avoid chasing query plans before you’ve confirmed the cluster isn’t simply starving for disk or stuck behind a lock.
Most “database performance” incidents are actually “coordination and storage incidents wearing a SQL mask.”
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption (Postgres failover)
A mid-sized SaaS company ran Postgres with a primary in one zone and a streaming replica in another.
They had a failover manager and a VIP. Everyone felt safe. They even wrote “RPO near zero” in an internal doc, which is how you tempt fate.
One afternoon, the primary’s storage started timing out. The failover manager promoted the replica. The VIP moved.
The app recovered quickly—until customer support noticed “missing” recent orders. Not many, but enough to matter.
The team assumed synchronous replication was enabled because “we set it up months ago.”
It turned out the sync settings were applied on the old primary but never rolled out consistently.
Worse, the app used read replicas for some “confirmation screens,” and those were now showing a timeline that didn’t match the new primary.
The failover itself was correct. The assumption about RPO was not.
Recovery required painful triage: compare order IDs and timestamps, reconcile from upstream logs, and explain to the business why “highly available”
didn’t mean “no data loss.” The real fix was boring: configuration management, enforced parameter parity,
and a clear policy of which queries may go to replicas.
After the incident, they changed their playbook: failover is not “done” until they verify replication mode, replica lag, and application routing.
They also started writing explicit RPO/RTO targets per feature, not per database.
Mini-story 2: The optimization that backfired (CockroachDB hotspot)
An e-commerce team migrated a write-heavy cart service to CockroachDB for automatic failover across three zones.
Early results were fine. Latency looked stable. Then peak season arrived and the checkout path started producing retries and timeouts.
The team had “optimized” primary keys to be sequential for index locality, reasoning from years of Postgres experience.
In CockroachDB, that created a write hotspot: inserts hammered a small set of ranges because the keys were monotonically increasing.
Range splits helped, but leadership for the hottest ranges concentrated on a couple of nodes and became CPU-bound.
The on-call saw that nodes were live and replication was healthy. Yet latency climbed and the app retried aggressively,
turning contention into a self-inflicted denial of service. The system was available, technically. Users still couldn’t check out reliably.
The fix was counterintuitive for a Postgres brain: change key distribution (use random or hash-sharded prefixes),
and redesign “single-row counters” into sharded counters. They also implemented sane retry backoff and capped concurrency in the hottest code paths.
They kept CockroachDB, but they stopped trying to make it behave like a single-node database. That was the actual migration.
Mini-story 3: The boring but correct practice that saved the day (Postgres PITR)
A B2B platform ran Postgres with streaming replication and a strict PITR regimen: daily base backups,
WAL archiving validated every day, and quarterly restore drills. Nobody loved doing the drills. That was the point.
An engineer ran a migration that dropped a column in the wrong schema. The app didn’t crash immediately; it failed gradually as certain jobs ran.
The replica faithfully replicated the mistake, because replicas are obedient like that.
The team faced a familiar fork: try to patch forward under pressure, or roll back to a known-good point in time.
Because they had reliable WAL archives and rehearsed steps, they restored to a timestamp minutes before the migration and replayed safe changes.
The outage was measured in tens of minutes, not hours. The postmortem was mercifully short.
The heroes were not the people who wrote clever scripts; it was the people who tested restores when nothing was on fire.
HA reduced downtime from node failures. PITR saved them from themselves. Different tools. Both mandatory.
Common mistakes: symptoms → root cause → fix
1) Symptom: after failover, writes succeed but reads show old data
Root cause: application still reads from a lagging replica, or your read/write routing didn’t update cleanly.
Fix: enforce routing through a single endpoint with role awareness; add read-after-write routing rules; monitor replica replay delay.
2) Symptom: Postgres primary becomes “slow every few minutes”
Root cause: checkpoint spikes or WAL flush pressure due to storage limits and aggressive checkpoint configuration.
Fix: increase max_wal_size, tune checkpoint settings, and move WAL/data to faster storage; verify fsync latencies.
3) Symptom: Postgres replicas fall behind during batch jobs
Root cause: WAL bursts + replica I/O bottleneck, often compounded by long-running transactions delaying cleanup.
Fix: throttle batch writes, increase replica resources, and eliminate long transactions; consider logical partitioning or scheduling jobs off-peak.
4) Symptom: CockroachDB shows “healthy nodes” but app times out
Root cause: transaction contention and retries, frequently from hot keys or overly chatty transactions.
Fix: redesign hot spots (sharded keys, randomization), shorten transactions, implement bounded retries with backoff, and reduce lock contention patterns.
5) Symptom: CockroachDB becomes unstable during node maintenance
Root cause: insufficient headroom; rebalancing and re-replication compete with production workload.
Fix: add capacity, drain/decommission slowly, and avoid taking down multiple nodes; schedule maintenance windows with traffic-aware limits.
6) Symptom: Postgres HA “works,” but you occasionally get two primaries
Root cause: split brain due to unreliable leader election, no fencing, or network partition behavior not modeled.
Fix: implement proper quorum/lock store (etcd/consul), reliable fencing (STONITH where appropriate), and test partition scenarios.
7) Symptom: “We can’t restore quickly” despite having backups
Root cause: backups were never restored in practice; WAL archives missing; credentials/permissions stale.
Fix: schedule restore drills, verify archiving daily, and automate validation; treat restores as a production feature.
Checklists / step-by-step plan
Decision checklist: Postgres HA vs CockroachDB
- Define RPO/RTO per service. If you can’t write it down, you’re guessing.
- Measure your write locality needs. Where do writes happen? If “everywhere,” prove it with traces.
- Classify workload: high contention OLTP, append-heavy, mixed read/write, long transactions, background analytics.
- List must-have features: extensions, specific isolation behavior, large objects, specialized indexes, logical decoding, etc.
- Staffing reality: do you have operators who can debug consensus and contention, or a team fluent in Postgres internals?
- Failure drills: can you rehearse region loss and prove recovery in staging with production-like load?
Step-by-step: building “boring HA” on PostgreSQL
- Pick a topology: primary + 2 replicas across zones; decide which are read-only endpoints.
- Implement replication: streaming replication; decide async vs sync based on latency budget.
- Add orchestration: a real failover manager; avoid DIY scripts unless you enjoy archaeology.
- Add routing: stable endpoints for apps; explicit read/write separation if you use replicas.
- Prevent split brain: quorum-based leader election; fencing strategy; test partitions.
- Backups and PITR: base backups + WAL archiving; validate daily; drill restores.
- Observability: replication lag, lock waits, WAL rate, checkpoint timing, disk latency.
- Runbooks: promote, demote, rewind, rebuild replica; document “what not to do” during incident.
Step-by-step: operating CockroachDB without self-harm
- Start with 3+ nodes across 3 zones (or more for headroom). Don’t run a “just enough” cluster in production.
- Define locality and constraints before you need them. Decide where leases/leaders should live for critical tables.
- Model contention early: identify counters, single-row updates, sequential keys, and “latest record” patterns.
- Implement retries correctly in the application with bounded backoff. Unbounded retries are a distributed denial-of-service feature.
- Watch range health: under-replication, unavailable ranges, leaseholder concentration.
- Plan maintenance around rebalancing: decommission slowly, keep headroom, avoid multi-node drains under load.
- Benchmark with realistic concurrency, not a polite synthetic script that never contends.
FAQ
1) Can Postgres be “active-active” for writes?
Not in the core single-primary model. You can approximate with sharding, application-level routing, or specialized multi-master solutions,
but you’re signing up for conflict management and operational complexity. If you truly need multi-writer with strong consistency,
a consensus-based system may fit better—if you can handle the tradeoffs.
2) Does CockroachDB guarantee zero downtime?
It can survive many node failures without manual promotion, but “zero downtime” depends on quorum, capacity headroom,
and whether your workload triggers contention/retries. It’s resilient, not invincible.
3) What’s the simplest HA Postgres setup that doesn’t hate me later?
Primary + two replicas across zones, a proven failover manager (not cron scripts), a single stable endpoint for writers,
and tested PITR. Keep it boring, automate config parity, and rehearse failovers.
4) Is synchronous replication in Postgres always better?
It reduces RPO, but it increases latency and can stall writes if synchronous standbys are unavailable.
Use it when you can afford the latency and have reliable standby connectivity. Otherwise use async plus strong PITR.
5) Why do CockroachDB apps need retry logic?
Because serializable transactions under contention may be forced to retry to preserve correctness.
If you don’t handle retries properly, you’ll convert transient contention into user-visible errors.
6) Which is easier to debug at 2 a.m.?
Postgres is usually easier if your incident is “one node is slow” or “replica lag,” because the moving parts are fewer.
CockroachDB can be easier for straightforward node failures, but harder for performance pathologies and contention.
7) Can I lift-and-shift a Postgres schema into CockroachDB?
You can migrate a lot of SQL, but don’t assume identical behavior around locks, isolation edge cases, sequences/serial patterns,
and extension availability. The biggest risk is not syntax—it’s workload characteristics like contention and key distribution.
8) What about backups—does CockroachDB eliminate the need for PITR thinking?
No. HA handles node failures; backups handle human failures and logical corruption. You still need tested restore procedures.
Different database, same reality.
9) Should I use read replicas with Postgres if I care about correctness?
Yes, but be explicit. Route only safe read traffic to replicas, or implement read-after-write rules.
If you treat replicas as fully consistent, you’ll eventually ship a bug as a feature.
10) Is CockroachDB always slower because it’s distributed?
Not always, but distributed coordination adds baseline cost, and tail latency can widen under contention.
For some workloads it’s excellent; for single-region low-latency OLTP with heavy writes, Postgres often wins on raw speed.
Conclusion: next steps you can execute this week
PostgreSQL HA is a system you build; CockroachDB HA is a system you join. Both can run production workloads reliably.
Both can ruin your weekend if you treat them as magic.
Here’s what to do next, in order:
- Write down RPO/RTO per service and get business sign-off. You want fewer surprises than your auditors.
- Run a failure drill: kill a node, partition a network path, and measure recovery time end-to-end (including clients).
- Implement the fast diagnosis playbook as a runbook and rehearse it. Speed comes from repetition, not genius.
- Pick one pain to eliminate first: Postgres split-brain prevention, or Cockroach contention/hotspot mitigation.
- Test restores. Not “we have backups,” but “we restored last Tuesday’s backup to a clean environment and verified correctness.”
If you want HA without drama, choose the architecture that matches your organization’s habits.
If you want HA with new kinds of pain, choose the one that matches your organization’s curiosity.
Either way, keep it boring where it counts: backups, drills, and clear ownership.