PostgreSQL vs OpenSearch: the hybrid search setup that actually works

Was this helpful?

You’ve got a product search box that’s “fine” until it isn’t. Suddenly, customers can’t find what they just created.
Support files tickets. Leadership asks why search results are “random.” Engineering insists it’s “eventual consistency.”
And SRE is stuck in the middle, holding the pager and a half-written postmortem.

The real problem is usually not choosing PostgreSQL or OpenSearch. It’s shipping a hybrid setup with the wrong
boundaries: mixing correctness with relevance, bolting on indexing without backpressure, and treating “search” like a single
feature instead of two systems with two different contracts.

What “hybrid search” actually means (and what it doesn’t)

“Hybrid search” is an overloaded phrase. In vendor slides it often means mixing vector + keyword search. In production
systems it more often means: PostgreSQL remains the system of record, OpenSearch serves search queries, and you
run a pipeline to keep them close enough that users trust the results.

That “close enough” part is the whole job. If you can’t define it—by product expectation and operational budget—you’ll
end up arguing about correctness during incidents, which is a terrible time to discover your definitions were vibes.

Hybrid search is two problems, not one

  • Correctness: “Is this object allowed to be returned? Does it exist? Is the user authorized?”
  • Relevance: “Given allowed objects, which are the best matches and how are they ranked?”

PostgreSQL is very good at correctness and relational constraints. OpenSearch is very good at relevance and retrieval
across large text fields with flexible ranking. When teams try to make one system do both, they pay for it in latency,
complexity, or trust. Usually all three.

Two jokes, because production needs humor

Joke 1: Search is just a database query with opinions. Unfortunately, those opinions tend to have uptime requirements.

PostgreSQL vs OpenSearch: different contracts, different failure modes

PostgreSQL’s contract

PostgreSQL promises transactional integrity, predictable semantics, and a query planner that will try its best—until you
hand it a query shaped like a cursed artifact. It is your source of truth. It’s where data is written, validated,
deduplicated, and governed.

PostgreSQL can do full-text search. For many workloads it’s enough. But it has limits: stemming and ranking are
less flexible; horizontal scaling is more work; relevance tuning is possible but fiddly; and you’ll feel pain when your
product starts demanding “search like a consumer app” on top of relational schemas.

OpenSearch’s contract

OpenSearch (and its Elasticsearch lineage) is a distributed search engine designed for retrieval and ranking at scale.
It will gladly accept denormalized documents, tokenize text, compute relevance scores, and answer complex search queries
quickly—assuming you feed it stable mappings, control your shard strategy, and keep it from being accidentally treated as
a transactional database.

OpenSearch does not promise transactional consistency with your primary database. It promises near-real-time
indexing and cluster-level durability based on replication, segment merging, and write-ahead logs in its own world.
If you need “read-after-write” correctness across systems, you must design for it.

Interesting facts and context (because history predicts outages)

  1. PostgreSQL’s lineage goes back to the POSTGRES project at UC Berkeley in the 1980s; its bias toward correctness and
    extensibility is not an accident.
  2. Full-text search in PostgreSQL became a first-class feature in the mid-2000s, and it has matured steadily—but it’s still
    a relational database feature, not a search engine’s core identity.
  3. Lucene (the underlying library behind Elasticsearch and OpenSearch) started around 1999; its design assumptions are
    “documents and inverted indexes,” not “tables and joins.”
  4. Near-real-time search indexing is a deliberate compromise: documents become searchable after refresh, not immediately
    after write, unless you pay for more frequent refreshes.
  5. “Type” removal in Elasticsearch (7.x era) forced many teams to confront schema/mapping design; OpenSearch inherits
    the same lesson: mappings are a contract, and breaking them is expensive.
  6. The common practice of using “application-level soft deletes” (a boolean flag) can silently poison search relevance and
    correctness if filters aren’t enforced consistently in both systems.
  7. The outbox pattern gained popularity as teams got burned by dual-write problems; it’s now a default answer for “keep
    two systems in sync” when you care about not losing writes.
  8. Search clusters fail in weirdly physical ways: disk watermark triggers read-only mode, segment merges saturate IO, and a
    “simple” mapping change can cause massive reindex costs.

If you remember nothing else: PostgreSQL is your ledger. OpenSearch is your catalog. Don’t try to pay taxes with your
catalog.

Decision rules: when to query Postgres, when to query OpenSearch

Use PostgreSQL when

  • You need strict correctness: billing, permissions, compliance, deduplication, idempotency.
  • You need relational constraints and joins that are hard to flatten without corruption risk.
  • You’re filtering and sorting on structured fields with selective indexes.
  • Your “search” is really a filtered list with light text matching and predictable volume.

Use OpenSearch when

  • You need relevance ranking, fuzziness, synonyms, stemming, boosting, “did you mean,” or fielded text queries.
  • You need fast retrieval across large text fields for many users concurrently.
  • You can accept eventual consistency, or you can design UX/flows around it.
  • You need aggregations over huge result sets where a search engine shines.

The hybrid rule that keeps you out of trouble

Use OpenSearch to produce candidate IDs and scores; use PostgreSQL to enforce truth and authorization.
That means your search API often becomes a two-step: OpenSearch query → list of IDs → Postgres fetch (and filter).
When performance matters, you optimize that join carefully. When correctness matters, you never skip it casually.

If you’re tempted to store permissions in OpenSearch and never check Postgres: you are building a security incident with
a nice UI.

A reference architecture that survives production

The moving parts

  • PostgreSQL: source of truth. Writes happen here. Transactions mean something here.
  • Outbox table: durable record of “things that must be indexed” written in the same transaction as the business write.
  • Indexer worker: reads outbox, fetches full entity state (or a projection), writes to OpenSearch, marks outbox done.
  • OpenSearch: denormalized searchable documents. Tuned mappings. Controlled shard count.
  • Search API: queries OpenSearch for candidates; hydrates from Postgres; returns results.
  • Backfill/reindex job: bulk rebuild index from Postgres snapshots or consistent reads.
  • Observability: lag metrics, error budgets, slow query logs, indexing throughput, and cluster health.

Why outbox beats “just publish an event”

The outbox pattern is the unglamorous friend who shows up on time. You write your row(s) to the business table and an
outbox row in the same database transaction. If the transaction commits, the outbox row exists. If it rolls back,
it doesn’t. That’s the entire point: no lost updates due to “DB commit succeeded but Kafka publish failed,” or the reverse.

A CDC system (logical replication, Debezium, etc.) can also work. But you still need to manage ordering, deletes, schema
changes, and replay semantics. Outbox is simpler to reason about for many teams, particularly when the indexer needs to
compute a projection anyway.

How to handle deletes without lying

Deletes are where hybrid systems go to die quietly. You must decide:

  • Hard delete in Postgres + delete document in OpenSearch.
  • Soft delete in Postgres + filter in both systems + eventual purge job.

The safest operational posture is: treat Postgres as authoritative, always. If OpenSearch returns a candidate ID that no
longer exists or is deleted, your hydration step drops it. This is less “efficient” but far more “sleepable.”

One reliability quote (paraphrased idea)

Paraphrased idea — John Allspaw: reliability comes from how systems behave under stress, not from how they look on diagrams.

Data modeling: source of truth, denormalization, and why joins don’t belong in search

Design the OpenSearch document as a projection

Your OpenSearch document should be a stable projection of the entity as the user searches it. This usually means:
flattened fields, repeated data (denormalized), and a few computed fields for relevance (like “popularity_score”,
“last_activity_at,” or “title_exact”).

The projection should be generated by code you can version and test. If your indexer is “SELECT * and hope,” you’ll ship
mapping explosions and quality regressions.

Do not model relational joins in OpenSearch unless you like performance surprises

OpenSearch has join-like features (nested, parent-child). They can be valid. They are also easy to misuse and hard to
operate at scale. In most product search workloads, you’ll be happier denormalizing into a single document per searchable
entity and paying the indexing cost once.

When a related object changes (e.g., organization name), you reindex the impacted documents. This is a known cost. You
budget for it and throttle it. You do not pretend it won’t happen.

Mapping stability is an SRE concern

Treat mappings like database migrations. A sloppy mapping change can trigger:

  • Unexpected field type conflicts (index rejects documents).
  • Implicit dynamic fields that balloon index size and heap pressure.
  • Costly reindexing that competes with production traffic.

Set dynamic mapping carefully. Most production teams should not allow a free-for-all. “But it’s convenient” is how you
end up with an index containing 40 versions of the same field spelled differently.

Consistency, latency, and the only SLA that matters

Define “search freshness” as a measurable SLO

Your real SLA is: how long after a write until a user can find it via search. This is not the same as
API p99 latency. It’s a pipeline SLO.

You measure it as lag: write timestamp vs indexed timestamp vs searchable timestamp. Then you decide what’s acceptable
(e.g., 5 seconds, 30 seconds, 2 minutes). If product needs “instant,” build the UX to cover the gap: show the new object
directly after creation, bypass search for that flow, and avoid teaching users that search is the only truth.

Refresh intervals: the knob that costs money

Lower refresh interval means documents become searchable faster, but it increases segment churn and IO overhead.
Increasing refresh interval improves indexing throughput and reduces load but makes search less fresh. Set it based on
user expectation and write volume. Also: don’t tune it during an incident unless you understand the downstream effects.

Correctness guardrails

  • Authorization filtering: enforce in Postgres hydration, or replicate auth rules carefully and test them.
  • Deleted/hidden items: filter everywhere; treat Postgres as final truth.
  • Multi-tenant boundaries: tenant_id must be a first-class filter in OpenSearch queries and a key in Postgres fetch.

Joke 2: Eventual consistency is great until your CEO searches for the thing they created five seconds ago. Then it becomes “an outage.”

Practical tasks (commands + outputs + decisions)

These are not toy commands. They’re the kind you run when search is slow, stale, or lying—and you need to decide what to
do next. Each task includes: command, what the output means, and the decision you make.

Task 1: Check Postgres replication / write pressure (are we falling behind before indexing even starts?)

cr0x@server:~$ psql -d appdb -c "select now(), xact_commit, xact_rollback, blks_read, blks_hit from pg_stat_database where datname='appdb';"
              now              | xact_commit | xact_rollback | blks_read | blks_hit 
-------------------------------+-------------+---------------+-----------+----------
 2025-12-30 18:41:12.482911+00 |    19403822 |         12031 |   3451201 | 98234410
(1 row)

Meaning: If commits spike but cache hits drop (blks_read rises), you’re doing more physical reads—often a sign of IO pressure.

Decision: If IO is hot, pause reindex/backfill jobs, verify indexes, and consider read replicas for hydration workloads.

Task 2: Find Postgres queries dominating time (hydration step often does this)

cr0x@server:~$ psql -d appdb -c "select query, calls, total_time, mean_time, rows from pg_stat_statements order by total_time desc limit 5;"
                       query                        | calls  | total_time | mean_time |  rows  
----------------------------------------------------+--------+------------+-----------+--------
 select * from items where id = $1 and tenant_id=$2  | 982144 |  842123.11 |     0.857 | 982144
 select id from items where tenant_id=$1 and ...     |   1022 |  221000.44 |   216.243 | 450112
 ...
(5 rows)

Meaning: Hydration by primary key should be fast. If it’s dominating total time, you might be doing it too often, or missing an index on (tenant_id, id).

Decision: Batch hydration (WHERE id = ANY($1)) and add composite indexes aligned with tenant boundaries.

Task 3: Confirm the critical Postgres index exists and is used

cr0x@server:~$ psql -d appdb -c "explain (analyze, buffers) select * from items where tenant_id='t-123' and id=987654;"
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Index Scan using items_tenant_id_id_idx on items  (cost=0.42..8.44 rows=1 width=512) (actual time=0.041..0.042 rows=1 loops=1)
   Index Cond: ((tenant_id = 't-123'::text) AND (id = 987654))
   Buffers: shared hit=5
 Planning Time: 0.228 ms
 Execution Time: 0.072 ms
(6 rows)

Meaning: Index Scan + low execution time + buffers hit means hydration is healthy for that path.

Decision: If you see Seq Scan, fix indexing or reduce row width fetched (select needed columns only).

Task 4: Inspect outbox lag (the “freshness SLO” in one query)

cr0x@server:~$ psql -d appdb -c "select count(*) as pending, max(now()-created_at) as max_lag from search_outbox where processed_at is null;"
 pending |   max_lag   
---------+-------------
   18234 | 00:07:41.110
(1 row)

Meaning: 18k pending and max lag ~8 minutes: indexing pipeline is behind. Users will notice.

Decision: Scale indexer workers, check OpenSearch ingest latency, and ensure outbox consumers aren’t stuck on poison pills.

Task 5: Detect poison-pill outbox records (stuck retries)

cr0x@server:~$ psql -d appdb -c "select id, entity_id, attempts, last_error, updated_at from search_outbox where processed_at is null and attempts >= 10 order by updated_at asc limit 10;"
  id  | entity_id | attempts |           last_error            |         updated_at         
------+-----------+----------+---------------------------------+----------------------------
 9981 |  7712331  |       14 | mapping conflict on field tags  | 2025-12-30 18:20:01+00
 ...
(10 rows)

Meaning: Repeated mapping conflict errors are not “transient.” They’re schema bugs.

Decision: Quarantine these records, patch the projection/mapping, and reprocess after a controlled fix.

Task 6: Check OpenSearch cluster health (red/yellow is not “just a color”)

cr0x@server:~$ curl -s http://opensearch.service:9200/_cluster/health?pretty
{
  "cluster_name" : "search-prod",
  "status" : "yellow",
  "number_of_nodes" : 6,
  "active_primary_shards" : 120,
  "active_shards" : 232,
  "unassigned_shards" : 8,
  "initializing_shards" : 0,
  "relocating_shards" : 2
}

Meaning: Yellow means primary shards are allocated but replicas are not fully allocated. You have reduced redundancy; performance may be affected during recovery.

Decision: If yellow persists, investigate shard allocation, disk watermarks, and node capacity before scaling query traffic.

Task 7: Identify indices with too many shards (a classic performance tax)

cr0x@server:~$ curl -s http://opensearch.service:9200/_cat/indices?v
health status index               uuid                   pri rep docs.count store.size
green  open   items_v12           a1b2c3d4e5             24  1   88441211   310gb
green  open   items_v12_alias     -                      -   -   -          -
green  open   audit_v03           f6g7h8i9j0              6  1   120011223  540gb

Meaning: 24 primary shards for one index might be fine—or might be a shard explosion depending on node count and query patterns.

Decision: If shard count > (nodes * 2–4) for a single hot index, plan a shrink/reindex with sane shard sizing.

Task 8: Check OpenSearch indexing pressure (are merges and refreshes killing ingest?)

cr0x@server:~$ curl -s http://opensearch.service:9200/_nodes/stats/indices?pretty | head -n 30
{
  "cluster_name" : "search-prod",
  "nodes" : {
    "n1" : {
      "name" : "search-n1",
      "indices" : {
        "refresh" : { "total" : 992112, "total_time_in_millis" : 31212344 },
        "merges" : { "current" : 14, "current_docs" : 8812333, "total_time_in_millis" : 92233111 }
      }
    }

Meaning: High merge concurrency and large current_docs means the cluster is spending serious time merging segments—often IO-bound.

Decision: Throttle bulk indexing, increase refresh interval temporarily (if acceptable), and verify disk throughput/queue depth on data nodes.

Task 9: Measure search latency at the engine (is it OpenSearch or your app?)

cr0x@server:~$ curl -s -H 'Content-Type: application/json' http://opensearch.service:9200/items_v12/_search -d '{
  "profile": true,
  "size": 10,
  "query": { "bool": { "filter": [ { "term": { "tenant_id": "t-123" } } ], "must": [ { "match": { "title": "graph api" } } ] } }
}' | head -n 25
{
  "took" : 38,
  "timed_out" : false,
  "hits" : {
    "total" : { "value" : 129, "relation" : "eq" },
    "max_score" : 7.1123,
    "hits" : [
      { "_id" : "987654", "_score" : 7.1123, "_source" : { "title" : "Graph API Gateway" } }
    ]
  }

Meaning: “took: 38” is engine time in milliseconds. If your API p99 is 800ms, your bottleneck is likely hydration, network, serialization, or downstream calls.

Decision: Use this to stop blame ping-pong. Optimize the actual slow tier.

Task 10: Verify mapping for a risky field (avoid silent keyword/text mismatches)

cr0x@server:~$ curl -s http://opensearch.service:9200/items_v12/_mapping?pretty | grep -n "title" -n | head
132:         "title" : {
133:           "type" : "text",
134:           "fields" : {
135:             "keyword" : { "type" : "keyword", "ignore_above" : 256 }
136:           }
137:         },

Meaning: title is text with a keyword subfield. That’s standard: use title for full-text, title.keyword for exact matches/sorting (within reason).

Decision: If you’re sorting on title (text), fix your queries; sort on title.keyword or a normalized sort field.

Task 11: Check disk watermarks (search clusters go read-only when storage is tight)

cr0x@server:~$ curl -s http://opensearch.service:9200/_cluster/settings?include_defaults=true | grep -n "watermark" | head -n 20
412:         "cluster.routing.allocation.disk.watermark.low" : "85%",
413:         "cluster.routing.allocation.disk.watermark.high" : "90%",
414:         "cluster.routing.allocation.disk.watermark.flood_stage" : "95%",

Meaning: At flood_stage, indices may be marked read-only to protect the cluster. Indexing failures will follow.

Decision: If you’re near flood_stage, scale storage, delete old indices, or reduce retention. Don’t “just retry.”

Task 12: Confirm your alias points to the intended index (reindexing mistakes look like relevance bugs)

cr0x@server:~$ curl -s http://opensearch.service:9200/_cat/aliases?v | grep items
alias         index     filter routing.index routing.search is_write_index
items_current items_v12 -      -            -              true

Meaning: The write alias and read alias should be intentional. If your app reads items_v11 while indexers write to items_v12, you will “lose” documents in search.

Decision: Fix alias choreography: write to the new index, backfill, then atomically switch read alias.

Task 13: Inspect OpenSearch thread pools (are you saturating search or write threads?)

cr0x@server:~$ curl -s http://opensearch.service:9200/_cat/thread_pool/search?v
node_name name   active queue rejected completed
search-n1 search      18   120     3421  91822311
search-n2 search      17   110     3302  91011210

Meaning: High queue and rising rejected indicates overload; OpenSearch is refusing work.

Decision: Reduce query cost (filters, fewer shards), add nodes, or implement client-side rate limiting and fallbacks.

Task 14: Validate bulk indexing behavior (are you sending too big batches?)

cr0x@server:~$ curl -s -H 'Content-Type: application/json' http://opensearch.service:9200/_cat/nodes?v
ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.2.11           92          78  86   12.11   10.42     9.88 dimr      -      search-n1
10.0.2.12           89          76  80   11.22    9.90     9.31 dimr      -      search-n2

Meaning: Heap > ~85–90% sustained is a warning. Bulk ingestion can trigger GC thrash and latency spikes.

Decision: Reduce bulk request size, cap concurrency, and ensure your shard count isn’t forcing too much per-node overhead.

Fast diagnosis playbook

When “search is broken,” you need a path that avoids endless debate. Here’s the order that finds the bottleneck fast.
It’s optimized for hybrid setups where Postgres is truth and OpenSearch is retrieval.

First: is it stale, wrong, or slow?

  • Stale: New/updated objects not appearing.
  • Wrong: Unauthorized/deleted objects appearing, or missing items that should match.
  • Slow: Latency spikes or timeouts.

These are different failure classes. Don’t troubleshoot relevance tuning when your outbox is 20 minutes behind.

Second: establish which tier is slow

  1. Measure engine time (OpenSearch “took”) for the same query.
  2. Measure API time for the request.
  3. Measure hydration query time in Postgres (EXPLAIN ANALYZE).

If OpenSearch is fast but API is slow, the culprit is usually hydration (N+1), permission checks, or serialization.
If OpenSearch is slow but Postgres is fine, you’re in shard/merge/heap territory.

Third: check pipeline lag and error rates

  1. Outbox pending count and max lag.
  2. Indexer error logs (mapping conflicts, 429 rejections, timeouts).
  3. OpenSearch cluster health, disk watermarks, thread pool rejections.

Fourth: decide on a safe mitigation

  • Stale: scale indexers, throttle backfills, fix poison pills, avoid manual refresh spam.
  • Wrong: enforce Postgres truth in hydration; tighten filters; audit alias; validate delete propagation.
  • Slow: reduce query cost, reduce shards hit (routing/filters), cap concurrency, add nodes as last resort.

Common mistakes: symptom → root cause → fix

1) “I created it but search can’t find it.”

Symptom: Newly created items are missing for minutes; direct link works.

Root cause: Indexing lag (outbox backlog), long refresh interval, or indexer throttled by OpenSearch rejections.

Fix: Measure outbox lag; scale consumers; reduce bulk size; ensure OpenSearch isn’t in disk flood-stage; set refresh interval intentionally and communicate freshness SLO.

2) “Search shows items from other tenants.”

Symptom: Cross-tenant leakage in results—usually discovered by an angry customer.

Root cause: Missing tenant_id filter in OpenSearch query or hydration query; alias points to mixed index; caching layer not keyed by tenant.

Fix: Require tenant_id in every query (API contract); add query tests; verify index partitioning strategy; fix caches to include tenant keys.

3) “OpenSearch is fast but the API is slow.”

Symptom: OpenSearch took is low (<50ms) but API p99 is high.

Root cause: Hydration N+1 queries, wide SELECT *, permission checks per row, or slow network serialization of giant payloads.

Fix: Batch hydration with WHERE id = ANY($1); fetch minimal columns; precompute permission flags; set hard caps on result sizes and fields returned.

4) “Indexing is failing randomly.”

Symptom: Some documents never index; retries loop; errors look inconsistent.

Root cause: Mapping conflicts due to dynamic fields or inconsistent types (string vs array, integer vs keyword).

Fix: Lock down mappings; validate projection output; quarantine poison pills; reindex with corrected mapping and strict input validation.

5) “We tuned relevance and it got worse.”

Symptom: Result quality degrades after adding boosting/synonyms; support complains about “nonsense” results.

Root cause: Changing analyzers without reindexing, boosting on noisy fields, synonyms too broad, or mixing filter logic into scoring queries.

Fix: Treat analyzer changes as reindex events; validate with offline judgment sets; separate filters (bool.filter) from scoring (bool.must/should).

6) “Cluster turned read-only and indexing died.”

Symptom: Bulk requests start failing; logs mention read-only blocks.

Root cause: Disk flood-stage watermark triggered.

Fix: Free disk (delete old indices), add capacity, reduce retention; then clear read-only blocks after resolving capacity—not before.

7) “Reindex took down search.”

Symptom: Latency spikes and timeouts during backfill; cluster CPU/IO pegged.

Root cause: Reindex competing with live traffic; too much concurrency; refresh interval too low; shard count too high.

Fix: Throttle bulk ingestion; increase refresh interval during backfill; schedule off-peak; isolate indexing nodes if you can; keep shard sizing sane.

Three corporate-world mini-stories from the trenches

Mini-story 1: The incident caused by a wrong assumption

A mid-size B2B platform shipped “instant search” for newly created records. The team assumed the search engine would
behave like the database: once the indexing call returned 200, the document would be searchable. That assumption lived in
a comment, then in a customer promise, then in a sales demo.

On a busy Monday, support reported that users couldn’t find the records they just created. Engineers checked the API logs:
indexing calls were successful. OpenSearch looked green. So the team blamed “client caching” and shipped a cache-busting
patch that did exactly nothing.

The actual issue was refresh behavior combined with load. Under heavy indexing, refreshes were delayed and merges were
expensive. The indexing pipeline was correct, but “searchability” was not immediate. Users were hitting the search box
within seconds of creation, and the system had no UX path to show newly created records outside search.

The fix wasn’t heroic. They defined a freshness SLO and implemented a post-create flow that shows the created record
directly, plus a banner that search may take a short time to catch up. They also adjusted refresh interval and created a
metric for “time to searchable.” The incident ended when everyone agreed on the actual contract: search is near-real-time,
not transactional.

Mini-story 2: The optimization that backfired

Another company tried to save latency by skipping Postgres hydration. The logic: “OpenSearch already has all the fields we
need. Why fetch from Postgres? That’s extra network and extra load.” On paper, it looked great. They even celebrated the
drop in database read QPS.

Then came the subtle bugs. A permissions change in Postgres took time to reach OpenSearch, so users briefly saw results
they shouldn’t. Soft-deleted records lingered. A “hidden” flag wasn’t applied consistently. The product team got reports
of “ghost items” and “private items in search.”

The worst part: the system became hard to reason about. Correctness now depended on OpenSearch being perfectly synced and
the projection being perfect. When it wasn’t, the failure mode was data exposure, not just stale search.

They rolled back to hydration for protected entities and kept a limited “OpenSearch-only” mode for public content. Latency
went up slightly. Incident risk went down a lot. The optimization backfired because it optimized the wrong metric: p95,
not trust.

Mini-story 3: The boring but correct practice that saved the day

A different org ran quarterly reindexing drills. Not because it was fun—because it wasn’t. They treated it like a fire
drill: build a new index version, backfill from Postgres, dual-write, validate samples, then flip the read alias.

One day a mapping change shipped with a bad field type. Indexing started failing for a subset of documents, but the
pipeline didn’t collapse because failures were isolated to specific outbox records. Alerts fired on “outbox poison pill
rate” and “indexing error rate.” The oncall could see exactly what broke and where.

They rolled forward by creating a corrected index version and replaying the outbox from a known offset. Because alias
flipping and backfill were practiced, the team avoided a long outage and avoided manual data surgery. The customers mostly
noticed nothing.

It wasn’t glamorous engineering. It was a checklist, a runbook, and the discipline to test the scary thing before it’s
urgent. That’s the kind of practice that doesn’t get applause—until it saves a weekend.

Checklists / step-by-step plan

Step-by-step: ship a hybrid search that won’t betray you

  1. Define the contract: freshness SLO (“write to searchable”), correctness rules (auth, delete, tenant boundaries),
    and acceptable staleness per feature.
  2. Make Postgres the source of truth: all writes happen there; no exceptions “for performance” without a written risk review.
  3. Create an outbox table: business write transaction also writes an outbox record containing entity_id, entity_type, operation, and timestamps.
  4. Build an indexer worker with idempotency: safe retries, dedupe keys, bounded concurrency, and poison-pill quarantine.
  5. Design the OpenSearch document projection: deterministic, versioned, tested; avoid dynamic mapping surprises.
  6. Use aliases from day one: read alias + write alias, so you can reindex without changing application code.
  7. Implement search query pattern: OpenSearch returns candidate IDs; Postgres hydrates authoritative records and filters.
  8. Build a backfill job: bulk index from Postgres, throttled, resumable, observable. Plan for it; you will need it.
  9. Instrument pipeline lag: outbox lag, indexing throughput, error rates; page on sustained freshness SLO violations.
  10. Load test realistically: include indexing, merges, refresh, and production-like query mixes—especially tenant filters and aggregations.
  11. Practice reindex drills: at least quarterly, and after any analyzer/mapping changes that require reindex.
  12. Write the “break-glass” mode: if OpenSearch is degraded, fall back to Postgres FTS for limited functionality or return partial results with clear UX.

Checklist: before you blame OpenSearch for slow search

  • Compare OpenSearch “took” to API latency.
  • Check Postgres hydration query patterns (N+1?) and indexes.
  • Verify you’re not selecting huge payloads unnecessarily.
  • Confirm tenant filters and permission logic aren’t doing per-row calls.

Checklist: before you run a reindex in production

  • Is shard sizing reasonable for the new index?
  • Do you have enough disk headroom for parallel indices?
  • Is refresh interval tuned for backfill?
  • Do you have rate limits and backpressure from OpenSearch rejections?
  • Do you have a validation plan (samples, counts, spot checks)?
  • Is the alias flip atomic and rehearsed?

FAQ

1) Can PostgreSQL full-text search replace OpenSearch entirely?

Sometimes. If your search is mostly structured filtering with modest text matching, Postgres FTS is simpler and more
correct by default. If you need sophisticated relevance, fuzziness, multi-field boosting, or very high concurrency on
large corpora, OpenSearch earns its keep.

2) Why not store everything in OpenSearch and stop using Postgres for reads?

Because you’ll eventually rediscover transactions, constraints, and auditing the hard way. OpenSearch is not built to be
your ledger. If you skip Postgres checks for authorization and deletions, you’re betting your security posture on
eventual consistency and perfect projections.

3) Outbox vs CDC: which is better?

Outbox is simpler to reason about and test in application code. CDC is powerful when you need broad capture across many
tables and services. For search indexing, outbox often wins because you typically need a projection anyway, plus you want
explicit control over what gets indexed and when.

4) Should I hydrate results from Postgres every time?

For protected or mutable entities, yes—at least for a correctness gate. For public, low-risk content you might return
OpenSearch sources directly, but you need tight discipline around deletions, hiding, and tenant boundaries.

5) How do I handle “search right after create” UX?

Don’t promise immediate searchability unless you’re prepared to pay for it (and even then, distributed systems have
edges). After creating an object, send the user to the object page, show it in “recently created,” or pin it in UI until
it becomes searchable.

6) What’s the most common cause of indexing failures?

Mapping conflicts from inconsistent field types, usually caused by dynamic mapping and sloppy projections. Second place:
disk watermarks turning indices read-only. Third: bulk request sizing and concurrency triggering rejections.

7) How do I avoid shard-related performance problems?

Keep shard count aligned with your node count and expected data size. Avoid tiny shards (overhead) and giant shards
(recovery pain). Use aliases and reindex when you outgrow the initial guess. Shards are not free; they cost heap and
operational complexity.

8) Do I need separate OpenSearch clusters for indexing and querying?

Not always, but separation helps when you have heavy backfills, frequent reindexing, or sharp traffic spikes. At minimum,
control indexing throughput with backpressure so merges don’t starve query latency. If your business depends on search,
consider architecture that isolates blast radius.

9) How do I test relevance without breaking production?

Build an offline judgment set (queries + expected good results). Run A/B evaluations on snapshots. Deploy relevance
changes behind feature flags or index versions. Most relevance “bugs” are changes without baselines.

10) What’s the best fallback when OpenSearch is degraded?

A limited Postgres-based search for essential workflows (by exact match, prefix, or constrained FTS) is often good
enough. The key is to define what “degraded mode” means and keep it within the database’s capacity.

Next steps you can actually do this week

If you already have Postgres and OpenSearch in production, your goal isn’t “perfect search.” It’s predictable search.
Here are practical moves that pay back quickly:

  1. Add freshness metrics: outbox lag, indexer throughput, “time to searchable,” and indexing error rates.
  2. Audit your tenant/auth filters: enforce them in one place (preferably hydration) and test them like security controls.
  3. Find and kill N+1 hydration: batch fetch by IDs and fetch only required columns.
  4. Lock down mappings: stop accidental dynamic fields from ballooning and causing conflicts.
  5. Practice alias flipping: rehearse a reindex, even if you don’t need one today. You will.
  6. Write a one-page contract: what’s fresh, what’s correct, and what happens during degradation.

Hybrid search works when you stop trying to make one system be two things. Let Postgres be correct. Let OpenSearch be
relevant. Then build a pipeline and an API that admit reality and still serve users well.

← Previous
Docker “Too Many Requests” When Pulling Images: Fix Registry Throttling Like You Mean It
Next →
Docs Callouts That Don’t Betray You: CSS Variables, Dark Mode, and Operational Discipline

Leave a comment