OpenSearch vs PostgreSQL — Hybrid Search Without the Pain

February 10, 2026 • February 10, 2026 • Read: 22 min • Views: 1

Was this helpful?

You didn’t set out to run “Search Platform Engineering.” You just wanted your product’s search box to stop embarrassing you:
users can’t find things they know exist, latency spikes during batch updates, and every “quick relevance tweak” turns into a weekend.

The usual fork in the road is deceptively simple: “Can we keep search in PostgreSQL?” versus “Do we need OpenSearch?”
The real answer is almost always: “Use both, but only if you’re disciplined about where truth lives and how you operate the pipeline.”

The decision you actually need to make

“OpenSearch vs PostgreSQL” is the wrong framing. You’re really deciding:

Where does the source of truth live? (Almost always PostgreSQL.)
What kind of queries must be fast? Keyword, phrase, prefix, fuzzy, faceting, vector similarity, hybrid ranking, geo, aggregations.
What failure is acceptable? “Search is stale for 5 minutes” is often acceptable. “Checkout breaks” is not.
How many operational knobs can your team afford? PostgreSQL can do search, but the moment you need serious relevance tuning and hybrid retrieval at scale, OpenSearch earns its keep.

Here’s the blunt guidance:

Use PostgreSQL-only when: dataset is modest, query patterns are simple, and search correctness matters more than fancy relevance. You want fewer moving parts.
Use OpenSearch-only when: your product can tolerate eventual consistency, and search is your product. You’ll still keep a database somewhere, but you won’t pretend it’s the search engine.
Use a hybrid (PostgreSQL + OpenSearch) when: you need rich ranking, aggregations, or vector search at scale, but you also need transactional truth and sane auditing.

One quote worth keeping on a sticky note when people ask for “just one system”: “paraphrased idea” — Werner Vogels (Amazon CTO) has long pushed the idea that you design for availability by accepting eventual consistency where it’s acceptable.
Search is one of the few places where that trade is usually rational.

Historical facts that matter more than vendor slides

These aren’t trivia. They explain why each tool behaves the way it does under pressure.

PostgreSQL full-text search has been in core for decades (tsvector/tsquery), and it’s optimized for transactional workloads first, search second.
Lucene’s inverted index model (the basis for Elasticsearch/OpenSearch) was designed for fast retrieval and scoring, not ACID updates per row.
GIN indexes in Postgres are brilliant for set membership (like token presence) but can become maintenance-heavy when your text fields churn constantly.
Elasticsearch fork history matters operationally: OpenSearch exists because of licensing changes; many teams inherited Elasticsearch operational habits and applied them unchanged—sometimes correctly, sometimes disastrously.
Search engines treat updates as “reindex” operations internally: a “document update” is often a delete + add. That’s why frequent small updates can create segment churn and merge pressure.
BM25 became the default relevance workhorse for many modern search stacks; PostgreSQL’s ranking functions are useful but less flexible in real-world tuning workflows.
Vector search went mainstream fast: pgvector and kNN in OpenSearch emerged because “keyword search” stopped being enough once embeddings became cheap.
Distributed shards are a tax, not a feature: OpenSearch scaling via shards gives you throughput, but also failure modes (hot shards, skew, relocation storms) that a single Postgres instance doesn’t invent.

What each system is good at (and bad at)

PostgreSQL search: what it nails

Transactional correctness: search results reflect committed data when you query the same database. No sync lag unless you add it.
Simple deployment: one engine, one backup chain, one set of permissions.
Join-heavy filtering: when “search” is mostly structured filters with a bit of text matching, Postgres is often faster and simpler.
Auditing and compliance: point-in-time recovery, WAL archiving, and tight access control are normal day-to-day tools.

PostgreSQL search: where it hurts

Relevance tuning gets awkward: you can do ranking, weights, dictionaries, synonyms, but the workflow is less ergonomic than dedicated search tooling.
Faceting and aggregations at scale: yes, SQL can aggregate; no, you may not like the latency under high cardinality with many concurrent queries.
Hybrid retrieval (keyword + vector + rerank): possible, but your query plans become Rube Goldberg machines.
Write amplification: GIN maintenance plus frequent updates plus vacuum realities can become your performance ceiling.

OpenSearch: what it nails

Fast full-text retrieval: inverted indices are built for this.
Facets/aggregations: terms aggregations, range buckets, and analytics-style queries are first-class citizens.
Relevance tooling: analyzers, synonyms, boosting, function scores, and explainability are what the product is for.
Horizontal scale: you can add nodes and redistribute shards. It’s not free, but it’s possible.
Vector + hybrid patterns: kNN search and hybrid retrieval are now common operational patterns.

OpenSearch: where it hurts

Operational complexity: JVM tuning, shard sizing, segment merging, circuit breakers, heap pressure, cluster coordination.
Eventual consistency: refresh intervals, ingestion pipelines, and retries mean “now” is a fuzzy concept.
Reindexing is a lifestyle: mappings change, analyzers change, synonyms change. You will reindex. Plan for it.

Joke #1: A search cluster is like a pet rabbit—quiet, cute, and then suddenly you have 400 of them and no one remembers how.

Storage-engine reality check

PostgreSQL stores rows. OpenSearch stores documents. That sounds like marketing, but it matters:

Partial updates are cheap in Postgres and often expensive in OpenSearch (update = reindex document).
Large text fields can bloat Postgres TOAST and increase IO; in OpenSearch they increase segment size and merge costs.
Compression and merge behavior in OpenSearch is a constant background process. In Postgres, vacuum and autovacuum are your background processes. Neither is optional.

Hybrid search architectures that don’t punish you later

Pattern A: PostgreSQL is truth, OpenSearch is the read-optimized index

This is the classic. You write to Postgres. You stream changes to OpenSearch. The app queries OpenSearch for IDs and ranking, then fetches canonical rows from Postgres.
It’s boring. Boring scales.

Pros: correctness lives in one place; search outages degrade gracefully (“stale results” or “limited search”).
Cons: you must operate a sync pipeline and handle backfills.

Pattern B: Dual-read with feature flag fallback

When migrating from Postgres FTS to OpenSearch, you run both in parallel. You compare results and latency.
You gate rollout per tenant, per feature, or per query type.

Pros: safer migrations; gives you relevance A/B testing.
Cons: twice the query cost; requires careful caching and rate limits.

Pattern C: Postgres for filters + OpenSearch for retrieval, then rerank

When structured constraints are complex (permissions, entitlements, time windows), Postgres can compute the candidate set or constraints.
OpenSearch retrieves candidates; a reranker (possibly ML) reorders.

Pros: strong correctness on access control; keeps “who can see what” in SQL where auditors live.
Cons: can become expensive if candidate sets are huge; needs careful pagination semantics.

Sync strategies: choose one and make it boring

Most pain comes from “we’ll just keep them in sync” hand-waving. Pick a real mechanism:

Transactional outbox: write a row to an outbox table in the same transaction; a worker publishes to OpenSearch. High correctness, simple replay.
Logical replication / CDC: stream WAL changes and transform into documents. Powerful, but you’re now operating a data pipeline.
Batch rebuild nightly: acceptable for “catalog search” in some businesses; unacceptable for security/alerts/time-sensitive search.

The one rule: design for replay. If you can’t rewind and rebuild the OpenSearch index deterministically, you don’t have a system—you have a vibe.

Operational realities: SLOs, failure modes, and on-call pain

Define SLOs that match the architecture

A hybrid system has at least three latency buckets:

Ingest lag: time from Postgres commit to OpenSearch searchable.
Query latency: P50/P95/P99 for search queries, plus time to hydrate results from Postgres.
Freshness correctness: how often users see stale or missing results for newly written data.

OpenSearch failure modes you should expect

Heap pressure leads to GC storms, which look like random latency spikes until the cluster goes yellow/red.
Hot shards from skewed routing cause one node to melt while others idle.
Merge pressure during high update rates causes IO spikes and query latency inflation.
Mapping explosions (dynamic fields) bloat cluster state and break everything in a very democratic way.

PostgreSQL failure modes you should expect

Autovacuum debt leading to bloated tables/indexes, slow queries, and eventually transaction ID wraparound risks.
GIN bloat especially for frequently updated documents; the index grows faster than your patience.
Bad query plans from stale statistics or parameter sniffing-like issues (generic vs custom plans).
Locking surprises from schema changes or heavy concurrent updates.

Joke #2: The fastest way to improve search latency is to rename the dashboard from “P95” to “P50” and go to lunch.

What to avoid like a pager at 3 a.m.

Letting OpenSearch become the system of record for anything compliance cares about.
Dynamic mappings in production unless you enjoy discovering “string vs number” conflicts mid-incident.
Reindexing without capacity headroom. Reindexing is not a background hobby; it’s a load test you run on yourself.
Permission filtering only in OpenSearch if your authorization model is non-trivial. Keep entitlements in Postgres and treat search as a suggestion engine.

Fast diagnosis playbook

Your search is slow or wrong. Don’t debate architecture in Slack. Check the bottleneck in minutes.

First: is it ingestion lag or query-time pain?

Symptom: users can’t find new items, but old items search fine → likely ingestion lag.
Symptom: everything searches, but slow/unreliable → likely query-time (OpenSearch/PG) or hydration bottleneck.

Second: isolate which hop is failing

App → OpenSearch: measure raw search call latency and error rate.
OpenSearch internal: CPU, heap, GC, thread pools, queue rejections, segment merges.
Hydration → Postgres: slow SQL, connection pool saturation, index misses, locking, IO.
Sync pipeline: consumer lag, dead-letter queues, retries, bulk request errors.

Third: choose the quickest safe mitigation

Ingestion lag: pause non-critical updates; increase bulk size carefully; add pipeline consumers; temporarily increase refresh interval.
OpenSearch query overload: reduce query complexity (fewer aggregations), add caching, adjust shard count only with a plan, or temporarily route heavy tenants elsewhere.
Postgres overload: cap hydration fanout; batch fetch IDs; add missing indexes; increase connection pool only if DB can handle it.

Practical tasks: commands, outputs, and decisions

These are real tasks you can run during build-out or an incident. Each includes what the output means and what decision to make next.

1) Check OpenSearch cluster health and what “yellow” actually means

cr0x@server:~$ curl -s localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "search-prod",
  "status" : "yellow",
  "number_of_nodes" : 3,
  "active_primary_shards" : 48,
  "active_shards" : 48,
  "unassigned_shards" : 48
}

Meaning: primaries are allocated, replicas aren’t. Queries may work, but you have no redundancy and rebalancing risk.
Decision: do not start a reindex; first fix allocation (node availability, disk watermarks, shard limits) or reduce replica count temporarily.

2) Find unassigned shard reasons before you guess

cr0x@server:~$ curl -s localhost:9200/_cat/shards?v
index             shard prirep state      docs store ip         node
products_v12      0     p      STARTED   9812  42mb  10.0.0.12  os-1
products_v12      0     r      UNASSIGNED

Meaning: replica shard can’t be placed.
Decision: inspect allocation explain; if it’s disk watermark, free disk or add nodes; if it’s shard limit, reduce shards or increase limits deliberately.

3) Explain shard allocation with the cluster’s own words

cr0x@server:~$ curl -s -H 'Content-Type: application/json' localhost:9200/_cluster/allocation/explain?pretty -d '{"index":"products_v12","shard":0,"primary":false}'
{
  "index" : "products_v12",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_name" : "os-2",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the high watermark"
        }
      ]
    }
  ]
}

Meaning: disk thresholds are blocking replica allocation.
Decision: free disk, add storage, or move shards; don’t “just disable watermarks” unless you enjoy sudden read-only indices.

4) Check OpenSearch node resource pressure from the outside (Linux)

cr0x@server:~$ ps -o pid,cmd,%cpu,%mem --sort=-%mem | head
  PID CMD                         %CPU %MEM
 2178 /usr/share/opensearch/jdk   312.0 48.6
 1042 /usr/bin/node app-server     64.2  3.1

Meaning: OpenSearch JVM is consuming major memory and CPU.
Decision: check heap vs RSS; if RSS is huge, you may be swapping or the page cache is under pressure. Next step: verify GC and heap usage.

5) Inspect JVM heap usage via nodes stats

cr0x@server:~$ curl -s localhost:9200/_nodes/stats/jvm?pretty | head -n 25
{
  "nodes" : {
    "n1" : {
      "jvm" : {
        "mem" : {
          "heap_used_in_bytes" : 15234567890,
          "heap_max_in_bytes" : 17179869184
        },
        "gc" : {
          "collectors" : {
            "old" : {
              "collection_count" : 92,
              "collection_time_in_millis" : 184322
            }
          }
        }
      }
    }
  }
}

Meaning: heap is ~88% used with significant old GC time.
Decision: reduce query/aggregation load, check fielddata usage, and consider heap tuning only after you confirm it’s not a mapping/aggregation problem.

6) Detect thread pool rejections (a quiet killer)

cr0x@server:~$ curl -s localhost:9200/_nodes/stats/thread_pool?pretty | grep -A3 rejected | head
          "search" : {
            "threads" : 13,
            "queue" : 1000,
            "rejected" : 421

Meaning: search requests are being rejected because the queue is full.
Decision: slow down clients (circuit breaker), reduce fanout, lower expensive aggregations, or add capacity. Do not just increase queue size and call it a fix.

7) Check segment/merge pressure (why updates feel like sandpaper)

cr0x@server:~$ curl -s localhost:9200/_nodes/stats/indices/segments,merge?pretty | head -n 35
{
  "nodes" : {
    "n1" : {
      "indices" : {
        "segments" : {
          "count" : 2345,
          "memory_in_bytes" : 987654321
        },
        "merge" : {
          "current" : 12,
          "current_docs" : 1900000,
          "total_throttled_time_in_millis" : 842112
        }
      }
    }
  }
}

Meaning: many segments and merges, with throttling time piling up.
Decision: adjust refresh interval, bulk ingest strategy, and update patterns. Consider moving frequently changing fields out of the document or using denormalization more carefully.

8) Check index settings that affect freshness and ingest cost

cr0x@server:~$ curl -s localhost:9200/products_v12/_settings?pretty | head -n 40
{
  "products_v12" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "1s",
        "number_of_shards" : "12",
        "number_of_replicas" : "1"
      }
    }
  }
}

Meaning: 1s refresh interval is great for freshness, costly for ingest.
Decision: if ingest is falling behind, temporarily raise refresh_interval (e.g., 10s–30s) and communicate freshness expectations; revert after backlog clears.

9) Verify PostgreSQL bloat and vacuum health at a glance

cr0x@server:~$ psql -d app -c "SELECT relname,n_live_tup,n_dead_tup,round(100.0*n_dead_tup/nullif(n_live_tup+n_dead_tup,0),2) AS dead_pct FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
  relname   | n_live_tup | n_dead_tup | dead_pct
------------+------------+------------+---------
products    |   12400123 |    3120099 |   20.11
events      |    9021123 |    2019921 |   18.29

Meaning: dead tuples are high; vacuum may be behind.
Decision: check autovacuum settings, long transactions, and IO. Consider manual VACUUM (not FULL) during a quiet window if you’re in trouble.

10) Confirm whether Postgres is missing the index you think it has

cr0x@server:~$ psql -d app -c "\d+ products"
                                                  Table "public.products"
   Column   |           Type           | Collation | Nullable | Default | Storage  | Stats target | Description
------------+--------------------------+-----------+----------+---------+----------+--------------+-------------
 id         | bigint                   |           | not null |         | plain    |              |
 title      | text                     |           | not null |         | extended |              |
 body       | text                     |           |          |         | extended |              |
 search_tsv | tsvector                 |           |          |         | extended |              |
Indexes:
    "products_pkey" PRIMARY KEY, btree (id)

Meaning: no GIN index on search_tsv.
Decision: add the right index before blaming hardware. Also verify how search_tsv is maintained (trigger vs generated column vs batch).

11) Explain a slow Postgres search query properly

cr0x@server:~$ psql -d app -c "EXPLAIN (ANALYZE,BUFFERS) SELECT id FROM products WHERE search_tsv @@ plainto_tsquery('english','wireless headphones') ORDER BY ts_rank(search_tsv, plainto_tsquery('english','wireless headphones')) DESC LIMIT 20;"
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..152.33 rows=20 width=8) (actual time=812.221..812.260 rows=20 loops=1)
   Buffers: shared hit=120 read=9812
   ->  Index Scan using products_search_tsv_gin on products  (cost=0.43..9821.33 rows=1290 width=8) (actual time=812.219..812.248 rows=20 loops=1)
         Index Cond: (search_tsv @@ plainto_tsquery('english'::regconfig, 'wireless headphones'::text))
 Planning Time: 2.012 ms
 Execution Time: 812.312 ms

Meaning: heavy disk reads (read=9812) dominate; index exists but data isn’t in cache.
Decision: consider warming cache for hot queries, add RAM, reduce working set, or move this workload to OpenSearch if concurrency will grow.

12) Check Postgres slow queries in real time

cr0x@server:~$ psql -d app -c "SELECT pid,now()-query_start AS age,wait_event_type,wait_event,substr(query,1,80) AS q FROM pg_stat_activity WHERE state='active' ORDER BY age DESC LIMIT 8;"
 pid  |   age   | wait_event_type |  wait_event   |                                        q
------+---------+-----------------+---------------+--------------------------------------------------------------------------------
 8123 | 00:01:12| IO              | DataFileRead  | SELECT id FROM products WHERE search_tsv @@ plainto_tsquery('english','wirel...
 8344 | 00:00:44| Lock            | tuple         | UPDATE products SET body = $1 WHERE id = $2

Meaning: IO waits and row-level locks are both present.
Decision: separate read/write hotspots; consider moving heavy writes away from tables backing search, or decouple search indexing from OLTP paths.

13) Validate replication lag (if hydration reads a replica)

cr0x@server:~$ psql -d app -c "SELECT application_name,state,write_lag,flush_lag,replay_lag FROM pg_stat_replication;"
 application_name |   state   | write_lag | flush_lag | replay_lag
------------------+-----------+-----------+-----------+-----------
 app-read-replica  | streaming | 00:00:00  | 00:00:01  | 00:00:08

Meaning: replica replay lag is 8 seconds.
Decision: if users complain about “missing items after creation,” check whether hydration reads replicas. Route critical freshness reads to primary or accept staleness explicitly.

14) Check your sync pipeline backlog (outbox table)

cr0x@server:~$ psql -d app -c "SELECT count(*) AS pending, min(created_at) AS oldest FROM search_outbox WHERE processed_at IS NULL;"
 pending |         oldest
---------+------------------------
  184221 | 2026-02-04 08:11:03+00

Meaning: you have a large backlog and it’s old.
Decision: pause non-essential producers, scale consumers, and verify OpenSearch bulk errors. If the backlog grows during normal load, you sized the pipeline for optimism, not reality.

15) Verify OpenSearch bulk ingest errors (don’t trust “200 OK”)

cr0x@server:~$ curl -s -H 'Content-Type: application/json' localhost:9200/_bulk -d $'{"index":{"_index":"products_v12","_id":"42"}}\n{"title":"x","price":"not-a-number"}\n' | head
{"took":7,"errors":true,"items":[{"index":{"_index":"products_v12","_id":"42","status":400,"error":{"type":"mapper_parsing_exception","reason":"failed to parse field [price] of type [float]"}}}]}

Meaning: the bulk API accepted the request but failed an item due to mapping conflicts.
Decision: send failed items to a dead-letter path; fix the mapping and reindex. If you ignore this, your index becomes silently incomplete.

16) Confirm OpenSearch field mappings to prevent “dynamic mapping surprises”

cr0x@server:~$ curl -s localhost:9200/products_v12/_mapping?pretty | head -n 40
{
  "products_v12" : {
    "mappings" : {
      "properties" : {
        "title" : { "type" : "text" },
        "price" : { "type" : "float" }
      }
    }
  }
}

Meaning: price is a float; ingestion must respect it.
Decision: lock down mappings and validate data at the producer. “Let OpenSearch figure it out” works until it doesn’t, and it never breaks politely.

Corporate-world mini-stories (anonymized)

Mini-story 1: The incident caused by a wrong assumption

A mid-sized B2B SaaS company moved “global search” from PostgreSQL FTS to OpenSearch to support faceting and faster prefix queries.
They did the migration the way people do: new service, new index, a CDC pipeline, and a triumphant dashboard showing query latency dropping.

Then support tickets arrived: “I created a record and search can’t find it.” The engineering manager assumed it was user error or caching.
The team doubled the OpenSearch refresh rate to 500ms and called it done. Tickets continued.

The wrong assumption was subtle: they thought the CDC pipeline was “near real-time” because it processed messages quickly when the queue was empty.
During peak hours, the outbox consumer lagged behind by minutes. Refresh intervals don’t fix missing documents that haven’t been indexed.

The fix was not fancy. They added lag monitoring, scaled consumers, and made the UI honest: “New items may take up to N minutes to appear.”
For a handful of workflows where freshness mattered, the app used PostgreSQL queries directly for a short “just created” window.

The lesson: refresh interval is not a consistency guarantee. It’s a visibility control after ingest. Your pipeline is the consistency boundary now.

Mini-story 2: The optimization that backfired

Another org had a multi-tenant marketplace. Search was “fine” until they added semantic search with embeddings.
Someone proposed an optimization: put everything into a single OpenSearch index with aggressive dynamic mapping, and route by tenant ID for locality.
It worked beautifully in staging. Staging is a magical place where data never argues.

In production, tenant behavior diverged. One tenant shipped documents with a “metadata” object containing hundreds of keys that varied per record.
Dynamic mapping created fields like it was being paid per field. Cluster state grew, heap usage climbed, and rolling restarts started failing because nodes took too long to join.

They tried to “optimize” further by increasing heap. That bought time, then increased GC pauses.
Meanwhile, relevance debugging became impossible because different tenants’ analyzer needs collided inside one mapping.

The eventual fix: move to a controlled schema. Flatten metadata into a single keyword field when needed, and explicitly whitelist indexable keys.
Some high-volume tenants got dedicated indices with separate lifecycle policies. The system became slightly less elegant and dramatically more reliable.

The lesson: dynamic mapping + multi-tenant variability is a scaling bug, not a feature. You pay for flexibility with cluster stability.

Mini-story 3: The boring practice that saved the day

A payments-adjacent company ran PostgreSQL as truth and OpenSearch for search. They had one habit that looked like overkill: every index version was immutable.
products_v10, products_v11, products_v12. The application queried an alias called products_current.

A developer needed to add synonyms and change analyzers for better relevance. That requires reindexing.
Instead of “updating the index in place,” they built products_v13 in parallel, validated document counts, ran query comparisons, and then atomically flipped the alias.

During rollout, they discovered an ingestion bug: a price field sometimes arrived as “N/A”. The new index rejected those documents. Old index still served traffic.
They fixed the producer, replayed failed outbox entries, rebuilt v13, and flipped the alias again. Users barely noticed anything besides gradually improving relevance.

The lesson: immutable index versions + alias cutovers aren’t glamorous, but they turn “reindex” from a crisis into a routine.

Common mistakes: symptoms → root cause → fix

1) “Search is slow only when we add aggregations”

Symptoms: P95 explodes on faceted queries; CPU spikes; OpenSearch search thread pool rejections.
Root cause: high-cardinality terms aggregations on unanalyzed fields, too many shards, or insufficient heap for aggregation structures.
Fix: reduce aggregation cardinality, use keyword fields, add filter context, precompute facets, or split indices by tenant/category. Validate with thread pool stats and heap.

2) “New records don’t show up, but OpenSearch is green”

Symptoms: missing fresh data; cluster health fine; no query errors.
Root cause: ingestion lag, bulk errors, dead-letter accumulation, or refresh interval too high for expectations.
Fix: monitor outbox/CDC lag, inspect bulk responses for errors, implement replay. Set and communicate freshness SLOs.

3) “Postgres FTS was fast, then it got slow over months”

Symptoms: increasing latency; high IO reads; tables and GIN indexes grow disproportionately.
Root cause: bloat from updates/deletes + vacuum debt; GIN pending list growth; long transactions blocking vacuum.
Fix: tune autovacuum, break up long transactions, consider REINDEX or rebuilding the GIN index during maintenance, and keep search_tsv maintenance efficient.

4) “OpenSearch node restarts cause a cascade of timeouts”

Symptoms: rolling restart causes cluster yellow/red; requests time out; shard relocations saturate network/disk.
Root cause: shard count too high, insufficient disk headroom, relocation throttles misconfigured, or replicas unavailable.
Fix: reduce shard count (bigger shards, fewer of them), ensure disk headroom, stage rolling restarts, and keep replicas healthy.

5) “We tuned relevance and now results are wrong”

Symptoms: exact matches buried; brand terms ignored; unpredictable ranking changes.
Root cause: analyzer changes without reindex, synonyms applied at query time incorrectly, multi-field mapping mismatches.
Fix: version analyzers and indices, test with explain, maintain golden queries, and deploy with alias flips and comparison runs.

6) “Our hybrid query duplicates or misses results when paginating”

Symptoms: users see repeated items on page 2; missing items; inconsistent ordering.
Root cause: unstable sort keys between OpenSearch and Postgres hydration, using offset-based pagination under concurrent writes.
Fix: use search_after or cursor pagination in OpenSearch; hydrate by stable IDs; maintain deterministic secondary sort (e.g., _score + id).

Checklists / step-by-step plan

Step-by-step: choosing Postgres-only vs OpenSearch vs hybrid

Write down query types: keyword, phrase, prefix, fuzzy, facets, geo, vector, hybrid ranking.
Write down freshness requirements: “seconds,” “minutes,” “hourly.” Be explicit.
Define failure behavior: what happens if search is down or stale?
Estimate data churn: are documents updated frequently or mostly append-only?
Prototype the hardest query in both systems and measure P95 under concurrency, not just single-user timing.
Pick the simplest architecture that meets SLOs. Not the coolest.

Step-by-step: building a safe hybrid pipeline

Keep Postgres as truth: don’t let OpenSearch become the only copy of critical fields.
Implement an outbox table with a durable cursor and replay support.
Make indexing idempotent: document ID derived from primary key; updates overwrite.
Use bulk indexing with backoff and a dead-letter mechanism.
Version indices and use aliases for cutover.
Monitor lag: outbox backlog age, bulk error rate, OpenSearch thread pool rejections, Postgres replication lag if relevant.
Load test ingest + query together: the cluster that indexes fine may query poorly when merges start.

Step-by-step: reindexing without drama

Create a new index version with explicit mappings and settings.
Backfill from Postgres in controlled batches.
Dual-write new changes (or replay outbox) into both old and new indices.
Validate document counts, sample queries, and bulk error logs.
Flip alias atomically to the new index.
Keep old index for rollback until confidence is high; then delete.

FAQ

1) Can PostgreSQL do “good enough” search for a product catalog?

Yes, if your needs are mostly exact-ish keyword search plus filters, and your scale/churn are reasonable.
When you need heavy faceting, fuzzy matching, synonyms at scale, or hybrid vector + keyword ranking, OpenSearch becomes the sane choice.

2) If I add pgvector, do I still need OpenSearch?

Sometimes no. pgvector is great when you want embeddings near your transactional data and can tolerate the query cost.
But OpenSearch tends to win when you need combined relevance, aggregations, and retrieval performance under high concurrency.

3) Why not store everything in OpenSearch and skip Postgres?

Because you’ll reinvent transactional guarantees with duct tape. OpenSearch is not an OLTP database.
Keep truth in Postgres (or another database) unless the domain truly tolerates eventual consistency for everything you care about.

4) What’s the biggest operational trap with OpenSearch?

Shard sprawl and uncontrolled mappings. Too many shards create coordination overhead; dynamic fields inflate cluster state and memory.
Both fail slowly, then all at once.

5) What’s the biggest operational trap with Postgres-only search?

Vacuum and bloat management, especially with GIN indexes on frequently updated text.
Postgres can do it, but you must treat vacuum as a first-class production concern, not a background fairy.

6) How do I make hybrid search consistent enough for users?

Define freshness SLOs, measure pipeline lag, and provide graceful fallbacks.
For workflows requiring immediate visibility, query Postgres directly for recently created records or show “pending indexing” state.

7) How should I paginate in a hybrid system?

Prefer OpenSearch cursor-style pagination (search_after) with stable sorting, then hydrate those IDs in Postgres using a single batched query.
Avoid offset pagination across changing datasets unless you like duplicates.

8) Do I need a separate OpenSearch index per tenant?

Not always. Per-tenant indices can explode shard counts. A shared index with routing can work, but only with strict schema control.
High-volume or high-variability tenants often justify dedicated indices.

9) How do I know if I should raise OpenSearch refresh_interval?

If ingest is behind, merges are throttling, and users can tolerate some staleness, raise it temporarily.
If your product promise is “instant search,” raising it just moves the complaint from ops to support.

10) Should search results be hydrated from Postgres every time?

Often yes for correctness and access control, but not always for performance. Many systems store enough denormalized fields in OpenSearch for display.
The rule is: if a field must be correct and current, hydrate or verify against Postgres.

Next steps you can do this week

Decide the truth source: write it down. If it’s not Postgres, be explicit about what you’re risking.
Add lag metrics for your indexing pipeline (outbox backlog age, consumer throughput, bulk error rate).
Run the fast diagnosis playbook once on a calm day, not during an incident. Capture baselines.
Version your index and implement alias cutover if you haven’t. This is the difference between “change” and “incident.”
Pick three golden queries and track them in CI or a scheduled job with latency and top-result checks.
Reduce surprise: disable dynamic mapping where possible, explicitly map fields, and validate producer data types.

If you’re stuck deciding: start with Postgres-only if you can meet relevance and latency targets today. Move to hybrid when you can name the specific query patterns that demand it.
Don’t adopt OpenSearch just because someone wants synonyms. Adopt it because your system needs a search engine, not because you want another cluster to babysit.