You shipped search. Users loved it for a week. Then “search is slow” tickets arrived, followed by “search results are wrong,” and finally the executive classic: “Can we just make it Google-y?”
That’s when finance asks the most operationally-dangerous question: “Is Elasticsearch cheaper than just doing it in Postgres?”
The honest answer is that “cheaper” depends less on list price and more on what you’re signing up to operate for the next three years: data movement, index hygiene, failure domains, and the human cost of running one more distributed system.
The decision frame: what you’re really buying
PostgreSQL full-text search (FTS) and Elasticsearch are not rivals in the “feature checkbox” sense. They’re rivals in the “how many 3 a.m. pages do you want per quarter” sense.
Both can answer “find documents matching these terms, ranked reasonably.” The costs diverge over time because the operational models diverge.
What stays cheap in Postgres
- Fewer moving parts. Same backup system, same HA strategy, same observability stack, same upgrade calendar.
- Transactional consistency by default. Search results reflect committed writes without a CDC pipeline or dual-write.
- Small-to-medium corpora with predictable queries. Product catalogs, tickets, notes, CRM records, internal docs.
- Teams that are already good at Postgres. Skill reuse is not a soft benefit; it’s a budget line item.
What stays cheap in Elasticsearch
- Complex relevance and analytics-style queries. Facets, aggregations, fuzziness, synonyms, per-field boosting, “did you mean,” and heavy multi-field ranking experiments.
- High query fan-out with caching wins. If your read traffic dwarfs writes, ES can be a cost-effective query workhorse—when tuned.
- Large text corpora and multi-tenant search when you’ve accepted the operational overhead and sized shards sanely.
- Organisations already running Elastic well. If you have a platform team with proven playbooks, the marginal cost drops.
The long-term “cheap” question is really three questions
- Do you want one system of record or two? Search clusters are rarely systems of record, which means ingest pipelines, backfills, and reconciliation.
- Can you tolerate eventual consistency in search results? If not, you’ll pay somewhere—usually in complexity.
- Who will own relevance quality? If it’s “nobody,” Postgres FTS often wins by virtue of being “good enough” and stable.
Paraphrased idea from Werner Vogels (Amazon CTO): Everything fails, all the time; design systems assuming failure.
That’s not poetry. It’s a budget forecast.
Interesting facts and short history (because this stuff didn’t appear yesterday)
- PostgreSQL FTS isn’t new. Core full-text search landed in PostgreSQL 8.3 (2008), building on earlier tsearch modules.
- GIN indexes were a game changer. Generalized Inverted Index (GIN) support made token-to-row lookups practical at scale for tsvector data.
- Elasticsearch rode the Lucene wave. Lucene predates Elasticsearch by a decade; ES packaged Lucene into a distributed service with REST APIs and cluster features.
- “Near real-time” is literal. Lucene-based systems refresh segments periodically; newly indexed docs become searchable after refresh, not instantly at commit time.
- Postgres ranking is rooted in IR. ts_rank/ts_rank_cd implement classic information retrieval ideas; it’s not magic, but it’s not naïve substring matching either.
- Elasticsearch’s default shard count has burned many teams. Shards are not free; too many shards increase overhead and recovery time.
- Vacuum is a cost lever. Postgres bloat and index churn can turn “cheap search” into “why is our disk 90% full.”
- Mapping changes can be operationally expensive. In ES, many mapping changes require reindexing; you pay in IO and time.
- Synonyms are an organisational problem. Whether Postgres or ES, synonyms lists become product policy—someone has to own the arguments.
Joke #1: Search relevance is like office coffee—everyone has opinions, and nobody wants to maintain the machine.
Long-term cost model: compute, storage, people, and risk
Cost bucket 1: compute and memory
Postgres FTS usually burns CPU during query time (ranking, filtering) and during writes (maintaining GIN indexes, triggers, generated columns). It also benefits massively from cache: hot index pages in memory matter.
If you run Postgres on “just enough RAM,” you will discover that full-text queries are excellent at turning random IO into a lifestyle.
Elasticsearch burns memory for heap (cluster metadata, segment readers, caches) and uses OS page cache heavily for Lucene segments. Underprovision heap and you get GC drama; overprovision heap and you starve the page cache. Either way, you’ll have a spreadsheet.
Cost bucket 2: storage amplification
Storage is where the long-term math often flips. Search indexes are not compact. They are deliberate redundancy to make queries fast.
- Postgres: one copy of the data (plus WAL), and one or more indexes. FTS adds tsvector storage and a GIN (or GiST) index. Bloat is the silent multiplier.
- Elasticsearch: one copy of source (unless you disable it), inverted index structures, doc values for aggregations, and typically at least one replica. That’s 2× before you blink. Snapshots add another layer.
If you’re comparing a single Postgres instance to an ES cluster with replicas, you’re not comparing software. You’re comparing risk tolerance.
Cost bucket 3: data movement and backfills
Postgres FTS: data is already there. You add a column, build an index, and you’re done—until you change your tokenization config or add a new field and need to recompute vectors.
Elasticsearch: you must ingest. That means CDC (logical decoding), an outbox pattern, a streaming pipeline, or dual-write. It also means backfills. Backfills happen at the worst time: after you already depend on search for revenue.
Cost bucket 4: people and process
Elasticsearch in production is not “install and search.” It’s index lifecycle policies, shard sizing, refresh intervals, mappings, analyzers, and cluster upgrades that can’t be treated like a library bump.
Postgres FTS isn’t free either, but you’re paying a smaller “new system tax.”
The cheapest system is the one your on-call can explain under pressure. If your team has never done a rolling restart of an ES cluster while shards are relocating and user traffic is spiking, you are not done budgeting.
Cost bucket 5: risk and blast radius
Elasticsearch isolates search load from your primary database. That can reduce risk if search queries are heavy and unpredictable.
Postgres keeps everything together: fewer systems, but a greater chance that a bad search query becomes a database incident.
Long-term cheap is not “lowest monthly bill.” Long-term cheap is “fewest multi-team incidents and emergency reindex weekends.”
PostgreSQL full-text search: what it does well, and what it punishes
The good parts (and why they stay good)
Postgres FTS shines when search is an attribute of your transactional data, not a separate product.
You can keep your data model simple: a table, a tsvector column, a GIN index, and queries with to_tsquery or plainto_tsquery.
- Consistency: Within the same transaction, you can update content and search vector atomically (generated columns or triggers).
- Operational simplicity: One backup, one restore. One place to run security policies. One set of access controls.
- Great for “search inside an app”: where the UX is mostly “type words, get records,” with light ranking needs.
The punishments (where costs creep in)
Postgres will make you pay for three sins: oversized rows, high write churn, and unbounded query patterns.
- Write amplification: Updating a tsvector and GIN index can be expensive under high write volume.
- Bloat: Frequent updates/deletes to indexed columns can bloat tables and GIN indexes, increasing IO and slowing VACUUM.
- Relevance ceiling: You can do weighting, dictionaries, and configs, but you won’t get ES-level tooling for synonyms, per-field analyzers at scale, and relevance experiments without building it.
- Multi-tenant pitfalls: If you do “tenant_id AND tsquery” for thousands of tenants, you may need partial indexes, partitioning, or both.
When Postgres is the cheaper long-term choice
- Your search corpus is under tens of millions of rows, and you can keep documents reasonably small.
- You can constrain queries (no leading wildcards, no “OR everything” queries that explode).
- You need strong consistency and simple operations more than bleeding-edge relevance.
- Your team is already staffed for Postgres, not distributed search.
When Postgres becomes the expensive choice
- Search traffic is large enough to compete with OLTP queries and you can’t isolate it with replicas.
- You need heavy faceting/aggregations over many fields at low latency.
- Relevance tuning becomes a product differentiator and you need faster iteration loops than SQL plus custom code can provide.
Elasticsearch: what it does well, and what it punishes
The good parts
Elasticsearch is built to be searched. It’s not shy about precomputing index structures to keep query latency low.
It’s also designed for horizontal scaling: add nodes, rebalance shards, keep going. In practice, “keep going” is where your runbooks earn their salary.
- Relevance and UX features: analyzers, token filters, synonyms, fuzziness, highlighting, per-field boosting, “more like this.”
- Aggregations: facets, histograms, cardinality estimates, and analytics-ish queries that Postgres can do but often at different cost profiles.
- Isolation: You can keep search load away from your transactional database.
- Scale-out story: With correct shard sizing, ES can scale reads and storage across nodes cleanly.
The punishments
Elasticsearch punishes teams that treat it like a black box and then act surprised when it behaves like a distributed system.
Shards are where the bodies are buried.
- Shard overhead: Too many shards wastes heap, increases file handles, slows cluster state updates, and prolongs recovery.
- Reindexing tax: Mapping mistakes or analyzer changes often require a full reindex. That’s time, IO, and operational risk.
- Eventual consistency: You must manage ingest lag, refresh intervals, and “why isn’t my write searchable yet” support tickets.
- Upgrade choreography: Rolling upgrades are doable, but versions, plugins, and breaking changes require discipline.
- Hidden coupling: Your app, ingest pipeline, index templates, ILM, and cluster settings become one big organism.
Joke #2: Elasticsearch is easy until you need it to be reliable—then it becomes a distributed systems course you didn’t enroll in.
When Elasticsearch is the cheaper long-term choice
- Search is a primary product feature and you need to iterate on relevance quickly.
- Your query patterns include aggregations/facets across many fields with low latency requirements.
- Your dataset is large enough that a dedicated search tier avoids hammering the OLTP database.
- You have (or will fund) the operational maturity: sizing, monitoring, ILM, snapshots, and a tested restore.
When Elasticsearch becomes the expensive choice
- You don’t have a clean ingest story and end up dual-writing with inconsistent behavior.
- You run too many shards “just in case” and pay heap and CPU forever.
- You treat reindexing as a rare event and then do it during peak season.
Architectural patterns that keep you out of trouble
Pattern A: “Postgres only” with sane constraints
Do this if search is secondary. Use a generated tsvector column, a GIN index, and accept that you’re building “good internal search,” not a search company.
Put guardrails in your API so users can’t generate pathological queries.
- Use
websearch_to_tsqueryfor user input (better UX, fewer surprises). - Use weights and a small number of fields; don’t stuff a whole JSON blob into one vector unless you mean it.
- Consider read replicas for search traffic isolation.
Pattern B: Postgres as source of truth + Elasticsearch as projection
This is the common “grown-up” pattern: OLTP in Postgres, search in ES. The cost is the pipeline.
Do it only when you can answer “how do we rebuild ES from Postgres” confidently.
- Use an outbox table and a consumer to index changes.
- Design idempotent indexing operations.
- Plan for backfills and schema evolution (versioned documents or index aliases).
Pattern C: Two-tier search—cheap default, expensive advanced
Keep most search in Postgres. Route only advanced queries (facets, fuzzy matching, heavy ranking) to Elasticsearch.
This reduces ES load and keeps the pipeline smaller. It also creates a “two sources of truth for search behavior,” so be deliberate.
Hard rule: don’t let search become your write path
If user-facing writes depend on ES being healthy, you’ve turned your search cluster into a critical transactional dependency. That’s how “search incident” becomes “revenue incident.”
Keep the write path in Postgres; let ES lag rather than block.
Practical tasks (commands), what the output means, and the decision you make
These are the kinds of checks that turn “I think it’s slow” into “it’s slow because we’re doing X, and we can fix it by Y.”
The commands are runnable examples. Replace DB names and paths to match your environment.
Task 1: Check Postgres index sizes (is FTS eating your disk?)
cr0x@server:~$ psql -d appdb -c "\di+ public.*"
List of relations
Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description
--------+--------------------+-------+----------+-------------+-------------+---------------+--------+-------------
public | documents_fts_gin | index | app | documents | permanent | gin | 12 GB |
public | documents_pkey | index | app | documents | permanent | btree | 2 GB |
(2 rows)
What it means: Your GIN index is the big one. That’s normal—until it isn’t.
Decision: If the FTS index is dominating disk, evaluate whether you’re indexing too many fields, too much text, or suffering bloat from churn. If churn is high, plan VACUUM/REINDEX strategy.
Task 2: Inspect table and index bloat indicators (quick approximation)
cr0x@server:~$ psql -d appdb -c "SELECT relname, n_live_tup, n_dead_tup, last_vacuum, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
relname | n_live_tup | n_dead_tup | last_vacuum | last_autovacuum
------------+------------+------------+--------------------+-------------------------
documents | 42000000 | 9800000 | | 2025-12-29 03:12:01+00
events | 180000000 | 1200000 | 2025-12-28 01:02:11| 2025-12-29 02:44:09+00
(2 rows)
What it means: Dead tuples are high. Autovacuum is running, but it may not be keeping up.
Decision: Tune autovacuum for the hot tables or reduce update churn on indexed text columns. If dead tuples keep growing, expect performance cliffs and disk surprises.
Task 3: Validate your FTS query uses the GIN index (EXPLAIN ANALYZE)
cr0x@server:~$ psql -d appdb -c "EXPLAIN (ANALYZE, BUFFERS) SELECT id FROM documents WHERE fts @@ websearch_to_tsquery('english','backup policy');"
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on documents (cost=1234.00..56789.00 rows=1200 width=8) (actual time=45.210..62.118 rows=980 loops=1)
Recheck Cond: (fts @@ websearch_to_tsquery('english'::regconfig, 'backup policy'::text))
Heap Blocks: exact=8412
Buffers: shared hit=120 read=8410
-> Bitmap Index Scan on documents_fts_gin (cost=0.00..1233.70 rows=1200 width=0) (actual time=40.901..40.902 rows=980 loops=1)
Index Cond: (fts @@ websearch_to_tsquery('english'::regconfig, 'backup policy'::text))
Buffers: shared hit=10 read=2100
Planning Time: 0.322 ms
Execution Time: 62.543 ms
(10 rows)
What it means: It is using the GIN index, but it’s reading a lot of heap blocks from disk.
Decision: If reads dominate, add RAM (cache), reduce result set size with filters, or consider a covering strategy (store fewer large columns, use TOAST wisely, consider denormalization for just the search path).
Task 4: Check Postgres cache effectiveness (are you IO-bound?)
cr0x@server:~$ psql -d appdb -c "SELECT datname, blks_hit, blks_read, round(100.0*blks_hit/nullif(blks_hit+blks_read,0),2) AS hit_pct FROM pg_stat_database WHERE datname='appdb';"
datname | blks_hit | blks_read | hit_pct
--------+-----------+-----------+---------
appdb | 982341210 | 92341234 | 91.41
(1 row)
What it means: 91% hit ratio is okay-ish, not great for a DB that wants low latency. For FTS-heavy workloads, you usually want higher.
Decision: If hit ratio drops during peak search, you’re paying cloud-IO prices in latency. Consider more RAM, better indexing, or moving search read load off the primary (replica or ES).
Task 5: Find the slowest FTS queries (pg_stat_statements)
cr0x@server:~$ psql -d appdb -c "SELECT mean_exec_time, calls, rows, query FROM pg_stat_statements WHERE query ILIKE '%tsquery%' ORDER BY mean_exec_time DESC LIMIT 3;"
mean_exec_time | calls | rows | query
----------------+-------+------+-----------------------------------------------------------
812.44 | 1200 | 100 | SELECT ... WHERE fts @@ websearch_to_tsquery($1,$2) ORDER BY ...
244.10 | 8400 | 20 | SELECT ... WHERE fts @@ plainto_tsquery($1,$2) AND tenant_id=$3
(2 rows)
What it means: Your slowest searches are obvious now, not mythical.
Decision: Add LIMITs, narrow filters, adjust ranking/sorting strategy, or precompute ranking fields. If the slow queries are “global search across everything,” consider ES.
Task 6: Check autovacuum settings for a hot table (are you vacuuming too late?)
cr0x@server:~$ psql -d appdb -c "SELECT relname, reloptions FROM pg_class JOIN pg_namespace n ON n.oid=relnamespace WHERE n.nspname='public' AND relname='documents';"
relname | reloptions
-----------+---------------------------------------------
documents | {autovacuum_vacuum_scale_factor=0.02}
(1 row)
What it means: Someone already lowered the vacuum scale factor (good).
Decision: If bloat is still high, increase autovacuum workers, adjust cost limits, or schedule periodic REINDEX/pg_repack (with change control).
Task 7: Measure WAL volume (is search indexing inflating write costs?)
cr0x@server:~$ psql -d appdb -c "SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),'0/0')) AS wal_since_boot;"
wal_since_boot
----------------
684 GB
(1 row)
What it means: WAL volume is huge; could be normal for a busy system, or a sign your indexed text is churning.
Decision: If WAL growth correlates with text updates, reduce update frequency, avoid rewriting whole documents, or move search projection to ES.
Task 8: Check Elasticsearch cluster health (baseline triage)
cr0x@server:~$ curl -s http://localhost:9200/_cluster/health?pretty
{
"cluster_name" : "search-prod",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 4,
"active_primary_shards" : 128,
"active_shards" : 256,
"unassigned_shards" : 12
}
What it means: Yellow means primaries are allocated but replicas aren’t. That’s reduced redundancy and can become a performance issue during recovery.
Decision: If unassigned replicas persist, fix allocation issues before you scale traffic. Don’t accept yellow as “fine” unless you have an explicit risk waiver.
Task 9: Count shards per node (are you paying the shard tax?)
cr0x@server:~$ curl -s http://localhost:9200/_cat/shards?v
index shard prirep state docs store ip node
docs-v7 0 p STARTED 912341 18gb 10.0.0.21 data-1
docs-v7 0 r STARTED 912341 18gb 10.0.0.22 data-2
docs-v7 1 p STARTED 901122 17gb 10.0.0.23 data-3
docs-v7 1 r STARTED 901122 17gb 10.0.0.24 data-4
...output truncated...
What it means: You can see per-shard store sizes and placement.
Decision: If you see hundreds or thousands of tiny shards (sub-GB), consolidate (fewer primary shards per index, rollover policies, reindex). Shards are a fixed overhead; pay it once, not forever.
Task 10: Inspect JVM heap pressure (is GC your hidden latency?)
cr0x@server:~$ curl -s http://localhost:9200/_nodes/stats/jvm?pretty | head -n 25
{
"cluster_name" : "search-prod",
"nodes" : {
"q1w2e3" : {
"name" : "data-1",
"jvm" : {
"mem" : {
"heap_used_in_bytes" : 21474836480,
"heap_max_in_bytes" : 25769803776
},
"gc" : {
"collectors" : {
"young" : { "collection_count" : 124234, "collection_time_in_millis" : 982341 },
"old" : { "collection_count" : 231, "collection_time_in_millis" : 412341 }
}
}
What it means: Heap usage is ~83% of max, and old GC counts are non-trivial.
Decision: If heap sits high with frequent old GC, reduce shard count, adjust caches, or resize heap (carefully). “Add heap” is not always the fix; it can reduce OS cache and hurt.
Task 11: Check indexing pressure via refresh and merge activity (are you IO-bound from writes?)
cr0x@server:~$ curl -s http://localhost:9200/_nodes/stats/indices/refresh,merges?pretty | head -n 40
{
"nodes" : {
"q1w2e3" : {
"indices" : {
"refresh" : {
"total" : 882341,
"total_time_in_millis" : 9123412
},
"merges" : {
"current" : 12,
"current_docs" : 402341,
"total_time_in_millis" : 22123412
}
What it means: High merge time and current merges suggests heavy write/segment activity, often IO-bound.
Decision: If indexing competes with search latency, tune refresh intervals, throttle ingest, or separate hot ingest indexes. Don’t “optimize” by disabling refresh blindly; you’ll just move pain around.
Task 12: Verify replica and snapshot posture (what’s your restore story?)
cr0x@server:~$ curl -s http://localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count store.size
yellow open docs-v7 xYz 16 1 18000000 320gb
What it means: Yellow plus rep=1 suggests replicas exist but aren’t fully allocated (or you’re missing nodes).
Decision: If you rely on replicas for HA, make it green or reduce replica count with an explicit tradeoff. Also ensure you have snapshots and you’ve tested restore; replicas are not backups.
Task 13: Measure ingest lag (is ES “behind” Postgres?)
cr0x@server:~$ psql -d appdb -c "SELECT now() - max(updated_at) AS db_freshness FROM documents;"
db_freshness
--------------
00:00:03.421
(1 row)
cr0x@server:~$ curl -s http://localhost:9200/docs-v7/_search -H 'Content-Type: application/json' -d '{"size":1,"sort":[{"updated_at":"desc"}],"_source":["updated_at"]}' | jq -r '.hits.hits[0]._source.updated_at'
2025-12-30T10:41:12Z
What it means: If Postgres max(updated_at) is newer than ES’s newest document timestamp, your pipeline is lagging.
Decision: Decide whether eventual consistency is acceptable. If not, fix ingest throughput, add backpressure, or route “freshness-critical” queries to Postgres.
Task 14: Postgres vs ES latency sampling (stop guessing)
cr0x@server:~$ time psql -d appdb -c "SELECT count(*) FROM documents WHERE fts @@ websearch_to_tsquery('english','error budget');"
count
-------
1242
(1 row)
real 0m0.219s
user 0m0.010s
sys 0m0.005s
cr0x@server:~$ time curl -s http://localhost:9200/docs-v7/_search -H 'Content-Type: application/json' -d '{"query":{"match":{"body":"error budget"}},"size":0}' | jq '.took'
37
real 0m0.061s
user 0m0.012s
sys 0m0.004s
What it means: ES query “took” is 37ms and end-to-end ~61ms; Postgres is ~219ms here.
Decision: If ES is consistently faster and your pipeline is healthy, ES may be cheaper in user experience terms—while costing more in ops. If Postgres is “fast enough,” don’t buy an extra cluster to shave 150ms.
Task 15: Check Linux IO saturation (the universal villain)
cr0x@server:~$ iostat -xz 1 3
Linux 6.5.0 (db-1) 12/30/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.1 0.0 4.3 18.7 0.0 64.9
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 820.0 410.0 81200 32100 14.2 0.9 98.7
What it means: %util ~99% and await ~14ms: storage is saturated.
Decision: Before rewriting queries, address IO: faster volumes, more memory, less bloat, better caching, or move search load elsewhere. Saturated disks turn “fine” systems into pager factories.
Fast diagnosis playbook: find the bottleneck in minutes
First: decide if it’s compute, IO, or coordination
- Check end-to-end latency. App metrics: p50/p95/p99 for search endpoint. If p99 is bad, focus there.
- Check system IO saturation.
iostat -xzon DB/search nodes. High await and %util is your “stop everything” signal. - Check CPU and memory. If CPU is pegged and IO is fine, you’re compute-bound (ranking, merges, GC, or too much JSON parsing).
Second: isolate where time is spent
- Postgres: use
EXPLAIN (ANALYZE, BUFFERS). If buffers show many reads, it’s IO/cache. If it’s CPU, you’ll see high time with mostly hits. - Elasticsearch: compare
tookvs client time. If “took” is low but client time is high, you have network, load balancer, TLS, or threadpool queuing issues. - Pipeline: check ingest lag. “Search is wrong” is often “ES is behind.”
Third: pick the smallest safe lever
- Constrain queries. Add LIMIT, add filters, remove pathological query features.
- Fix bloat/shards. Vacuum/reindex (Postgres) or consolidate shards / tune ILM (ES).
- Add hardware only after. RAM helps both systems, but it’s the most expensive band-aid when the underlying design is broken.
Three corporate mini-stories from the field
Incident caused by a wrong assumption: “Search is eventually consistent, but users won’t notice”
A mid-size B2B SaaS added Elasticsearch to improve search across customer tickets. They wired a CDC stream from Postgres into an indexer service and shipped.
Early metrics looked great: lower p95 latency, happier support team, fewer database spikes.
Then the first compliance audit week hit. Support staff were searching for newly created tickets and not seeing them. They re-created tickets, attached files twice, and escalated to engineering.
The engineering team assumed “refresh interval” was a minor tuning knob. They had increased it to reduce indexing overhead during a previous load test.
During audit week, ticket volume spiked. Ingest lag increased, the indexer fell behind, and the refresh interval hid new documents longer.
The failure mode wasn’t “Elasticsearch is down.” It was worse: the system was up and returning plausible-but-wrong results.
The fix was operational, not philosophical: tighten SLOs around freshness, measure lag explicitly, and route “just created” views to Postgres for a short window.
They also added a “search may take up to N seconds to reflect changes” banner for internal workflows—boring, honest UX that prevented duplicate work.
Optimization that backfired: “Let’s reduce Postgres load by indexing everything once”
Another company leaned into Postgres FTS for a knowledge base. It worked fine until the product team requested richer ranking.
An engineer decided to precompute a monster tsvector including title, body, tags, comments, and extracted PDF text—everything.
Query latency improved at first because fewer joins and fewer computations were needed per request. The team celebrated and moved on.
Two months later, the database started growing faster than expected. Autovacuum fell behind. IO went up. Read replicas started lagging.
The “optimization” increased write amplification. Any edit to any part of the document rewrote a larger row and updated a larger GIN index entry set.
The system didn’t fail loudly. It failed slowly: higher p95, occasional timeouts, then a weekend of emergency vacuum tuning and disk expansion.
They recovered by shrinking what was indexed, splitting rarely-changed extracted text into a separate table, and only updating its vector when that text changed.
The lesson: in Postgres, you can buy query speed with write cost. If you don’t measure the write cost, it will invoice you later.
Boring but correct practice that saved the day: “Rebuildable search and tested restores”
A fintech ran both Postgres and Elasticsearch. Search wasn’t a system of record, but it was customer-facing and revenue-adjacent.
Their platform team insisted on a quarterly “search rebuild” game day: drop an index, rebuild from source, measure time, and validate counts and spot-check results.
Nobody loved it. It wasn’t glamorous, and it never produced a roadmap slide.
What it produced was confidence: they knew how long a reindex took, what it cost, and where the bottlenecks lived.
One day, a mapping mistake slipped into production, causing certain queries to behave oddly. The team didn’t panic.
They rolled forward to a new index version behind an alias, reindexed in the background, and switched traffic when validation passed.
Customers saw a minor relevance wobble for a short period, not a full outage. The company didn’t need a war room with twelve people and one exhausted incident commander.
Boring practice saved the day, which is the highest compliment in operations.
Common mistakes: symptom → root cause → fix
1) “Postgres search got slow over time”
Symptom: p95 latency increases month by month; disk usage climbs; autovacuum runs constantly.
Root cause: Table and GIN index bloat from high update/delete churn on indexed text fields; autovacuum not tuned for large tables.
Fix: Lower autovacuum scale factors on hot tables, increase vacuum resources, reduce update churn, and consider periodic REINDEX/pg_repack during maintenance windows.
2) “Elasticsearch is fast in tests but slow in production”
Symptom: Lab benchmarks are great; production has tail latency spikes and timeouts.
Root cause: Too many shards, heap pressure and GC, or merges competing with queries under real ingest load.
Fix: Reduce shard count, correct shard sizing, tune refresh interval and ingest rate, verify heap/page cache balance, and monitor merge backlog.
3) “Search results are missing fresh data”
Symptom: Users can’t find newly created/updated records, but the record exists in Postgres.
Root cause: Ingest pipeline lag, refresh interval too high, or failed indexing events without alerting.
Fix: Instrument lag, alert on backlog, add idempotent retries, and provide a freshness strategy (read-your-writes via Postgres or sticky sessions).
4) “Search caused a database incident”
Symptom: CPU and IO spike on primary; unrelated transactional endpoints slow down.
Root cause: Search endpoint runs expensive queries on primary without LIMITs; no read isolation; poor query constraints.
Fix: Route search to read replicas, enforce query budgets, add LIMIT, require filters, and cache common queries.
5) “We can’t change the analyzer/mapping without downtime”
Symptom: Any relevance change becomes a scary reindex project.
Root cause: No index versioning/aliases; no rehearsal of reindex workflows.
Fix: Adopt versioned indices with aliases, automate reindex pipelines, and practice a rebuild regularly.
6) “Elasticsearch is green but queries still time out”
Symptom: Cluster health is green; users see intermittent timeouts.
Root cause: Threadpool saturation, slow queries, or coordination overhead (e.g., heavy aggregations) despite healthy shard allocation.
Fix: Identify slow queries, add query timeouts/circuit breakers at the application layer, reduce aggregation cardinality, and precompute fields.
7) “Postgres FTS ranking feels ‘off’”
Symptom: Results contain matches but ordering seems wrong to humans.
Root cause: Missing weights, wrong config/language dictionary, or mixing fields without structure.
Fix: Use weighted vectors per field, choose correct regconfig, consider phrase queries, and validate with a curated test set.
Checklists / step-by-step plan
Step-by-step: choose Postgres FTS (and keep it cheap)
- Define query constraints. Decide what users can search: fields, operators, max length, max tokens.
- Model the vector. Keep it intentional: title + body + a few metadata fields, weighted.
- Index with GIN. Build the GIN index and validate with EXPLAIN that it’s used.
- Budget for churn. If you update documents frequently, tune autovacuum early, not after bloat shows up.
- Protect the primary. Put search on replicas when traffic grows. The cheapest database incident is the one you don’t have.
- Measure. Track query latency, buffer reads, dead tuples, and disk growth monthly.
Step-by-step: choose Elasticsearch (and avoid the expensive traps)
- Design the ingest pipeline first. Outbox/CDC, retries, idempotency, backfill.
- Define index versioning. Use aliases so you can reindex without downtime.
- Size shards intentionally. Pick shard sizes that keep recovery reasonable and overhead low; avoid tiny shards.
- Plan ILM/retention. Hot/warm/cold if needed; at least have rollover and delete policies.
- Set SLOs for freshness. Measure lag, not vibes.
- Test restores. Snapshot/restore drills are not optional if search matters.
- Write runbooks. Node failure, cluster yellow/red, reindex, mapping change, heap pressure response.
Step-by-step: hybrid approach (most teams should start here)
- Start with Postgres FTS for core flows and learn real query patterns.
- Instrument search behavior. Log queries (safely), measure latency, capture “no results” cases.
- Graduate to Elasticsearch only for features that truly need it (facets, fuzzy, heavy relevance iteration).
- Keep “read-your-writes” in Postgres for freshness-critical UI elements.
- Make ES rebuildable. If you can’t rebuild it, you don’t own it; it owns you.
FAQ
1) Is Postgres full-text search “good enough” for customer-facing search?
Often, yes—especially for B2B apps where users know what they’re looking for. If you need heavy faceting, fuzzy matching, synonyms at scale, and rapid relevance iteration, ES is usually a better fit.
2) What is the biggest hidden long-term cost of Elasticsearch?
The pipeline and the operational discipline: index versioning, shard sizing, reindex workflows, and monitoring freshness. The cluster is the easy part; keeping it correct is the cost.
3) What is the biggest hidden long-term cost of Postgres FTS?
Write amplification and bloat. If you index large mutable text fields and update them often, your GIN index and table can grow and slow in ways that look like “mysterious degradation.”
4) Can read replicas make Postgres search “as good as” a search cluster?
Replicas help isolate load, but they don’t turn Postgres into a relevance engine. They make “cheap and stable” scale further. They don’t give you ES-style analyzers and aggregation ergonomics.
5) Is it safe to dual-write to Postgres and Elasticsearch?
It can be, but it’s rarely the simplest safe option. Dual-write introduces consistency problems during partial failures. Prefer outbox/CDC so Postgres remains the authoritative write and ES is a projection.
6) How do I decide based on data size alone?
Data size is a weak predictor without query patterns and update rate. Tens of millions of rows can be fine in Postgres FTS with good constraints and hardware. A few million docs can be painful in ES if you create thousands of tiny shards.
7) Are Elasticsearch replicas a backup?
No. Replicas protect against node loss, not operator mistakes, bad mappings, accidental deletions, or corruption propagated across copies. You still need snapshots and tested restores.
8) What about using Postgres trigram search instead of full-text?
Trigrams are great for substring and fuzzy-ish matching on shorter fields (names, identifiers) and can complement FTS. They can also be expensive if misused on large blobs of text. Use the right index for the question.
9) What if we need both search and analytics?
If you’re pushing heavy aggregations and dashboards into ES, you’re effectively running an analytics workload on a search cluster. That can work, but it increases resource contention. Separate concerns if it starts to fight itself.
10) How do we keep long-term costs predictable?
Measure the drivers: Postgres bloat/WAL volume and ES shard count/heap pressure/ingest lag. Costs become predictable when growth rates are visible and tied to capacity plans, not surprise incidents.
Practical next steps
If you want the cheapest long-term answer for most product teams: start with Postgres FTS, make it disciplined, and only add Elasticsearch when you can name the feature gap in one sentence.
“Because everyone uses Elasticsearch” is not a gap; it’s a career-limiting move with a monthly invoice.
- Inventory your real requirements. Facets? Fuzzy? Synonyms? Multi-language? Freshness constraints?
- Run the tasks above on your current system. Get index sizes, bloat signals, shard counts, heap pressure, and ingest lag.
- Decide which cost you’re paying: database contention (Postgres) or pipeline + cluster operations (ES).
- If choosing Postgres: enforce query budgets, isolate with replicas, tune autovacuum, and keep vectors small and meaningful.
- If choosing Elasticsearch: build a rebuildable projection with aliases, sane shard sizing, freshness SLOs, and practiced restore/reindex procedures.
Cheap long-term search is not a brand name. It’s a set of habits: constrain queries, measure the right counters, and treat index maintenance like the production workload it is.