MariaDB vs Elasticsearch for Site Search: When the Search Cluster Is Mandatory

Was this helpful?

Your site search is “fine” until it isn’t: a product launch, a viral link, or one enthusiastic customer typing
“wireless noise cancelling…” and your database CPU hits the ceiling. Then it’s pages timing out, relevance turning
into randomness, and someone suggests caching everything “until we fix it.” That’s how outages get commemorative mugs.

This is the practical line between “MariaDB is enough” and “you need a search cluster.” Not as ideology. As operations:
tail latency, reindexing, ranking control, failure domains, and how to diagnose what’s actually slow.

The real question: what are you optimizing for?

“MariaDB vs Elasticsearch” is a misleading framing. In production, you don’t choose a technology; you choose which
failure modes you’re willing to own.

MariaDB is great when search is a filtered lookup problem. Elasticsearch is great when search is a language and
relevance problem. The moment you need both at scale—fast filters, fuzziness, synonyms, autocomplete, “did you mean,”
ranking tuning, multi-field scoring—you’ve moved from “querying records” to “running a retrieval system.”
Retrieval systems want their own infrastructure, because the work they do is different.

Here’s the production-minded rule: use MariaDB for search when the query is predictable, bounded, and can be indexed
with tight selectivity. Use Elasticsearch when you need to serve intent, not exact matches, and you need to control
relevance without rewriting your SQL every week.

The trap is thinking Elasticsearch is “for scale” and MariaDB is “for small.” That’s backwards. Both can scale.
Elasticsearch is for search behaviors that relational engines aren’t designed to do cheaply or consistently.

Interesting facts and historical context (short, useful)

  • Lucene predates Elasticsearch. The core indexing/search library (Lucene) started in 1999; Elasticsearch (built on Lucene) arrived around 2010 and packaged it for distributed use.
  • BM25 became the default relevance baseline. Lucene moved from classic TF/IDF to BM25 as a more modern, tunable scoring model. This matters because “relevance” is math, not vibes.
  • MySQL/MariaDB FULLTEXT was built for documents, but with constraints. It’s useful, but language analysis, scoring controls, and query composition are limited compared to Lucene-based engines.
  • InnoDB FULLTEXT came later than many people remember. Historically, FULLTEXT was associated with MyISAM; InnoDB support showed up later, and operational expectations lagged behind reality.
  • Elasticsearch made “near real-time” indexing popular. The refresh interval concept (segments becoming searchable after refresh) is a deliberate design tradeoff between throughput and freshness.
  • Distributed search is a coordination problem. The hard part isn’t indexing; it’s shards, replicas, routing, merges, and the very human desire to change mappings after you shipped.
  • SQL LIKE became the original sin of site search. People still ship it because it works on day one. Day thirty comes with a CPU invoice.
  • Search appliances came before cloud “managed search.” Enterprises used proprietary search boxes long before today’s hosted options; the operational pain is not new, just repackaged.

What MariaDB can do for site search (and where it cracks)

MariaDB search is great when it’s really an index lookup

MariaDB shines when your “search” is mostly:

  • Exact matches (SKU, ID, email, username).
  • Prefix matches that can use an index (e.g., normalized keyword table).
  • Filtering/sorting across structured columns (category, price, status, permissions).
  • Small-ish datasets where full scans are still cheap and predictable.

The database gives you strong consistency, easy joins, transactions, and a single operational surface.
If your product needs “find products with color=blue and price < 50 and in stock,” MariaDB is the adult in the room.

FULLTEXT: useful, but don’t pretend it’s a search engine

MariaDB FULLTEXT can absolutely support a basic site search. It’s often the right choice when:

  • You have one language, limited morphology needs, and mostly literal matching.
  • You can accept coarse relevance and you’re not tuning ranking weekly.
  • The corpus is not huge, and query concurrency is moderate.
  • You don’t need fancy analyzers (synonyms, stemming rules per field, n-grams for autocomplete).

But FULLTEXT is where optimism goes to get budget approval and then dies in production.
The cracks show up in four places:

  1. Ranking control: you get fewer levers, and the levers are less composable. If you need “title matches win unless body matches strongly and recency matters,” you’ll end up building custom scoring logic outside SQL.
  2. Language analysis: tokenization, stopwords, stemming, synonyms—this is not where relational engines focus. You can patch it, but you’ll be patching forever.
  3. Operational isolation: heavy search load competes with OLTP writes. When search spikes, your checkout traffic shouldn’t learn about it.
  4. Query flexibility: fuzziness, typo tolerance, phrase queries, proximity, and multi-field boosting are possible in some form but painful to evolve without rewriting.

The big production smell: your DB becomes the search box

If you run search inside MariaDB, you’re tying user typing behavior to your primary datastore. That means:

  • Every keystroke on an autocomplete endpoint can become a DB query.
  • Search can trigger fulltext lookups that don’t cache well.
  • Slow queries steal buffer pool and CPU from transactions.

This is where teams end up “solving” search by throwing read replicas at it. It works until it doesn’t, and it shifts
failure into replication lag and inconsistent results. It’s not a strategy; it’s a delay tactic.

Joke #1: If you use LIKE '%term%' in production, the database will eventually match you with an incident call.

What Elasticsearch is actually buying you

It’s a purpose-built retrieval engine with opinionated tradeoffs

Elasticsearch is not “a database replacement.” It is an indexing and retrieval system optimized for:

  • Fast full-text queries over large corpora.
  • Flexible query DSL: boolean logic, phrase/proximity, fuzziness, boosting, field weights.
  • Text analysis pipelines: tokenizers, filters, synonyms, stemming, n-grams, per-field analyzers.
  • Horizontal scale with shards and replicas.
  • Near real-time updates (refresh cycles) rather than strict transactional consistency.

Operationally, you’re buying isolation and different knobs

The production reason to run Elasticsearch is isolation: search load belongs in a separate cluster with its own scaling,
its own performance envelope, and its own failure blast radius.

Elasticsearch gives you knobs MariaDB doesn’t:

  • Shard-level parallelism: fan-out queries across shards, then merge results. You pay coordination overhead; you gain throughput.
  • Index lifecycle management: rollover, retention, segment merges, and tiering can be planned, not guessed.
  • Reindexing as a first-class operation: schema changes are painful, but the ecosystem expects reindexing and provides workflows.
  • Relevance iteration speed: changing analyzers, boosting, synonyms, and query logic can happen without redesigning the relational schema.

What you also buy: new failure modes

Elasticsearch will happily let you create a cluster that looks healthy and performs terribly. Typical culprits:

  • Shard explosion: too many small shards, too much overhead.
  • Segment merge pressure: indexing throughput collapses under merges.
  • Heap pressure and GC pauses: tail latency becomes a sawtooth graph.
  • Mapping mistakes: fields indexed incorrectly, leading to expensive queries or broken relevance.
  • Refresh interval too low: “why is indexing so slow?” because you made every second a commit party.

And yes, it’s another cluster. Another set of backups/snapshots, upgrades, and on-call surprises. But if search is
a product feature, not a checkbox, the cluster is not optional. It’s the cost of doing business.

When the search cluster is mandatory (decision triggers)

“Mandatory” is a strong word. So here are strong, operational triggers. If you have two or more, stop debating and
start planning the search cluster.

1) Autocomplete and typeahead at real traffic levels

Autocomplete changes the query profile: instead of one search per session, you get several per user interaction.
The queries are short, low-selectivity, and hard to cache. MariaDB can do it, but it will do it loudly.

2) Relevance is a product KPI, not a nice-to-have

If stakeholders talk about “quality of results,” “merchandising,” “boost this category,” “demote out-of-stock,”
“synonyms,” or “typo tolerance,” you’re in search-engine territory. FULLTEXT relevance controls are not where you want
to spend your career.

3) Multi-language, morphology, or domain language

Once you need analyzers per language, custom tokenization (part numbers, chemical names, legal citations),
or synonyms that change without deployments, Elasticsearch becomes the practical choice.

4) You need hybrid ranking signals

Search results increasingly depend on signals like popularity, recency, click-through, personalization, availability,
and business rules. Elasticsearch can blend these with function scores and structured filters without turning
SQL into a performance art piece.

5) Your OLTP database is a sacred resource

If MariaDB is also handling money, orders, auth, inventory, or anything that wakes executives up, you want search
outside it. Search spikes should not be able to starve transactional throughput.

6) You require “good enough” freshness, not strict consistency

Elasticsearch is near real-time by design. If “items become searchable within ~1–30 seconds” is acceptable, you can
decouple indexing from writes. If you truly need read-after-write consistency for search results, you either design for
it explicitly (hybrid read path, UI hints, or synchronous indexing) or stay in the database and accept the constraints.

7) You need zero-downtime schema evolution of search

You will change mappings/analyzers. If your organization can’t tolerate search downtime while you migrate indexes,
you want an engine with built-in reindex workflows and alias cutovers.

Joke #2: A search cluster is like a second espresso machine—nobody wants to maintain it, but everyone notices when it’s gone.

The sane default: hybrid architecture

Pattern: MariaDB as source of truth, Elasticsearch as read-optimized index

In most real companies, the best answer is both:

  • MariaDB remains the source of truth for entities, permissions, transactions, and referential integrity.
  • Elasticsearch stores a denormalized, query-optimized document per entity (product, article, ticket, listing).
  • Search requests hit Elasticsearch first; results return IDs; the app optionally hydrates details from MariaDB.

This architecture has a name in practice: accept eventual consistency, but design the gap.
You build an indexing pipeline, you monitor lag, and you decide how the UI behaves when the index is behind.

Indexing pipeline: your reliability is here, not in the query

The pipeline can be:

  • CDC (change data capture) from MariaDB binlogs → stream → indexer → Elasticsearch.
  • Outbox pattern: application writes entity + outbox event in the same DB transaction; a worker consumes and indexes.
  • Periodic batch reindexing for simpler systems, with incremental updates where possible.

The outbox pattern is boring. That’s why it works. It turns “did we index this?” into a retriable job with a durable ledger.

Key design choices that prevent future pain

  • Immutable index versions: create products_v42, then switch an alias products_current. Never “edit in place” when analyzers change.
  • Store source fields needed for rendering: avoid “hydration storms” where every search result becomes N database queries.
  • Separate query types: autocomplete index vs full search index vs category browsing index. Different analyzers, different shard strategies.
  • Explicit freshness SLO: “99% of updates searchable within 30s.” Then measure it.

One quote, because operations is philosophy with pager duty

“Hope is not a strategy.” — Gen. Gordon R. Sullivan

Production tasks: commands, outputs, decisions (12+)

These are the kinds of checks you run when someone says “search is slow” and everyone stares at the database.
Each task includes a runnable command, a realistic output snippet, what it means, and the decision you take.

Task 1: Identify top MariaDB queries by time (Performance Schema)

cr0x@server:~$ mysql -e "SELECT DIGEST_TEXT, COUNT_STAR, ROUND(SUM_TIMER_WAIT/1e12,2) AS total_s, ROUND(AVG_TIMER_WAIT/1e12,4) AS avg_s FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 3\G"
*************************** 1. row ***************************
DIGEST_TEXT: SELECT * FROM products WHERE name LIKE ? ORDER BY updated_at DESC LIMIT ?
 COUNT_STAR: 248901
   total_s: 9123.55
     avg_s: 0.0367
*************************** 2. row ***************************
DIGEST_TEXT: SELECT id FROM products WHERE MATCH(title,body) AGAINST (? IN BOOLEAN MODE) LIMIT ?
 COUNT_STAR: 55432
   total_s: 2101.22
     avg_s: 0.0379

What it means: You’re spending hours of CPU time on LIKE and FULLTEXT queries. Average latency looks tolerable, but volume makes it dominant.

Decision: If these are user-facing search endpoints, isolate them. At minimum: stop LIKE wildcard searches; consider Elasticsearch if relevance/autocomplete are required.

Task 2: Confirm whether LIKE is using an index (EXPLAIN)

cr0x@server:~$ mysql -e "EXPLAIN SELECT * FROM products WHERE name LIKE '%headphones%' LIMIT 20\G"
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: products
         type: ALL
possible_keys: idx_products_name
          key: NULL
      key_len: NULL
          rows: 1843921
         Extra: Using where

What it means: Full table scan. Your index on name is useless with a leading wildcard.

Decision: Replace with FULLTEXT where acceptable, or move to Elasticsearch for substring/typo-tolerant matching.

Task 3: Check MariaDB buffer pool and read pressure

cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests';"
+-------------------------+----------+
| Variable_name           | Value    |
+-------------------------+----------+
| Innodb_buffer_pool_reads| 18499231 |
+-------------------------+----------+
+----------------------------------+-------------+
| Variable_name                    | Value       |
+----------------------------------+-------------+
| Innodb_buffer_pool_read_requests | 83199231121 |
+----------------------------------+-------------+

What it means: The ratio is fine-ish, but absolute disk reads are large. Search queries may be blowing cache locality.

Decision: If OLTP is impacted, isolate search. Increasing buffer pool helps until it doesn’t; better to move search IO patterns off the primary DB.

Task 4: Spot replication lag if you pushed search to replicas

cr0x@server:~$ mysql -e "SHOW SLAVE STATUS\G" | egrep "Seconds_Behind_Master|Slave_IO_Running|Slave_SQL_Running"
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 47

What it means: Replica is behind by ~47 seconds. Your search results may be stale, and reads can pile up.

Decision: Either accept staleness explicitly (UX hint, eventual model) or stop using replicas as a search crutch. This is a strong signal for a search index.

Task 5: Measure MariaDB FULLTEXT behavior (and whether it’s selective)

cr0x@server:~$ mysql -e "EXPLAIN SELECT id FROM products WHERE MATCH(title,body) AGAINST ('wireless headphones' IN NATURAL LANGUAGE MODE) LIMIT 20\G"
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: products
         type: fulltext
possible_keys: ft_title_body
          key: ft_title_body
      key_len: 0
          rows: 120000
         Extra: Using where

What it means: FULLTEXT is used, but it expects to examine ~120k rows for that query. Not terrible once; terrible at scale.

Decision: If you need high QPS, autocomplete, or multiple fields with boosting, switch to Elasticsearch and design shard sizing accordingly.

Task 6: Check Elasticsearch cluster health quickly

cr0x@server:~$ curl -s http://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "search-prod",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 48,
  "active_shards" : 48,
  "unassigned_shards" : 48
}

What it means: Yellow with unassigned replicas. You can serve reads, but you have no redundancy for many shards.

Decision: Treat as a reliability incident. Fix allocation (disk watermarks, node loss, replica counts) before tuning performance; otherwise you’re tuning a limping cluster.

Task 7: Identify shard explosion (shard count and sizes)

cr0x@server:~$ curl -s http://localhost:9200/_cat/shards?v | head
index           shard prirep state   docs   store ip         node
products_v42    0     p      STARTED 812341 3.2gb 10.0.1.21  es-data-1
products_v42    0     r      UNASSIGNED
products_v42    1     p      STARTED 799002 3.1gb 10.0.1.22  es-data-2
products_v42    1     r      UNASSIGNED

What it means: You have replicas unassigned, and shard sizes are small-ish. If there are hundreds/thousands of shards like this, overhead will dominate.

Decision: Consolidate shards (fewer primaries), fix replica allocation, and stop creating one index per tenant unless you like pain.

Task 8: Check search latency distribution (not averages)

cr0x@server:~$ curl -s http://localhost:9200/_nodes/stats/indices/search?pretty | egrep -A3 "query_total|query_time_in_millis"
"query_total" : 9012331,
"query_time_in_millis" : 81233122,
"query_current" : 37

What it means: Aggregate stats show load, not tail latency. But high query_current under load hints saturation.

Decision: Correlate with application p95/p99 and threadpool rejection metrics next. Don’t declare victory from a mean.

Task 9: Check for threadpool rejections (classic “it’s slow” symptom)

cr0x@server:~$ curl -s http://localhost:9200/_nodes/stats/thread_pool/search?pretty | egrep -A6 "\"search\"|rejected"
"search" : {
  "threads" : 13,
  "queue" : 1000,
  "active" : 13,
  "rejected" : 21844,
  "largest" : 13,
  "completed" : 99123312
}

What it means: Search requests are being rejected. Users see timeouts and intermittent failures.

Decision: Scale out, reduce query cost (filters, fielddata, deep pagination), add caching, or throttle. Rejections mean you’re past capacity right now.

Task 10: Detect slow Elasticsearch queries via the slowlog

cr0x@server:~$ sudo tail -n 5 /var/log/elasticsearch/search-prod_index_search_slowlog.log
[2025-12-30T10:21:41,902][WARN ][index.search.slowlog.query] [es-data-2] [products_v42][1] took[2.8s], took_millis[2803], total_hits[200000], search_type[QUERY_THEN_FETCH], source[{"query":{"match":{"body":{"query":"wireless headphones","operator":"and"}}},"sort":["_score",{"updated_at":"desc"}],"from":0,"size":10000}]

What it means: Deep pagination (size:10000) plus sorting is expensive and often unnecessary.

Decision: Implement search_after or cap size, return fewer results, or use a separate browsing endpoint with precomputed sorts.

Task 11: Check segment count and merge pressure

cr0x@server:~$ curl -s http://localhost:9200/_cat/segments/products_v42?v | head
index        shard prirep segment generation docs.count size  committed searchable version compound
products_v42 0     p      _0      0          5123      9.8mb true      true       9.11.1  true
products_v42 0     p      _1      1          4981      9.5mb true      true       9.11.1  true

What it means: Many small segments can imply frequent refreshes and heavy merges later.

Decision: For heavy indexing, increase refresh interval, batch updates, or use bulk indexing. Let merges breathe.

Task 12: Validate index refresh interval (freshness vs throughput)

cr0x@server:~$ curl -s http://localhost:9200/products_v42/_settings?pretty | egrep -A3 "refresh_interval"
        "refresh_interval" : "1s",
        "number_of_shards" : "12",
        "number_of_replicas" : "1"

What it means: 1s refresh is aggressive. Great for freshness, bad for indexing throughput and segment churn.

Decision: If you don’t need 1-second freshness, move to 5–30s and measure ingestion + query latency improvements.

Task 13: Confirm mapping issues that cause expensive queries

cr0x@server:~$ curl -s http://localhost:9200/products_v42/_mapping?pretty | egrep -n "\"name\"|\"type\"|\"keyword\"" | head
34:        "name" : {
35:          "type" : "text"
36:        },
78:        "category" : {
79:          "type" : "text"
80:        }

What it means: category is text only; filtering/aggregations on it will be slow or will require fielddata (bad news).

Decision: Add keyword subfields for exact filters/aggregations. Plan a reindex with an alias cutover.

Task 14: Measure indexing lag (pipeline health)

cr0x@server:~$ mysql -e "SELECT MAX(updated_at) AS db_latest, (SELECT MAX(indexed_at) FROM search_index_audit) AS indexed_latest\G"
*************************** 1. row ***************************
     db_latest: 2025-12-30 10:27:12
indexed_latest: 2025-12-30 10:26:02

What it means: Index is ~70 seconds behind DB writes (assuming indexed_at tracks successful index updates).

Decision: If your freshness SLO is 30s, this is an incident. Investigate indexer backlog, bulk failures, or ES ingestion throttling.

Task 15: Check filesystem and disk latency on data nodes

cr0x@server:~$ iostat -x 1 3
Device            r/s     w/s   rkB/s   wkB/s  await  svctm  %util
nvme0n1         210.3   98.2  8123.4  4312.8  18.42   0.62  97.8

What it means: Disk is near saturated, with elevated await. Search and merges are IO-hungry.

Decision: Scale out storage throughput (more nodes, faster disks), reduce shard count, and avoid heavy indexing during peak query windows.

Fast diagnosis playbook

When “search is slow,” you want a three-pass triage: user-facing symptoms, backend saturation, then query/index design.
Don’t start by guessing. Start by narrowing.

First: confirm where time is spent

  1. Application metrics: p50/p95/p99 latency for the search endpoint, and error rates/timeouts.
  2. Dependency breakdown: DB time vs ES time vs application rendering time. If you don’t have tracing, add it. Your guesses are expensive.
  3. Concurrency: how many in-flight requests are piling up? Queueing is the silent killer.

If p50 is fine and p99 is awful, you have saturation, GC pauses, lock contention, or a small percent of pathological queries.
If everything is uniformly slow, you have systemic constraints: disk, CPU, or network.

Second: check the obvious saturation signals

  1. MariaDB: slow query log, CPU, InnoDB buffer pool misses, lock waits, replication lag (if using replicas for search).
  2. Elasticsearch: threadpool rejections, GC, heap usage, disk utilization/latency, unassigned shards, hot nodes.
  3. Infrastructure: node IO (iostat), CPU steal (virtualized), network drops/latency between app and search nodes.

Third: isolate the expensive query shapes

  1. Deep pagination (from/size) and sorting on unanalyzed fields.
  2. Wildcard queries and leading wildcards in SQL.
  3. High-cardinality aggregations (ES) or unselective predicates (SQL).
  4. Bad mappings forcing fielddata or scripting.
  5. Overly aggressive freshness (ES refresh interval too low) while indexing heavily.

Fourth: decide whether this is a capacity problem or a design problem

If threadpool rejections, disk saturation, or replica lag are present, you have a capacity problem now.
If queries are fundamentally expensive (wildcards, deep pagination, low selectivity), you have a design problem that
capacity will only mask.

Three corporate mini-stories (realistic, anonymized)

Mini-story 1: the incident caused by a wrong assumption

A mid-sized marketplace launched “search suggestions” because product wanted the UI to feel fast. Engineering took the
obvious shortcut: a MariaDB query on the products table, WHERE name LIKE 'term%', with an index on name.
It worked in staging. It worked in production, too—at 9 a.m.

The wrong assumption was that “prefix search uses an index, so it’s cheap.” That’s only half true. In production,
users typed short prefixes: “a”, “b”, “s”. Those prefixes matched huge ranges. The query used the index, yes, but it still
walked massive portions of it, did sorting, and returned top-N. Latency climbed. CPU climbed. Connections piled up.
Checkout started timing out because it shared the same database.

The incident response was classic: scale the DB vertically, add a read replica, add caching. Caching helped for the most
popular prefixes, until promotions changed what users typed and the cache missed again. The replica lagged, suggestions
became inconsistent, and support got tickets about “missing products.”

The fix wasn’t heroic. They separated the search use case. Autocomplete moved into Elasticsearch with an n-gram analyzer.
They rate-limited suggestions per user, debounced on the frontend, and put a strict cap on result counts. MariaDB went
back to being good at transactions. The lesson wasn’t “Elasticsearch is faster.” It was “user input shape matters more
than your index definition.”

Mini-story 2: the optimization that backfired

An internal docs portal ran Elasticsearch and had decent latency—until an engineer “optimized relevance.” The plan:
increase fields in the query (title, headings, body, tags), add fuzziness, add synonyms, and boost recent documents.
All good goals. The query DSL grew into a small novella.

They shipped it right before a company-wide onboarding wave. Search traffic spiked. Suddenly p99 latencies became seconds,
not milliseconds. Nodes were “healthy,” but search threadpool rejections climbed. GC pauses started showing up. The cluster
didn’t fall over; it just became politely useless.

The backfire came from a few compounded costs: fuzziness across multiple fields, a low minimum-should-match, and a large
size because the UI wanted to show “lots of results” without pagination. Each query expanded into a lot of term
combinations and pulled large result windows. CPU and heap went up. Query caches were ineffective because every user query
was unique.

The recovery was to treat relevance features as a budget. They constrained fuzziness to short fields, tightened minimum
should match, reduced size, removed expensive sorts, and introduced a two-phase approach: first a cheap retrieval,
then optional reranking for the top-N. Relevance improved and the cluster survived onboarding.

The moral: “better relevance” can be a denial-of-service attack you write yourself. Make it measurable and bounded.

Mini-story 3: the boring but correct practice that saved the day

A retail site ran hybrid search: MariaDB for truth, Elasticsearch for search. Nothing fancy. What they did have was
an outbox table in MariaDB, processed by a small worker fleet. Every index update was a job. Every job had retries.
Every failure ended up in a dead-letter bucket with enough context to replay.

One afternoon, a deployment introduced a mapping change that rejected documents for a subset of products—only those with
a weird attribute. The worker started failing those jobs repeatedly. Without guardrails, this would have become an
unbounded backlog, then a “search is missing items” incident, then a long night.

Instead, the boring controls kicked in. The worker had a circuit breaker: after N failures for a job, it quarantined
it. Index lag alerts fired because freshness SLO breached. On-call could immediately see which documents failed and why,
because the outbox entry stored the entity ID and operation type.

They rolled back the mapping change, replayed quarantined jobs, and were back to steady-state without guessing.
No heroics, no manual SQL archaeology. Just a ledger and a retry loop.

The lesson: your indexing pipeline is a distributed system. Treat it like one, even if it’s “just search.”

Common mistakes: symptoms → root cause → fix

1) Symptom: search intermittently times out during traffic spikes

Root cause: Search shares MariaDB resources with OLTP; connection pools saturate; slow searches queue behind writes.

Fix: Isolate search workload (Elasticsearch or separate DB cluster), enforce query timeouts, and implement rate limits for autocomplete.

2) Symptom: “we added a read replica and it got worse”

Root cause: Replica lag causes inconsistent results; application retries amplify load; read-after-write expectations break.

Fix: Stop using replicas as search engines. If you must, explicitly accept staleness and show it in UX; otherwise build an index.

3) Symptom: Elasticsearch is “healthy” but queries are slow

Root cause: Threadpool saturation, hot shards, or GC pauses; cluster health doesn’t capture performance.

Fix: Check thread_pool rejections, heap pressure, disk IO, and shard distribution; scale or reduce query cost.

4) Symptom: filters/aggregations are painfully slow in Elasticsearch

Root cause: Filtering on text fields, missing keyword mapping; fielddata enabled to “fix” it, exploding heap.

Fix: Add keyword subfields, reindex, and remove fielddata-based hacks.

5) Symptom: indexing slows down exactly when traffic peaks

Root cause: Refresh interval too low, heavy merges, and simultaneous query load; IO becomes contested.

Fix: Increase refresh interval, batch bulk indexing, schedule heavy reindexing off-peak, and ensure data nodes have IO headroom.

6) Symptom: “search results are missing new items”

Root cause: Index pipeline lag, failed jobs, or silent mapping rejects.

Fix: Track freshness lag explicitly, implement dead-letter queues/quarantine, alert on rejected documents, and provide replay tooling.

7) Symptom: relevance changes break unexpectedly after a deploy

Root cause: Analyzer/mapping changes require reindex; partial changes create mixed behavior across segments and fields.

Fix: Use versioned indexes + alias cutover. Never “edit the tires while driving” by modifying analyzers in place.

8) Symptom: MariaDB CPU spikes from search, even with FULLTEXT

Root cause: Low selectivity queries, high concurrency, or mixed workload; FULLTEXT still costs CPU and memory.

Fix: Move search out, or constrain search (min query length, filters), add caching for frequent queries, and cap concurrency.

Checklists / step-by-step plan

Step-by-step: decide if Elasticsearch is mandatory

  1. List required search behaviors: autocomplete, typo tolerance, synonyms, multi-language, phrase queries, boosting, custom ranking signals.
  2. Measure query profile: QPS, p95/p99, concurrency, and whether users generate multiple queries per interaction.
  3. Check DB impact: do search queries appear in top time consumers? Are OLTP endpoints impacted during search spikes?
  4. Define freshness requirement: is 1–30 seconds acceptable? If yes, decouple indexing. If no, plan a hybrid read path or reconsider scope.
  5. Decide isolation: if OLTP is business-critical, isolate search by default.
  6. Estimate operational budget: on-call skill, upgrade cadence, snapshot strategy, and capacity planning for shards/heap/disk.

Step-by-step: implement a hybrid search architecture without making it fragile

  1. Define the document model: denormalize the fields needed for search and for rendering the results page.
  2. Choose an indexing mechanism: outbox pattern or CDC. Prefer outbox if you want controllable failure handling.
  3. Build idempotency: indexing the same entity update twice must be safe.
  4. Add audit tables/metrics: track latest indexed timestamp and failure counts.
  5. Use versioned indexes: items_v1, items_v2 with an alias items_current.
  6. Reindex plan: background build, dual-write (optional), then alias cutover, then decommission old index.
  7. Guardrails: cap result sizes, throttle autocomplete, block deep pagination, and set query timeouts.
  8. Failure UX: if ES is down, decide whether to show fallback (limited DB search) or a friendly error. Pick one and test it.

Operational checklist: keep Elasticsearch from becoming a haunted house

  • Shard sizing policy (avoid tiny shards; avoid too many).
  • Dashboards for: threadpool rejections, heap usage, GC, disk latency, indexing rate, refresh/merge time.
  • Snapshot strategy and restore test schedule.
  • Explicit SLOs: query p95/p99 and indexing freshness lag.
  • Runbooks for: unassigned shards, hot nodes, rejected documents, reindex cutover.

FAQ

1) Can MariaDB FULLTEXT replace Elasticsearch for a small site?

Yes, if “search” is a single-language, low-QPS feature with modest relevance expectations and no autocomplete.
The moment product asks for synonyms, fuzziness, or multi-field boosting, plan the migration.

2) Is Elasticsearch only for “big data”?

No. Elasticsearch is for search behavior complexity and workload isolation. Small datasets can still justify it if the UX depends on relevance and typeahead.

3) What’s the biggest operational difference between MariaDB search and Elasticsearch?

MariaDB search competes with writes and transactions; Elasticsearch search competes with indexing and merges.
In both cases, you manage contention—but in Elasticsearch you can isolate it from OLTP.

4) Why not just add more MariaDB replicas for search?

Replicas trade CPU for replication lag and operational complexity. It’s fine for read scaling of predictable queries.
Search queries are often unpredictable and low-selectivity, and replicas won’t fix relevance limitations.

5) How do I handle permissions (ACLs) in Elasticsearch?

Either filter by permission fields at query time (document contains tenant/org/visibility fields), or use per-tenant indexes if tenant count is small.
Avoid per-user indexes unless you enjoy long weekends. For complex ACLs, consider returning candidate IDs then enforcing permissions in the app layer.

6) How do I reindex without downtime?

Use versioned indexes plus an alias. Build items_v2, backfill, then atomically move items_current alias from v1 to v2.
Keep v1 for rollback until confidence is high.

7) What’s the most common Elasticsearch scaling mistake?

Too many shards. It increases coordination cost, memory overhead, and recovery time. Fewer, right-sized shards beat many tiny shards in most production setups.

8) Should search results be hydrated from MariaDB or served fully from Elasticsearch?

If you hydrate every result row from MariaDB, you risk turning a search into N+1 queries under load.
Store enough fields in Elasticsearch to render the results page. Hydrate only when the user clicks a detail page.

9) Can I use Elasticsearch as the system of record?

You can, but you usually shouldn’t. Elasticsearch is not optimized for relational integrity and transactional workflows.
Use it as a derived index, and keep your truth in MariaDB (or another transactional store).

10) What freshness can I realistically expect with Elasticsearch?

Typically seconds to tens of seconds, depending on refresh interval, indexing load, and pipeline design.
If you need “immediately searchable,” you’ll either accept higher indexing cost or implement a hybrid read path.

Conclusion: next steps you can execute

If your site search is essentially structured filtering with occasional keyword matching, MariaDB can carry it—especially
if you keep queries selective and avoid wildcard fantasies. But if search is a product feature with autocomplete, typo
tolerance, relevance tuning, and multi-field ranking, the search cluster isn’t a luxury. It’s the correct boundary.

Practical next steps:

  1. Run the diagnostics: identify whether MariaDB is paying the search bill today (slow queries, CPU, buffer pressure).
  2. Decide on isolation: if OLTP and search share resources, separate them before the next traffic spike separates them for you.
  3. Design the pipeline: pick outbox or CDC, define freshness SLO, and build retries + quarantine.
  4. Implement versioned indexes: alias cutovers are how you avoid reindexing drama becoming downtime.
  5. Set guardrails: cap result windows, throttle autocomplete, and ban deep pagination in your API contracts.

Build search like you expect it to be used by humans at scale: impatiently, creatively, and all at once. That’s not pessimism. That’s operations.

← Previous
WooCommerce Cart Empties After Refresh: Sessions and Cache Fixes That Actually Work
Next →
Intelligent App Control (IAC) Explained: Windows’ App Gatekeeper for Unknown Apps

Leave a comment