Your product team says “reports are slow,” and what they mean is: dashboards time out, CSV exports take minutes,
and the MariaDB primary looks like it’s mining cryptocurrency. Meanwhile checkout latency creeps up and you start
getting that special kind of pager noise that ruins weekends.
This isn’t a mystery. You’re trying to do analytics on an OLTP engine, under OLTP constraints, with OLTP data models,
while pretending it’s fine because “it’s just SQL.” It’s not fine. It’s expensive, risky, and usually avoidable.
Why “reports” kill MariaDB (and why indexes aren’t a religion)
MariaDB (and MySQL) are fantastic at transactional workloads: small reads and writes, tight indexes,
predictable access paths, and concurrency patterns shaped by user requests. Your application does lots of
“get one order,” “insert one payment,” “update one session.” The engine is built to make that fast and safe.
Reports are different. Reports do wide scans, large joins, grouped aggregations, and window-ish patterns
that churn through millions of rows to summarize “everything.” They also show up at the worst possible time:
end-of-month, end-of-quarter, Monday morning, right when traffic spikes.
OLTP vs analytics: it’s not just “bigger queries”
OLTP performance is about latency and contention. Analytics performance is about throughput and bandwidth:
how fast can you scan, decode, filter, and aggregate data. These are different physics.
- Row store vs column store: MariaDB stores rows. Analytics wants columns. Scanning 2 columns out of 50 should not force you to read 50.
- Index reliance: OLTP leans on indexes. Analytics often prefers scans with vectorized execution and compressed columnar storage.
- Concurrency patterns: OLTP is many short queries; analytics can be fewer long queries that tie up CPU, memory, and IO.
- Data model reality: normalized schemas are clean for writes; denormalized schemas are faster for aggregates and joins-at-query-time.
The real damage: reports don’t just run slowly, they destabilize the database
“Slow” is annoying. “Slow and destabilizing” is an incident. A report query can:
- Evict your hot OLTP working set from the buffer pool.
- Flood the disk with reads (and undo/redo writes if temp tables spill).
- Create replication lag (writes still happen, but the replica can’t keep up applying them).
- Trigger metadata locks in awkward ways during schema changes.
- Cause CPU saturation that makes every query slow, including the ones that matter.
If you’ve been living on “just add an index,” here’s the catch: indexes speed up selective queries.
Reports are often not selective. Or they’re selective on a low-cardinality field that makes indexes less useful.
Or they join large tables in ways that make the “right” index impossible to pick because there is no right index.
Joke #1: Adding indexes to fix reporting is like adding more lanes to a highway to fix rush hour—congratulations, you now have a wider traffic jam.
Interesting facts and a bit of history (so you stop arguing with gravity)
- MariaDB forked from MySQL after concerns about Oracle’s acquisition of Sun; it kept MySQL compatibility while diverging in storage engines and features.
- ClickHouse originated at Yandex for web-scale analytics; its design assumes huge scans and fast aggregates are the norm, not the exception.
- Columnar storage became mainstream in analytics because reading fewer bytes matters more than “having the right index” once datasets get large.
- Compression is a feature in column stores: better compression means less IO, and less IO means more speed. It’s not just about saving disk.
- MergeTree (ClickHouse’s core family of engines) relies on background merges and immutable parts—great for ingestion + reads, different failure modes than B-trees.
- Materialized views in ClickHouse are often used as “pre-aggregations,” a pattern popularized by OLAP systems to trade storage for predictable query latency.
- Star schemas and denormalized fact tables became common because joining normalized transactional tables at analytics scale is expensive and fragile.
- Replication lag as a symptom predates modern analytics stacks: “run reports on a replica” has been a hack since early MySQL days, and it still bites.
What ClickHouse is good at (and what it’s bad at)
ClickHouse is an OLAP database built for fast reads over large datasets. It reads columnar data,
executes queries in a vectorized way, compresses aggressively, and can scale horizontally.
It’s designed for “give me aggregates over a month of events,” not “update one row by primary key 5,000 times a second.”
ClickHouse strengths
- Fast scans and aggregates: group-bys, time buckets, percentile-ish computations, top-N lists.
- High compression: you often get surprisingly small storage footprints for event-like data.
- Append-friendly ingestion: insert batches; let merges clean it up later.
- Pre-aggregation options: materialized views can build summary tables as data arrives.
- Distributed execution: scale reads and storage across shards/replicas with the right design.
ClickHouse weaknesses (aka where people hurt themselves)
- Updates/deletes are not OLTP: yes, you can do mutations, but they’re heavy and not the happy path.
- Joins can be expensive if you treat it like a normalized OLTP schema and expect miracles.
- Eventual consistency patterns: you’ll design for late-arriving data, deduplication, and idempotency.
- Operational nuance: merges, parts, disk watermarks, and memory limits are not optional knowledge.
Here’s the core recommendation: keep MariaDB for transactions and source-of-truth state.
Use ClickHouse for reporting, dashboards, and exploration. Don’t make them fight over the same resources.
Decision framework: keep MariaDB, optimize, or offload
Not every slow report requires a new database. Some require a better query, a summary table,
or admitting the dashboard doesn’t need second-by-second freshness. But once the workload is fundamentally analytical,
MariaDB will keep charging you interest in the form of bigger boxes, more replicas, and more incidents.
When to stay in MariaDB (for now)
- Dataset is small (fits in memory comfortably) and queries are simple.
- Reports can be precomputed nightly without business pain.
- Most slowness is from obvious query bugs (missing predicates, N+1 joins at report layer, accidental cross joins).
- You don’t have operational capacity to run another system yet.
When to offload to ClickHouse
- Reports scan millions of rows routinely and still need interactive latency.
- Dashboards run constantly and compete with production traffic.
- You need flexible slicing/dicing on event-like data (time series, logs, clicks, orders, inventory movements).
- “Run it on a replica” is causing replication lag, or the replica has become production-critical anyway.
When to rethink the product instead
Sometimes the honest fix is not technical. It’s setting expectations.
If finance wants “all-time cohort analysis” on demand over five years of data, the answer is:
build the right pipeline and storage, and accept that it costs money and time.
Fast diagnosis playbook
You want the bottleneck in minutes, not after a week of vibes-based tuning. Here’s the order that works in real life.
First: is MariaDB choking on CPU, IO, locks, or memory?
- CPU pinned: likely bad plans, large joins, no selectivity, heavy sorting/grouping.
- IO bound: buffer pool misses, table scans, temp table spills, poor locality, too many random reads.
- Lock bound: metadata locks, long transactions, or reports holding locks longer than expected.
- Memory pressure: sorts, temp tables, or buffer pool too small causing churn.
Second: is the workload mismatch the real issue?
- Queries require scanning “most of the table” even with indexes.
- Business asks for high-cardinality group-bys and flexible filters.
- Reports are frequent and run during peak hours.
Third: is the data pipeline your hidden problem?
- Replica lag makes “run reports on replica” useless.
- CDC or ETL introduces duplicates or out-of-order events.
- Schema changes break the pipeline and you discover it a week later.
Fourth: confirm by measuring one representative query end-to-end
Pick one query that hurts. Get its plan in MariaDB, run it with tracing, and quantify rows examined, temp tables,
sort passes, and time spent waiting. Then decide: optimize in MariaDB or stop torturing it and offload.
One quote to keep you honest: “Hope is not a strategy.” — Gene Kranz
Practical tasks: commands, outputs, and decisions (12+)
These are not “toy” commands. They’re the day-to-day moves you use to decide what to do next.
For each task: run the command, interpret the output, then make a decision.
Task 1: Find the top slow queries in MariaDB (slow query log)
cr0x@server:~$ sudo pt-query-digest /var/log/mysql/mariadb-slow.log | head -n 30
# 120s user time, 3s system time, 1.23M rss, 2.45G vsz
# Current date: Mon Dec 30 10:12:44 2025
# Hostname: db-primary
# Files: mariadb-slow.log
# Overall: 8.42k total, 97 unique, 0.23 QPS, 0.01x concurrency
# Time range: 2025-12-29T09:00:00 to 2025-12-30T09:00:00
# Attribute total min max avg 95% stddev median
# ============ ======= ======= ======= ======= ======= ======= =======
# Exec time 13200s 100ms 95s 1.6s 12s 4.3s 450ms
# Rows examine 2.1e11 0 9.2e8 2.5e7 1.1e8 6.3e7 8.2e6
# Query 1: 38% RPS, 61% time, 80% rows examined
What it means: Rows examined is enormous relative to rows returned. That’s classic “analytics query on OLTP tables.”
Decision: If the worst offenders are grouping/aggregating across large time ranges, plan an offload.
If it’s one query with a missing predicate, fix the query first.
Task 2: Check current MariaDB threads and whether they’re stuck
cr0x@server:~$ mariadb -e "SHOW PROCESSLIST\G" | sed -n '1,60p'
*************************** 1. row ***************************
Id: 8123
User: report_user
Host: 10.2.4.19:52311
db: app
Command: Query
Time: 184
State: Sending data
Info: SELECT customer_id, count(*) FROM orders WHERE created_at >= '2025-10-01' GROUP BY customer_id
*************************** 2. row ***************************
Id: 8130
User: app_user
Host: 10.2.7.11:49821
db: app
Command: Query
Time: 2
State: Updating
Info: UPDATE inventory SET qty=qty-1 WHERE sku='X'
What it means: “Sending data” for hundreds of seconds is often scan/aggregation, not network transfer.
Decision: If report threads are long-lived and numerous, throttle them, move them to a separate system, or both.
Task 3: Confirm whether the query plan is doing a scan
cr0x@server:~$ mariadb -e "EXPLAIN SELECT customer_id, count(*) FROM orders WHERE created_at >= '2025-10-01' GROUP BY customer_id\G"
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: orders
type: range
possible_keys: idx_orders_created_at
key: idx_orders_created_at
rows: 48219321
Extra: Using index condition; Using temporary; Using filesort
What it means: Even with a range index, you’re scanning ~48M rows, then temp + filesort for group-by.
Decision: If “rows” is in the tens of millions and the report is frequent, stop optimizing around the edges. Offload.
Task 4: Check InnoDB buffer pool behavior (is it thrashing?)
cr0x@server:~$ mariadb -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';"
+---------------------------------------+------------+
| Variable_name | Value |
+---------------------------------------+------------+
| Innodb_buffer_pool_read_requests | 9923412331 |
| Innodb_buffer_pool_reads | 183492211 |
+---------------------------------------+------------+
What it means: The ratio indicates many reads are satisfied from memory, but 183M physical reads is still huge.
Watch the trend during reporting windows.
Decision: If buffer pool reads spike when reports run and app latency follows, isolate analytics workload.
Task 5: Check for temp table spills (a silent killer)
cr0x@server:~$ mariadb -e "SHOW GLOBAL STATUS LIKE 'Created_tmp%tables';"
+-------------------------+----------+
| Variable_name | Value |
+-------------------------+----------+
| Created_tmp_disk_tables | 421193 |
| Created_tmp_tables | 702881 |
+-------------------------+----------+
What it means: Disk temp tables are expensive: extra IO, extra latency, more contention.
Decision: If disk temp tables grow rapidly during reports, either redesign queries or stop running them on MariaDB.
Task 6: Measure replication lag if you’re running reports on replicas
cr0x@server:~$ mariadb -h db-replica -e "SHOW SLAVE STATUS\G" | egrep 'Seconds_Behind_Master|Slave_IO_Running|Slave_SQL_Running'
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 1870
What it means: 31 minutes behind. Your “reporting replica” is now a time machine.
Decision: Stop treating replicas as analytics engines. Offload to ClickHouse or build dedicated analytical replicas with different guarantees.
Task 7: Identify whether the host is IO-bound (Linux)
cr0x@server:~$ iostat -xz 1 3
Linux 6.1.0 (db-primary) 12/30/2025
avg-cpu: %user %nice %system %iowait %steal %idle
52.21 0.00 6.12 18.44 0.00 23.23
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await %util
nvme0n1 2100.0 268800.0 0.0 0.00 8.10 128.0 980.0 105600.0 12.40 98.70
What it means: %util near 99% and high iowait: storage is saturated. Reports are reading too much.
Decision: You can scale up storage, but that’s buying time. Better: reduce scan volume by moving analytics to columnar storage.
Task 8: Check MariaDB table sizes and growth (capacity pressure)
cr0x@server:~$ mariadb -e "SELECT table_name, round((data_length+index_length)/1024/1024/1024,2) AS gb FROM information_schema.tables WHERE table_schema='app' ORDER BY (data_length+index_length) DESC LIMIT 10;"
+-------------------+-------+
| table_name | gb |
+-------------------+-------+
| events | 612.45 |
| orders | 189.12 |
| order_items | 141.77 |
| sessions | 88.04 |
+-------------------+-------+
What it means: Big tables with mixed OLTP + analytics use (like events) are prime offload candidates.
Decision: Start with the largest scan-heavy tables; keep transactional tables in MariaDB.
Task 9: Baseline ClickHouse server health
cr0x@server:~$ clickhouse-client -q "SELECT version(), uptime(), round(memory_usage/1024/1024) AS mem_mb FROM system.metrics WHERE metric IN ('MemoryTracking')"
24.9.2.42
86400
512
What it means: You can track memory tracking and uptime quickly; if this is already high at idle, you have a workload or config issue.
Decision: Set sane memory limits and investigate runaway queries before onboarding dashboards.
Task 10: Check ClickHouse parts/merges (are you drowning in tiny parts?)
cr0x@server:~$ clickhouse-client -q "SELECT table, count() AS parts, sum(rows) AS rows, round(sum(bytes_on_disk)/1024/1024/1024,2) AS gb FROM system.parts WHERE active GROUP BY table ORDER BY parts DESC LIMIT 5"
events 18234 891233112 122.31
orders 1820 83211299 18.44
What it means: 18k parts for events suggests too many small inserts or poor partitioning.
Decision: Batch inserts, tune partitioning, and ensure merges can keep up; otherwise query latency will be spiky.
Task 11: Find slow queries in ClickHouse (system.query_log)
cr0x@server:~$ clickhouse-client -q "SELECT query_duration_ms, read_rows, read_bytes, memory_usage, query FROM system.query_log WHERE type='QueryFinish' AND event_time > now()-3600 ORDER BY query_duration_ms DESC LIMIT 5"
12034 981233112 44120341212 2147483648 SELECT customer_id, count() FROM events WHERE ts > now()-86400 GROUP BY customer_id
What it means: Read rows/bytes tell you whether the query is scanning too much; memory usage hints at big group-by state.
Decision: If reads are huge, revisit primary key/order by and partitioning; if memory is huge, consider pre-aggregation.
Task 12: Verify disk watermarks and free space on ClickHouse nodes
cr0x@server:~$ df -h /var/lib/clickhouse
Filesystem Size Used Avail Use% Mounted on
/dev/md0 3.6T 3.1T 420G 89% /var/lib/clickhouse
What it means: ClickHouse needs headroom for merges; running close to full makes merges slow and failures weird.
Decision: Keep meaningful free space; if you’re above ~80–85% consistently, plan capacity or lifecycle policies now.
Task 13: Verify CDC lag (binlog position tracking via a connector)
cr0x@server:~$ sudo journalctl -u mariadb-to-clickhouse-cdc --since "10 min ago" | tail -n 8
Dec 30 10:01:11 cdc1 cdc[1842]: last_binlog=mysql-bin.003211:918233112
Dec 30 10:01:21 cdc1 cdc[1842]: last_binlog=mysql-bin.003211:918244901
Dec 30 10:01:31 cdc1 cdc[1842]: lag_seconds=3
What it means: Low and stable lag is what you want. Spikes mean downstream pressure or connector issues.
Decision: If lag spikes during report windows, your offload is incomplete: either dashboards still hit MariaDB or ClickHouse can’t ingest fast enough.
Task 14: Confirm ClickHouse is actually using the index (data skipping) via EXPLAIN
cr0x@server:~$ clickhouse-client -q "EXPLAIN indexes=1 SELECT count() FROM events WHERE ts >= now()-3600 AND customer_id=12345"
Indexes:
PrimaryKey
Keys:
ts
customer_id
Condition: (ts >= (now() - 3600)) AND (customer_id = 12345)
What it means: You want conditions that align with the ORDER BY so ClickHouse can skip granules.
Decision: If your common filters don’t appear in the primary key, your schema is fighting your queries.
Designing the offload: data flow, latency, and correctness
“Offload analytics” sounds clean until you ask the annoying questions: How fresh must the data be?
What happens when a row gets updated? How do you model deletes? What’s the source of truth?
Pick your offload pattern
- Batch ETL: nightly/hourly exports from MariaDB into ClickHouse. Simple, cheap, not real-time.
- CDC (change data capture): stream binlog changes into ClickHouse continuously. More moving parts, lower latency.
- Dual-write: application writes to both MariaDB and ClickHouse. Fast path to regret unless you have strong discipline.
For most companies: batch ETL for low-stakes reporting, CDC for dashboards that drive operations, and avoid dual-write unless you enjoy debugging consistency at 2 a.m.
Define “correctness” like an adult
Analytics correctness is rarely “exactly matches OLTP at every millisecond.” Usually it’s:
- Data is complete up to T minus N minutes.
- Late events are handled (backfill window).
- Deduplication is deterministic.
- Business metrics are defined once, not re-invented per dashboard.
How to handle updates and deletes from MariaDB
In MariaDB, a row update is a normal day. In ClickHouse, you typically model change over time:
- Immutable events: append-only facts (order created, payment settled). Best case.
- ReplacingMergeTree: keep latest version by a version column; duplicates merge away eventually.
- CollapsingMergeTree: model sign (+1/-1) for “insert/delete” semantics, good for some event streams.
- Mutations: heavy; use sparingly for true corrections, not routine updates.
If your OLTP table is “current state” (like current inventory), you can replicate it into ClickHouse,
but don’t expect it to behave like a transactionally updated table. For analytics, state tables are often derived from events anyway.
Schema modeling that doesn’t age badly
The fastest way to make ClickHouse slow is to import your normalized MariaDB schema as-is and then act surprised.
The second fastest way is to pick an ORDER BY that doesn’t match your actual filters.
Start with the questions, not the tables
List your top 10 dashboard queries. Real ones, not aspirational ones. For each, capture:
- Time range filters (last hour, last day, arbitrary date range)
- High-selectivity dimensions (customer_id, org_id, tenant_id)
- Group-by dimensions (country, plan, product_id)
- Expected cardinality (how many groups?)
- Refresh frequency and SLO
Order key strategy: the “skip data” superpower
ClickHouse’s primary key (the ORDER BY expression in MergeTree tables) is not a B-tree index.
It’s how data is physically sorted, enabling skipping blocks when predicates match the sort order.
If every dashboard filters by time and tenant, your order key often starts with those:
(tenant_id, ts) or (ts, tenant_id). Which comes first depends on access patterns.
- Mostly “one tenant, many time ranges”: (tenant_id, ts)
- Mostly “global time slices across all tenants”: (ts, tenant_id)
Partitioning: keep it boring
Partitioning helps manage data lifecycle and merge scope. Over-partitioning creates too many parts and operational pain.
A common pattern: partition by month for event data, by day for extremely high volume, or by tenant only if tenants are huge and isolated.
Materialized views: pre-aggregate for predictable dashboards
The “dashboard is slow” complaint is often “group-by over 90 days is slow.” Pre-aggregate the hot metrics.
Build summary tables keyed by the dimensions you actually filter on. Then dashboards hit summaries, not raw facts.
Joke #2: Materialized views are like meal prep—you spend a little time upfront so you don’t cry into takeout at 11 p.m.
Ops and reliability: running ClickHouse like you mean it
Offloading analytics is not “install ClickHouse and vibe.” You are introducing a new production system.
It needs monitoring, backup, capacity planning, schema change discipline, and incident response.
What to monitor (minimum viable SRE hygiene)
- Ingestion lag: from MariaDB binlog to ClickHouse table availability.
- Merge backlog: parts count, merges in progress, bytes to merge.
- Disk usage and free space: merges need headroom; near-full disks cause cascading failures.
- Query latency percentiles: for dashboard queries and ad-hoc exploration.
- Memory usage and OOM kills: group-bys can allocate aggressively without limits.
- Replica health: if using replication, monitor replication queue and errors.
Backups: decide what you’re protecting
ClickHouse data is often derived. That doesn’t mean you don’t back it up.
It means your backup strategy can be different:
- If you can rehydrate from raw logs/object storage, backups are about faster recovery, not existential survival.
- If ClickHouse contains curated business truth (metrics tables, corrected data), back it up like it matters—because it does.
Resource isolation: keep dashboards from becoming denial-of-service
Use users/profiles/quotas. Set memory limits per query and per user. Use queueing if needed.
The goal: one heavy analyst query should not starve operational dashboards.
Schema changes: treat them like production deploys
ClickHouse schema changes are often easier than OLTP migrations, but they still have blast radius:
materialized views depend on columns; CDC pipelines depend on schema; dashboards depend on semantics.
Plan changes with compatibility in mind, and version your transformations.
Three corporate mini-stories (pain included)
Mini-story 1: The incident caused by a wrong assumption
A mid-sized B2B SaaS company had a familiar setup: MariaDB primary, one replica labeled “reporting,” and a BI tool
pointed at that replica. Everyone felt clever. Writes stayed on primary, reads stayed on replica, world peace achieved.
The wrong assumption was subtle: they believed replication lag would be “small enough” for dashboards, and if it got big,
it would be obvious. It wasn’t. The BI tool cached results, users exported CSVs, and nobody noticed that the “current month revenue”
number was sometimes thirty minutes stale during peak hours. The data was wrong, but consistent enough to look believable.
Then came a product launch. Write volume spiked, and so did heavy reporting. Replica lag jumped, and the BI tool got slower.
Someone “fixed” it by pointing the BI tool at the primary for “just a day.” That day included a multi-hour table scan and a few
expensive group-bys. Checkout latency increased; the app retried; writes increased more; the primary started to saturate disk.
The incident wasn’t one catastrophic error. It was the accumulation of a bad architecture decision: using replication as an analytics system.
The postmortem had the usual plot twist: the replica was already production-critical because it fed customer-facing dashboards.
So it wasn’t “nice to have.” It was a dependency with no SLOs and no guardrails.
The fix was boring and structural: move analytics to ClickHouse, keep the replica for failover, define freshness expectations, and add lag alarms.
Replication went back to being what it should be: a resilience tool, not a reporting engine.
Mini-story 2: The optimization that backfired
Another company offloaded to ClickHouse and celebrated. Dashboards dropped from minutes to seconds.
Then they “optimized” ingestion by streaming every row as a single-row insert because “real time” sounded good in meetings.
Within weeks, query latency became erratic. Sometimes fast, sometimes awful. The ClickHouse cluster looked healthy until it didn’t.
Parts count climbed. Merges ran constantly. Disks got busy with background work. The cluster wasn’t down; it was just perpetually annoyed.
The backfire was classic: ClickHouse is append-friendly, but it wants batches.
Single-row inserts create many small parts, and many small parts create merge pressure, and merge pressure steals IO from queries.
The users experienced this as “dashboards are flaky” which is arguably worse than “dashboards are slow,” because it’s harder to trust.
The remediation was not magical tuning. They batched inserts (time-based or size-based), reduced partition granularity,
and set operational limits: maximum parts per partition thresholds with alerting. They also accepted that “real time”
meant “within a minute” for most metrics, not “within a second.”
It was a good lesson: the fastest system is the one you don’t force into pathological behavior.
Mini-story 3: The boring but correct practice that saved the day
A payments-adjacent company ran both MariaDB and ClickHouse. Nothing flashy.
They had a written data contract: a small document describing each metric table, the source fields,
and the transformation logic, plus a “freshness window” and a test query that validated row counts.
Every schema change in MariaDB that touched replicated tables required a lightweight review:
does this break CDC? does ClickHouse need a new column? does a materialized view need updating?
The process was not beloved, but it was consistent. They also ran a daily reconciliation job:
compare counts and sums between MariaDB and ClickHouse for a moving time window.
One Friday, a seemingly harmless app change started writing NULLs into a dimension field that dashboards grouped by.
MariaDB didn’t care. ClickHouse didn’t care. The dashboard, however, showed a sudden drop in a key metric.
The reconciliation check caught it within an hour because group cardinality changed sharply.
The on-call didn’t need to guess. They rolled back the app change, backfilled the missing dimension in ClickHouse from source events,
and the Monday morning executive dashboard did not become a courtroom drama.
Nobody got a trophy for that process. That’s why it worked.
Common mistakes: symptom → root cause → fix
1) Symptom: “Reports are slow” only during business hours
Root cause: reports compete with OLTP workload for CPU and IO; buffer pool churn amplifies it.
Fix: move dashboards to ClickHouse; if you must stay on MariaDB, schedule heavy reports off-peak and add query timeouts.
2) Symptom: MariaDB replica lag grows whenever BI runs
Root cause: replica is doing heavy reads and can’t apply writes fast enough; or IO is saturated by scans.
Fix: stop using a replication replica for analytics; offload. If interim, add a dedicated “analytics replica” with isolated resources and accept it’s still a stopgap.
3) Symptom: ClickHouse dashboards are fast sometimes, slow other times
Root cause: too many small parts and heavy merges; ingestion pattern is pathological.
Fix: batch inserts; reduce partition count; monitor parts; ensure disks have headroom; tune merge settings cautiously.
4) Symptom: ClickHouse query reads billions of rows for a narrow filter
Root cause: ORDER BY doesn’t align with typical predicates; poor data skipping.
Fix: redesign table order key; consider projections or materialized rollups; don’t replicate OLTP schema blindly.
5) Symptom: Data in ClickHouse doesn’t match MariaDB
Root cause: CDC duplicates, out-of-order events, missing deletes/updates semantics.
Fix: use idempotent keys, version columns, and deduplication patterns (e.g., ReplacingMergeTree); implement reconciliation checks and backfill workflows.
6) Symptom: ClickHouse runs out of disk unexpectedly
Root cause: merges need temporary space; retention not enforced; high compression assumptions failed due to data type choices.
Fix: enforce TTL/retention; keep disk below safe thresholds; use correct types (LowCardinality where appropriate), and don’t store JSON blobs as your main plan.
7) Symptom: MariaDB query “optimized” with a new index but app got slower
Root cause: extra index increased write amplification; cache churn; optimizer chose a worse plan for OLTP queries.
Fix: rollback index; use covering indexes only where selective; separate analytics workload instead of piling on indexes.
Checklists / step-by-step plan
Step-by-step: offload analytics the right way
-
Inventory your reporting workload.
Collect the top 20 queries, their frequency, and the dashboards/exports that trigger them. -
Classify queries: OLTP-friendly vs scan-heavy analytics.
If a query examines tens of millions of rows routinely, it’s analytics. -
Define freshness targets.
“Within 5 minutes” is a real requirement; “real time” is a meeting word. -
Select ingestion approach.
Batch ETL for hourly/daily reporting; CDC for near-real-time dashboards. -
Design ClickHouse tables from queries.
Pick partitioning andORDER BYbased on filter patterns. -
Start with one domain.
Example: events/orders analytics, not the entire database. -
Build reconciliation.
Row counts, sums, and “known good” metrics over rolling windows. Alert on drift. -
Move dashboards gradually.
Shadow-run dashboards against ClickHouse and compare results before cutover. -
Throttle and protect.
Enforce timeouts and concurrency limits on MariaDB reports during migration. -
Operationalize ClickHouse.
Monitoring, backups, capacity planning, and runbooks before you declare victory.
Checklist: what to set up on day one (minimum)
- MariaDB slow query log enabled and parsed daily.
- ClickHouse query log retention and dashboards for top queries.
- Disk usage alerts with conservative thresholds.
- Ingestion lag metric + alert.
- Merge/parts monitoring.
- A policy for who can run ad-hoc heavy queries and how they’re limited.
FAQ
1) Can’t I just add more MariaDB replicas and run reports there?
You can, and many teams do. It’s a stopgap. You’re still using a row-store OLTP engine for scan-heavy work,
and you’ll still fight lag, cache churn, and expensive hardware. If reporting is strategic, build an analytical system.
2) Is ClickHouse a replacement for MariaDB?
Not for OLTP. ClickHouse can store lots of data and answer questions fast, but it’s not designed for transactional updates,
strict constraints, and high-frequency point updates the way MariaDB is.
3) What’s the biggest schema mistake when moving to ClickHouse?
Importing normalized OLTP tables and expecting joins to behave like a warehouse. Design fact tables and pick an ORDER BY
aligned with your filters. Make your common queries cheap by construction.
4) How fresh can my dashboards be with ClickHouse?
With CDC and reasonable batching, “within a minute” is common. “Within a few seconds” is possible but increases operational risk,
especially if you turn ingestion into a small-part factory.
5) What about deletes and GDPR-style erasure?
Plan it early. For true erasure, you may need targeted mutations or TTL policies depending on your data model.
If you have legal deletion requirements, don’t hand-wave them—design the pipeline and storage with that in mind.
6) Should we pre-aggregate everything with materialized views?
No. Pre-aggregate the few metrics that are hot and expensive, especially dashboard tiles with strict latency expectations.
Keep raw facts for flexibility, but don’t make every dashboard query re-scan raw data for the same metrics.
7) How do I keep analysts from blowing up ClickHouse with bad queries?
Use profiles/quotas, limit memory per query, and provide curated tables for common exploration.
Also: teach people to filter by time first. Culture is a performance feature.
8) Is “run reports on MariaDB after hours” a valid long-term plan?
Only if the business accepts the latency and the dataset growth remains modest. In practice, reporting expands,
“after hours” disappears across time zones, and you end up back here. Offload early if reporting matters.
9) What if our “reports” need exact transactional consistency?
Then define which reports truly require it. Most don’t. For the few that do, keep them in MariaDB or generate them from
a consistent snapshot and accept a reporting delay. Don’t force every dashboard to meet the strictest requirement.
10) What’s the simplest first win with ClickHouse?
Offload event-style tables (page views, audits, order lifecycle events) and build a couple of summary tables for top dashboards.
You’ll see immediate relief on MariaDB and quick wins for dashboard latency.
Practical next steps
If reports are slow, treat it as a production problem, not a BI inconvenience. Your OLTP database is telling you it’s doing the wrong job.
Measure the pain, identify the worst offenders, and decide whether you’re optimizing a query or changing an architecture.
Do this next, in order:
- Pull the slow query log and rank by total time and rows examined.
- Pick one representative “killer report” and confirm it’s a scan-heavy aggregation.
- Set a freshness target and choose batch ETL or CDC accordingly.
- Build a ClickHouse fact table designed around the top dashboard filters and group-bys.
- Move one dashboard, validate results with reconciliation, then iterate.
- Add guardrails (quotas, memory limits, monitoring) before you invite the whole company.
The goal isn’t to worship ClickHouse. It’s to stop punishing MariaDB for being good at transactions.
Put each system in its lane, and you’ll get faster reports, fewer incidents, and a calmer on-call rotation.