Somewhere in your company, a well-meaning analyst just refreshed a dashboard. Now checkout is slow, the API is timing out, and the on-call channel is becoming a group therapy session.
This isn’t a “bad query” problem. It’s an architecture problem: OLTP and analytics are different animals, and putting them in the same cage ends predictably. The fix is a clean split—MySQL does transactions, ClickHouse does analytics, and you stop letting curiosity DDoS your revenue path.
The actual problem: OLTP and analytics fight at the storage layer
OLTP is about latency, correctness, and predictable concurrency. You optimize for thousands of small reads/writes per second, tight indexes, and hot working sets that fit in memory. The cost of a single slow request is paid immediately—in customer experience, timeouts, and retries that amplify load.
Analytics is about throughput, wide scans, and aggregation. You optimize for reading lots of data, compressing it well, and using vectorized execution to turn CPU into answers. Analytics queries are often “embarrassingly parallel” and don’t mind being a few seconds slower—until they’re pointed at your transactional database and become a denial-of-service with a pivot table attached.
The punchline: OLTP and analytics compete for the same finite resources—CPU cycles, disk I/O, page cache, buffer pools, locks/latches, and background maintenance (flushing, checkpoints, merges). Even if you add a read replica, you’re still frequently sharing the same fundamental pain: replication lag, I/O saturation, and inconsistent performance caused by unpredictable scans.
Where the knife goes in: resource contention in MySQL
- Buffer pool pollution: A big reporting query reads a cold slice of history, evicts hot pages, and suddenly your primary workload becomes disk-bound.
- InnoDB background pressure: Long scans + temp tables + sorts can increase dirty pages and redo pressure. Flush storms are not polite.
- Locks and metadata locks: Some reporting patterns trigger ugly interactions (think “ALTER TABLE during business hours” meets “SELECT …” holding MDL).
- Replication lag: Heavy reads on a replica steal I/O and CPU from the SQL thread applying changes.
Where ClickHouse fits
ClickHouse is built for analytics: columnar storage, compression, vectorized execution, and aggressive parallelism. It expects you to read lots of rows, but only a few columns, and it rewards you for grouping work into partitions and sorted keys.
The discipline is simple: treat MySQL as the system of record for transactions. Treat ClickHouse as the system of truth for analytics—truth meaning “derived from the record, reproducible, and queryable at scale.”
Paraphrased idea from Werner Vogels: “Everything fails; design for failure.” It applies to data too: design for failure modes like query storms, lag, and backfills.
MySQL vs ClickHouse: the real differences that matter in production
Storage layout: row vs column
MySQL/InnoDB is row-oriented. Great for fetching a row by primary key, updating a couple columns, maintaining secondary indexes, and enforcing constraints. But scanning a billion rows to compute aggregates means dragging entire rows through the engine, touching pages you didn’t need, and burning cache.
ClickHouse is column-oriented. It reads only the columns you ask for, compresses them well (often dramatically), and processes them in vectors. You pay upfront with different modeling constraints—denormalization, careful ordering keys, and a merge process that you must respect.
Concurrency model: transactional vs analytical parallelism
MySQL handles many concurrent short transactions well—up to the limits of your schema, indexes, and hardware. ClickHouse handles many concurrent reads too, but the magic is in parallelizing big reads and aggregations efficiently. If you point a BI tool at ClickHouse and allow unlimited concurrency with no query limits, it will try to set your CPU on fire. You can and should govern it.
Consistency and correctness
MySQL is ACID (with the usual caveats, but yes, it’s your transactional anchor). ClickHouse is typically eventually consistent for ingested data and append-oriented. You can model updates/deletes, but you do it on ClickHouse’s terms (ReplacingMergeTree, CollapsingMergeTree, version columns, or asynchronous deletes). That’s fine: analytics usually wants the current truth and a time series of changes, not per-row transactional semantics.
Indexing and query patterns
MySQL indexes are B-trees that support point lookups and range scans. ClickHouse uses primary key ordering and sparse indexes, plus data skipping indexes (like bloom filters) where it helps. The best ClickHouse query is one that can skip big chunks of data because your partitioning and ordering match the access patterns.
Operational posture
MySQL operations revolve around replication health, backups, schema migrations, and query stability. ClickHouse operations revolve around merges, disk utilization, part counts, TTL, and query governance. In other words: you trade one set of dragons for a different set. The deal is still worth it because you stop letting analytics wreck your checkout flow.
Joke #1: A dashboard refresh is the only kind of “user engagement” that can increase error rates and churn simultaneously.
Facts and historical context (useful, not trivia)
- MySQL’s InnoDB became default in MySQL 5.5 (2010 era), cementing row-store OLTP behavior for most deployments.
- ClickHouse started at Yandex to power analytics workloads at scale; it grew up in a world where scanning big data fast was the whole job.
- Column stores took off because CPU got faster than disks, and compression + vectorized execution let you spend CPU to avoid I/O.
- InnoDB buffer pool “pollution” is a classic failure mode when long scans blow away hot pages; the engine isn’t “broken,” it’s doing what you asked.
- Replication-based analytics has existed for decades: people have been shipping OLTP changes into data warehouses since before “data lake” was a résumé keyword.
- MySQL query cache was removed in MySQL 8.0 because it caused contention and didn’t scale well; caching isn’t free, and global locks are expensive.
- ClickHouse’s MergeTree family stores data in parts and merges them in the background—great for writes and compression, but it creates operational signals (part counts, merge backlog) you must monitor.
- “Star schema” and dimensional modeling predate modern tools; ClickHouse often pushes teams back toward denormalized, query-friendly shapes because joins at scale have real costs.
The clean split plan: patterns that don’t melt prod
Principle 1: MySQL is for serving users, not curiosity
Make it policy: production MySQL is not a reporting database. Not “usually.” Not “except for a quick query.” Never. If someone needs a one-off, run it against ClickHouse or a controlled snapshot environment.
You’ll get pushback. That’s normal. The trick is to replace “no” with “here’s the safe way.” Provide the safe path: ClickHouse access, curated datasets, and a workflow that doesn’t involve begging on-call for permission to run a JOIN across a year of orders.
Principle 2: Choose a data movement strategy that matches your failure tolerance
There are three common ways to feed ClickHouse from MySQL. Each has sharp edges.
Option A: Batch ETL (dump and load)
You extract hourly/daily snapshots (mysqldump, CSV exports, Spark jobs), load into ClickHouse, and accept staleness. This is simplest operationally but can be painful when you need near-real-time metrics, and backfills can be heavy.
Option B: Replication-driven ingestion (CDC)
Capture changes from MySQL binlog and stream them into ClickHouse. This gets you near-real-time analytics while keeping MySQL insulated from query load. But it introduces pipeline health as a first-class production concern: lag, schema drift, and reprocessing become your new hobby.
Option C: Dual-write (application writes to both)
Don’t. Or, if you absolutely must, do it only with robust idempotency, asynchronous delivery, and a reconciliation job that assumes the dual-write will lie to you occasionally.
The clean split plan usually means CDC plus curated data models in ClickHouse. Batch ETL is acceptable when you can tolerate staleness. Dual-write is a trap unless you enjoy explaining data mismatches during incident postmortems.
Principle 3: Model ClickHouse for your questions, not your schema
Most OLTP schemas are normalized. Analytics wants fewer joins, stable keys, and event-style tables. Your job is to build an analytics representation that’s easy to query and hard to misuse.
- Prefer event tables: orders_events, sessions, payments, shipments, support_tickets. Append events. Derive facts.
- Partition by time: usually by day or month. This gives you predictable pruning and manageable TTL.
- Order by query dimensions: put the most common filter/group-by keys early in ORDER BY (after the time key if you always filter by time).
- Pre-aggregate where it’s stable: materialized views can produce rollups so dashboards don’t repeatedly scan raw data.
Principle 4: Governance beats heroics
ClickHouse can answer questions fast enough that people will ask worse questions more frequently. You need guardrails:
- Separate users and quotas: BI users get timeouts and max memory. ETL gets a different profile.
- Set max threads and concurrency: avoid a “thundering herd” of parallel queries.
- Use dedicated “gold” datasets: stable views or tables that dashboards depend on, versioned if needed.
- Define SLOs: MySQL latency SLO is sacred. ClickHouse freshness SLO is negotiable but measurable.
Hands-on tasks (commands, outputs, decisions)
These are the moves you actually make at 02:13. Each task includes a command, sample output, what it means, and the decision you make from it.
Task 1: Confirm MySQL is suffering from analytic scans (top digests)
cr0x@server:~$ mysql -e "SELECT DIGEST_TEXT, COUNT_STAR, SUM_TIMER_WAIT/1e12 AS total_s FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 5\G"
*************************** 1. row ***************************
DIGEST_TEXT: SELECT customer_id, sum(total) FROM orders WHERE created_at BETWEEN ? AND ? GROUP BY customer_id
COUNT_STAR: 9421
total_s: 18873.214
*************************** 2. row ***************************
DIGEST_TEXT: SELECT * FROM orders WHERE created_at > ? ORDER BY created_at DESC LIMIT ?
COUNT_STAR: 110233
total_s: 8211.532
What it means: Your worst time is coming from a classic reporting aggregate across a date range. It’s not “one slow query,” it’s repeated pain.
Decision: Block or reroute the analytic query pattern. Don’t tune MySQL into an OLAP engine. Start by moving that dashboard to ClickHouse or a rollup table.
Task 2: Check current MySQL thread activity (is it a dogpile?)
cr0x@server:~$ mysql -e "SHOW PROCESSLIST;" | head
Id User Host db Command Time State Info
31 app 10.0.2.14:51234 prod Query 2 Sending data SELECT customer_id, sum(total) FROM orders WHERE created_at BETWEEN '2025-12-01' AND '2025-12-30' GROUP BY customer_id
44 app 10.0.2.14:51239 prod Query 2 Sending data SELECT customer_id, sum(total) FROM orders WHERE created_at BETWEEN '2025-12-01' AND '2025-12-30' GROUP BY customer_id
57 app 10.0.2.14:51241 prod Query 1 Sending data SELECT customer_id, sum(total) FROM orders WHERE created_at BETWEEN '2025-12-01' AND '2025-12-30' GROUP BY customer_id
What it means: Many identical queries are running concurrently. This is a dashboard or a fleet of workers doing the same expensive work.
Decision: Throttle at the app/BI layer and introduce caching or pre-aggregation in ClickHouse. Also consider MySQL connection limits and per-user resource controls.
Task 3: Validate InnoDB buffer pool pressure (hot pages getting evicted)
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';"
Variable_name Value
Innodb_buffer_pool_read_requests 987654321
Innodb_buffer_pool_reads 12345678
What it means: A high number of physical reads (Innodb_buffer_pool_reads) relative to logical reads suggests your working set isn’t staying in memory—often due to big scans.
Decision: Stop the scans (move analytics off), and only then consider increasing buffer pool or adjusting workload. Hardware cannot outvote bad workload mix forever.
Task 4: Catch disk I/O saturation on the MySQL host
cr0x@server:~$ iostat -xz 1 3
Linux 6.2.0 (mysql01) 12/30/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
18.12 0.00 6.44 31.55 0.00 43.89
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await aqu-sz %util
nvme0n1 820.0 64200.0 0.0 0.0 12.4 78.3 410.0 18800.0 9.8 18.2 98.7
What it means: %util near 100% and high iowait means the disk is the bottleneck. Analytics scans love this outcome.
Decision: Immediate: reduce query concurrency, kill worst offenders, shift analytics to ClickHouse. Long-term: separate storage and workloads; don’t bank on “faster NVMe” as a strategy.
Task 5: Identify MySQL replication lag (your “read replica” isn’t helping)
cr0x@server:~$ mysql -h mysql-replica01 -e "SHOW SLAVE STATUS\G" | egrep "Seconds_Behind_Master|Slave_SQL_Running|Slave_IO_Running"
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 487
What it means: The replica is ~8 minutes behind. Dashboards reading it are lying. Worse: if you fail over, you might lose recent transactions.
Decision: Don’t use the replica as an analytics sink. Use CDC to ClickHouse, or at least a dedicated replica with controlled query access and guaranteed resources.
Task 6: Show the actual expensive query plan (stop guessing)
cr0x@server:~$ mysql -e "EXPLAIN SELECT customer_id, sum(total) FROM orders WHERE created_at BETWEEN '2025-12-01' AND '2025-12-30' GROUP BY customer_id\G"
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: orders
type: range
possible_keys: idx_created_at
key: idx_created_at
rows: 98234123
Extra: Using where; Using temporary; Using filesort
What it means: Even with an index, you’re scanning ~98M rows and using temp/filesort. That’s not an OLTP query; it’s an OLAP job.
Decision: Move it. If you must keep some aggregates in MySQL, use summary tables updated incrementally, not ad hoc GROUP BY over raw facts.
Task 7: Confirm ClickHouse health basics (are merges or disk the issue?)
cr0x@server:~$ clickhouse-client -q "SELECT hostName(), uptime()"
ch01
345678
What it means: You can connect and the server is alive long enough to be useful.
Decision: Proceed to deeper checks: parts/merges, query load, and disk.
Task 8: Check ClickHouse active queries and their resource usage
cr0x@server:~$ clickhouse-client -q "SELECT user, query_id, elapsed, read_rows, formatReadableSize(memory_usage) AS mem, left(query, 80) AS q FROM system.processes ORDER BY memory_usage DESC LIMIT 5 FORMAT TabSeparated"
bi_user 0f2a... 12.4 184001234 6.31 GiB SELECT customer_id, sum(total) FROM orders_events WHERE event_date >= toDate('2025-12-01')
etl 9b10... 3.1 0 512.00 MiB INSERT INTO orders_events FORMAT JSONEachRow
What it means: BI is consuming memory. That’s fine if it’s budgeted. It’s a problem if it starves merges or triggers OOM.
Decision: Set per-user max_memory_usage, max_threads, and possibly max_concurrent_queries. Keep ETL reliable.
Task 9: Check ClickHouse merges backlog (parts growing like weeds)
cr0x@server:~$ clickhouse-client -q "SELECT database, table, sum(parts) AS parts, formatReadableSize(sum(bytes_on_disk)) AS disk FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(parts) DESC LIMIT 10 FORMAT TabSeparated"
analytics orders_events 1842 1.27 TiB
analytics sessions 936 640.12 GiB
What it means: Thousands of parts can mean heavy insert fragmentation or merges falling behind. Query performance will degrade, and startup/metadata gets heavier.
Decision: Adjust insert batching, tune merge settings conservatively, and consider partition strategy. If parts keep climbing, treat it as an incident in slow motion.
Task 10: Validate partition pruning (if it scans everything, you modeled it wrong)
cr0x@server:~$ clickhouse-client -q "EXPLAIN indexes=1 SELECT customer_id, sum(total) FROM analytics.orders_events WHERE event_date BETWEEN toDate('2025-12-01') AND toDate('2025-12-30') GROUP BY customer_id"
Expression ((Projection + Before ORDER BY))
Aggregating
Filter (WHERE)
ReadFromMergeTree (analytics.orders_events)
Indexes:
MinMax
Keys: event_date
Condition: (event_date in [2025-12-01, 2025-12-30])
Parts: 30/365
Granules: 8123/104220
What it means: It’s reading 30/365 parts thanks to the date filter. That’s what “works as designed” looks like.
Decision: If Parts reads close to total, change partitioning and/or require time filters in dashboards.
Task 11: Monitor ClickHouse disk usage and predict capacity trouble
cr0x@server:~$ df -h /var/lib/clickhouse
Filesystem Size Used Avail Use% Mounted on
/dev/nvme1n1 3.5T 3.1T 330G 91% /var/lib/clickhouse
What it means: 91% used. You are one backfill away from a bad day, and merges need headroom.
Decision: Stop non-essential backfills, extend storage, enforce TTL, and compress/optimize data model. ClickHouse under disk pressure becomes unpredictably slow and risky.
Task 12: Verify CDC pipeline lag at the consumer (is analytics stale?)
cr0x@server:~$ clickhouse-client -q "SELECT max(ingested_at) AS last_ingest, now() AS now, dateDiff('second', max(ingested_at), now()) AS lag_s FROM analytics.orders_events"
2025-12-30 19:03:12 2025-12-30 19:03:29 17
What it means: ~17 seconds lag. That’s healthy for “near real time” analytics.
Decision: If lag climbs, pause heavy queries, check pipeline throughput, and decide whether to degrade dashboards or risk OLTP.
Task 13: Check MySQL binary log format for CDC correctness
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'binlog_format';"
Variable_name Value
binlog_format ROW
What it means: ROW format is typically what CDC tools want for correctness. STATEMENT can be ambiguous for non-deterministic queries.
Decision: If you’re not on ROW, plan a change window. CDC correctness is not something you “hope for.”
Task 14: Confirm MySQL has sane slow query logging (so you can prove causality)
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'slow_query_log%'; SHOW VARIABLES LIKE 'long_query_time';"
Variable_name Value
slow_query_log ON
slow_query_log_file /var/log/mysql/mysql-slow.log
Variable_name Value
long_query_time 0.500000
What it means: You’ll capture queries slower than 500ms. That’s aggressive, but useful during a noisy period.
Decision: During incidents, lower long_query_time briefly and sample. Afterward, set it to a stable threshold and use digest summaries.
Task 15: Verify ClickHouse user limits (prevent a BI “parallelism party”)
cr0x@server:~$ clickhouse-client -q "SHOW CREATE USER bi_user"
CREATE USER bi_user IDENTIFIED WITH sha256_password SETTINGS max_memory_usage = 4000000000, max_threads = 8, max_execution_time = 60, max_concurrent_queries = 5
What it means: BI is fenced: 4GB memory, 8 threads, 60s runtime, 5 concurrent queries. That’s the difference between a dashboard and a stress test.
Decision: If you can’t set limits because “business needs,” you’re not running analytics, you’re running roulette.
Fast diagnosis playbook
This is the order that finds the bottleneck quickly, without turning the incident into a philosophy debate.
First: Is MySQL overloaded by reads, writes, locks, or I/O?
- Top query digests (performance_schema digests or slow log): identify the query families eating time.
- Thread states (SHOW PROCESSLIST): “Sending data” suggests scan/aggregation; “Locked” suggests contention; “Waiting for table metadata lock” suggests DDL collision.
- Disk I/O (iostat): if iowait is high and disk %util is pegged, stop scans before tuning anything else.
Second: Is the “solution” (replica) actually making it worse?
- Replication lag (SHOW SLAVE STATUS): if lag is minutes, analytics users are making decisions on stale data and blaming you for it.
- Replica resource contention: heavy queries can starve the SQL thread and increase lag further.
Third: If ClickHouse exists, is it healthy and governed?
- system.processes: identify runaway BI queries and memory hogs.
- Parts and merges (system.parts): too many parts means ingestion shape or merge backlog problems.
- Disk headroom (df): merges and TTL need space; 90% full is operational debt with interest.
Fourth: Is data freshness the real complaint?
- CDC lag (max ingested_at): quantify the staleness.
- Communicate a fallback: if freshness degrades, degrade dashboards—not checkout.
Three corporate mini-stories from the trenches
Incident caused by a wrong assumption: “Read replicas are for reporting”
A mid-size subscription business had a primary MySQL cluster and two read replicas. Their BI tool was pointed at a replica because “reads don’t affect writes.” That phrase has caused more incidents than caffeine has prevented.
During month-end, finance ran a set of cohort and revenue reports. The replica’s disk hit saturation: heavy scans plus temp tables. Replication lag rose from seconds to tens of minutes. Nobody noticed at first because application traffic was fine; the primary wasn’t directly impacted.
Then someone made the second assumption: “If the primary fails, we can fail over to a replica.” Right when the lag was worst, the primary had an unrelated host issue and went into an unhealthy state. Automation tried to promote the “best” replica—except “best” was 20 minutes behind.
They didn’t lose the entire database. They lost enough recent transactions to create a customer support nightmare: payments that “succeeded” externally but didn’t exist internally, and sessions that didn’t match billing. Recovery was a careful mix of binlog spelunking and reconciling against the payment provider.
The fix wasn’t heroic. They separated concerns: a dedicated replica for failover with strict query blocking, and analytics moved to ClickHouse via CDC. Reporting became fast, and failover became trustworthy because the replica was no longer being used as a punching bag.
Optimization that backfired: “Let’s just add an index”
An e-commerce team had a slow reporting query on orders: time range filter plus group-by. Someone added an index on created_at and another composite index on (created_at, customer_id). The query got faster in isolation, so they shipped it and celebrated.
Two weeks later, write latency started spiking. Inserts into orders slowed, and the background flush rate climbed. The new indexes increased write amplification—every insert now maintained more B-tree structures. At peak traffic, they were paying an index tax on every transaction to make a handful of reports cheaper.
Then the BI tool got a new dashboard that ran the same query every minute. The query was faster, so concurrency increased (humans love pressing refresh when refresh is quick). The system traded one slow query for many medium-fast queries and still ended up I/O bound.
The actual solution was to remove the index bloat, keep OLTP lean, and build a ClickHouse rollup table updated continuously. Dashboards hit ClickHouse. Transactions stayed smooth. The team learned the hard lesson: indexing is not “free speed,” it’s a write-time bill you pay forever.
Boring but correct practice that saved the day: quotas and staged backfills
A B2B SaaS company ran ClickHouse for analytics with strict user profiles. BI users had max_execution_time, max_memory_usage, and concurrency limits. ETL had different limits and ran in a controlled queue. Nobody loved those constraints. Everyone benefited from them.
One afternoon, an analyst attempted to run a wide query across two years of raw events without a date filter. ClickHouse started scanning, hit the execution time limit, and killed the query. The analyst complained. On-call did not get paged. That’s a good trade.
Later that month, the data team needed a backfill due to a schema change in the upstream CDC. They staged it: a day at a time, verifying part counts, disk headroom, and lag after each chunk. Slow, careful, measurable. The backfill finished without threatening production dashboards.
The boring practice wasn’t a fancy algorithm. It was governance and operational discipline: limits, queues, and incremental backfills. It saved them because the system behaved predictably when humans behaved unpredictably.
Joke #2: The only thing more permanent than a temporary dashboard is the incident channel it creates.
Common mistakes: symptom → root cause → fix
-
Symptom: MySQL p95 latency spikes during “business reporting hours”
Root cause: Long scans and GROUP BY queries competing with OLTP for buffer pool and I/O
Fix: Move reporting to ClickHouse; enforce policy; add curated rollups; block BI users from MySQL. -
Symptom: Read replica lag increases when analysts run reports
Root cause: Replica I/O and CPU saturated; SQL thread can’t apply binlog fast enough
Fix: Remove analytic access from failover replicas; use CDC to ClickHouse; cap query concurrency. -
Symptom: ClickHouse queries get slower over time with no change in data size
Root cause: Parts explosion; merges falling behind due to fragmented inserts or disk pressure
Fix: Batch inserts; tune merge-related settings carefully; monitor parts; ensure disk headroom; consider repartitioning. -
Symptom: Dashboards are “fast sometimes” and time out randomly on ClickHouse
Root cause: Unbounded BI concurrency; memory pressure; noisy neighbor queries
Fix: Set per-user limits (memory, threads, execution time, concurrent queries); create pre-aggregated tables; add query routing. -
Symptom: Analytics data has duplicates or “wrong latest state”
Root cause: CDC applied as append-only without dedup/versioning; updates/deletes not modeled correctly
Fix: Use version columns and ReplacingMergeTree where appropriate; store events and derive current state via materialized views. -
Symptom: ClickHouse disk keeps climbing until it’s an emergency
Root cause: No TTL; storing raw forever; heavy backfills; no capacity guardrails
Fix: Apply TTL for cold data; downsample; compress; archive; enforce quotas and backfill procedures. -
Symptom: “We moved to ClickHouse but MySQL is still slow”
Root cause: CDC pipeline still reads MySQL in a heavy way (full-table extracts, frequent snapshots), or app still runs reports on MySQL
Fix: Use binlog-based CDC; review MySQL query sources; firewall/reporting accounts; validate with digest data. -
Symptom: ClickHouse freshness lags during peaks
Root cause: Ingestion bottleneck (pipeline throughput), merges, or disk pressure; sometimes too many small inserts
Fix: Batch inserts; scale ingestion; monitor lag; temporarily reduce BI concurrency; prioritize ETL resources.
Checklists / step-by-step plan
Step-by-step: the clean split implementation plan
- Declare the boundary: production MySQL is OLTP only. Write it down. Enforce it with accounts and network policy.
- Inventory analytic queries: use MySQL digest tables and slow log summaries to list the top 20 query families.
- Pick the ingestion method: CDC for near-real-time; batch for daily/hourly; avoid dual-write.
- Define analytics tables in ClickHouse: start with event tables, time partitioning, and ORDER BY keys aligned to filters.
- Build “gold” datasets: materialized views or rollup tables for dashboards; keep raw data for deep dives.
- Set governance from day one: user profiles, quotas, max_execution_time, max_memory_usage, max_concurrent_queries.
- Measure freshness: track ingestion lag and publish the SLO to stakeholders. People tolerate staleness when it’s explicit.
- Cut over dashboards: migrate the highest-impact dashboards first (the ones that page on-call indirectly).
- Block the old path: remove BI credentials from MySQL; firewall if needed; prevent regression.
- Backfill safely: incremental, measurable, with disk headroom checks; no “just run it overnight” fantasies.
- Load test analytics: simulate dashboard concurrency. ClickHouse will happily accept your optimism and then punish it.
- Operationalize: alerts on ClickHouse parts count, disk usage, query failures, ingestion lag; and on MySQL latency/IO.
Release checklist: moving one dashboard from MySQL to ClickHouse
- Does the dashboard query include a time filter that matches partitioning?
- Is there a rollup table/materialized view to avoid scanning raw events repeatedly?
- Is the ClickHouse user limited (memory, threads, execution time, concurrency)?
- Is the CDC lag metric visible to the dashboard users?
- Is the old MySQL query blocked or at least removed from the app/BI tool?
- Did you validate results for a known time window (spot-check totals and counts)?
Operational checklist: weekly hygiene that prevents slow disasters
- Review ClickHouse active parts by table; investigate fast growth.
- Review ClickHouse disk headroom; keep enough free space for merges and backfills.
- Review top BI queries by read_rows and memory usage; optimize or pre-aggregate.
- Review MySQL top digests to ensure analytics didn’t creep back in.
- Test restore paths: MySQL backups, ClickHouse metadata and data recovery expectations.
FAQ
1) Can’t I just scale MySQL vertically and call it a day?
You can, and you’ll get temporary relief. The failure mode returns when the next dashboard or cohort query appears. The issue is workload mismatch, not just horsepower.
2) What if I already have MySQL read replicas—should I point BI at them?
Only if you’re comfortable with lag and you’re not using those replicas for failover. Even then, cap concurrency and treat it as a temporary bridge, not the end state.
3) Is ClickHouse “real time” enough for operational dashboards?
Often yes, with CDC. Measure ingestion lag explicitly and design dashboards to tolerate small delays. If you need sub-second transactional truth, that’s MySQL territory.
4) How do I handle updates and deletes from MySQL in ClickHouse?
Prefer event modeling (append changes). If you need “current state,” use versioned rows with engines like ReplacingMergeTree and design queries/materialized views accordingly.
5) Will ClickHouse replace my data warehouse?
Sometimes. For many companies it becomes the primary analytics store. But if you need heavy transformations, governance, or cross-system modeling, you may still keep a warehouse layer. Don’t force a religious conversion.
6) What’s the fastest win if we’re on fire today?
Stop running analytics on MySQL immediately: kill the worst queries, remove BI access, and ship the dashboard to ClickHouse or a cached rollup. Then fix it properly.
7) What’s the biggest ClickHouse operational surprise for MySQL teams?
Merges and parts. Row-store folks expect “I inserted it, it’s done.” ClickHouse continues working in the background, and you must monitor that work.
8) How do I prevent analysts from writing expensive ClickHouse queries?
Use user profiles with quotas and timeouts, provide curated “gold” tables, and teach people that missing time filters is not “exploration,” it’s arson.
9) Do materialized views solve everything?
No. They’re great for stable rollups and common aggregates. But they can add complexity and storage cost. Use them where they reduce repeated work measurably.
10) What if my analytics queries require complex joins across many tables?
Denormalize for the common paths, precompute dimensions, and keep joins limited. ClickHouse can join, but the best production analytics systems avoid doing it repeatedly at query time.
Conclusion: practical next steps
If you take one action this week, make it this: remove analytics load from MySQL. Not by pleading with users to “be careful,” but by providing a better place to ask questions.
- Lock down MySQL: separate accounts, block BI networks, and enforce that production MySQL serves users first.
- Stand up ClickHouse governance: limits, quotas, and curated datasets before you invite the whole company.
- Move the top 5 worst queries: replicate the needed data via CDC or batch, then build rollups so dashboards stay cheap.
- Operationalize freshness: publish ingestion lag and treat it like a product requirement. It’s better to be honestly 60 seconds behind than unknowably wrong.
- Practice backfills: staged, measurable, reversible. Your future self will appreciate your current self’s restraint.
The clean split isn’t glamorous. It’s just the difference between a database that serves customers and a database that hosts a daily analytics cage match. Pick the calmer life.