You add a “simple” realtime dashboard. Product loves it. Execs love it. Then checkout gets weird: p95 latency doubles, CPU pegs, and MySQL starts
coughing up “too many connections” like it’s auditioning for a medical drama.
The dashboard isn’t evil. The architecture is. OLTP databases and analytics workloads have different physics, and ignoring that is how you end up
explaining to finance why “conversion rate” briefly converted into “zero”.
The real problem: mixed workloads and shared failure domains
Realtime dashboards fail in a surprisingly predictable way: they ask OLTP systems to behave like OLAP systems. You can sometimes brute-force it
with indexes, replicas, and cache. But if the dashboard matters, it will grow. It will acquire more filters, more joins, more “just one more
dimension.” Then one day it runs during a campaign spike and your payment flow gets a front-row seat to your analytics ambitions.
The core issue isn’t query speed. It’s interference. OLTP and analytics compete for the same scarce resources:
- CPU (analytics loves it; OLTP needs it for concurrency)
- Buffer pool / page cache (analytics churns it; OLTP benefits from stability)
- I/O (analytics scans; OLTP needs low-latency random reads/writes)
- Locks and latches (not always obvious, but they show up when you least want them)
- Connection slots (dashboards are often “chatty”)
Dashboards are also socially privileged workloads. Nobody says “turn off checkout for an hour.” People absolutely say “why is the dashboard not
realtime?” That’s how you end up optimizing the wrong thing.
One dry truth: “realtime” usually means “fresh enough to make decisions.” That might be 2 seconds. It might be 2 minutes. If you can’t define it,
you’ll implement the worst kind: expensive realtime.
MySQL vs ClickHouse: what each engine is really good at
MySQL: the checkout workhorse
MySQL is an OLTP database with a mature ecosystem, strong operational tooling, and predictable behavior under write-heavy transactional workloads.
InnoDB is optimized for:
- Many small reads/writes
- Point lookups by primary key or well-chosen secondary indexes
- Transactional guarantees and consistent reads
- High concurrency with short queries
MySQL can do analytics-ish queries. It can even do them well if your data set is modest and your queries are disciplined. But when dashboards start
doing big scans, heavy group-bys, wide joins, and “last 90 days by X and Y and Z,” MySQL pays in I/O, buffer pool churn, and CPU. The price shows
up in the one place you don’t want: tail latency.
ClickHouse: the dashboard engine that eats scans for breakfast
ClickHouse is a columnar OLAP database designed for high-throughput analytical queries on large data sets. It shines when you need:
- Fast aggregations over billions of rows
- High-cardinality dimensions (within reason and with proper design)
- Compression and efficient storage for append-heavy event data
- Parallel query execution
ClickHouse’s strengths come with a different set of expectations: you model for reads, you accept eventual consistency patterns, and you learn the
quirks of merges, parts, and background work. If you try to use it like MySQL (row-by-row updates, transactional workflows, lots of tiny mutations),
it will remind you—politely at first—that you’re holding it wrong.
Here’s the decision line I use in production: if the query shape is “scan + filter + group-by + time window,” that belongs in ClickHouse.
If it’s “read/update a small number of rows, enforce invariants, take money,” keep it in MySQL.
Joke #1: Real-time dashboards on MySQL are like towing a boat with a scooter. You can do it briefly, but everyone learns something unpleasant.
Interesting facts and historical context (the stuff that explains today’s tradeoffs)
- Column stores became mainstream when analytics datasets outgrew row-store caching assumptions; compression and vectorized execution changed the game.
- ClickHouse started at Yandex to power large-scale web analytics; it was built to answer “what happened?” quickly, not “did this purchase succeed?”
- InnoDB’s buffer pool design is excellent for hot working sets; dashboard scans can evict the very pages checkout needs.
- MySQL replication was historically used to offload reads; it helps, but analytics read patterns can still saturate replicas and lag grows teeth.
- The rise of CDC (change data capture) became common as teams wanted to decouple OLTP from OLAP without building bespoke “export jobs” every week.
- Materialized views aren’t new; what changed is how cheaply we can maintain them with modern storage and compute, especially in columnar engines.
- High-cardinality dimensions were once treated as a tax; ClickHouse made them feasible, but not free—dictionary encoding and careful ordering still matter.
- “Realtime” used to mean batch every hour for many businesses; now expectations are minutes or seconds because alerting and marketing loops got tighter.
Architectures that work (and why)
Pattern A: MySQL for OLTP, ClickHouse for dashboards (recommended for serious realtime)
This is the grown-up architecture. You keep transactional truth in MySQL, stream changes out, and query ClickHouse for dashboards. The key benefit
isn’t speed. It’s blast-radius control. Dashboards can misbehave without stealing oxygen from checkout.
Core components:
- MySQL primary for writes
- MySQL replica(s) for operational read traffic (optional)
- CDC pipeline (binlog-based) into ClickHouse
- ClickHouse tables optimized for queries (MergeTree family)
- Pre-aggregations where they pay off (materialized views or rollup tables)
The hard part: deciding what to ship and how to model it. If you naively replicate normalized OLTP tables into ClickHouse and expect
the same join patterns, you’ll end up with slow queries and a confused team. Denormalize strategically.
Pattern B: “Just use a MySQL read replica for dashboards” (works until it doesn’t)
A replica is tempting because it feels like a free lunch. It’s not. Replication lag under heavy reads is real, and if dashboards hit the replica with
big scans, you’re basically building an I/O furnace. That might be fine if:
- The dataset is small-ish
- Dashboards are limited and stable
- Freshness requirements are loose
- You can tolerate occasional lag spikes
This pattern fails when the dashboard usage grows (it will) or when your most important business events correlate with traffic spikes (they do).
Pattern C: Dual-write (application writes to MySQL and ClickHouse)
Dual-write can achieve low latency. It also introduces correctness risk: partial writes, ordering differences, retries, and the “works in staging”
phenomenon. If you do it, you need idempotency, backfill, and a reconciliation story. Otherwise the dashboard becomes a creative writing project.
Pattern D: Batch ETL into ClickHouse (boring, cheap, sometimes perfect)
If your business can live with 5–15 minute freshness, batch wins on simplicity. You can run periodic extracts from MySQL (ideally from a replica),
load into ClickHouse, and keep the pipeline understandable. People underestimate how valuable “understandable” is at 3 a.m.
Data modeling: the part everyone skips, then pays for
MySQL schema choices that sabotage dashboards
In OLTP, normalization helps with correctness and write performance. In analytics, normalized schemas push cost into joins and repeated lookups.
A dashboard query that joins orders, order_items, users, payments, shipments, and promotions is a great way to burn CPU and I/O.
The trick is not “denormalize everything.” The trick is to create an event or fact model that matches the questions you ask:
- Facts: order_created, payment_captured, shipment_delivered, refund_issued
- Dimensions: merchant, country, device type, campaign, payment method
- Time: always first-class; dashboards are time-series with opinions
ClickHouse table design: ORDER BY is your index, and it’s a commitment
In ClickHouse, the ORDER BY clause in MergeTree tables is the main lever for query speed. It defines how data is physically organized.
Choose it based on your most common filters. For realtime dashboards, that’s usually:
- Time window first (e.g., event_date, event_time)
- A small set of common dimensions (e.g., merchant_id, region, event_type)
If you pick ORDER BY user_id because it feels “primary-key-ish,” you’ll regret it when every dashboard filters by time and scans
everything anyway.
Pre-aggregation: when it helps, when it lies
Pre-aggregations (materialized views, rollups) are powerful. They also create a subtle risk: people trust the dashboard numbers as ground truth while
the pre-aggregation logic quietly drifts from the source semantics. This is especially common with refunds, partial captures, and “final state”
concepts.
My rule: pre-aggregate only when you can define the metric precisely, including late-arriving events and corrections. Otherwise keep raw facts and
accept higher query cost (in ClickHouse, you often can).
Ingestion pipelines: CDC, batch, streaming, and the lie of “realtime”
CDC from MySQL: what you’re actually signing up for
CDC is the usual choice for realtime-ish dashboards because it gives low latency without hammering MySQL with frequent queries. But CDC isn’t magic;
it’s a distributed system. You need to answer:
- Ordering: do events arrive in commit order? per table? per partition?
- Schema changes: how are DDL and type changes handled?
- Deletes/updates: do you model them as new events, or mutations?
- Backfills: how do you re-ingest historical data safely?
- Idempotency: can you replay without double-counting?
For ClickHouse, many teams prefer an append-only event stream even when the source is row-updated. Instead of “update row,” you
write “state changed” events. It’s not always possible, but it is often cleaner.
Freshness budgets: define them or suffer
“Realtime” should be a budget with SLOs: p95 ingest lag < 30 seconds, p99 < 2 minutes, for example. Without that, teams chase the wrong
bottleneck. Also: you’ll eventually need a “data is delayed” banner in the dashboard UI. It’s not optional; it’s honesty.
Quote (paraphrased idea): Werner Vogels has often pushed the idea that you should pick consistency models intentionally; reliability comes from explicit tradeoffs.
Fast diagnosis playbook (find the bottleneck in minutes)
When dashboards are slow or checkout is suffering, don’t start with opinions. Start with where time is being spent and
which system is saturating. This order works well under pressure.
1) Is the pain in MySQL, ClickHouse, or the app layer?
- Check MySQL p95 query latency and CPU/iowait
- Check ClickHouse query latency and merges/background load
- Check dashboard service saturation (threads, pool limits, GC if relevant)
2) If checkout is slow: confirm MySQL interference first
- Look for long-running SELECTs from dashboard users
- Look for buffer pool churn and disk reads
- Look for connection spikes and thread contention
3) If dashboards are slow: determine if it’s data freshness, query shape, or storage layout
- Is ingest lag growing? Then it’s pipeline or ClickHouse insert/merge capacity.
- If ingest is fine but query is slow: it’s ORDER BY / partitioning / too-wide scans.
- If queries are fast but UI is slow: it’s the application and caching strategy.
4) Decide the immediate mitigation
- Throttle dashboard queries (rate limit, cached snapshots)
- Move dashboards off MySQL now (even to a replica) if checkout is burning
- Reduce query time window and dimensions temporarily
- Scale ClickHouse or tune merges if it’s the bottleneck
Practical tasks: commands, outputs, and what decision to make
Below are hands-on tasks I actually run during incidents and tuning cycles. Each includes a command, sample output, what it means, and the decision
it drives. You can run most of these from a bastion or directly on the DB hosts with appropriate access.
Task 1: Spot dashboard queries hurting MySQL right now
cr0x@server:~$ mysql -e "SHOW FULL PROCESSLIST\G" | sed -n '1,120p'
*************************** 1. row ***************************
Id: 83421
User: dashboard_ro
Host: 10.22.8.14:51122
db: checkout
Command: Query
Time: 37
State: Sending data
Info: SELECT date(created_at) d, count(*) FROM orders WHERE created_at > now() - interval 7 day GROUP BY d
*************************** 2. row ***************************
Id: 83455
User: checkout_app
Host: 10.22.3.7:60318
db: checkout
Command: Query
Time: 1
State: updating
Info: UPDATE inventory SET reserved = reserved + 1 WHERE sku = 'A17-44'
Meaning: you have a long-running scan/group-by on the primary, competing with transactional updates.
Decision: kill or throttle the dashboard query; move dashboards to ClickHouse or at least a replica; add query guards immediately.
Task 2: Kill the specific offender (surgical, not heroic)
cr0x@server:~$ mysql -e "KILL 83421;"
Query OK, 0 rows affected (0.00 sec)
Meaning: the offending query is terminated.
Decision: follow up with an access policy: the dashboard user should not query the primary, period.
Task 3: Check MySQL current load and slowest statements via Performance Schema
cr0x@server:~$ mysql -e "SELECT DIGEST_TEXT, COUNT_STAR, ROUND(SUM_TIMER_WAIT/1e12,2) AS total_s, ROUND(AVG_TIMER_WAIT/1e9,2) AS avg_ms FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 3\G"
*************************** 1. row ***************************
DIGEST_TEXT: SELECT date(created_at) d, count(*) FROM orders WHERE created_at > ? GROUP BY d
COUNT_STAR: 1821
total_s: 614.35
avg_ms: 337.41
*************************** 2. row ***************************
DIGEST_TEXT: SELECT * FROM cart_items WHERE cart_id = ?
COUNT_STAR: 92144
total_s: 211.02
avg_ms: 2.29
*************************** 3. row ***************************
DIGEST_TEXT: UPDATE inventory SET reserved = reserved + ? WHERE sku = ?
COUNT_STAR: 40122
total_s: 88.44
avg_ms: 2.20
Meaning: the dashboard digest dominates total DB time.
Decision: prioritize moving this metric to ClickHouse or rewriting it to use an index/summary table, and enforce query limits.
Task 4: Confirm whether buffer pool churn is happening
cr0x@server:~$ mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests';"
+-------------------------+----------+
| Variable_name | Value |
+-------------------------+----------+
| Innodb_buffer_pool_reads| 19433211 |
+-------------------------+----------+
+----------------------------------+------------+
| Variable_name | Value |
+----------------------------------+------------+
| Innodb_buffer_pool_read_requests | 8821132441 |
+----------------------------------+------------+
Meaning: disk reads are non-trivial; if the first number is climbing fast during dashboard usage, scans are evicting hot pages.
Decision: isolate analytics off the primary; consider increasing buffer pool only after isolation (bigger furnace is still a furnace).
Task 5: Check MySQL replication lag before blaming “the dashboard server”
cr0x@server:~$ mysql -h mysql-replica-1 -e "SHOW REPLICA STATUS\G" | egrep "Seconds_Behind_Source|Replica_IO_Running|Replica_SQL_Running"
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 187
Meaning: replica is ~3 minutes behind; dashboards reading from it will be stale and can trigger “why don’t numbers match?” tickets.
Decision: either accept explicit freshness SLAs, add more replicas, or move analytics to ClickHouse with controlled ingest.
Task 6: Verify MySQL binary log format is compatible with CDC needs
cr0x@server:~$ mysql -e "SHOW VARIABLES LIKE 'binlog_format'; SHOW VARIABLES LIKE 'gtid_mode';"
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| binlog_format | ROW |
+---------------+-------+
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| gtid_mode | ON |
+---------------+-------+
Meaning: ROW binlogs and GTIDs simplify CDC correctness and resuming.
Decision: if not ROW, plan a migration; CDC on STATEMENT is a trap unless you enjoy edge cases.
Task 7: Inspect ClickHouse running queries (find the dashboard hog)
cr0x@server:~$ clickhouse-client --query "SELECT query_id, user, elapsed, read_rows, formatReadableSize(read_bytes) rb, query FROM system.processes ORDER BY elapsed DESC LIMIT 2"
d8e2c6c1-8d1b-4c3f-bc45-91b34a6c12de dashboard 12.941 812334112 34.21GiB SELECT merchant_id, count() FROM events WHERE event_time > now() - INTERVAL 30 DAY GROUP BY merchant_id
a1a01f2a-9d72-4d85-9b43-423a1c91a8f1 internal 1.120 182991 17.02MiB INSERT INTO events FORMAT JSONEachRow
Meaning: a dashboard query is reading 34 GiB to answer a question that likely needs a narrower time window or better ordering.
Decision: fix ORDER BY/partitioning, add rollups, and cap dashboard lookback defaults. Also consider query quotas.
Task 8: Check ClickHouse merges pressure (background work stealing your lunch)
cr0x@server:~$ clickhouse-client --query "SELECT database, table, sum(merge_type = 'Regular') AS regular_merges, sum(merge_type = 'TTL') AS ttl_merges, round(sum(elapsed),1) AS total_s FROM system.merges GROUP BY database, table ORDER BY total_s DESC LIMIT 5"
analytics events 4 0 912.4
analytics sessions 1 2 211.9
Meaning: merges are active and long-running on the hot events table.
Decision: tune insert batch sizes, partitioning, and merge settings; add capacity if merges constantly backlog.
Task 9: Verify ClickHouse parts explosion (classic “too many small inserts”)
cr0x@server:~$ clickhouse-client --query "SELECT table, count() parts, formatReadableSize(sum(bytes_on_disk)) disk FROM system.parts WHERE active AND database='analytics' GROUP BY table ORDER BY parts DESC LIMIT 3"
events 12844 1.92TiB
sessions 2211 204.11GiB
rollup_minute 144 9.87GiB
Meaning: 12k active parts suggests inserts are too granular, causing merge debt and query overhead.
Decision: batch inserts, use buffering, or adjust your ingestion tool; reduce parts before scaling hardware blindly.
Task 10: Measure ingest lag as a first-class metric (don’t guess)
cr0x@server:~$ clickhouse-client --query "SELECT max(event_time) AS max_event, now() AS now_ts, dateDiff('second', max_event, now_ts) AS lag_s FROM analytics.events"
2025-12-30 11:58:29 2025-12-30 12:00:03 94
Meaning: ClickHouse is about 94 seconds behind “now” for this table.
Decision: if your SLO is 30 seconds, focus on pipeline and merge capacity; if SLO is 2 minutes, you’re fine and should stop panicking.
Task 11: Validate ClickHouse pruning efficiency with EXPLAIN (are you scanning everything?)
cr0x@server:~$ clickhouse-client --query "EXPLAIN indexes=1 SELECT count() FROM analytics.events WHERE event_date >= today()-1 AND merchant_id=42"
Expression ((Projection + Before ORDER BY))
ReadFromMergeTree (analytics.events)
Indexes:
MinMax
Keys: event_date
Condition: (event_date in [today() - 1, +Inf))
PrimaryKey
Keys: (event_date, merchant_id)
Condition: (event_date in [today() - 1, +Inf)) AND (merchant_id in [42, 42])
Meaning: both MinMax and primary key conditions apply; pruning should be decent.
Decision: if you don’t see useful index conditions, revisit ORDER BY and partition key.
Task 12: Confirm the dashboard is not DDoS’ing your DB with short polling
cr0x@server:~$ sudo ss -tnp | awk '$4 ~ /:3306$/ {print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head
422 10.22.8.14
38 10.22.8.15
21 10.22.3.7
Meaning: one dashboard host has 422 TCP connections to MySQL.
Decision: add connection pooling, enforce max connections, cache results, and stop “refresh every 1s” at the source.
Task 13: Check host-level I/O wait on MySQL box (quick reality check)
cr0x@server:~$ iostat -x 1 3
Linux 6.8.0 (mysql-primary) 12/30/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
18.12 0.00 6.33 24.51 0.00 51.04
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await aqu-sz %util
nvme0n1 812.0 98304.0 0.0 0.00 8.12 121.1 244.0 16384.0 3.91 6.41 92.3
Meaning: high iowait and ~92% disk utilization; MySQL is I/O bound under current load.
Decision: stop scans on the primary, then revisit storage/IOPS; throwing faster disks at mixed workloads is an expensive way to stay wrong.
Task 14: Check ClickHouse disk usage and whether merges are blocked by space
cr0x@server:~$ df -h /var/lib/clickhouse
Filesystem Size Used Avail Use% Mounted on
/dev/nvme1n1p1 3.5T 3.3T 140G 96% /var/lib/clickhouse
Meaning: ClickHouse is at 96% usage; merges may fail or throttle, increasing parts and slowing everything.
Decision: free space now (retention/TTL), add disk, or move cold partitions. Don’t “wait until the weekend”; the weekend is when merges decide to panic.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
A mid-market retailer wanted “live” sales dashboards during a holiday push. The team assumed, reasonably on paper, that a read replica would isolate
analytics from checkout. They created a replica, pointed the dashboard at it, and moved on.
The first problem was freshness. During the first big traffic wave, replication lag jumped from seconds to minutes. Execs saw the dashboard “stall”
and demanded a fix. Someone suggested, with admirable confidence and no sense of danger, that the dashboard should read from the primary “just for
the campaign.”
Five minutes later, checkout latency climbed. Not because SELECTs take locks in a dramatic way, but because those scans churned the buffer pool and
saturated I/O. Orders were still committing, but the tail latency got ugly and timeouts started.
The wrong assumption was subtle: “reads are safe.” Reads are not safe when they’re big, frequent, and uncached. Reads can
absolutely take down a write-heavy system by starving it of the resources it needs.
The fix was not heroic query tuning. The fix was architectural: they pulled dashboards off MySQL entirely, accepted 60–120 seconds freshness during
the campaign, and shipped a CDC-to-ClickHouse pipeline afterward. The incident ended with a policy: dashboards do not query the primary, even when
someone senior is yelling.
Mini-story 2: The optimization that backfired
A SaaS company moved analytics to ClickHouse and got excited. Queries were fast. Everyone loved the flexibility. Then someone optimized ingestion to
“reduce latency” by sending each event as its own insert. It worked in testing. It also created a parts explosion in production.
Within a few days, background merges were running constantly. CPU looked “fine” at a glance, but disk was hot, and query latency became spiky.
Dashboards would be fast, then randomly slow. The team started adding nodes. It helped briefly, then merges caught up like a tax auditor.
They tried tuning merge settings and even considered switching table engines. The real issue was upstream: too many tiny inserts. ClickHouse can
ingest a lot, but it wants batches. Small parts cost metadata overhead, merge overhead, and query overhead. It’s death by a thousand commits.
The fix was boring: buffer events for a few seconds and insert in chunks. Latency increased slightly, but the system became stable and cheaper. The
optimization that backfired was chasing a few seconds of freshness at the cost of sustained performance.
Mini-story 3: The boring but correct practice that saved the day
Another team ran both MySQL and ClickHouse for dashboards. They also had something rare: a written freshness SLO, an ingest-lag metric on the
dashboard itself, and an on-call runbook that began with “check lag before you debug queries.”
One afternoon, dashboards started showing “flat” graphs. Sales were clearly happening, but the chart looked dead. Slack filled with panic. The
on-call followed the runbook and checked ClickHouse max event time versus now. Lag was 18 minutes. That’s not a query problem; that’s an ingestion
problem.
They checked the CDC consumer and found it stuck on a single bad row due to an unexpected schema change: a column went from nullable to non-nullable,
and the ingest mapping was rejecting records. Because they had explicit metrics, they didn’t waste an hour tuning queries that were already fast.
They applied the schema mapping fix, replayed the backlog, and the dashboards recovered. No one praised the runbook. No one writes poems about “lag
alerts.” But the boring practice saved an afternoon of confusion and prevented someone from “fixing” it by pointing dashboards at MySQL.
Joke #2: A dashboard that queries production MySQL is basically a performance test, except the results are graded by angry customers.
Common mistakes: symptom → root cause → fix
1) Checkout p95 latency spikes when a dashboard launches
- Symptom: MySQL CPU and iowait rise; slow queries show analytics-like SELECTs.
- Root cause: dashboards querying primary or a shared replica that also serves application reads; buffer pool churn and I/O saturation.
- Fix: isolate dashboards to ClickHouse; enforce user-level permissions; add query limits and timeouts; cache dashboard responses.
2) Dashboards are “realtime” but numbers don’t match MySQL
- Symptom: ClickHouse shows fewer orders than MySQL for the last few minutes; backfills “fix” the past.
- Root cause: CDC lag, out-of-order events, or missing update/delete semantics.
- Fix: track ingest lag; design idempotent event keys; model state changes explicitly; add reconciliation jobs for critical metrics.
3) ClickHouse queries randomly slow down even though hardware is fine
- Symptom: same dashboard sometimes 200ms, sometimes 8s; merges show activity.
- Root cause: too many parts, merge backlog, or disk near-full causing throttling.
- Fix: batch inserts; monitor parts count; increase disk headroom; tune partitioning; add nodes only after reducing ingestion fragmentation.
4) Replica-based dashboards show stale data during peaks
- Symptom: “Seconds_Behind_Source” jumps; dashboard freshness is inconsistent.
- Root cause: replica cannot apply relay log fast enough under read pressure; replication is not a free analytics lane.
- Fix: move dashboards to ClickHouse; add dedicated analytics replica; reduce query load; scale replica I/O and CPU.
5) ClickHouse storage grows faster than expected
- Symptom: disk usage climbs; retention policies not working; merges churn.
- Root cause: missing TTL/partition drops; storing too many verbose columns; duplicating events due to non-idempotent ingestion.
- Fix: enforce TTL; compact schema; add dedup keys and ingest exactly-once semantics (or at least effectively-once via idempotency).
6) “We can just add an index” becomes a weekly ritual
- Symptom: MySQL has dozens of indexes; writes slow down; still can’t satisfy dashboard queries.
- Root cause: trying to make OLTP indexing solve OLAP scans; indexes help point lookups, not unlimited dimensional analysis.
- Fix: stop indexing your way into sadness; move analytics to ClickHouse and model for it.
Checklists / step-by-step plan
Step-by-step plan: build realtime dashboards without destroying checkout
-
Define freshness SLOs.
- Write down p95 and p99 ingest lag targets.
- Decide what happens when you miss them (banner, fallback cache, degraded mode).
-
Classify dashboard queries by shape.
- Scan + group-by + time window → ClickHouse.
- Point lookups for drill-down details → can remain in MySQL (or a separate service).
-
Put guardrails on MySQL now.
- No dashboard access to the primary.
- Set timeouts and max execution time for analytics users.
- Add rate limiting at the API layer.
-
Choose your ingestion approach.
- CDC for low-latency and decent simplicity.
- Batch for simplicity when freshness can be minutes.
- Avoid dual-write unless you have strong operational maturity.
-
Model a fact table in ClickHouse.
- Prefer append-only events with stable ids.
- Pick ORDER BY based on time + common dimensions.
-
Implement pre-aggregations only for proven hot queries.
- Start with raw facts; measure query cost; then roll up where it pays.
- Define late events and correction behavior.
-
Operationalize it.
- Alert on ingest lag, parts count, disk headroom, and query latency.
- Create on-call runbooks that start with “is data late?”
-
Test failure modes.
- CDC consumer down.
- ClickHouse disk nearly full.
- Dashboard query accidentally widened to 365 days.
Safety checklist: protect checkout from dashboard ambition
- Dashboards cannot connect to MySQL primary (network ACL + credentials enforcement).
- MySQL users for analytics have strict privileges and low resource limits.
- Dashboard API has caching and rate limiting.
- Default dashboard time window is small; expanding requires explicit user action.
- ClickHouse has disk headroom policy (target < 80–85% used).
- Ingest lag is displayed and alerting is wired.
FAQ
1) Can I keep dashboards in MySQL if I optimize queries?
For a small dataset and a small set of stable dashboards, yes. The moment dashboards become exploratory (many filters, longer windows),
you’re fighting the engine’s design goals. If checkout matters, isolate analytics sooner.
2) Is a MySQL read replica “safe enough” for dashboards?
It’s safer than hitting the primary, but not “safe” by default. Big scans can saturate the replica, increase lag, and still cause operational pain.
A dedicated analytics replica with strict query controls can work as a transitional step.
3) How realtime can ClickHouse be?
Seconds to minutes, depending on ingestion batching, merge pressure, and query patterns. If you try to force sub-second ingest with tiny inserts,
you’ll pay for it elsewhere (parts, merges, disk).
4) What’s the biggest ClickHouse modeling mistake for dashboards?
Choosing ORDER BY without thinking about common filters, especially time. If your dashboards filter by time and merchant, your ORDER BY should reflect that.
5) Do I need materialized views in ClickHouse?
Not on day one. Start with raw facts and measure. Use materialized views when a small number of queries dominate cost and the metric semantics are stable.
6) How do I handle updates and deletes from MySQL in ClickHouse?
Prefer event modeling (append “state change” events) over frequent mutations. If you must mirror row state, use versioning patterns and deduplication keys,
and accept that heavy mutations can be expensive.
7) Why do numbers differ between MySQL and ClickHouse during incidents?
Usually ingest lag, out-of-order events, or duplicate ingestion. Treat “data freshness” as a metric, not a feeling, and make the dashboard show it.
8) What metrics should I alert on for this stack?
MySQL: query latency (p95/p99), active threads, buffer pool hit rate, disk util/iowait, replication lag.
ClickHouse: query latency, parts count, merge backlog, disk usage, insert rate, ingest lag.
9) Should I cache dashboard results?
Yes, unless your dashboard is truly mission-critical to be second-by-second accurate. Cache at the API layer with short TTLs and cache busting rules.
It’s cheaper than “more database.”
10) What’s a sane definition of “realtime” for business dashboards?
Whatever your operators can support reliably. Many orgs thrive with 30–120 seconds freshness. Make it explicit, measure it, and display it.
Conclusion: what to do next, Monday morning
If your dashboard queries touch MySQL primary, fix that first. Not tomorrow. Today. Add a dedicated path: either a replica with strict limits as a
stopgap, or (preferably) ClickHouse as the analytics plane.
Then do the work that actually prevents repeat incidents:
- Define freshness SLOs and show ingest lag in the dashboard UI.
- Model facts for ClickHouse with an ORDER BY that matches real filters.
- Batch ingestion to avoid parts explosions and merge debt.
- Operationalize: alert on lag, merges, disk headroom, and worst queries.
- Write the runbook and make it the first thing people reach for under stress.
The goal isn’t “dashboards that are fast.” The goal is “dashboards that are fast without turning checkout into a science experiment.”
Separate the workloads, measure the lag, and let each database do the job it was built to do.