Data chaos rarely announces itself with fireworks. It arrives quietly: dashboards stop matching invoices, a “simple” backfill turns into a weekend, and your analytics cluster starts burning CPU like it’s trying to heat the office.
PostgreSQL and ClickHouse can both power excellent pipelines. They can also both enable an exquisite mess if you treat them like interchangeable buckets. The trick is to pick ETL patterns that match how each system actually behaves under load, under failure, and under the eternal reality of late and duplicate data.
Interesting facts and context
- PostgreSQL started as POSTGRES at UC Berkeley in the 1980s, explicitly exploring extensibility (types, indexes, rules) long before “pluggable” became a product pitch.
- MVCC in PostgreSQL (multi-version concurrency control) is why reads don’t block writes, but it’s also why vacuum is not optional if you update a lot.
- ClickHouse came out of Yandex in the 2010s for high-scale web analytics; its biases are unapologetically “scan fast, aggregate faster.”
- ClickHouse is columnar: it stores each column separately, which is why it can read only the columns needed for a query, but it also means lots of tiny columns can increase overhead.
- MergeTree engines are not “append-only” in the naïve sense; they are “append then merge in the background,” which shapes how you handle updates and deduplication.
- PostgreSQL replication evolved in stages: physical streaming replication, then logical replication for row-level change streams, enabling ETL patterns that don’t require constant full extracts.
- ClickHouse materialized views are insert-triggered: they transform data at insert time, great for ingestion pipelines, but they don’t retroactively recompute unless you reinsert/backfill.
- Both systems have “it depends” compression stories: Postgres compresses TOASTed values; ClickHouse compresses columnar data extremely well, especially low-cardinality columns.
- JSON became a first-class citizen differently: Postgres built robust JSONB indexing and operators; ClickHouse supports JSON but typically expects you to normalize the hot query paths into typed columns.
Two mental models: OLTP truth vs analytical truth
PostgreSQL is where you keep the truth. Not “truthy.” Not “close enough.” Truth, with constraints, transactions, and the uncomfortable job of saying “no” to bad data. ClickHouse is where you keep analytical truth: highly queryable, usually denormalized, optimized for group-bys over billions of rows.
If you treat PostgreSQL like a warehouse, you’ll end up fighting bloat, lock contention, and a schema that tries to be both normalized and denormalized (so it ends up being neither). If you treat ClickHouse like a transactional database, you’ll eventually discover that “update” is a philosophical question, and the answer is “not like that.”
The operational implication
The ETL patterns that don’t create chaos are the ones that respect each engine’s constraints:
- In PostgreSQL: enforce constraints early, keep writes efficient, use incremental extraction, and avoid “analytics workloads disguised as reports” on production primaries.
- In ClickHouse: embrace append-and-merge semantics, design partitioning for pruning, and explicitly model deduplication and late-arriving data.
Paraphrased idea from Werner Vogels (Amazon CTO): you should plan for failure as the normal state, not the exceptional one. ETL that only works when nothing goes wrong is a demo, not a pipeline.
What “no data chaos” actually means
“No data chaos” isn’t a vibe; it’s a set of invariants you can test. Here are the ones that matter in production:
- Idempotence: rerunning a job does not create duplicates or drift. You can restart without prayer.
- Determinism: the same input window yields the same output rows (or the same final state), even with retries.
- Observable freshness: you can answer “how late is the data?” with a query, not a guess.
- Reconciliation: you can compare source and sink counts/checksums at meaningful boundaries and detect silent loss.
- Controlled backfills: reprocessing historical windows doesn’t melt the cluster or change metrics without a trace.
- Schema contract: when upstream adds a column or changes semantics, the pipeline fails loudly or adapts safely—no silent nonsense.
Joke #1: ETL stands for “Eventually Things Lag.” If yours doesn’t, it’s probably skipping work.
The hidden villain: time
Most chaos is time-shaped: late events, clock skew, reorders, multi-region replication delay, and backfills that replay old data into “current” partitions. A pipeline that treats event time, ingestion time, and update time as the same thing will create the kind of dashboard arguments that end friendships.
PostgreSQL ETL patterns that stay sane
Pattern 1: Treat Postgres as the contract boundary (constraints, not hope)
If you can validate a thing once, validate it where it enters your system. Use NOT NULL, foreign keys where they are operationally reasonable, check constraints for domain rules, and unique indexes for idempotency keys. This is the cheapest place to catch garbage.
For ETL extraction, you want stable identifiers and a monotonic cursor. That can be:
- a sequence-based primary key (good for inserts, not for updates),
- a last_updated timestamp (good if correctly maintained),
- or a logical replication slot / CDC stream (best when you need all changes).
Pattern 2: Incremental loads with watermarks (and a lookback window)
“Load everything every time” is a valid plan when everything is small. It stops being valid right around the time someone says “we should keep more history.” Incremental loads are the default. But incremental loads based on timestamps need a lookback window to handle late commits and clock drift.
What you do:
- Maintain a watermark per table/pipeline (e.g., last successful
updated_atseen). - On each run, extract
updated_at > watermark - lookback. - Upsert into staging/warehouse with a deterministic key.
What you avoid: using now() as a boundary without recording it, or assuming timestamps are strictly increasing. They are not, especially with retries and multi-writer systems.
Pattern 3: Use CDC when “updates matter”
If your source rows change and you need those changes reflected downstream, CDC is the clean approach. Logical replication (or a decoding plugin) gives you ordered change events with LSN positions. The operational win is you stop guessing what changed.
The operational cost: you now run stateful consumers and must monitor replication slot lag. But you already run stateful systems; you just call them “pipelines” and pretend they’re stateless.
Pattern 4: Separate OLTP from extraction load
Run heavy extracts against a replica, not the primary. If you must extract from the primary, use carefully indexed predicates, small batches, and explicit statement timeouts. Analytics queries on primaries are how “data engineering” becomes an incident category.
Pattern 5: Stage in Postgres only if you can keep it tidy
Staging tables in Postgres are fine when they’re short-lived and vacuumed, and when you don’t turn the database into a landing zone for unbounded raw events. Use UNLOGGED tables for transient staging if you can tolerate data loss on crash and you want speed. For durable raw ingestion at scale, Postgres is usually not the best landing pad.
Pattern 6: Upsert with intent (not with wishful thinking)
INSERT ... ON CONFLICT is a gift. It’s also a loaded weapon. If you upsert large batches into hot tables without thinking about indexes and bloat, you’ll discover what “autovacuum scale factor” means at 2 a.m.
For ETL staging, use deterministic keys and include a source-version column (like source LSN, or updated_at, or a hash) so you can detect no-op updates and avoid rewriting rows unnecessarily.
ClickHouse ETL patterns that stay sane
Pattern 1: Insert in blocks, not row-by-row
ClickHouse loves large inserts. Your goal is fewer, bigger blocks that compress well and produce fewer parts. Many small inserts create too many parts, and background merges will spend their life cleaning up after you.
Pattern 2: Design MergeTree tables around queries, not around “source schema”
ClickHouse table design starts with:
- partition key (what you prune),
- order by (how you cluster for range scans and dedup),
- primary key (sparse index over the ordered data).
If you choose a partition key that causes tiny partitions (like by user_id) you create a zoo of partitions and merges. If you partition too coarsely, you lose pruning and backfills become expensive. Time-based partitions (day/week/month) are common because they map to late-data handling and TTL.
Pattern 3: Handle duplicates explicitly with ReplacingMergeTree (and understand the trade)
ReplacingMergeTree is the “I get duplicates and I want a last-write-wins view” engine. It works by collapsing duplicates during merges based on the sorting key (and optional version column). The non-obvious bit: deduplication is eventual. Queries may see duplicates until merges run, unless you force a FINAL query (expensive) or design around it.
Use it when:
- you can tolerate temporary duplicates,
- you have a stable dedup key,
- and you can provide a version column (e.g., source updated_at or ingest sequence).
Avoid it when you need strict transactional uniqueness at read time. That’s not ClickHouse’s job description.
Pattern 4: For “append-only facts,” keep it boring: MergeTree + immutable rows
If your data is naturally immutable (events, logs, page views), don’t invent updates. Use plain MergeTree or ReplicatedMergeTree, partition by time, order by the query keys, and let it rip.
Pattern 5: Use materialized views for ingestion-time transforms, not as a magic wand
Materialized views are great for:
- pre-aggregations (e.g., hourly rollups),
- denormalizing reference attributes at ingestion time,
- splitting a raw stream into multiple shaped tables.
But they only process what gets inserted after they exist. If you create a view and expect historic data to appear, you’ll learn a valuable lesson about causality. Insert triggers don’t time travel.
Pattern 6: Late-arriving data: partition strategy + “grace window” merges
Late data is normal. Your job is to make it cheap. Use time partitions that align with the maximum lateness you expect, and keep recent partitions “merge-friendly.” A common approach:
- Keep daily partitions for the last N days.
- Merge aggressively on those partitions.
- After N days, consider compacting or moving older partitions to cheaper storage via TTL.
Pattern 7: Aggregations: AggregatingMergeTree is powerful, but it is not a free lunch
If you use AggregatingMergeTree, you’re storing aggregate states and merging them. This can be extremely efficient for rollups, but it complicates ad-hoc queries and debugging because the stored data isn’t raw facts. Use it for stable, well-defined metrics. Keep a raw facts table for reconciliation and reprocessing.
Joke #2: ClickHouse merges are like laundry—ignore them long enough and everything piles up, smells odd, and costs your weekend.
The handoff: staging, contracts, and reconciliation
Stop thinking “ETL,” start thinking “contracts”
The biggest leap in pipeline reliability is to define the contract at the boundary: what constitutes a valid row, what keys make it unique, what “update” means, and which timestamp is authoritative.
A useful contract for event-like data typically includes:
- event_id or deterministic idempotency key,
- event_time (when it happened),
- ingest_time (when you saw it),
- source_version (LSN, updated_at, or similar),
- schema_version (so you can evolve safely).
Staging strategy: land raw, then shape
There are two broadly sane staging approaches:
- Raw in ClickHouse, shaped in ClickHouse: land a raw table (typed columns for hot fields, plus a blob/JSON for the rest), then materialized views build shaped tables. This is common when ClickHouse is the primary analytics store.
- Raw outside both, load both: land in object storage or a queue, then load Postgres (for operational uses) and ClickHouse (for analytics). This reduces coupling and makes replays safer.
When Postgres is the source of truth, a common pattern is CDC from Postgres into ClickHouse. It works well if you treat the CDC stream as your raw contract, and you capture deletes/updates explicitly.
Reconciliation: counts are not enough, but they’re a start
Row counts are the smoke alarm. They won’t tell you which room is on fire, but they’ll tell you the building is not fine. Better: reconcile by partitions/windows using:
- counts per day/hour,
- distinct counts of keys,
- checksums/hashes over stable projections,
- late-data rate (events arriving > X hours late).
If you can’t reconcile, you’re not running a pipeline; you’re running a rumor mill.
12+ practical tasks with commands, outputs, and decisions
These are the kinds of commands you run when you’re on-call for data. Each includes what the output means and what decision you make next.
Task 1: Check Postgres table bloat symptoms via dead tuples
cr0x@server:~$ psql -h pg-prod-replica -U app -d appdb -c "SELECT relname, n_live_tup, n_dead_tup, round(100.0*n_dead_tup/nullif(n_live_tup+n_dead_tup,0),2) AS dead_pct FROM pg_stat_user_tables ORDER BY dead_pct DESC LIMIT 10;"
relname | n_live_tup | n_dead_tup | dead_pct
------------------+------------+------------+----------
events | 92034112 | 18010233 | 16.37
orders | 1203321 | 220331 | 15.48
...
Meaning: High n_dead_tup suggests frequent updates/deletes without enough vacuum progress.
Decision: If dead_pct is high on ETL-upserted tables, reduce update churn (avoid no-op updates), adjust autovacuum settings, or switch to append + periodic rebuild.
Task 2: See if autovacuum is keeping up (per table)
cr0x@server:~$ psql -h pg-prod-replica -U app -d appdb -c "SELECT relname, last_autovacuum, last_autoanalyze, autovacuum_count, autoanalyze_count FROM pg_stat_user_tables ORDER BY last_autovacuum NULLS FIRST LIMIT 10;"
relname | last_autovacuum | last_autoanalyze | autovacuum_count | autoanalyze_count
------------+--------------------------+--------------------------+------------------+-------------------
events | 2025-12-30 02:11:41+00 | 2025-12-30 02:15:03+00 | 182 | 210
...
Meaning: Stale last_autovacuum on hot tables implies vacuum isn’t running enough or is blocked.
Decision: If ETL tables are starving autovacuum, lower scale factors for those tables or change ETL to insert-new + swap.
Task 3: Validate index usage for incremental extract predicate
cr0x@server:~$ psql -h pg-prod-replica -U app -d appdb -c "EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE updated_at > now() - interval '2 hours';"
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Scan using orders_updated_at_idx on orders (cost=0.43..12842.11 rows=24000 width=312)
Index Cond: (updated_at > (now() - '02:00:00'::interval))
Buffers: shared hit=802 read=19
Execution Time: 48.221 ms
Meaning: You’re getting an index scan with low read buffers. Good.
Decision: If it’s a sequential scan, add/adjust the index or change the predicate (e.g., use a numeric cursor, avoid functions on the column).
Task 4: Check replication slot lag (CDC health)
cr0x@server:~$ psql -h pg-prod-primary -U repl -d postgres -c "SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS restart_lag, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS flush_lag FROM pg_replication_slots;"
slot_name | active | restart_lag | flush_lag
----------------+--------+-------------+-----------
ch_cdc_slot | t | 12 GB | 420 MB
Meaning: WAL is retained because the consumer hasn’t advanced. Restart lag affects disk usage; flush lag is “how far behind the consumer is.”
Decision: If lag grows, throttle producers, fix the consumer, or expand WAL disk. Do not drop the slot casually; that’s how you lose change history.
Task 5: Inspect Postgres locks that may block ETL
cr0x@server:~$ psql -h pg-prod-primary -U app -d appdb -c "SELECT pid, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state <> 'idle' AND wait_event IS NOT NULL ORDER BY pid LIMIT 5;"
pid | wait_event_type | wait_event | query
------+-----------------+---------------+--------------------------------------------------
48211 | Lock | relation | ALTER TABLE orders ADD COLUMN promo_code text;
...
Meaning: Your ETL might be queued behind a DDL lock or vice versa.
Decision: Move DDL to maintenance windows, use concurrent index builds, and set statement timeouts for ETL sessions.
Task 6: Check ClickHouse parts explosion (too many small inserts)
cr0x@server:~$ clickhouse-client -h ch01 -q "SELECT table, sum(parts) AS parts, formatReadableSize(sum(bytes_on_disk)) AS size FROM system.parts WHERE active AND database='analytics' GROUP BY table ORDER BY parts DESC LIMIT 5;"
events_raw 18234 1.21 TiB
sessions 3921 211.33 GiB
Meaning: High active part count indicates many small data parts; merges will struggle and queries slow down.
Decision: Batch inserts, tune async inserts, or redesign ingestion to reduce part creation. Consider partitioning adjustments if partitions are too granular.
Task 7: See if ClickHouse merges are falling behind
cr0x@server:~$ clickhouse-client -h ch01 -q "SELECT database, table, sum(rows) AS rows_merging, count() AS merge_jobs FROM system.merges GROUP BY database, table ORDER BY rows_merging DESC LIMIT 5;"
analytics events_raw 1289031123 7
analytics sessions 210331119 3
Meaning: Many rows merging means background work is heavy; write amplification is happening.
Decision: Reduce insert part count, avoid backfills during peak, increase merge resources carefully, or temporarily stop heavy optimization jobs.
Task 8: Check ClickHouse query hotspots and read volume
cr0x@server:~$ clickhouse-client -h ch01 -q "SELECT query_duration_ms, read_rows, formatReadableSize(read_bytes) AS read, query FROM system.query_log WHERE type='QueryFinish' AND event_time > now() - INTERVAL 10 MINUTE ORDER BY read_bytes DESC LIMIT 3;"
8421 1290341122 78.12 GiB SELECT * FROM events_raw WHERE ...
2103 230113112 12.03 GiB SELECT user_id, count() FROM events_raw ...
Meaning: Someone is doing a wide SELECT * scan or missing partition pruning.
Decision: Fix queries, add projections/materialized rollups, or enforce query limits. Also: stop selecting * from raw fact tables unless you enjoy paging.
Task 9: Verify partition pruning is working in ClickHouse
cr0x@server:~$ clickhouse-client -h ch01 -q "EXPLAIN indexes=1 SELECT count() FROM events_raw WHERE event_date = toDate('2025-12-29') AND customer_id = 42;"
Expression (Projection)
Aggregating
ReadFromMergeTree (analytics.events_raw)
Indexes:
MinMax
Keys: event_date
Condition: (event_date in [2025-12-29, 2025-12-29])
Parts: 3/120
Granules: 512/20480
Meaning: Only 3 of 120 parts are read. That’s pruning. Your partition key is doing real work.
Decision: If it reads most parts, revisit partition key, query predicates, or consider adding a secondary data-skipping index for common filters.
Task 10: Detect duplicate keys in ClickHouse raw ingestion (before merges hide it)
cr0x@server:~$ clickhouse-client -h ch01 -q "SELECT event_id, count() AS c FROM events_raw WHERE event_date >= today()-1 GROUP BY event_id HAVING c > 1 ORDER BY c DESC LIMIT 5;"
e_9f3b2 4
e_1a77c 3
Meaning: Duplicates are arriving. That might be expected (retries) or a bug (non-idempotent producer).
Decision: If duplicates are expected, ensure dedup strategy (ReplacingMergeTree/version) matches semantics. If not expected, fix producer idempotency and add ingestion-level dedup.
Task 11: Check ClickHouse replication queue (if using ReplicatedMergeTree)
cr0x@server:~$ clickhouse-client -h ch01 -q "SELECT database, table, count() AS queue, sum(num_tries) AS tries FROM system.replication_queue GROUP BY database, table ORDER BY queue DESC LIMIT 5;"
analytics events_raw 23 41
Meaning: Replication tasks are pending; retries suggest network/storage issues or overload.
Decision: If queue grows, check ZooKeeper/ClickHouse Keeper health, network, disk latency, and whether merges are saturating I/O.
Task 12: Validate “freshness” in Postgres and ClickHouse with explicit watermarks
cr0x@server:~$ psql -h pg-prod-replica -U app -d appdb -c "SELECT max(updated_at) AS pg_max_updated_at FROM orders;"
pg_max_updated_at
-----------------------------
2025-12-30 08:41:12.912+00
cr0x@server:~$ clickhouse-client -h ch01 -q "SELECT max(source_updated_at) AS ch_max_updated_at FROM orders_dim;"
2025-12-30 08:39:58
Meaning: ClickHouse is ~74 seconds behind the source for that dataset.
Decision: If lag exceeds SLA, inspect CDC lag, ingestion queues, merges, and part counts. If it’s within SLA, stop touching it.
Task 13: Track ClickHouse disk pressure and bytes moved by merges
cr0x@server:~$ clickhouse-client -h ch01 -q "SELECT name, value FROM system.asynchronous_metrics WHERE name IN ('DiskUsed_default','DiskAvailable_default','MergesBytesWritten') ORDER BY name;"
DiskAvailable_default 2.31e+12
DiskUsed_default 5.84e+12
MergesBytesWritten 1.19e+12
Meaning: Disk usage and merge write volume are high; you may be in write amplification territory.
Decision: Pause big backfills, reduce parts, and ensure TTL moves/deletes aren’t causing constant churn.
Task 14: Sanity-check a CDC consumer position against Postgres LSN
cr0x@server:~$ psql -h pg-prod-primary -U repl -d postgres -c "SELECT pg_current_wal_lsn() AS current_lsn;"
current_lsn
-------------
5A3/1F2B9C0
cr0x@server:~$ psql -h pg-prod-primary -U repl -d postgres -c "SELECT confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='ch_cdc_slot';"
confirmed_flush_lsn
---------------------
5A3/1A90D10
Meaning: The consumer is behind by a measurable WAL distance.
Decision: If it’s growing, scale the consumer, fix downstream backpressure (often ClickHouse merges/disk), or temporarily reduce upstream change volume for non-critical tables.
Fast diagnosis playbook
When ETL slows down or metrics drift, you don’t have time to become a philosopher. You need a fast triage path that narrows the failure domain in minutes.
First: is it freshness, correctness, or cost?
- Freshness problem: data is late; max timestamps lag.
- Correctness problem: duplicates, missing rows, wrong joins, or “numbers changed after backfill.”
- Cost problem: CPU/I/O spikes, merges runaway, queries time out.
Second: locate the bottleneck boundary
- Source (Postgres) extraction: slow queries, lock contention, replication slot lag.
- Transport: queue backlog, consumer errors, batch sizing issues.
- Sink (ClickHouse) ingestion: parts explosion, merges behind, replication queue.
- Serving queries: bad query patterns, missing pruning, bad table design.
Third: the minimal checks that usually find it
- Watermark lag: compare max updated/event times source vs sink (Task 12).
- CDC slot lag: if using CDC, check replication slot lag (Task 4/14).
- ClickHouse parts and merges: parts count and merges backlog (Task 6/7).
- Query log top read bytes: find the query eating the cluster (Task 8).
- Partition pruning check: confirm the hot queries prune parts (Task 9).
If you do those five things, you’ll usually know whether you’re dealing with slow extraction, slow load, or expensive reads. Then you can fix the right thing instead of “tuning everything,” which is just panic with a spreadsheet.
Common mistakes: symptoms → root cause → fix
1) Metrics jump up and down after reruns
Symptoms: rerunning a job changes daily totals; duplicate sessions; “same window, different result.”
Root cause: non-idempotent loads (append without dedup), or upsert keys don’t represent real uniqueness.
Fix: define a deterministic idempotency key, load into a staging table, then merge into final tables with explicit dedup (Postgres: ON CONFLICT; ClickHouse: ReplacingMergeTree with version, or explicit dedup in query layer for raw tables).
2) ClickHouse ingestion starts fast, then slows to a crawl
Symptoms: inserts time out; CPU and disk busy; active parts balloon.
Root cause: too many small inserts causing part explosion; merges can’t keep up.
Fix: batch inserts into larger blocks, tune insert buffering, and avoid per-row inserts. Consider reducing partition granularity and revisiting ORDER BY to improve merge locality.
3) Backfill “worked” but queries became slower
Symptoms: after a backfill, scan queries read far more data; caches miss; response times degrade.
Root cause: backfill inserted out-of-order data relative to ORDER BY, producing many parts and poor locality; or it hit old partitions with TTL/merges and caused churn.
Fix: backfill in partition order, use controlled batch sizes, and avoid interleaving old and new partitions. Validate pruning with EXPLAIN before and after.
4) Postgres primary gets slow during extracts
Symptoms: application latency increases; connection pool saturates; slow queries show extract statements.
Root cause: ETL reads competing with OLTP writes; sequential scans due to missing indexes; long-running transactions holding back vacuum.
Fix: run extracts on replicas, add proper indexes for incremental predicates, limit batch size, and enforce statement timeouts. Keep ETL transactions short.
5) “We use updated_at” but rows still go missing
Symptoms: source has rows not present in sink; usually small percentages; hard to reproduce.
Root cause: updated_at not reliably updated on all changes; timezone confusion; extraction window boundary too tight; clock skew from app servers.
Fix: add a database-triggered updated_at if needed, include a lookback window, store the extraction boundary explicitly, or switch to CDC.
6) ClickHouse shows duplicates “sometimes”
Symptoms: duplicates appear in raw or even dedup tables; later they “go away.”
Root cause: relying on ReplacingMergeTree merges for dedup; dedup is eventual.
Fix: accept eventual dedup and query with care (avoid FINAL unless necessary), or dedup at ingestion. For strict needs, keep a “current state” table built by periodic compaction jobs.
7) Pipeline correctness breaks after a schema change
Symptoms: columns shifted, nulls where they shouldn’t be, dashboards quietly wrong.
Root cause: schema evolution without contracts; using SELECT *; relying on positional CSV imports; weak typing at ingestion.
Fix: version schemas, pin column lists, validate with checks (null rate, cardinality), and fail fast on incompatible changes.
Three corporate mini-stories (painfully familiar)
Mini-story 1: The incident caused by a wrong assumption
Company A ran a subscription product. Postgres was the billing system; ClickHouse was analytics. The data team built an incremental extractor keyed on updated_at from the subscriptions table. It looked perfect in dev, fine in staging, and “mostly fine” in production.
Then finance noticed churn numbers were consistently off by a small amount. Not huge. Just enough to cause weekly meetings where everyone brought their own spreadsheet like it was a comfort blanket.
The wrong assumption was simple: “updated_at changes whenever the row’s business meaning changes.” In reality, a background job toggled a boolean flag via a stored procedure that didn’t touch updated_at. Those rows never got extracted. They weren’t late. They were invisible.
They tried widening the extraction window. It helped a bit, then plateaued. They tried a full reload weekly. It “fixed” the numbers and turned Saturday into an unpleasant ritual.
The fix was boring and correct: they added a database trigger to enforce updated_at updates on any row change that mattered, and they introduced a secondary reconciliation check that compared daily counts and a checksum of key fields between Postgres and ClickHouse. Once they did that, missing updates turned into alerts instead of mysteries.
Mini-story 2: The optimization that backfired
Company B had a high-volume events pipeline into ClickHouse. Inserts were coming from multiple microservices, each sending small batches every few seconds. Someone noticed ingestion latency and “optimized” by increasing concurrency: more threads, more parallel inserts, smaller retry intervals. The graphs looked exciting. Like a rocket launch.
Two days later, query latency spiked. CPU wasn’t the issue. Disk was. Active parts per table went from “healthy” to “is this a bug?” Background merges started occupying the cluster’s life, and replication queues grew because replicas couldn’t keep up with part fetches and merges.
They had optimized the wrong thing. They reduced per-request latency for producers at the expense of cluster stability. ClickHouse can ingest very fast, but it wants big blocks. Lots of tiny blocks are like feeding a wood chipper one toothpick at a time: technically possible, emotionally draining.
The rollback was not glamorous: they forced upstream batching (bigger blocks, fewer inserts), introduced a buffer/queue to smooth bursts, and limited backfills to low-traffic windows. Ingestion latency improved anyway because merges became manageable. The “optimization” had been a tax, not a feature.
Mini-story 3: The boring but correct practice that saved the day
Company C ran a marketplace. Their analytics fed pricing decisions, so correctness mattered. They had a rule: every pipeline must publish a daily reconciliation report into a small Postgres table—counts, distinct keys, and a checksum per partition. It wasn’t fancy. It wasn’t machine-learning. It was arithmetic.
One Tuesday, a deploy changed how a service generated event IDs. It wasn’t malicious. It wasn’t even obviously wrong. But it changed the idempotency semantics: retries now produced new IDs instead of reusing the same one.
In ClickHouse, duplicates started piling up. Dashboards drifted upward. Most teams wouldn’t notice for a week, because the drift was gradual and everyone assumes “traffic is up.”
The reconciliation job noticed in hours: distinct event_id per day dropped relative to total rows, and the checksum diverged. The alert went to the right channel, and the on-call had enough evidence to point directly at “ID generation changed” instead of chasing phantom merge issues or blaming ClickHouse.
They fixed the producer, replayed a small window, and moved on. Nothing heroic happened. That’s the point: boring checks prevent exciting incidents.
Checklists / step-by-step plan
Step-by-step: choosing the right pattern for each dataset
- Classify the dataset: immutable events, slowly changing dimensions, mutable entities, or aggregated metrics.
- Define the unique key: event_id, natural key, or synthetic key + version.
- Pick the authoritative time: event_time vs updated_at vs CDC LSN; document it.
- Decide how you handle updates/deletes: ignore, append new version, or apply state changes via dedup/merge.
- Pick the extraction mechanism: incremental timestamp with lookback, numeric cursor, or CDC.
- Pick the ClickHouse table engine: MergeTree for immutable, ReplacingMergeTree for “latest version,” AggregatingMergeTree for rollups.
- Choose partitioning: usually time-based; confirm pruning with
EXPLAIN. - Choose ORDER BY: match the most common filters and grouping keys; avoid high-cardinality chaos if it hurts merge locality.
- Build reconciliation: per partition/window; counts + distinct keys + checksum.
- Define backfill procedure: partition-ordered, rate-limited, and observable.
Operational checklist: before shipping a new ETL pipeline
- Can I rerun the job without duplicates or drift?
- Do I have explicit watermarks stored somewhere durable?
- Do I know the max tolerated lateness, and does the design handle it?
- Are heavy extracts off the Postgres primary?
- Does ClickHouse ingestion batch into large blocks?
- Can I prove partition pruning works for the top 3 queries?
- Do I have an alert on CDC slot lag / ingestion lag / parts explosion?
- Can I backfill one day without impacting current-day freshness?
Backfill plan that won’t torch production
- Freeze the definition: version the transformation code used for backfill.
- Pick windows: do not backfill “all history” in one job. Use day/week partitions.
- Throttle: cap ClickHouse insert rate and concurrent merges; avoid peak query hours.
- Write to a shadow table: validate counts/checksums before swapping views or routing queries.
- Reconcile: compare source/sink per partition, not globally.
- Cut over gradually: start with internal dashboards before exec dashboards. Your future self enjoys fewer surprise meetings.
FAQ
1) Should I use PostgreSQL or ClickHouse as my staging area?
If staging is transient and small, Postgres can work. If staging is large, raw, and append-heavy, ClickHouse or an external landing zone is usually better. Staging should not become an unbounded junk drawer.
2) Can ClickHouse replace my Postgres analytics queries directly?
For read-heavy analytics, often yes. But ClickHouse won’t replace transactional semantics, strict constraints, and row-level update patterns without rethinking your data model.
3) What’s the simplest safe incremental load from Postgres?
An incremental extract using a reliable updated_at plus a lookback window, loading into a staging table, then upserting into the final model with deterministic keys.
4) When should I switch from timestamp-based incremental to CDC?
When updates/deletes matter, when you can’t trust updated_at, or when you need exact change capture without scanning large windows. CDC adds operational complexity but removes ambiguity.
5) Is ReplacingMergeTree enough to “do upserts” in ClickHouse?
It’s enough for “eventual last-write-wins” if you provide a stable dedup key and (ideally) a version column. If you need strict uniqueness at query time, you’ll need additional logic or a different approach.
6) Why are FINAL queries discouraged?
FINAL forces merges/deduplication at query time and can turn a fast scan into a costly operation. It’s a debugging tool or last resort, not a default.
7) How do I avoid breaking dashboards during schema evolution?
Version schemas, avoid SELECT *, validate null rates and cardinalities on new columns, and use explicit column lists in ingestion. Fail fast beats silent wrong.
8) What’s the best way to handle deletes from Postgres into ClickHouse?
Either model “tombstones” (a delete event) and filter at query time, or maintain a current-state table where deletes are applied via dedup/versioning logic. ClickHouse can do deletes, but frequent row-level deletes are expensive.
9) How do I know if ClickHouse is slow because of ingestion or because of queries?
Check active parts and merges backlog first (ingestion pressure), then query log read bytes and durations (query pressure). If both are high, your writes created the read problem—usually too many parts or poor partitioning.
10) What’s a reasonable reconciliation baseline?
Per partition/window: row count, distinct key count, and a checksum/hash across a stable projection of key fields. Add late-arrival rate if time matters (it does).
Conclusion: next steps that reduce risk this week
If you want ETL that doesn’t create data chaos, don’t start by tuning knobs. Start by deciding what “correct” means, then make it observable.
- Define idempotency keys for your top 3 datasets and document them where engineers will trip over them.
- Add explicit watermarks (source max time/version and sink max time/version) and alert on lag.
- Run one reconciliation job daily for one critical pipeline: counts, distinct keys, checksum by day. Ship it. Iterate.
- In ClickHouse, measure parts and merges and fix small-insert patterns before they become a lifestyle.
- Move heavy extracts off Postgres primary or switch them to CDC if updates matter and the table is large.
Production systems don’t reward optimism. They reward explicit contracts, controlled failure modes, and pipelines that you can restart on a Tuesday without negotiating with the universe.