You add a JSON column because you “just need flexibility.” Then your dashboards get slow, your replicas lag,
and someone asks for “a quick ad-hoc query” that turns into a table scan across millions of rows. JSON is the
duct tape of data modeling: sometimes it saves the day, sometimes it’s why the day needed saving.
MySQL and PostgreSQL both support JSON, but they reward very different habits. One will let you ship fast and
quietly accumulate debt. The other will let you build powerful indexes and constraints—while also giving you
enough rope to knit a sweater of bloat and lock contention if you’re careless.
The decision in one page: what to choose and when
Use PostgreSQL when…
- You need rich querying (containment, existence, nested filters) and want the optimizer to have options. PostgreSQL’s JSONB + GIN is the grown-up toolset.
- You want constraints around semi-structured data: CHECK constraints, expression indexes, generated columns, and functional indexes are first-class citizens.
- You expect JSON to stick around longer than a quarter. PostgreSQL tends to age better when JSON becomes “core schema.”
- You can operate vacuum competently. PostgreSQL will reward you, but only if you respect MVCC housekeeping.
Use MySQL when…
- Your JSON usage is mostly document storage + retrieval, not heavy analytical filtering. If queries are “get by id, return blob,” MySQL can be totally fine.
- You rely on generated columns to project hot JSON paths into indexed scalars. This is MySQL’s practical path to predictability.
- You’re already standardized on MySQL operationally and JSON is a small corner of the workload. Consistent ops beats theoretical elegance.
What I’d tell a production team
If your JSON columns are a transitional hack (ingest fast, normalize later), pick whichever database your team already
operates well. But if JSON is the interface contract (events, configurations, feature flags, user attributes) and you expect to query
inside it at scale, PostgreSQL is usually the safer long-term bet.
MySQL can perform well with JSON, but it often demands you “declare the important bits” via generated columns and targeted indexes.
If you don’t, you’ll end up explaining to leadership why your flexible schema became inflexible latency.
One quote that belongs on every on-call runbook: “Hope is not a strategy.”
— a widely repeated operations maxim (paraphrased idea).
With JSON, hoping the database will figure it out is how you buy yourself a weekend incident.
Facts and history: how we got here
JSON-in-the-database feels modern, but the industry has been circling this idea for decades: “store flexible data near structured data,
and query it without giving up transactional safety.” The details differ, and those details are why you’re reading this instead of sleeping.
8 facts worth keeping in your head
- PostgreSQL added JSON in 9.2 (2012), then introduced JSONB in 9.4 (2014) for binary storage and better indexing.
- MySQL introduced a native JSON type in 5.7 (2015); before that, it was TEXT with a prayer and a regex.
- JSONB normalizes key order and removes duplicate keys (last key wins). That’s great for indexing, surprising for “store exactly what I sent.”
- MySQL stores JSON in a binary format too, and it validates JSON on insert, avoiding some “invalid blob” horror.
- PostgreSQL’s GIN indexes were originally built for full-text search, then became the workhorse for JSONB containment.
- MySQL’s generated columns have existed since 5.7, and they’re the reason many MySQL JSON deployments don’t melt down.
- MVCC in PostgreSQL means updates create new row versions; large JSON updates can amplify bloat unless vacuum keeps up.
- Replication formats matter: MySQL row-based binlog and PostgreSQL logical decoding behave differently under frequent JSON updates and hot rows.
JSON semantics: what the engines really store
MySQL: JSON is a type, but treat it like a document unless you project fields
MySQL’s JSON type is not “TEXT with a label.” It’s validated, stored in a binary representation, and manipulated with JSON functions.
That’s the good news. The operational news is that you rarely get sustained performance unless you do one of two things:
(1) keep JSON mostly write-once/read-by-primary-key, or (2) pull frequently queried paths into generated columns and index those.
MySQL will happily let you write a query that looks selective but isn’t indexable. The optimizer will do what it can, then it will scan.
You can sometimes rescue it with functional indexes (version-dependent) or generated columns, but you have to be intentional.
PostgreSQL: JSONB is for querying; JSON (text) is for preserving exact input
PostgreSQL gives you two different philosophies:
json stores the original text (including whitespace and ordering), and
jsonb stores a decomposed binary format optimized for operators and indexing.
If you want performance, you almost always want JSONB.
PostgreSQL’s operators are expressive: containment (@>), existence (?, ?|, ?&),
path extraction (->, ->>, #>, #>>), and JSON path queries.
That expressiveness can be a trap: people write clever filters that look cheap and end up CPU-bound on decompression or stuck on an index that doesn’t match the predicate.
Joke 1/2: JSON is like a junk drawer—everything fits until you actually need to find the scissors.
Indexing JSON: where performance is made or lost
MySQL indexing: generated columns are the adult move
In MySQL, indexing arbitrary JSON expressions has improved over time, but the most reliable operational pattern is still:
define generated columns for the few JSON paths you query all the time, cast them to stable scalar types, and index them.
This does three things:
- Gives the optimizer a normal B-tree index it understands.
- Avoids repeated JSON extraction at runtime.
- Forces you to admit which fields are actually part of the “real schema.”
The catch: schema changes become slower and more political, because now the JSON blob has tentacles into DDL and migrations.
That’s not a bug. That’s the price of pretending semi-structured data has structure (because it does, once you rely on it).
When MySQL JSON indexing fails in practice
- Overly dynamic predicates (different JSON paths depending on user input) push you toward scans.
- Comparing JSON strings to numbers causes implicit casts and breaks index usage.
- Using functions in WHERE without an indexable expression makes the optimizer shrug and do work the slow way.
PostgreSQL indexing: GIN is powerful, but you must choose the operator class
PostgreSQL’s JSONB indexing story is stronger, but it’s not magic. GIN indexes can accelerate containment and key-existence queries,
but they have different operator classes:
- jsonb_ops: indexes more kinds of operations but can be larger.
- jsonb_path_ops: more compact and faster for containment, but supports fewer operators.
If your workload is “find rows where JSON contains these pairs,” jsonb_path_ops is often the right call.
If you need flexible existence and more operator support, jsonb_ops.
Pick wrong, and you’ll have an index that exists purely to make VACUUM sad.
Expression indexes: the practical bridge between JSON and relational
If you frequently filter on one extracted field (say, payload->>'customer_id'), an expression index can beat a broad GIN
in size and predictability. It’s also easier to reason about selectivity.
Joke 2/2: A GIN index is like caffeine—amazing when targeted, regret when you overdo it.
Query patterns that separate “fine” from “on fire”
Pattern 1: “Fetch by id and return JSON” (safe-ish)
Both MySQL and PostgreSQL handle this well. The dominant cost is I/O and row size, not JSON functions.
Where teams get hurt is the slow creep: JSON grows, row size grows, cache efficiency drops, and suddenly “simple reads” become disk reads.
Pattern 2: “Filter by JSON keys with high cardinality” (index or die)
If you filter by user_id, tenant_id, order_id inside JSON, you are effectively filtering by a relational key.
Don’t pretend it’s flexible. Promote it: generated column + index in MySQL, expression index in Postgres, or just make it a real column.
This is not ideology. It’s about avoiding full scans and unstable query plans.
Pattern 3: “Ad-hoc analytics over JSON” (beware the slow creep)
JSON is attractive for analytics because it’s self-describing. In production OLTP databases, that’s a trap.
Ad-hoc analytics tends to:
- Use functions on many rows, causing CPU burn.
- Force sequential scans because predicates don’t match indexes.
- Serialize your workload on one big table and one hot disk subsystem.
If the business wants analytics, either carve out a reporting replica with stricter guardrails, or stream events elsewhere.
“Just run it on prod” is a budget decision disguised as an engineering decision.
Pattern 4: partial updates to JSON (hot rows, heavy logs)
Both databases can update paths inside JSON, but the performance characteristics differ and the operational impact is similar:
frequent updates to big JSON documents mean more bytes written, more index churn, more replication work, and more cache invalidation.
The practical rule: if a JSON field is updated frequently and read frequently, it deserves a real column or a separate table.
JSON is not a free pass on normalization; it’s a delayed invoice.
Updates, WAL/binlog, and replication lag
MySQL: binlog volume and row-based replication realities
In MySQL, large JSON updates can produce big binlog events—especially with row-based replication. If you update many rows or update large
documents, your replicas pay the price. Replication lag is rarely “a replica problem.” It’s an application write amplification problem.
Also watch for transaction size and commit frequency. A workload that updates JSON in bursts can create nasty spikes: fsync pressure,
binlog flush stalls, and replica SQL thread backlog.
PostgreSQL: WAL pressure + MVCC churn
PostgreSQL writes WAL for changes, and MVCC means updates create new row versions. Update a big JSONB field frequently and you’ll get:
more WAL, more dead tuples, more vacuum work, and potentially more index bloat.
Replication lag shows up as WAL sender backlog or replay delay. The key is to distinguish:
replica can’t apply fast enough (CPU/I/O bound applying changes) vs
primary produces too much WAL (write amplification).
Operational guidance
- Measure WAL/binlog bytes per second during peak. It’s the closest thing to “truth” about write amplification.
- Partition or split hot JSON fields if update rates are high.
- On PostgreSQL, tune autovacuum for tables with heavy JSON updates, or vacuum debt will show up as latency debt.
Storage and I/O reality: bloat, page churn, and cache behavior
Row size and cache: your invisible tax
JSON columns make rows bigger. Bigger rows mean fewer rows per page. Fewer rows per page means more page reads for the same number of logical rows.
This shows up as:
- Higher buffer pool churn in MySQL (InnoDB).
- More shared_buffers churn in PostgreSQL.
- More pressure on the OS page cache.
Most “mysterious performance regressions” after adding JSON are actually “we doubled row size and no one adjusted memory or access patterns.”
PostgreSQL bloat: MVCC means you owe the vacuum collector
PostgreSQL doesn’t update in place; it creates new row versions. If JSONB is big and frequently updated, dead tuples accumulate and indexes
churn. Autovacuum can handle a lot, but it needs the right thresholds. Default settings are designed to be safe for beginners, not optimal for your mess.
MySQL: secondary indexes and undo/redo pressure
MySQL’s InnoDB has its own write amplification: redo logs, undo logs, doublewrite buffer, secondary index maintenance.
Big JSON updates increase the bytes touched and can push you into log flush stalls. You’ll see it as intermittent latency spikes,
“suddenly slow commits,” and replicas falling behind.
Practical tasks: 14 commands you can run today
These are the kinds of commands I run during an incident or a performance review. Each task includes:
the command, what the output means, and what decision you make next.
Hostnames and paths are deliberately boring; boring is repeatable.
Task 1 (MySQL): confirm JSON usage and size pressure
cr0x@server:~$ mysql -e "SELECT table_schema, table_name, column_name, data_type FROM information_schema.columns WHERE data_type='json' ORDER BY table_schema, table_name;"
+--------------+------------+-------------+-----------+
| table_schema | table_name | column_name | data_type |
+--------------+------------+-------------+-----------+
| app | events | payload | json |
| app | users | attrs | json |
+--------------+------------+-------------+-----------+
Meaning: you now know which tables are candidates for JSON-related pain.
Decision: shortlist the top 1–3 tables by row count and update rate. Those are where indexing and schema choices matter.
Task 2 (MySQL): check table sizes and index footprint
cr0x@server:~$ mysql -e "SELECT table_name, table_rows, ROUND(data_length/1024/1024,1) AS data_mb, ROUND(index_length/1024/1024,1) AS index_mb FROM information_schema.tables WHERE table_schema='app' ORDER BY data_length DESC LIMIT 10;"
+------------+------------+---------+----------+
| table_name | table_rows | data_mb | index_mb |
+------------+------------+---------+----------+
| events | 4821031 | 8120.4 | 2104.7 |
| users | 820114 | 1190.8 | 412.2 |
+------------+------------+---------+----------+
Meaning: JSON-heavy tables tend to balloon data_mb.
Decision: if data_mb is growing faster than business growth, you need to cap payload size, compress upstream, or normalize hot fields.
Task 3 (MySQL): identify slow JSON predicates in the slow log
cr0x@server:~$ sudo pt-query-digest /var/log/mysql/mysql-slow.log --limit 5
# 1.2s user time, 40ms system time, 27.31M rss, 190.55M vsz
# Query 1: 0.68 QPS, 0.31x concurrency, ID 0xA1B2C3D4 at byte 91234
# Time range: 2025-12-28T00:00:00 to 2025-12-28T01:00:00
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Exec time 62 180s 120ms 12s 540ms 3s 900ms 300ms
# Rows examine 90 1200M 10 2.5M 360k 1.1M 500k 200k
# Query: SELECT ... WHERE JSON_EXTRACT(payload,'$.customer.id') = ?
Meaning: rows examined is your “scan tax.” JSON_EXTRACT in WHERE without an index is a usual suspect.
Decision: create a generated column for that path (or a functional index if appropriate) and rewrite the query to use it.
Task 4 (MySQL): verify whether a query uses an index
cr0x@server:~$ mysql -e "EXPLAIN SELECT id FROM app.events WHERE JSON_UNQUOTE(JSON_EXTRACT(payload,'$.customer.id'))='12345' LIMIT 10\G"
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: events
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 4821031
filtered: 10.00
Extra: Using where
Meaning: type: ALL and no key means full table scan.
Decision: don’t tune buffers first. Fix the schema/query: generated column + index, or redesign.
Task 5 (MySQL): add a generated column for a hot JSON path
cr0x@server:~$ mysql -e "ALTER TABLE app.events ADD COLUMN customer_id VARCHAR(64) GENERATED ALWAYS AS (JSON_UNQUOTE(JSON_EXTRACT(payload,'$.customer.id'))) STORED, ADD INDEX idx_events_customer_id (customer_id);"
Query OK, 0 rows affected (2 min 41 sec)
Records: 0 Duplicates: 0 Warnings: 0
Meaning: STORED generated column materializes the value, index becomes usable.
Decision: rewrite application queries to filter by customer_id instead of JSON_EXTRACT in WHERE. Then re-check EXPLAIN.
Task 6 (MySQL): validate optimizer now uses the new index
cr0x@server:~$ mysql -e "EXPLAIN SELECT id FROM app.events WHERE customer_id='12345' LIMIT 10\G"
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: events
type: ref
possible_keys: idx_events_customer_id
key: idx_events_customer_id
key_len: 258
ref: const
rows: 120
Extra: Using index
Meaning: you went from scanning millions to touching ~120 rows.
Decision: ship it, then watch write latency: maintaining the new index increases write cost.
Task 7 (MySQL): check replication lag and apply pressure
cr0x@server:~$ mysql -e "SHOW REPLICA STATUS\G" | egrep "Seconds_Behind_Source|Replica_SQL_Running|Replica_IO_Running|Last_SQL_Error"
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 87
Last_SQL_Error:
Meaning: lag exists even though threads run. Usually apply can’t keep up with writes.
Decision: measure binlog rate and transaction size; reduce JSON update volume or batch behavior before blaming the replica.
Task 8 (PostgreSQL): list JSON/JSONB columns and their tables
cr0x@server:~$ psql -d appdb -c "SELECT table_schema, table_name, column_name, data_type FROM information_schema.columns WHERE data_type IN ('json','jsonb') ORDER BY 1,2,3;"
table_schema | table_name | column_name | data_type
--------------+------------+-------------+-----------
public | events | payload | jsonb
public | users | attrs | jsonb
(2 rows)
Meaning: scope. Same as MySQL: identify the few tables that matter most.
Decision: focus on tables with high update rates and customer-facing queries first.
Task 9 (PostgreSQL): find the worst JSON queries by total time
cr0x@server:~$ psql -d appdb -c "SELECT calls, total_exec_time::bigint AS total_ms, mean_exec_time::numeric(10,2) AS mean_ms, rows, query FROM pg_stat_statements WHERE query ILIKE '%jsonb%' OR query ILIKE '%->%' OR query ILIKE '%@>%' ORDER BY total_exec_time DESC LIMIT 5;"
calls | total_ms | mean_ms | rows | query
-------+----------+---------+------+-------------------------------------------
18211 | 932144 | 51.20 | 0 | SELECT ... WHERE payload @> $1
4102 | 512030 | 124.82 | 0 | SELECT ... WHERE (payload->>'customer')= $1
(2 rows)
Meaning: you have hot queries, not theories.
Decision: run EXPLAIN (ANALYZE, BUFFERS) on the top offenders and build the right index for the predicate shape.
Task 10 (PostgreSQL): inspect a JSONB query plan with buffers
cr0x@server:~$ psql -d appdb -c "EXPLAIN (ANALYZE, BUFFERS) SELECT id FROM events WHERE payload @> '{\"customer\":{\"id\":\"12345\"}}'::jsonb LIMIT 10;"
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..8.44 rows=10 width=8) (actual time=0.088..0.146 rows=10 loops=1)
Buffers: shared hit=42
-> Index Scan using idx_events_payload_gin on events (cost=0.42..22134.77 rows=26235 width=8) (actual time=0.086..0.141 rows=10 loops=1)
Index Cond: (payload @> '{"customer": {"id": "12345"}}'::jsonb)
Buffers: shared hit=42
Planning Time: 0.412 ms
Execution Time: 0.182 ms
(7 rows)
Meaning: index scan + mostly buffer hits = healthy.
Decision: keep this index if it supports core product paths. If it’s only for ad-hoc queries, don’t pay the write tax.
Task 11 (PostgreSQL): create a targeted GIN index (choose operator class)
cr0x@server:~$ psql -d appdb -c "CREATE INDEX CONCURRENTLY idx_events_payload_pathops ON events USING gin (payload jsonb_path_ops);"
CREATE INDEX
Meaning: concurrent build avoids blocking writes (at the cost of time and extra work).
Decision: use jsonb_path_ops when containment dominates; otherwise consider jsonb_ops or expression indexes.
Task 12 (PostgreSQL): build an expression index for a single hot path
cr0x@server:~$ psql -d appdb -c "CREATE INDEX CONCURRENTLY idx_events_customer_id_expr ON events ((payload->'customer'->>'id'));"
CREATE INDEX
Meaning: this makes equality predicates on that extracted value predictable and cheap.
Decision: if this path is stable and widely used, consider promoting it to a real column to reduce JSON processing altogether.
Task 13 (PostgreSQL): check bloat signals and autovacuum effectiveness
cr0x@server:~$ psql -d appdb -c "SELECT relname, n_live_tup, n_dead_tup, last_autovacuum, last_vacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
relname | n_live_tup | n_dead_tup | last_autovacuum | last_vacuum
---------+------------+------------+------------------------+------------------------
events | 4809123 | 912044 | 2025-12-28 00:41:12+00 | 2025-12-22 03:11:02+00
users | 820104 | 12033 | 2025-12-28 00:38:01+00 | 2025-12-23 02:08:40+00
(2 rows)
Meaning: dead tuples on events are high; autovacuum is running, but may be under-provisioned for update churn.
Decision: tune autovacuum thresholds for that table, reduce update frequency to large JSONB, or split hot mutable fields out.
Task 14 (System-level): identify whether you’re I/O bound or CPU bound
cr0x@server:~$ iostat -x 1 5
Linux 6.1.0 (db01) 12/29/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
22.11 0.00 6.34 18.90 0.00 52.65
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s w_await wareq-sz aqu-sz %util
nvme0n1 320.0 18240.0 0.0 0.00 4.20 57.00 410.0 24576.0 9.80 59.95 6.10 92.0
Meaning: high %util and significant iowait points to storage saturation. JSON workloads often inflate I/O due to larger rows and index churn.
Decision: fix query/index patterns first; if still saturated, scale IOPS (better disks) or reduce write amplification (schema/design changes).
Fast diagnosis playbook
When JSON queries get slow, people waste hours arguing about “database choice” instead of finding the actual bottleneck.
This playbook is the order I’d run in an incident—because it converges quickly.
First: prove whether it’s a scan, an index miss, or raw I/O
- MySQL: run
EXPLAINon the slow query. Iftype: ALL, stop and fix the predicate/index. - PostgreSQL: run
EXPLAIN (ANALYZE, BUFFERS). If you see sequential scans on big tables, you need a matching index or query rewrite. - System: check
iostat -x. If storage is pegged, scans and bloat will be your prime suspects.
Second: quantify write amplification and replication pressure
- MySQL: inspect replication lag and binlog growth patterns; large JSON updates often correlate with lag spikes.
- PostgreSQL: check WAL generation and dead tuples; heavy JSON updates can turn vacuum into a permanent background crisis.
Third: check cache effectiveness and row size creep
- Is your hot working set still in memory, or did JSON growth evict it?
- Did you add a broad GIN index that doubled write cost?
- Did someone start doing ad-hoc filters on unindexed JSON keys?
Fourth: fix the smallest thing that changes the curve
- Promote hot keys to real columns (best) or generated/expression columns (next best).
- Add the right index for the predicate shape, then validate with EXPLAIN.
- If updates are the problem, split mutable fields out of the JSON blob.
Common mistakes: symptoms → root cause → fix
Mistake 1: “Query looks selective but is slow”
Symptoms: latency grows with table size; EXPLAIN shows full scan; CPU spikes during peak.
Root cause: JSON extraction in WHERE without an indexable expression (MySQL), or mismatch between operator and index (PostgreSQL).
Fix: MySQL: STORED generated column + B-tree index; PostgreSQL: expression index or correct GIN operator class; rewrite predicate to match index.
Mistake 2: “We added a GIN index and writes got slower”
Symptoms: insert/update latency increases; WAL/binlog rate spikes; replication lag worsens after index creation.
Root cause: broad GIN index on large JSONB with frequent updates; high index maintenance cost.
Fix: replace with narrower expression indexes; use jsonb_path_ops if containment-only; split mutable fields out; reconsider whether you need that query on OLTP.
Mistake 3: “Postgres is slow over time; vacuum can’t keep up”
Symptoms: table and index sizes grow; queries slow; autovacuum runs constantly; dead tuples high.
Root cause: frequent updates to large JSONB fields create many dead tuples; autovacuum thresholds not tuned for the table’s churn.
Fix: tune per-table autovacuum settings; reduce update frequency/size; move mutable data into separate table; consider partitioning for event-like tables.
Mistake 4: “MySQL replication lag after adding JSON features”
Symptoms: Seconds_Behind_Source climbs during bursts; replicas recover slowly; commits are spiky.
Root cause: large row-based binlog events from JSON updates; oversized transactions; too many secondary indexes on projected JSON fields.
Fix: reduce JSON update volume; batch differently; limit indexed projections to truly hot paths; verify binlog/redo log settings and commit patterns.
Mistake 5: “We stored everything in JSON and now we need constraints”
Symptoms: inconsistent values in JSON; app-level validations drift; queries must handle missing keys and wrong types.
Root cause: schema outsourced to application code; no enforced constraints; migrations avoided until too late.
Fix: promote key fields to columns; add CHECK constraints (Postgres) or enforce via generated columns + NOT NULL (MySQL); introduce versioned payloads.
Three corporate mini-stories from the JSON trenches
1) Incident caused by a wrong assumption: “JSON is basically free to query”
A mid-size SaaS company shipped an “activity feed” backed by a table of events. Each event had a JSON payload.
The product team wanted filtering: “show only events where payload.actor.role = ‘admin’.” Easy, they thought.
The backend used MySQL, and the first implementation used JSON_EXTRACT in the WHERE clause.
In staging it was fine. In production it was a slow-motion disaster: the events table was large, and the filter was popular.
The query looked selective, but it did a full scan, touching millions of rows per request during peak.
CPU pegged, I/O saturated, and the whole cluster developed the “everything is slow” symptom that makes executives join the incident channel.
The wrong assumption wasn’t “MySQL can’t do JSON.” It was: “if the predicate is narrow, the database will optimize it.”
Databases optimize what you index, not what you hope. JSON extraction without an indexable expression isn’t narrow; it’s expensive math repeated across many rows.
The fix was painfully straightforward: add a STORED generated column for actor_role, index it, and change the query.
The postmortem added a rule: any JSON key used in a hot WHERE clause must be projected and indexed, or moved to a real column.
Flexible schema remained, but only where it wasn’t on the critical path.
2) Optimization that backfired: “Just add a big GIN index”
Another company ran PostgreSQL and had a single massive events table with JSONB payloads.
They wanted faster ad-hoc search for customer support, so someone added a broad GIN index on the entire payload using the default operator class.
Query speed improved instantly. Everyone high-fived and moved on.
Two weeks later, write latency started creeping up. Autovacuum activity became constant. Replication delay appeared during peak.
The GIN index was expensive to maintain because the payloads were large and frequently updated with enrichment fields.
The index also grew quickly, increasing checkpoint and I/O pressure. The “support search” win became an “every API endpoint is slower” problem.
The backfire was not that GIN is bad. It was that they indexed everything, for a query workload that wasn’t actually core.
The index turned the database into a search engine. PostgreSQL can do that, but you pay in write amplification and bloat.
The eventual fix: remove the broad index, add two expression indexes for the handful of keys used in support filters,
and move full-text-ish searching out of the OLTP path. Support still got their workflow, but production stopped paying the tax on every write.
3) Boring but correct practice that saved the day: “Make JSON a contract, version it, and test it”
A fintech-ish team stored customer verification metadata in JSONB in PostgreSQL. It included nested fields, optional keys, and vendor-specific blocks.
They knew this data would evolve, and they also knew they’d need to query a few fields reliably for compliance reports.
So they did something that feels unsexy: they added a schema_version integer column and wrote explicit migrations for payload shape changes.
They also promoted a few critical fields to real columns: customer_id, verification_status, and vendor_name.
Everything else lived in JSONB. On top of that, they had CHECK constraints ensuring the status column matched a known set,
and application tests that validated JSON schema compatibility per version.
Months later, a vendor changed their payload format in a subtle way (a field moved deeper).
Teams that store raw JSON without a contract usually discover this when reports break at 2 a.m.
This team discovered it in CI, because a schema validation test failed and the migration tooling forced an explicit transform.
The “boring practice” wasn’t a fancy index. It was treating JSON as a versioned contract, not an unbounded junk drawer.
Production benefited: query performance stayed stable, and incident frequency stayed low—the kind of win that never gets a celebratory email.
Checklists / step-by-step plan
If you are starting a new JSON-heavy feature
- Write down the top 5 query patterns you expect in the next six months (not just launch week).
- Classify fields: immutable vs mutable; frequently filtered vs rarely filtered; high cardinality vs low cardinality.
- Promote the “frequently filtered, high-cardinality” fields to real columns (preferred) or generated/expression columns.
- Choose database-specific indexing strategy:
- MySQL: STORED generated columns + B-tree indexes; avoid JSON_EXTRACT in hot WHERE clauses.
- PostgreSQL: expression indexes for hot paths; GIN for containment/existence; select operator class intentionally.
- Set payload size budgets (soft and hard limits). JSON growth is silent until it isn’t.
- Plan for evolution: add
schema_version, document transforms, and make migrations routine.
If you already shipped and it’s slow
- Find the top 3 queries by total time (slow log / pg_stat_statements).
- Run EXPLAIN with reality (MySQL EXPLAIN, Postgres EXPLAIN ANALYZE BUFFERS). Don’t guess.
- Add the smallest index that matches the predicate (generated column index or expression index) and verify plan changes.
- Measure write-side cost after indexing (commit latency, WAL/binlog rate, replication lag).
- If updates are heavy, split mutable fields out of JSON and into a separate table with a proper key.
- Put guardrails on ad-hoc queries (timeouts, read replicas, or a dedicated reporting path).
If you’re deciding between MySQL and PostgreSQL for JSON today
- Pick PostgreSQL if JSON querying is a product feature, not an implementation detail.
- Pick MySQL if JSON is mostly storage and you’re willing to project the hot keys into indexed generated columns.
- Pick the database your team can operate under incident conditions. A theoretically superior feature set doesn’t page in your on-call’s brain at 3 a.m.
FAQ
1) Is PostgreSQL always better for JSON than MySQL?
No. PostgreSQL is usually better for complex querying and flexible indexing. MySQL can be excellent when you keep JSON usage simple
or you project hot paths into generated columns. “Always” is how outages start.
2) Should I store JSON as TEXT/VARCHAR instead?
Usually not. You lose validation and many JSON operators. If you truly never query inside the JSON and just store and retrieve it,
TEXT can work—but you’re taking on data hygiene risk. Native JSON types are safer for correctness.
3) When should a JSON key become a real column?
If it’s used in joins, used in hot WHERE clauses, used for sorting, or needed for constraints, it’s a column. If it’s updated frequently,
it’s probably a column or a separate table. JSON is for variability, not for core identity.
4) Do GIN indexes solve JSONB performance in PostgreSQL?
They solve some problems. They can also create others (write cost, bloat, maintenance).
Use GIN when your predicates align with containment/existence and the indexed data is stable enough to justify the write tax.
5) What’s the MySQL equivalent of a Postgres GIN index on JSONB?
There isn’t a direct equivalent. In MySQL, you typically create generated columns that extract scalar values and index those.
That’s a different philosophy: you decide what matters up front.
6) How do I prevent “random keys everywhere” in JSON?
Treat JSON as a contract: version it, validate it, and document allowed shapes.
Enforce critical invariants with database constraints (Postgres) or generated columns + NOT NULL/type casts (MySQL).
7) Why do partial JSON updates still feel expensive?
Because “partial update” at the SQL level can still mean substantial rewrite and index churn at the storage level,
plus WAL/binlog volume. Big documents updated frequently are expensive, regardless of how pretty the SQL looks.
8) Can I use JSON for multi-tenant data and just filter on tenant_id inside JSON?
You can, but you shouldn’t. Tenant isolation belongs in a real column with an index.
Putting it in JSON makes it easier to accidentally scan across tenants and harder to enforce constraints and performance boundaries.
9) What’s the safest “hybrid model” pattern?
Store core fields as columns (ids, status, timestamps, foreign keys), store optional/vendor-specific fields in JSON/JSONB,
and index only the small subset of JSON paths you actually query. Everything else stays flexible without driving core query cost.
Conclusion: next steps that won’t embarrass you later
JSON in MySQL and PostgreSQL isn’t a novelty anymore. It’s a production tool—and like all production tools, it rewards discipline.
MySQL tends to want you to project structure out of JSON and index it explicitly. PostgreSQL gives you more expressive querying and indexing,
but it will bill you in WAL, bloat, and maintenance if you index too broadly or update large JSONB fields too often.
Practical next steps:
- Identify the top 3 JSON queries by total time and run EXPLAIN with real execution stats.
- Promote the top 3 JSON keys used for filtering/joining into columns or generated/expression columns and index them.
- Measure write amplification (WAL/binlog rate) before and after indexing; keep an eye on replication lag.
- Put a payload size budget in place and enforce it at ingestion.
- Version your JSON payloads. Future-you will otherwise spend a weekend decoding “why does this key sometimes exist.”
Pick the database that matches your team’s operational strengths, then design JSON usage like you expect it to become permanent—because it usually does.