You have a “slow query.” Everyone says it’s the database. Someone posts an EXPLAIN screenshot in chat with a red circle around “Seq Scan” or “Using temporary,” and the room collectively decides: “Add an index.” Two hours later the index is built, write latency spikes, and the query is still slow. Now you have a slow query and a slow system.
Query plans are not lies, but they are not the whole truth either. They’re a story the optimizer tells itself. Your job—especially in production—is to figure out what part of that story maps to reality: CPU, I/O, memory pressure, lock waits, stats drift, parameter weirdness, or the truly boring culprit: storage latency.
A practical mental model: plans are hypotheses
Treat an execution plan like you treat an incident timeline. It’s evidence, not a verdict. Both MariaDB and PostgreSQL optimizers choose plans based on a cost model. Cost models depend on statistics, configuration assumptions, and implementation details. They’re often right. Sometimes they’re hilariously wrong, and they fail in the most predictable ways: data distribution changes, parameters change, or concurrency changes.
The workflow that actually works in production has three layers:
- Confirm the symptom: time, frequency, concurrency, and what “slow” means (p95? p99? wall time? CPU time?).
- Find the resource bottleneck: CPU, I/O, memory, locks, network, or something outside the database.
- Use the plan to explain the bottleneck: not the other way around.
If you start with the plan and hunt for a line item to “fix,” you’ll often treat symptoms. You’ll also ship a “performance improvement” that increases risk: new index bloat, worse write amplification, or a different plan regression on Monday.
Paraphrased idea (attributed): Werner Vogels has long pushed the notion that “everything fails, design for it.” The same attitude applies to query plans: assume they’ll be wrong sometimes, and build guardrails.
Interesting facts and history (because it explains today’s weirdness)
Some performance mysteries make more sense once you know where the engines came from. Here are concrete bits of context that matter when you’re reading plans under pressure:
- PostgreSQL’s optimizer is heavily statistics-driven, and it will happily pick a “correct” plan that’s wrong for your parameter values unless you force better selectivity estimates.
- MariaDB inherited MySQL’s EXPLAIN vocabulary (“Using temporary”, “Using filesort”), which describes implementation tactics more than root causes. It’s useful, but it’s not a complete diagnosis.
- InnoDB became the default MySQL engine years ago, and MariaDB kept that lineage. A lot of “query tuning” is actually “InnoDB buffer pool behavior” in disguise.
- Postgres has MVCC at its core; “bloat” and vacuum behavior can change I/O patterns dramatically without any query text changing.
- Postgres supports multiple index types (B-tree, GIN, GiST, BRIN, hash). That means “add an index” is not a single decision; it’s a design choice.
- MariaDB has optimizer trace and switch flags that can be toggled per session, which is great for experimentation and terrible for consistency if abused.
- Postgres introduced parallel query over time; if your plan suddenly shows workers, you may be measuring CPU scheduling and memory pressure, not “SQL speed.”
- MySQL/MariaDB historically leaned on nested loop joins as the bread-and-butter, while Postgres has a wider range of join strategies and will choose them more readily when stats suggest it.
- Postgres can cache a generic plan for prepared statements, which can be fantastic until your parameter selectivity varies wildly.
Joke #1: The optimizer is like a confident coworker: it always has a plan, and it’s always sure it’s right.
Fast diagnosis playbook (first/second/third checks)
When production is hot and the CEO’s dashboard is cold, you don’t have time for philosophical plan analysis. You need a short sequence that finds the real bottleneck quickly and prevents “fixes” that make things worse.
First: determine if the query is waiting or working
- Waiting: lock waits, I/O waits, network waits, CPU run queue, checkpoint stalls.
- Working: burning CPU, reading a lot of pages, sorting, hashing, materializing intermediate results.
Second: check concurrency and contention
- How many sessions are running the same query?
- Is there a thundering herd after a deploy or cache expiry?
- Are writes blocking reads (or vice versa) due to transaction scope?
Third: confirm the plan matches reality
- Postgres: compare
EXPLAIN (ANALYZE, BUFFERS)estimated rows vs actual rows and buffer reads. - MariaDB: use
EXPLAIN FORMAT=JSON, check chosen indexes, and validate handler read counts.
Fourth: decide your lever
- Query rewrite (reduce rows early, avoid implicit casts, avoid functions on indexed columns).
- Index change (correct key order, partial/functional index in Postgres, composite index vs single).
- Stats and maintenance (ANALYZE, histograms in MariaDB, VACUUM in Postgres).
- Operational limiters (connection pool, work_mem/tmp_table_size, IOPS cap, noisy neighbor).
MariaDB vs PostgreSQL: what a plan does (and doesn’t) tell you
PostgreSQL plans: explicit, measurable, and occasionally misleading
Postgres gives you an execution tree with costs, row estimates, and (with ANALYZE) actual timing and row counts per node. This is gold. It also tempts people into treating the plan as deterministic truth. It’s not.
In Postgres, the most valuable comparison is estimated rows vs actual rows. When those diverge sharply, the optimizer is making decisions with bad inputs: stale stats, correlations it can’t model, parameter selectivity, or data skew. That’s where the plan becomes a breadcrumb trail to the real bottleneck.
Another Postgres superpower is BUFFERS. It tells you whether the query is hitting cached pages or pulling from storage. “Slow query” plus lots of shared read buffers is an I/O story. “Slow query” with few reads but high CPU is a compute story (or a lock story hiding as compute).
MariaDB plans: pragmatic, with optimizer internals available if you ask
MariaDB’s traditional EXPLAIN output is compact: join order, chosen index, and flags like “Using where,” “Using temporary,” “Using filesort.” Those flags are not moral judgments. They’re implementation hints. “Using filesort” does not mean “it’s sorting on disk.” It means it’s using the filesort algorithm, which may or may not spill.
For serious work, prefer EXPLAIN FORMAT=JSON and—when you need to know why a plan was chosen—OPTIMIZER_TRACE. MariaDB can show you the considered alternatives and the cost decisions. That’s how you catch cases where the engine picked a plan because it believed a filter was selective when it wasn’t.
What both engines share: the usual suspects
In both MariaDB and Postgres, most “plan problems” fall into a short list:
- Bad cardinality estimates: skew, correlation, stale stats, parameter-dependent selectivity.
- Wrong access path: index scan when it should be a sequential scan (yes, really), or vice versa.
- Join strategy mismatch: nested loops into a big table, hash join spilling, merge join sorting too much.
- Sort/hash spill: memory settings too low or concurrency too high.
- Locking: “slow query” that’s not slow—just blocked.
- Storage latency: plans look fine; the disk is not.
Practical tasks: commands, outputs, and decisions (12+)
These are the field-tested moves. Each task includes a runnable command, an example output snippet, what it means, and what decision you make next. Use them like a toolbox, not a religion.
Task 1 (Postgres): find the sessions and whether they’re waiting
cr0x@server:~$ psql -d appdb -c "SELECT pid, usename, state, wait_event_type, wait_event, now()-query_start AS age, left(query,120) AS q FROM pg_stat_activity WHERE datname='appdb' AND state<>'idle' ORDER BY age DESC LIMIT 10;"
pid | usename | state | wait_event_type | wait_event | age | q
------+--------+--------+-----------------+------------+----------+----------------------------------------------------------
8421 | app | active | Lock | relation | 00:01:12 | UPDATE orders SET status='paid' WHERE id=$1
9110 | app | active | IO | DataFileRead | 00:00:43 | SELECT ... FROM order_items WHERE order_id=$1
Meaning: One query is blocked on a relation lock, another is waiting on a read.
Decision: If it’s Lock, stop tuning the query and find the blocker. If it’s IO, validate storage latency and buffer reads before indexing.
Task 2 (Postgres): identify blockers and victims
cr0x@server:~$ psql -d appdb -c "SELECT blocked.pid AS blocked_pid, blocking.pid AS blocking_pid, blocked.query AS blocked_query, blocking.query AS blocking_query FROM pg_locks blocked_locks JOIN pg_stat_activity blocked ON blocked.pid = blocked_locks.pid JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pid JOIN pg_stat_activity blocking ON blocking.pid = blocking_locks.pid WHERE NOT blocked_locks.granted;"
blocked_pid | blocking_pid | blocked_query | blocking_query
------------+--------------+-------------------------------+----------------------------------
8421 | 8333 | UPDATE orders SET status=... | ALTER TABLE orders ADD COLUMN...
Meaning: DDL is blocking DML. This is not a query-plan issue.
Decision: Stop the DDL or move it to a safer method/window. Then revisit the original query if it’s still slow.
Task 3 (Postgres): get a measured plan with buffer activity
cr0x@server:~$ psql -d appdb -c "EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM order_items WHERE order_id=12345;"
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using order_items_order_id_idx on public.order_items (cost=0.43..42.10 rows=12 width=128) (actual time=0.085..12.611 rows=9800 loops=1)
Output: id, order_id, sku, qty, price
Index Cond: (order_items.order_id = 12345)
Buffers: shared hit=12 read=8102
Planning Time: 0.214 ms
Execution Time: 13.004 ms
Meaning: The index is used, but it’s reading thousands of pages from disk. The predicate is not very selective or the table is bloated/unclustered.
Decision: Consider a covering index (include columns), or reduce selected columns, or address table bloat. Also check I/O latency.
Task 4 (Postgres): compare estimated vs actual rows to spot stats problems
cr0x@server:~$ psql -d appdb -c "EXPLAIN (ANALYZE) SELECT * FROM users WHERE email LIKE '%@example.com';"
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on users (cost=0.00..18250.00 rows=100 width=256) (actual time=0.040..210.221 rows=48210 loops=1)
Filter: (email ~~ '%@example.com'::text)
Rows Removed by Filter: 151790
Planning Time: 0.251 ms
Execution Time: 213.904 ms
Meaning: The estimate (100) is wildly wrong (48k). Also, a leading wildcard disables a normal B-tree index.
Decision: If this query matters, redesign the predicate (store domain separately) or use a trigram index (Postgres extension-based). Don’t blame “seq scan” here; blame the predicate.
Task 5 (Postgres): check whether temp files (spills) are happening
cr0x@server:~$ psql -d appdb -c "SELECT datname, temp_files, temp_bytes FROM pg_stat_database WHERE datname='appdb';"
datname | temp_files | temp_bytes
---------+------------+------------
appdb | 1821 | 987654321
Meaning: The database is spilling to disk for sorts/hashes, or large intermediate results. Often a work_mem and concurrency story.
Decision: Find the queries doing big sorts/hashes; tune memory per session carefully, or rewrite to reduce rows earlier. Don’t just crank work_mem globally and pray.
Task 6 (Postgres): see the biggest time consumers (requires pg_stat_statements)
cr0x@server:~$ psql -d appdb -c "SELECT left(query,100) AS q, calls, total_exec_time, mean_exec_time, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 5;"
q | calls | total_exec_time | mean_exec_time | rows
---------------------------------------------------+-------+-----------------+----------------+-------
SELECT ... FROM order_items WHERE order_id=$1 | 92000 | 812345.12 | 8.83 | 901234
SELECT ... FROM users WHERE email LIKE $1 | 5000 | 401122.88 | 80.22 | 48210
Meaning: Not all slow queries matter. Total time shows you what burns your budget.
Decision: Start with top total time (and p95/p99 from your APM). Optimize where it moves the needle, not where it looks embarrassing.
Task 7 (MariaDB): find active threads and what they’re waiting on
cr0x@server:~$ mariadb -e "SHOW FULL PROCESSLIST;"
Id User Host db Command Time State Info
231 app 10.0.2.41:51012 appdb Query 65 Waiting for table metadata lock ALTER TABLE orders ADD COLUMN note TEXT
244 app 10.0.2.77:52910 appdb Query 42 Statistics SELECT ... FROM order_items WHERE order_id=12345
Meaning: Metadata lock is blocking others. Also note “Statistics” or other states that can indicate optimizer work or storage engine behavior.
Decision: Resolve the metadata lock first. Then measure query execution; plan tweaks won’t fix blocked threads.
Task 8 (MariaDB): show the plan in JSON (more detail than tabular EXPLAIN)
cr0x@server:~$ mariadb -e "EXPLAIN FORMAT=JSON SELECT * FROM order_items WHERE order_id=12345\G"
EXPLAIN
{
"query_block": {
"select_id": 1,
"table": {
"table_name": "order_items",
"access_type": "ref",
"possible_keys": ["order_id_idx"],
"key": "order_id_idx",
"key_length": "8",
"rows": 12000,
"filtered": 100,
"attached_condition": "(`order_items`.`order_id` = 12345)"
}
}
}
Meaning: MariaDB expects ~12k rows. If actual rows are 9.8k, fine. If actual rows are 900k, stats are lying or the predicate isn’t what you think.
Decision: Validate row counts with a targeted query and refresh stats if needed.
Task 9 (MariaDB): enable optimizer trace to see why the plan was chosen
cr0x@server:~$ mariadb -e "SET optimizer_trace='enabled=on'; SELECT * FROM order_items WHERE order_id=12345; SELECT TRACE FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE\G"
TRACE
{
"steps": [
{
"join_optimization": {
"table_dependencies": [
{"table": "order_items", "row_may_be_null": false}
],
"rows_estimation": [
{"table": "order_items", "range_analysis": {"chosen_range_access_summary": "ref on order_id_idx"}}
]
}
}
]
}
Meaning: You can see alternatives considered. This is how you prove “the optimizer believed X.”
Decision: If the optimizer’s choice is driven by a wrong estimate, fix stats/distribution (histograms) or rewrite. Don’t start force-indexing as a first move.
Task 10 (MariaDB): check handler reads to spot table scans and random I/O
cr0x@server:~$ mariadb -e "SHOW SESSION STATUS LIKE 'Handler_read%';"
Variable_name Value
Handler_read_first 0
Handler_read_key 12045
Handler_read_next 980000
Handler_read_rnd_next 4500000
Meaning: Lots of Handler_read_rnd_next often implies full table scans. Lots of read_next can indicate range scans walking many rows.
Decision: If scans are large, reduce rows earlier (better predicates, better index), or accept the scan if it’s cheaper than random I/O on a bad index path.
Task 11 (System): confirm whether the host is I/O-bound
cr0x@server:~$ iostat -x 1 3
Linux 6.1.0 (db01) 12/30/2025 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.10 0.00 6.20 31.50 0.00 50.20
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 820.0 240.0 65440.0 18200.0 18.2 0.8 98.0
Meaning: High %iowait, high await, and %util near 100%: storage is saturated. Your plan may be “fine” and still slow.
Decision: Reduce reads (indexes/covering indexes, fewer columns, better filters), fix bloat, or add IOPS. If you keep tuning SQL while the disk is pegged, you’re rearranging furniture during a fire drill.
Task 12 (System): check CPU run queue and saturation
cr0x@server:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 0 0 512000 12000 980000 0 0 200 500 1200 2400 55 10 30 5 0
9 0 0 508000 12000 978000 0 0 180 420 1180 2500 58 12 25 5 0
Meaning: Run queue r near or above CPU count and high us means CPU saturation. Low wa means it’s not primarily I/O wait.
Decision: Look for expensive functions, too many parallel workers (Postgres), hot loops, or missing indexes causing CPU-heavy filtering. Also check for compression/crypto overhead and connection storms.
Task 13 (Postgres): check autovacuum/vacuum health and bloat signals
cr0x@server:~$ psql -d appdb -c "SELECT relname, n_dead_tup, n_live_tup, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5;"
relname | n_dead_tup | n_live_tup | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze
------------+------------+------------+-------------+----------------------+---------------------+------------------------
order_items | 812345 | 9023410 | | 2025-12-30 09:12:01 | 2025-12-30 08:05:10 | 2025-12-30 09:12:10
Meaning: Lots of dead tuples: bloat risk, more pages to read, worse cache behavior, slower index scans.
Decision: Tune autovacuum for hot tables, consider VACUUM (or rewrite strategy) if you’re falling behind. Query plans won’t rescue you from a table that’s effectively twice its useful size.
Task 14 (MariaDB/InnoDB): check buffer pool pressure and reads
cr0x@server:~$ mariadb -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%'; SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages%';"
Variable_name Value
Innodb_buffer_pool_read_requests 987654321
Innodb_buffer_pool_reads 12345678
Variable_name Value
Innodb_buffer_pool_pages_total 1048576
Innodb_buffer_pool_pages_free 1024
Meaning: Very low free pages and non-trivial physical reads can mean the working set doesn’t fit. Reads will hit disk under load.
Decision: Reduce working set (drop unused indexes, narrow selects), increase buffer pool if safe, or fix the query to stop touching so many pages.
Task 15 (Postgres): test whether prepared statement plan caching is hurting you
cr0x@server:~$ psql -d appdb -c "SHOW plan_cache_mode;"
plan_cache_mode
-----------------
auto
Meaning: In auto, Postgres may pick a generic plan for prepared statements when it thinks it’s cheaper overall.
Decision: If you suspect parameter sensitivity, test with SET plan_cache_mode = force_custom_plan; in a session and compare performance. If it helps, fix at the app/driver level or query structure.
Three corporate mini-stories from the trenches
1) Incident caused by a wrong assumption: “Seq Scan is always bad”
A mid-sized company ran a Postgres-backed analytics page. The on-call got a page: the endpoint’s p95 went from “fine” to “a problem,” and the database CPU spiked. The developer posted an EXPLAIN showing a sequential scan over a moderately sized table and declared victory: “We need an index.”
They added a composite index matching the WHERE clause and ORDER BY. Writes got slower immediately. The query got faster in a quiet staging run. In production, the p95 barely improved, and CPU stayed hot. Meanwhile, the index build itself competed for I/O and made the incident uglier.
The real issue wasn’t the scan. It was concurrency: after a deploy, the endpoint started firing twice per page view due to a frontend regression. The query was doing a perfectly reasonable sequential scan because the table was small enough, the predicate wasn’t selective, and the cached pages made scanning cheap—until it wasn’t cached anymore under the doubled load.
Fixing the frontend bug cut QPS in half. The “performance fix” index became technical debt: it increased write amplification and autovacuum work, and it wasn’t necessary. They eventually dropped it once the incident postmortem stopped being emotionally attached to the seq scan.
2) Optimization that backfired: forcing the plan to behave
Another team ran MariaDB for a customer-facing API. A query occasionally went pathological after large imports. Someone used FORCE INDEX to coerce the optimizer into using what looked like the “right” index in EXPLAIN. Benchmarks improved. Everybody relaxed.
Two months later, a new feature added a predicate, and the forced index became the wrong index. The query now performed a huge range scan, then filtered most rows. The optimizer wanted to switch to a different key, but couldn’t. Latency spiked, and the API started timing out under peak traffic.
The incident was extra fun because the plan looked “stable.” Stability can be a trap: you locked in a decision that was only correct for the old distribution and old query shape. In postmortem, the team removed the forced index hint and instead fixed the stats and index design. They also added a regression test on query timing using representative data, because “works on last month’s data” is not a strategy.
3) Boring but correct practice that saved the day: measured rollouts + baselines
A large internal platform had both Postgres and MariaDB clusters. They’d been burned by plan regressions before, so they did something aggressively unglamorous: for every performance-sensitive query, they kept a baseline of (a) normalized query text, (b) expected row counts, (c) the plan shape, and (d) latency under a fixed dataset snapshot.
During a routine Postgres upgrade, a query changed from an index-based nested loop to a hash join. Latency doubled in staging. Nobody panicked, because they had a diff: the new plan spilled to disk due to a changed memory behavior under higher parallelism.
They adjusted work_mem for that role, constrained parallel workers for the session, and validated with EXPLAIN (ANALYZE, BUFFERS) that spills disappeared. The upgrade rolled out without drama. The best part: nobody had to argue from vibes. The plan regression was caught before users did.
Joke #2: Nothing says “enterprise” like spending a week to avoid spending a weekend.
Bottleneck patterns: how to spot the real limiter
1) Lock waits masquerading as slow queries
If a query is slow because it’s blocked, your plan is mostly irrelevant. In Postgres, look at wait_event_type. In MariaDB, check processlist states like metadata locks. The classic trap is tuning an UPDATE that’s “slow” because it’s waiting behind a long transaction, a DDL, or a batch job that forgot to commit.
Root cause is often transaction scope. The fix is usually not an index. It’s smaller transactions, different locking behavior, or scheduling DDL safely.
2) I/O-bound execution: the plan is fine, the storage is not
You can have a perfect index scan that still reads thousands of pages because the data is scattered and not in cache. Plans show what should happen; BUFFERS and system metrics show what did happen.
I/O-bound patterns:
- Postgres plan shows high
shared readbuffers, modest CPU time. - Host shows high iowait, high await, high device util.
- Latency worsens with concurrency (queueing on the device).
Fixes include covering indexes, reducing selected columns, partitioning for locality, vacuuming bloat, or admitting you need more IOPS. Sometimes the best optimization is buying the storage you thought you already had.
3) CPU-bound execution: expensive predicates and row-by-row pain
CPU-bound queries often involve:
- Functions on columns in WHERE (especially non-immutable in Postgres; or preventing index use).
- JSON parsing, regex, LIKE with wildcards, collation-heavy comparisons.
- Join explosion due to wrong join order or underestimated cardinality.
Plans help you identify where the rows blow up. But don’t miss the non-obvious CPU consumers: collation rules, decompression, or per-row casts from mismatched types.
4) Memory pressure: sorts/hashes spill, then everything looks “randomly slow”
In Postgres, spills show up as temp files and in node-level details for sorts/hashes in ANALYZE output. In MariaDB, “Using temporary” and “Using filesort” are hints, but you still need to confirm whether it spilled. Memory settings are per-thread/per-query; concurrency multiplies them. This is where naïve tuning causes outages.
5) Cardinality lies: when estimates are fiction
If estimated rows differ from actual by 10× or 100×, treat that as a primary diagnosis. It leads to bad join choices, bad index choices, and unpredictable performance.
Common causes:
- Stale stats (bulk loads, churn, skew).
- Correlation between columns (optimizer assumes independence).
- Parameter selectivity (prepared statements, ORMs).
- Non-sargable predicates (functions, casts).
Fixing stats is not glamorous, but it’s often the correct first move because it improves every query that depends on them.
Common mistakes (symptom → root cause → fix)
“We see Seq Scan, so it must be the problem”
Symptom: Postgres shows Seq Scan; someone wants an index.
Root cause: Predicate is not selective, or index would cause random I/O worse than scanning, or query returns a large portion of the table.
Fix: Measure with EXPLAIN (ANALYZE, BUFFERS). If most pages are touched anyway, accept the seq scan or redesign the query. Indexing is not a virtue.
“Using filesort means it’s sorting on disk”
Symptom: MariaDB EXPLAIN shows Using filesort, query slow.
Root cause: It’s using the filesort algorithm; it may be in-memory, or it may spill because memory is insufficient or result set is large.
Fix: Confirm with server status and tmp table metrics, reduce result set, add a composite index matching ORDER BY + WHERE where appropriate, and avoid sorting huge intermediate results.
“We added the index and it got worse”
Symptom: Reads improved slightly, writes and replication lag worsened.
Root cause: Write amplification and extra maintenance (vacuum/autovacuum, page splits, redo/undo) outweighed the read gain; also possible plan regression using the new index incorrectly.
Fix: Validate with workload-aware testing. Prefer narrower, covering indexes; drop unused indexes; consider partial indexes (Postgres) for hot subsets.
“The plan didn’t change, but it’s slower today”
Symptom: Same plan shape, higher latency.
Root cause: Storage latency changed, cache churn increased, bloat increased, concurrency increased, or checkpoint pressure.
Fix: Check iostat/vmstat, buffer reads vs hits, autovacuum and dead tuples, and concurrency. Treat it as a system issue first.
“It’s fast in staging but slow in prod”
Symptom: Benchmark looks great; production screams.
Root cause: Different data distribution, missing skew, different cache warmth, different concurrency, different settings (work_mem/tmp_table_size), different storage.
Fix: Use representative data snapshots, run with concurrency, capture real plans with ANALYZE/JSON, and compare buffer reads and temp usage.
“We force the optimizer and now it’s stable”
Symptom: Index hint / optimizer switch “fixes” performance.
Root cause: You froze a choice that depends on data distribution; future changes make it wrong.
Fix: Fix stats and schema design; keep hints as last resort with monitoring and a plan to remove them.
“Adding RAM will fix it”
Symptom: High latency, someone blames cache.
Root cause: Sometimes true; often it’s bad queries doing unnecessary work, or lock contention, or IOPS saturation.
Fix: Confirm with buffer metrics and I/O wait. Add RAM only when the working set is proven to not fit and the access pattern benefits.
Checklists / step-by-step plan
Step-by-step triage (15–30 minutes)
- Pick the right target: identify top total time or top p95/p99 queries (pg_stat_statements / slow log).
- Check waiting vs working: Postgres
pg_stat_activity; MariaDBSHOW FULL PROCESSLIST. - Check system saturation:
iostat -x,vmstat. - Capture a measured plan: Postgres
EXPLAIN (ANALYZE, BUFFERS); MariaDBEXPLAIN FORMAT=JSONplus handler reads and trace if needed. - Compare estimates: if estimates are wrong, fix stats first (ANALYZE, histograms).
- Check for spills: Postgres temp files; MariaDB tmp tables and sort behavior via status counters.
- Decide lever: query rewrite, index design, stats/maintenance, or infrastructure/concurrency.
- Validate with a rollback plan: measure before/after, confirm no regression on writes, replication, and storage.
Index decision checklist (don’t guess)
- Does the predicate reduce rows significantly? If not, index may hurt more than help.
- Does the query need ordering/limit? Composite indexes can turn sorts into index order scans.
- Are you selecting many columns? Consider covering indexes (Postgres INCLUDE; MariaDB secondary index includes PK in InnoDB, but not all columns).
- Is the query parameter-sensitive? Avoid assumptions based on one “typical” value.
- What’s the write cost? Consider insert/update rate, not just read latency.
Stats and maintenance checklist
- Postgres: are autovacuum and analyze keeping up on the hottest tables?
- Postgres: do you have high dead tuples and low last_autovacuum recency?
- MariaDB: do you need histograms for skewed columns?
- Both: did data distribution change (imports, backfills, tenant growth)?
FAQ
1) Which is easier to read: MariaDB EXPLAIN or Postgres EXPLAIN?
Postgres, because it gives you actual timing and row counts per node with ANALYZE. MariaDB is readable too, but you often need JSON format and additional status counters to confirm reality.
2) Should I always run EXPLAIN ANALYZE in production?
No. It executes the query. For safe production debugging, use it on a replica or a controlled sample. When you must run it, wrap with limits and tight predicates, and do it off-peak.
3) Why does Postgres choose a seq scan even when there’s an index?
Because scanning can be cheaper than random I/O when a large fraction of the table matches, or when the table is small, or when cached pages make scanning fast. The optimizer is often right.
4) In MariaDB, what does “Using temporary” actually mean?
It means an internal temporary table is used for grouping, sorting, or intermediate results. It might be in memory or on disk depending on size and settings. Treat it as “there’s an intermediate result” and measure whether it spills.
5) How do I detect a plan regression after a deploy?
Compare normalized query fingerprints and measured plans before/after. In Postgres, capture EXPLAIN (ANALYZE, BUFFERS) and check estimate accuracy. In MariaDB, compare JSON plan plus handler reads and execution time under representative load.
6) When is “add an index” the right answer?
When the predicate is selective, the query is frequent enough to justify the write cost, and the index supports an access pattern you can’t get otherwise (filtering, ordering, join key). If it’s a batch query run once a day, your “index fix” might be a tax forever.
7) Why did my optimization backfire under concurrency?
Because per-query memory multiplies across sessions, because cache warms differently, and because I/O queueing is non-linear. A plan that looks fast at concurrency 1 can collapse at concurrency 200.
8) What’s the fastest way to tell if it’s storage?
Look at host metrics: iostat -x await/util, plus database-level evidence: Postgres buffer reads and wait events; InnoDB physical reads and buffer pool pressure. If the device is saturated, query tuning is limited without reducing reads.
9) MariaDB or Postgres: which optimizer is “better”?
Neither is magically better; both are strong and both have blind spots. Postgres is easier to validate because it exposes more runtime truth in the plan. MariaDB gives you optimizer trace that can be very revealing when you need the “why.”
Conclusion: practical next steps
If you take one operational habit from this: stop treating plans as a to-do list. Start treating them as a hypothesis you confirm with waits, buffers, and system saturation. The fastest way to fix performance is to fix the right thing.
Next steps you can do this week:
- Enable and use query-level baselining (pg_stat_statements for Postgres; slow log + digesting for MariaDB) so you optimize what costs you the most.
- Adopt the fast diagnosis playbook: waiting vs working, then contention, then measured plans.
- Add a “no unreviewed indexes” policy: every index needs a measured win and a write-cost check.
- For your top 5 queries, capture a known-good plan and a measured runtime profile (buffers/temp usage). Re-check after schema changes and upgrades.
Your future self, on-call at 02:00, will appreciate it. Not emotionally. Practically.