Deadlocks are the kind of database problem that show up exactly when you’re already juggling a pager, a CEO “just checking in,”
and a dashboard full of red. Queries stall, workers pile up, and suddenly your “highly available” system is highly available for
taking complaints.
The practical question isn’t “what is a deadlock.” You already know. The question is: when production is on fire, which engine gives you
better clues faster—MySQL or MariaDB—and what should you do first, second, and third to stop the bleeding.
The blunt verdict: who’s easier to debug
If you’re running modern MySQL (8.0+), MySQL is usually easier to debug for deadlocks and lock waits in the moment,
because Performance Schema is a more consistent, richer source of truth: who holds which lock, who waits, which SQL text,
which indexes, which transactions. It’s not perfect, but it’s operationally cohesive. You can build muscle memory and reliable runbooks.
MariaDB can be perfectly debuggable too—especially if you’re disciplined about enabling the right instrumentation and logs—but
in practice it’s easier to end up with partial observability: enough to know you’re stuck, not enough to know why.
Also: MariaDB’s ecosystem has more variance (different forks/engines/replication layers like Galera are more common),
which means your “standard” deadlock story is less standard.
Here’s the practical rule I use:
- If you already have Performance Schema-based dashboards and runbooks, MySQL 8.0 tends to win on speed-to-root-cause.
- If you’re on MariaDB with Galera, you’re not only debugging InnoDB deadlocks—you’re often debugging certification conflicts, which look like deadlocks to the app and feel worse.
- If you rely on “SHOW ENGINE INNODB STATUS” only, both will betray you eventually. It’s a last incident artifact, not a monitoring strategy.
Opinionated guidance: for teams that don’t already have mature database observability, choose the platform that makes it easiest to get
answers with the fewest bespoke hacks. That’s usually MySQL 8.0 with Performance Schema, plus sane logging.
One quote that belongs in every ops brain:
“Hope is not a strategy.”
—General Gordon R. Sullivan
Joke #1: A deadlock is just two transactions having a meeting and agreeing to never compromise.
Interesting facts and historical context (the bits that matter at 03:00)
- MariaDB was created in 2009 after Oracle acquired Sun (and therefore MySQL). That fork decision still echoes in tooling and defaults.
- InnoDB became the default engine in MySQL 5.5. Older systems that “grew up” pre-5.5 often carry MyISAM-era habits that cause modern lock pain.
- MySQL 8.0 made Performance Schema the center of gravity for instrumentation. That’s why it’s the first place modern MySQL runbooks land.
- MariaDB replaced the upstream InnoDB with XtraDB historically (and later moved closer again). Depending on version, you’ll see different variable names and behaviors around instrumentation.
- Deadlocks are a sign of correctness, not failure: InnoDB detects cycles and rolls back a victim to keep the system moving. The failure is when your app can’t retry safely.
- Gap locks and next-key locks are a frequent “surprise deadlock” source under REPEATABLE READ. Many teams only learn they exist after an incident.
- Galera-style replication (common with MariaDB deployments) can return conflict errors that mimic deadlocks at the application layer, even without a classic InnoDB deadlock cycle.
- MySQL’s sys schema (a helper view layer over Performance Schema) made “what’s happening right now” queries easier for humans; it changed how operators work under pressure.
What deadlocks look like in real production systems
The two failure modes operators confuse (and shouldn’t)
In incident chat, you’ll hear “deadlock” used for two different things:
- Deadlock (cycle): InnoDB detects a cycle and aborts one transaction quickly. You see error 1213:
Deadlock found when trying to get lock. - Lock wait timeout (queue): There’s no cycle, just a long wait. Eventually the waiter gives up with error 1205:
Lock wait timeout exceeded.
Operationally, these require different instincts. A true deadlock is often a pattern (two code paths taking locks in different orders).
A lock wait timeout is often a hot row / hot index / long transaction problem, sometimes combined with missing indexes.
Both can happen at the same time during an outage, which is how you end up yelling at the wrong graph.
Why “but it’s just two UPDATEs” is a trap
InnoDB doesn’t lock “rows” the way developers imagine. It locks index records, may lock gaps between them,
and does so depending on isolation level and access path. If you’re scanning an index range, you can lock far more than you meant.
If you’re updating a secondary index, you might take locks you didn’t realize existed.
What makes deadlocks painful under load
- Retry storms: your app retries deadlocked transactions aggressively, making lock contention worse.
- Long transactions: a single slow transaction holds locks longer, raising the probability of cycles.
- Queueing collapse: threads pile up waiting for locks; latency goes nonlinear; pool exhaustion follows.
- Replication lag: on async replicas, long-running transactions and lock waits blow up apply time; on semi-sync, they can slow commits.
Joke #2: If you want two teams to align on lock ordering, tell them it’s mandatory and schedule a meeting—deadlock guaranteed.
Telemetry and tooling: MySQL vs MariaDB under pressure
MySQL’s edge: one instrumentation story to rule them all
MySQL 8.0’s Performance Schema is where you go to answer “what is blocking what” without guessing. The key operational advantage
is consistency: same tables, same join paths, same mental model across hosts. When the site is burning, consistency is speed.
You can usually get:
- current waiters and blockers (
performance_schema.data_lock_waits) - lock inventory (
performance_schema.data_locks) - transaction metadata (
information_schema.innodb_trx) - SQL text and statement digests (
events_statements_current, digest summaries)
MariaDB: can be good, but watch the version and the engine path
MariaDB has Performance Schema too, but operationally it’s more common to find it disabled, partially enabled, or less relied on.
MariaDB also leans on Information Schema views like INFORMATION_SCHEMA.INNODB_LOCK_WAITS in many shops (where available),
and of course the old warhorse: SHOW ENGINE INNODB STATUS.
If your MariaDB estate includes Galera, you need an extra mental model: not every “transaction aborted” is an InnoDB deadlock.
Some are replication-level write set conflicts. Debuggability depends on whether you collect those stats and map them to application behavior.
Logging differences that decide whether you find the culprit in minutes or hours
Deadlock logs are only as good as your configuration. In both MySQL and MariaDB, you want:
- deadlock logging enabled (InnoDB prints details to the error log)
- timestamps + thread IDs aligned with your app logs
- slow log and/or statement digest data so you can connect the deadlock to the query shape, not just a one-off SQL snippet
Real talk: a deadlock report without identifying the code path is just an expensive fortune cookie.
Fast diagnosis playbook
This is the “stop guessing” sequence. It’s intentionally short. Under load, the first priority is identifying:
(1) whether you have a deadlock storm or a lock-wait queue, (2) which tables/indexes are hot, (3) which transaction is the bully.
First: confirm what kind of pain you’re in
- Check error rates in the app: are you seeing 1213 (deadlock) or 1205 (timeout)?
- Check InnoDB status counters: are lock waits rising fast? Is history list length growing (long transactions)?
- Confirm thread pool / connection pool health: are threads stuck in “Waiting for table metadata lock” or row lock waits?
Second: find the blocker(s) and the hot objects
- List lock waits and identify blocking transaction IDs.
- Map blockers to SQL text and client/user/host.
- Identify the table and index involved. If it’s a secondary index range scan, assume the access path is the bug.
Third: choose a containment action
- Kill the single worst blocker transaction if it’s safe (often a long batch job or a stuck maintenance query).
- Temporarily reduce concurrency of the offending endpoint/worker type.
- Adjust retry behavior: exponential backoff with jitter, and cap retries. Don’t create a retry DDOS.
Only after containment do you do the “pretty” work: indexing, query rewrite, lock ordering, isolation level review.
Hands-on tasks: commands, outputs, decisions (12+)
The point of these tasks isn’t to collect trivia. It’s to turn mystery into a decision: “kill this,” “throttle that,” “rewrite this query,”
“add this index,” or “change this transaction boundary.” Each task includes: command, what the output means, and what you do next.
Task 1: confirm engine/version and whether you’re even playing the same game
cr0x@server:~$ mysql -uroot -p -e "SELECT VERSION() AS version, @@version_comment AS comment, @@innodb_version AS innodb_version\G"
*************************** 1. row ***************************
version: 8.0.36
comment: MySQL Community Server - GPL
innodb_version: 8.0.36
Meaning: Version tells you which instrumentation tables exist and which defaults apply.
Decision: If you’re on MySQL 5.7 / older MariaDB, plan for fewer live-lock introspection options; rely more on logs and SHOW ENGINE INNODB STATUS.
Task 2: check whether you’re dealing with deadlocks or timeouts at the server level
cr0x@server:~$ mysql -uroot -p -e "SHOW GLOBAL STATUS LIKE 'Innodb_deadlocks'; SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_timeouts';"
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| Innodb_deadlocks| 482 |
+-----------------+-------+
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| Innodb_row_lock_timeouts | 91 |
+---------------------------+-------+
Meaning: Deadlocks rising rapidly points to conflicting lock order; timeouts point to long waits and hotspots.
Decision: If Innodb_deadlocks is spiking, prioritize identifying the two SQL shapes involved. If timeouts spike, hunt the longest-running transaction and missing indexes.
Task 3: capture the latest deadlock narrative (fast, but not comprehensive)
cr0x@server:~$ mysql -uroot -p -e "SHOW ENGINE INNODB STATUS\G" | sed -n '/LATEST DETECTED DEADLOCK/,+80p'
LATEST DETECTED DEADLOCK
------------------------
2025-12-29 10:41:12 0x7f1c4c1fe700
*** (1) TRANSACTION:
TRANSACTION 123456789, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 5 lock struct(s), heap size 1136, 3 row lock(s)
MySQL thread id 9012, OS thread handle 139759, query id 882199 appdb 10.2.3.44 appuser updating
UPDATE orders SET status='paid' WHERE id=778812
*** (2) TRANSACTION:
TRANSACTION 123456790, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
5 lock struct(s), heap size 1136, 3 row lock(s)
MySQL thread id 9013, OS thread handle 139760, query id 882200 appdb 10.2.3.45 appuser updating
UPDATE orders SET status='shipped' WHERE id=778812
*** WE ROLL BACK TRANSACTION (1)
Meaning: You get the victim SQL and the conflicting peer, plus index/lock details further down.
Decision: If the two statements are from the same code path, check for non-deterministic ordering or hidden reads (foreign key checks, triggers). If from different paths, align lock ordering or split the row ownership model.
Task 4 (MySQL 8.0): list current blockers and waiters via Performance Schema
cr0x@server:~$ mysql -uroot -p -e "
SELECT
w.REQUESTING_ENGINE_TRANSACTION_ID AS waiting_trx,
w.BLOCKING_ENGINE_TRANSACTION_ID AS blocking_trx,
dlw.OBJECT_SCHEMA, dlw.OBJECT_NAME, dlw.INDEX_NAME,
dlw.LOCK_TYPE, dlw.LOCK_MODE, dlw.LOCK_STATUS
FROM performance_schema.data_lock_waits w
JOIN performance_schema.data_locks dlw
ON w.REQUESTING_ENGINE_LOCK_ID = dlw.ENGINE_LOCK_ID
ORDER BY dlw.OBJECT_SCHEMA, dlw.OBJECT_NAME\G"
*************************** 1. row ***************************
waiting_trx: 123456799
blocking_trx: 123456650
OBJECT_SCHEMA: appdb
OBJECT_NAME: order_items
INDEX_NAME: idx_order_id
LOCK_TYPE: RECORD
LOCK_MODE: X
LOCK_STATUS: WAITING
Meaning: This points to the hot table/index and the blocking transaction ID.
Decision: Immediately fetch both transactions’ SQL and duration. If the blocker is “obviously wrong” (long batch), kill it. If it’s normal traffic, you have a structural contention issue.
Task 5 (MySQL 8.0): map transaction IDs to sessions and SQL text
cr0x@server:~$ mysql -uroot -p -e "
SELECT
t.trx_id, t.trx_started, TIMESTAMPDIFF(SECOND, t.trx_started, NOW()) AS trx_age_s,
p.ID AS processlist_id, p.USER, p.HOST, p.DB, p.COMMAND, p.TIME, p.STATE,
LEFT(p.INFO, 200) AS sql_sample
FROM information_schema.innodb_trx t
JOIN information_schema.PROCESSLIST p
ON p.ID = t.trx_mysql_thread_id
ORDER BY trx_age_s DESC
LIMIT 10\G"
*************************** 1. row ***************************
trx_id: 123456650
trx_started: 2025-12-29 10:38:01
trx_age_s: 191
processlist_id: 8441
USER: batch
HOST: 10.2.8.19:55312
DB: appdb
COMMAND: Query
TIME: 187
STATE: Updating
sql_sample: UPDATE order_items SET price=price*0.98 WHERE order_id IN (...)
Meaning: You found the bully: a long-running transaction holding locks.
Decision: If it’s non-critical, kill it. If it’s critical, throttle concurrent writers and reduce lock footprint (indexing, chunking, narrower predicates) after containment.
Task 6: inspect lock wait timeout setting (and stop cargo-culting it)
cr0x@server:~$ mysql -uroot -p -e "SHOW VARIABLES LIKE 'innodb_lock_wait_timeout';"
+--------------------------+-------+
| Variable_name | Value |
+--------------------------+-------+
| innodb_lock_wait_timeout | 50 |
+--------------------------+-------+
Meaning: This is how long a transaction waits before giving up. It does not prevent deadlocks.
Decision: Don’t “fix” deadlocks by setting this to 1 or 500. Tune it based on user-facing latency budgets and background job tolerance, then fix the contention pattern.
Task 7: check isolation level (deadlocks are sensitive to it)
cr0x@server:~$ mysql -uroot -p -e "SELECT @@transaction_isolation AS isolation;"
+--------------+
| isolation |
+--------------+
| REPEATABLE-READ |
+--------------+
Meaning: REPEATABLE READ can introduce next-key/gap locking for range scans and index lookups.
Decision: If deadlocks involve range predicates (e.g., WHERE created_at BETWEEN...), consider READ COMMITTED for the workload, but only with an explicit consistency review.
Task 8: detect metadata lock pileups (not deadlocks, but the outage looks similar)
cr0x@server:~$ mysql -uroot -p -e "SHOW PROCESSLIST;" | sed -n '1,20p'
Id User Host db Command Time State Info
8123 appuser 10.2.3.12:51221 appdb Query 45 Waiting for table metadata lock ALTER TABLE orders ADD COLUMN x INT
8124 appuser 10.2.3.13:51222 appdb Query 44 Waiting for table metadata lock SELECT * FROM orders WHERE id=...
8125 appuser 10.2.3.14:51223 appdb Query 44 Waiting for table metadata lock UPDATE orders SET ...
Meaning: DDL is blocking reads/writes via MDL. This is not an InnoDB deadlock; it’s a schema change incident.
Decision: Pause/kill the DDL if safe, or move it to an online schema change strategy. Don’t waste time “debugging deadlocks” here.
Task 9: check for long history list length (undo pressure, long transactions)
cr0x@server:~$ mysql -uroot -p -e "SHOW ENGINE INNODB STATUS\G" | sed -n '/History list length/,+3p'
History list length 987654
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 123456650, ACTIVE 191 sec
...
Meaning: Massive history list length often means long-running transactions preventing purge. It correlates with lock churn and performance collapse.
Decision: Find and fix long transactions: batch jobs, “read in transaction” patterns, or app code holding transactions open while calling external services (yes, people do that).
Task 10: confirm indexes used by the deadlocking queries (the access path is often the bug)
cr0x@server:~$ mysql -uroot -p -e "EXPLAIN UPDATE order_items SET price=price*0.98 WHERE order_id IN (101,102,103)\G"
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: order_items
partitions: NULL
type: range
possible_keys: idx_order_id
key: idx_order_id
key_len: 8
ref: NULL
rows: 45000
filtered: 100.00
Extra: Using where
Meaning: A range scan locking many records is a classic deadlock contributor.
Decision: Change the query shape (chunk by primary key, join to a temp table of IDs, or ensure equality access). If it must touch many rows, serialize that workload type.
Task 11 (MySQL 8.0): get statement digests to spot the top locking offenders
cr0x@server:~$ mysql -uroot -p -e "
SELECT
DIGEST_TEXT,
COUNT_STAR,
SUM_LOCK_TIME/1000000000000 AS sum_lock_time_s,
SUM_TIMER_WAIT/1000000000000 AS sum_total_time_s
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_LOCK_TIME DESC
LIMIT 5\G"
*************************** 1. row ***************************
DIGEST_TEXT: UPDATE `order_items` SET `price` = `price` * ? WHERE `order_id` IN ( ... )
COUNT_STAR: 11922
sum_lock_time_s: 843.1200
sum_total_time_s: 910.5523
Meaning: This gives you the repeat offender query family, not one random sample.
Decision: Fix the highest aggregate lock-time digest first. That’s where you buy down incident frequency.
Task 12: verify deadlock logging configuration
cr0x@server:~$ mysql -uroot -p -e "SHOW VARIABLES LIKE 'innodb_print_all_deadlocks';"
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| innodb_print_all_deadlocks| OFF |
+---------------------------+-------+
Meaning: If OFF, you only get the latest deadlock in InnoDB status output; you may lose history under storm conditions.
Decision: Turn it ON in environments where deadlock forensics matter and log volume is acceptable. If you can’t afford the error log noise, rely on Performance Schema history where available.
Task 13: identify and kill a blocking session (surgical, not panicked)
cr0x@server:~$ mysql -uroot -p -e "KILL 8441;"
Query OK, 0 rows affected (0.00 sec)
Meaning: Session 8441 is terminated; its transaction rolls back, releasing locks.
Decision: Only do this when you’ve confirmed it’s the blocker and rollback cost is acceptable. If the transaction modified millions of rows, killing it might worsen I/O and undo churn temporarily.
Task 14: check replication lag and apply thread state (deadlocks can be upstream symptoms)
cr0x@server:~$ mysql -uroot -p -e "SHOW REPLICA STATUS\G" | sed -n '1,35p'
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Seconds_Behind_Source: 420
Last_SQL_Error:
Replica_SQL_Running_State: Waiting for dependent transaction to commit
Meaning: Lag suggests transactions are slow to apply, often due to lock waits or large transactions.
Decision: If lag is rising during the deadlock storm, expect stale reads on replicas and consider temporarily routing fewer reads there, or stopping non-essential write-heavy jobs.
Task 15: check OS-level saturation (because the “deadlock” might be a slow disk pretending)
cr0x@server:~$ iostat -x 1 3
Linux 6.1.0 (db01) 12/29/2025 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.31 0.00 6.88 34.22 0.00 46.59
Device r/s w/s rkB/s wkB/s await svctm %util
nvme0n1 120.0 980.0 4096.0 32768.0 48.10 0.85 92.30
Meaning: High iowait and high await indicate storage latency. Slow I/O stretches transaction time, increasing lock hold time and deadlock probability.
Decision: If storage is the bottleneck, your “deadlock problem” is partly an infrastructure problem: reduce write amplification (batching, index review) and fix the underlying I/O contention.
Three corporate-world mini-stories
Incident #1: the outage caused by a wrong assumption
A mid-sized commerce platform migrated from a legacy MySQL deployment to a MariaDB cluster because “it’s basically the same and we like open source governance.”
The migration went fine. The performance tests looked decent. Everyone celebrated responsibly by scheduling the cutover for a weekday morning.
The wrong assumption was subtle: the team assumed their retry logic—written years earlier—handled deadlocks “robustly” because it retried on
generic database errors. It did retry. Immediately. With no backoff. And it retried transactions that weren’t actually idempotent when partial work had
already happened in the application tier.
On cutover day, they hit a write conflict pattern they hadn’t seen before. A couple of hot rows (inventory counters and order state rows) became contention magnets.
The database did what databases do: detect deadlocks and abort victims. The app did what poorly supervised apps do: retry everything, fast, in parallel, without considering user-visible side effects.
The site didn’t crash instantly. It got slower. Then the worker pools saturated. Then the connection pool filled. Then health checks started failing.
From the outside it looked like “the database is deadlocking constantly.” From the inside it was “the application is amplifying a normal safety mechanism into a denial of service.”
The fix wasn’t “switch back to MySQL.” The fix was boring engineering: make retries conditional, use exponential backoff with jitter, cap retries,
and make the transaction truly idempotent (or design a compensating action). They also introduced a deadlock budget per endpoint: if the budget is exceeded,
the endpoint throttles before the database is forced to.
Incident #2: the optimization that backfired
A SaaS analytics company had a slow report generation job. It updated a “report_status” table for millions of users.
Someone spotted the classic inefficiency: “We’re doing too many commits. Let’s wrap the whole job in a single transaction for speed.”
On a quiet staging database, it looked brilliant. One transaction, one commit, less overhead. The graphs nodded politely.
In production, that “optimization” turned into a lock-holding monster. The transaction ran for minutes, sometimes longer.
During that time it held locks on secondary index entries and caused other workflows—especially user-facing updates—to queue behind it.
Deadlocks increased because every other transaction now had more time to collide with the long runner.
The incident pattern was nasty: not a single dramatic failure, but an escalating latency spiral. Customer requests slowed.
Background tasks retried. Replication lag grew. Operators killed the job; rollback took time; the system felt worse before it felt better.
The team then tried the worst possible “fix”: raising innodb_lock_wait_timeout. Now transactions waited longer to fail, so concurrency stayed high while the system made less progress.
The real fix was operationally unsexy and technically correct: chunk the job into small deterministic batches, commit every N rows,
and order updates by primary key to keep lock acquisition consistent. They accepted a bit more commit overhead in exchange for predictable lock lifetimes.
The job became slightly slower in isolation and dramatically faster for the business because the site stayed responsive.
The lesson: any optimization that increases transaction duration is an optimization against your own uptime.
Incident #3: the boring practice that saved the day
A fintech team ran MySQL 8.0 with strict operational hygiene. Nothing exotic: Performance Schema enabled, error logs shipped and searchable,
dashboards for lock waits, and a standard “incident query pack” in a shared repo. Every on-call had run the pack in a game day at least once.
One afternoon, an innocuous feature release caused a sudden increase in deadlocks on a table that tracked user verification state.
Customers couldn’t complete onboarding. Support tickets started stacking like bricks. On-call jumped in.
They didn’t start by debating isolation levels or blaming the ORM. They ran the lock-wait query against Performance Schema, identified the blocking digest,
and immediately saw a new query shape: an UPDATE with a range predicate on a timestamp column, missing an index, running inside a transaction that also read from another table.
That query wasn’t just slow; it was locking ranges.
Containment was simple: feature flag off that code path and throttle a background worker that was hammering the same table.
With pressure down, they added the missing composite index and rewrote the transaction to acquire locks in a consistent order.
The incident ended quickly, and the postmortem was refreshingly short.
The practice that saved them wasn’t heroics. It was routine instrumentation, routine runbooks, and the discipline to keep transaction scope small.
Boring is good. Boring scales.
Preventing deadlocks without lying to yourself
1) Make deadlocks a handled condition, not a surprise exception
Deadlocks will happen in any sufficiently concurrent system. Your job is to ensure they are:
(a) rare, (b) cheap, and (c) harmless. That means retries must be safe.
If you can’t retry safely, your data model and transaction boundaries are wrong for concurrency.
2) Keep transactions short, deterministic, and local
The most common deadlock amplifier is time. Every millisecond you hold locks is another lottery ticket for collision.
Shrink the transaction scope. Don’t keep a transaction open while you:
- call external services
- render templates
- wait on a queue
- perform unindexed reads you “plan to update later”
3) Use consistent lock ordering across code paths
If two transactions touch the same set of rows but lock them in different orders, you have a deadlock factory.
Fixing this is often cheaper than trying to “out-index” the problem. Common patterns:
- Always update parent before child (or vice versa), but choose one and enforce it.
- When updating multiple rows, sort IDs and update in that order.
- Prefer single-row updates by primary key where possible.
4) Design away hotspots
Some tables are born hot: counters, “current state” rows, leaderboards, session stores.
Hotspots cause both deadlocks and timeouts because they concentrate writes.
Options:
- shard by key (even within one database via partitioning or application-level bucketing)
- use append-only event tables and compute aggregates asynchronously
- avoid “global” counters; use per-entity counters and roll up
5) Don’t “fix” deadlocks by lowering isolation as a reflex
Switching to READ COMMITTED can reduce some gap-lock related deadlocks, yes. It can also change application semantics.
If your business logic assumes consistent reads within a transaction, you just bought a new class of bugs.
Do it intentionally, with a correctness review and tests.
Common mistakes: symptom → root cause → fix
1) Symptom: deadlocks spike right after a deployment
Root cause: new query shape touches rows in a different order, or adds a range scan due to missing index.
Fix: compare statement digests/top lock time before and after; revert or feature-flag; add/adjust indexes; enforce lock ordering in code.
2) Symptom: many sessions “Waiting for table metadata lock”
Root cause: DDL blocking via MDL, not an InnoDB deadlock.
Fix: stop/kill the DDL if safe; use online schema change methods; schedule DDL with guardrails and MDL monitoring.
3) Symptom: lock wait timeouts, but few deadlock errors
Root cause: long transaction or heavy batch job holding locks; hot rows; storage latency stretching transaction time.
Fix: identify longest transaction, chunk the job, add missing indexes, reduce I/O contention, and ensure the app doesn’t hold transactions open unnecessarily.
4) Symptom: deadlocks happen “randomly” under load
Root cause: non-deterministic locking order (unordered IN lists, parallel workers updating overlapping sets), or foreign key cascades locking additional rows.
Fix: sort keys before updates; serialize workers by partition key; review foreign key cascades and triggers; consider explicit SELECT … FOR UPDATE ordering.
5) Symptom: raising innodb_lock_wait_timeout makes things worse
Root cause: you increased the amount of time threads sit blocked, consuming concurrency and memory, without increasing throughput.
Fix: lower it to match your latency budget; fix the blocker; reduce contention; implement backpressure at the app tier.
6) Symptom: deadlocks disappear after you “add an index,” but performance degrades
Root cause: index reduced contention but increased write amplification; you created a new hot index or made updates heavier.
Fix: validate write cost; remove unused indexes; consider composite indexes that match access patterns; keep the minimal set that supports critical reads/writes.
7) Symptom: in Galera setups, “deadlock-like” aborts without clear InnoDB deadlock traces
Root cause: replication certification conflicts (two nodes commit conflicting writes).
Fix: reduce multi-master write overlap; route writes for a given entity to one node; review retry logic and conflict rate monitoring.
Checklists / step-by-step plan
During the incident (first 15 minutes)
- Classify the error: 1213 deadlock vs 1205 timeout vs MDL wait. Don’t mix them.
- Capture evidence fast: InnoDB status excerpt, top lock-wait query, and the worst offender sessions.
- Contain: kill the single biggest blocker if safe; throttle offending workers/endpoints; disable the new feature if correlated.
- Stabilize retries: cap retries, add backoff, and stop retrying non-idempotent workflows.
- Check infra constraints: storage latency, CPU saturation, buffer pool pressure. Deadlocks are easier to trigger on a slow system.
After containment (same day)
- Identify top offender digests (not one-off SQL). Fix the biggest contributor.
- Reduce transaction scope and remove “transaction while doing work” patterns.
- Enforce lock ordering in code and tests (yes, tests).
- Index review: ensure predicates match indexes; avoid unintended range scans.
- Validate rollback cost: if you kill blockers routinely, make sure you understand undo churn and I/O behavior.
Hardening (this is how you stop repeat pages)
- Enable durable deadlock observability: logs shipped, Performance Schema configured, dashboards for lock waits and transaction age.
- Create a deadlock budget per service: if it exceeds threshold, auto-throttle or shed load.
- Game day: practice using the exact queries and commands above under a synthetic contention scenario.
- Review schema patterns: hotspots, counters, state machines. Redesign where needed.
FAQ
1) Is a deadlock a database bug?
Usually no. It’s the database doing the correct thing when two transactions create a cycle. The bug is typically in concurrency design or retry handling.
2) Which is easier to debug: MySQL or MariaDB?
With modern defaults and mature Performance Schema usage, MySQL 8.0 tends to be easier to debug quickly. MariaDB can be equally workable, but observability varies more by deployment.
3) Should I just enable innodb_print_all_deadlocks?
If you can handle the log volume and you need forensic history, yes. But don’t treat it as your only tool. Under heavy churn, logs can become noise and still miss the bigger pattern.
4) Why do deadlocks happen when we only update by primary key?
Because secondary indexes and foreign key checks can still introduce additional locks. Also, “by primary key” is sometimes not actually by primary key once the ORM gets involved.
5) What’s the difference between deadlock and lock wait timeout for the app?
Deadlock errors are immediate aborts to break a cycle; timeouts are the system giving up after waiting. Both should be handled, but timeouts often indicate a long blocker or missing indexes.
6) Does READ COMMITTED fix deadlocks?
It can reduce gap-lock related deadlocks, but it does not “fix deadlocks” as a category. It also changes what your transactions can assume. Treat it as an engineering change, not a toggle.
7) Can I prevent deadlocks by increasing innodb_lock_wait_timeout?
No. That affects timeouts, not deadlock detection. Increasing it often worsens outages by letting threads wait longer and pile up.
8) What’s the safest immediate mitigation during a storm?
Throttle concurrency of the offending code path and kill the single worst blocking transaction if rollback cost is acceptable. Then fix the query shape and transaction scope.
9) How do I know if it’s a schema change issue instead?
If sessions show “Waiting for table metadata lock,” that’s MDL. Deadlock tooling won’t help; you need to manage DDL execution strategy.
Conclusion: next steps that actually reduce pages
If you want fewer deadlock incidents, stop treating deadlocks as spooky database lore and start treating them as
an observable, classifiable concurrency cost. MySQL 8.0 makes that easier out of the box with Performance Schema.
MariaDB can do it too, but you’ll need to be stricter about enabling and standardizing the right telemetry—especially in mixed topologies.
Practical next steps:
- Build a two-path runbook: deadlock (1213) vs timeout (1205) vs MDL. Different tools, different fixes.
- Enable and verify your observability: deadlock logging, Performance Schema tables you actually query, and searchable logs.
- Fix the top offender digest by lock time, not the scariest single query you happen to notice.
- Make retries sane: idempotent transactions, backoff with jitter, and caps. Stop retry storms before they start.
- Shorten transactions and enforce lock ordering. That’s the work. That’s also the win.