The team resolves a production incident using everything they know
Locks, deadlocks, bulk migration gone wrong — a real end-to-end incident resolution
This is the capstone chapter. Rahul, Priya, and Arjun have a real production incident to resolve together. Every skill from the previous 17 chapters is needed.
THE INCIDENT
Monday 11am. The order placement API started timing out. 30 seconds to place an order. Customer support was receiving complaints. The CTO messaged: what is happening?
STEP 1: ORIENT — WHAT IS HAPPENING RIGHT NOW
SHOW FULL PROCESSLIST;-- Priya sees 80 connections in state "Waiting for lock"
-- All waiting for the same table: orders
-- One query has been running for 145 secondsSELECT * FROM information_schema.INNODB_TRX;
-- Shows a transaction open for 148 seconds: updating orders-- Find the blocking thread:
SELECT r.trx_id waiting_trx,
r.trx_query waiting_query,
b.trx_id blocking_trx,
b.trx_query blocking_query
FROM information_schema.INNODB_LOCK_WAITS w
JOIN information_schema.INNODB_TRX b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.INNODB_TRX r ON r.trx_id = w.requesting_trx_id;STEP 2: IMMEDIATE FIX
KILL 4521; -- kill the blocking transaction-- Verify the backlog clears:
SHOW PROCESSLIST; -- connections dropping from 80 to normalSTEP 3: UNDERSTAND WHY IT HAPPENED
-- What was that long-running query?
-- Check the binary log around 11am:
mysqlbinlog --start-datetime="2026-03-16 10:55:00" --stop-datetime="2026-03-16 11:05:00" /var/log/mysql/mysql-bin.000245 | head -100-- Found: a data migration script that started a transaction,
-- updated 2 million rows, but forgot to COMMITSTEP 4: VERIFY NOTHING IS CORRUPTED
-- Check that order counts are correct:
SELECT status, COUNT(*) FROM orders GROUP BY status;-- Compare with expected from the app logs-- Check that transactions from the blocked period completed or rolled back:
SELECT COUNT(*) FROM orders WHERE ordered_at BETWEEN '11:00:00' AND '11:30:00';STEP 5: PREVENT RECURRENCE
-- Add max transaction time monitoring:
SELECT trx_id, trx_started, TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS age_seconds
FROM information_schema.INNODB_TRX
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 30;
-- Run this every 60 seconds via cron and alert if any transaction is over 30 seconds old-- Set transaction timeout:
SET GLOBAL innodb_lock_wait_timeout = 30; -- queries wait max 30 seconds for a lock-- Add to my.cnf for permanence:
innodb_lock_wait_timeout = 30-- For bulk data operations: do them in small batches
-- Instead of UPDATE orders SET ... WHERE status = 'paid' (2 million rows in one transaction):
-- Process 1000 rows at a time:SET @done = 0;
REPEAT
UPDATE orders SET migrated = 1
WHERE migrated = 0 AND status = 'paid'
LIMIT 1000;
SET @done = ROW_COUNT();
SELECT SLEEP(0.1); -- brief pause between batches
UNTIL @done = 0 END REPEAT;THE COMPLETE MYSQL DEBUGGING REFERENCE
When something goes wrong, use this order:
1. SHOW FULL PROCESSLIST -- what is happening now
2. SHOW ENGINE INNODB STATUS -- locks and transactions
3. sudo tail -100 /var/log/mysql/error.log -- server errors
4. EXPLAIN query -- query execution plan
5. SHOW SLAVE STATUS -- replication health
6. SELECT ... FROM performance_schema... -- historical stats
7. mysqlbinlog ... -- binary log for data recoveryCommon incident → First check:
App cannot connect → max_connections, disk space, MySQL running
Queries are slow → PROCESSLIST, EXPLAIN, slow query log
Locks and waits → INNODB_TRX, INNODB_LOCK_WAITS, kill blocking thread
Data was accidentally deleted → binary log, last backup
Replication broken → SHOW SLAVE STATUS, Last_Error
MySQL won't start → error log, disk space, port conflictPriya closed the incident in 12 minutes. The postmortem had three action items: cap migration transaction size, add transaction age monitoring, and require peer review for any bulk UPDATE or DELETE.
Rahul wrote in the postmortem: MySQL does not hide problems. Every slowdown, every lock, every error has a trace. Learn to read the traces. The database tells you exactly what went wrong if you know where to look.
SHOW FULL PROCESSLIST then information_schema.INNODB_TRX then INNODB_LOCK_WAITS — three queries to find any lock crisis
A bulk UPDATE without batching holds a lock on every affected row for the entire duration — always batch large operations in chunks of 1000
innodb_lock_wait_timeout = 30 means queries wait max 30 seconds for a lock then fail with an error rather than hanging forever
Binary logs are your undo button for data accidents — mysqlbinlog lets you replay or skip specific time windows
The complete debugging checklist: PROCESSLIST, INNODB STATUS, error log, EXPLAIN, SLAVE STATUS, performance schema, binary log