Learn › MySQL › The team resolves a production incident using everything they know

MySQL Ch 18 / 18 Expert

🏆

The team resolves a production incident using everything they know

Locks, deadlocks, bulk migration gone wrong — a real end-to-end incident resolution

⏱ 15 min 6 commands 5 takeaways

🏆

In this chapter

Rahul, Priya, Arjun

Full team, handling a real production incident

The story

This is the capstone chapter. Rahul, Priya, and Arjun have a real production incident to resolve together. Every skill from the previous 17 chapters is needed.

THE INCIDENT

Monday 11am. The order placement API started timing out. 30 seconds to place an order. Customer support was receiving complaints. The CTO messaged: what is happening?

STEP 1: ORIENT — WHAT IS HAPPENING RIGHT NOW

SHOW FULL PROCESSLIST;

-- Priya sees 80 connections in state "Waiting for lock"
-- All waiting for the same table: orders
-- One query has been running for 145 seconds

SELECT * FROM information_schema.INNODB_TRX;
-- Shows a transaction open for 148 seconds: updating orders

-- Find the blocking thread:
SELECT r.trx_id waiting_trx,
       r.trx_query waiting_query,
       b.trx_id blocking_trx,
       b.trx_query blocking_query
FROM information_schema.INNODB_LOCK_WAITS w
JOIN information_schema.INNODB_TRX b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.INNODB_TRX r ON r.trx_id = w.requesting_trx_id;

STEP 2: IMMEDIATE FIX

KILL 4521;  -- kill the blocking transaction

-- Verify the backlog clears:
SHOW PROCESSLIST;  -- connections dropping from 80 to normal

STEP 3: UNDERSTAND WHY IT HAPPENED

-- What was that long-running query?
-- Check the binary log around 11am:
mysqlbinlog --start-datetime="2026-03-16 10:55:00"                 --stop-datetime="2026-03-16 11:05:00"                 /var/log/mysql/mysql-bin.000245 | head -100

-- Found: a data migration script that started a transaction,
-- updated 2 million rows, but forgot to COMMIT

STEP 4: VERIFY NOTHING IS CORRUPTED

-- Check that order counts are correct:
SELECT status, COUNT(*) FROM orders GROUP BY status;

-- Compare with expected from the app logs

-- Check that transactions from the blocked period completed or rolled back:
SELECT COUNT(*) FROM orders WHERE ordered_at BETWEEN '11:00:00' AND '11:30:00';

STEP 5: PREVENT RECURRENCE

-- Add max transaction time monitoring:
SELECT trx_id, trx_started, TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS age_seconds
FROM information_schema.INNODB_TRX
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 30;
-- Run this every 60 seconds via cron and alert if any transaction is over 30 seconds old

-- Set transaction timeout:
SET GLOBAL innodb_lock_wait_timeout = 30;  -- queries wait max 30 seconds for a lock

-- Add to my.cnf for permanence:
innodb_lock_wait_timeout = 30

-- For bulk data operations: do them in small batches
-- Instead of UPDATE orders SET ... WHERE status = 'paid' (2 million rows in one transaction):
-- Process 1000 rows at a time:

SET @done = 0;
REPEAT
    UPDATE orders SET migrated = 1
    WHERE migrated = 0 AND status = 'paid'
    LIMIT 1000;
    SET @done = ROW_COUNT();
    SELECT SLEEP(0.1);  -- brief pause between batches
UNTIL @done = 0 END REPEAT;

THE COMPLETE MYSQL DEBUGGING REFERENCE

When something goes wrong, use this order:

1. SHOW FULL PROCESSLIST                           -- what is happening now
2. SHOW ENGINE INNODB STATUS                       -- locks and transactions
3. sudo tail -100 /var/log/mysql/error.log         -- server errors
4. EXPLAIN query                                   -- query execution plan
5. SHOW SLAVE STATUS                               -- replication health
6. SELECT ... FROM performance_schema...           -- historical stats
7. mysqlbinlog ...                                 -- binary log for data recovery

Common incident → First check:
App cannot connect        → max_connections, disk space, MySQL running
Queries are slow          → PROCESSLIST, EXPLAIN, slow query log
Locks and waits           → INNODB_TRX, INNODB_LOCK_WAITS, kill blocking thread
Data was accidentally deleted → binary log, last backup
Replication broken        → SHOW SLAVE STATUS, Last_Error
MySQL won't start         → error log, disk space, port conflict

Priya closed the incident in 12 minutes. The postmortem had three action items: cap migration transaction size, add transaction age monitoring, and require peer review for any bulk UPDATE or DELETE.

Rahul wrote in the postmortem: MySQL does not hide problems. Every slowdown, every lock, every error has a trace. Learn to read the traces. The database tells you exactly what went wrong if you know where to look.

Key takeaways

SHOW FULL PROCESSLIST then information_schema.INNODB_TRX then INNODB_LOCK_WAITS — three queries to find any lock crisis

A bulk UPDATE without batching holds a lock on every affected row for the entire duration — always batch large operations in chunks of 1000

innodb_lock_wait_timeout = 30 means queries wait max 30 seconds for a lock then fail with an error rather than hanging forever

Binary logs are your undo button for data accidents — mysqlbinlog lets you replay or skip specific time windows

The complete debugging checklist: PROCESSLIST, INNODB STATUS, error log, EXPLAIN, SLAVE STATUS, performance schema, binary log

Commands from this chapter

$ SELECT r.trx_id, r.trx_query, b.trx_id, b.trx_query FROM information_schema.INNODB_LOCK_WAITS w JOIN information_schema.INNODB_TRX b ON b.trx_id=w.blocking_trx_id JOIN information_schema.INNODB_TRX r ON r.trx_id=w.requesting_trx_id;

Find which transaction is blocking which

$ SELECT trx_id, TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS age FROM information_schema.INNODB_TRX ORDER BY age DESC;

Find long-running open transactions

$ SET GLOBAL innodb_lock_wait_timeout = 30;

Set max lock wait time to 30 seconds globally

$ UPDATE orders SET migrated=1 WHERE migrated=0 LIMIT 1000;

Batch large updates — run in a loop instead of updating 2M rows in one transaction

$ SHOW ENGINE INNODB STATUS\G

Full InnoDB status — contains deadlock history and lock details

$ mysqlbinlog --start-datetime='2026-03-16 11:00' --stop-datetime='2026-03-16 11:45' /var/log/mysql/mysql-bin.000245 | grep -A5 'UPDATE\|DELETE'

Find the problematic query in binary log