Learn › MySQL › Rahul teaches the thinking behind good database design

MySQL Ch 17 / 18 Expert

🏛️

Rahul teaches the thinking behind good database design

Normalisation, when to denormalise, migrations, the optimisation decision tree

⏱ 14 min 4 commands 5 takeaways

🏛️

In this chapter

Rahul

Database team lead, training new engineers

The story

Rahul now leads the database team. A new engineer joined and asked: how do I know when a query is fast enough? When should I denormalise? When should I add a cache layer? What is the right architecture for 10 million users?

Rahul said: let me show you how to think about it.

DATABASE DESIGN PRINCIPLES

NORMALISATION -- REMOVING REDUNDANCY

Normalisation is organising tables so each piece of information lives in exactly one place.

Third Normal Form (3NF) -- what most production databases aim for:

- Each table has a single primary key

- Every column depends on the primary key, not on another non-key column

- No redundant data

Bad (denormalised):

orders table: id, user_name, user_email, user_city, product, amount
-- user_name and user_email are repeated in every order row
-- If a user changes email, you update it in orders too? Or not?

Good (normalised):

users table: id, name, email, city
orders table: id, user_id, product, amount
-- user data lives once. Change email once in users table.

WHEN TO DENORMALISE

Denormalisation is intentionally adding redundant data to improve read performance.

Rule: normalise first, denormalise only when you have measured that it is too slow.

When it makes sense:

- A column is read in 95% of queries but requires an expensive JOIN to get

- The column changes rarely (city on a user's profile changes once a year)

-- Adding city to orders avoids JOIN with users for city-based reports:
ALTER TABLE orders ADD COLUMN user_city VARCHAR(50);
UPDATE orders o JOIN users u ON o.user_id = u.id SET o.user_city = u.city;
-- Now add a trigger to keep it updated when users.city changes

SCHEMA VERSIONING WITH MIGRATIONS

Every change to your database schema should be a versioned migration file.

-- migrations/0001_create_users.sql
CREATE TABLE users (...);

-- migrations/0002_add_phone_to_users.sql
ALTER TABLE users ADD COLUMN phone VARCHAR(15);

-- migrations/0003_add_orders.sql
CREATE TABLE orders (...);

Tools: Flyway, Liquibase, Alembic (Python), Knex (Node.js) all manage this.

The rule: never run ALTER TABLE directly in production. Always through a migration.

MEASURING QUERY PERFORMANCE

-- Response time histogram for a query (run 100 times):
SET @start = UNIX_TIMESTAMP(NOW(6));
SELECT * FROM orders WHERE status = 'paid' LIMIT 100;
SELECT UNIX_TIMESTAMP(NOW(6)) - @start AS seconds;

-- Performance schema for historical stats:
SELECT
    SUBSTR(DIGEST_TEXT, 1, 80) AS query,
    COUNT_STAR AS executions,
    ROUND(AVG_TIMER_WAIT/1e12, 3) AS avg_sec,
    ROUND(MAX_TIMER_WAIT/1e12, 3) AS max_sec,
    ROUND(SUM_ROWS_EXAMINED/COUNT_STAR) AS avg_rows_examined
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME = 'myapp_db'
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;

THE QUERY OPTIMISATION DECISION TREE

Query is slow. What do I do?

Step 1: EXPLAIN it. What type? How many rows?
-- type=ALL and rows=millions?  → Add an index on the WHERE column
-- type=ref but rows still high? → Make the WHERE more selective
-- Extra=Using filesort?         → Consider an index that covers ORDER BY
-- Extra=Using temporary?        → Consider an index that covers GROUP BY

Step 2: Is the index being used but query is still slow?
-- Check rows examined vs rows returned in EXPLAIN ANALYZE
-- Large difference? Index is not selective enough
-- Consider a composite index that covers all WHERE conditions

Step 3: Is the query itself correct?
-- Functions on indexed columns? Rewrite as a range.
-- SELECT * when you need 3 columns? Select only what you need.
-- IN with subquery? Try EXISTS or JOIN instead.

Step 4: Is the schema correct?
-- Are column types appropriate? (DECIMAL for money, not FLOAT)
-- Is the table fragmented? OPTIMIZE TABLE.
-- Is the buffer pool big enough? Check hit ratio.

Step 5: Is it a caching problem?
-- Same query called 1000x/second? Add Redis cache.
-- Report query on live data? Use a replica for reads.

USEFUL INFORMATION_SCHEMA QUERIES

-- Tables without a primary key (performance risk):
SELECT table_name
FROM information_schema.TABLES t
WHERE table_schema = 'myapp_db'
AND table_name NOT IN (
    SELECT DISTINCT table_name
    FROM information_schema.TABLE_CONSTRAINTS
    WHERE constraint_type = 'PRIMARY KEY'
    AND table_schema = 'myapp_db'
);

-- Foreign keys without corresponding indexes:
SELECT kcu.table_name, kcu.column_name, kcu.referenced_table_name
FROM information_schema.KEY_COLUMN_USAGE kcu
LEFT JOIN information_schema.STATISTICS idx
    ON kcu.table_name = idx.table_name
    AND kcu.column_name = idx.column_name
    AND kcu.table_schema = idx.table_schema
WHERE kcu.table_schema = 'myapp_db'
AND kcu.referenced_table_name IS NOT NULL
AND idx.index_name IS NULL;

Rahul's final lesson to the new engineer: the best database optimisation is one you never had to do because the schema and indexes were right from the beginning. Design thoughtfully, measure before you change, and keep transactions short.

Key takeaways

Normalise first — each piece of data should live in exactly one place, then denormalise only when you have measured the performance problem

Every schema change should be a versioned migration file — never ALTER TABLE directly in production without version control

Tables without a primary key have no internal ordering — InnoDB performance degrades significantly without a PRIMARY KEY

Foreign keys without indexes cause full table scans on every JOIN — query information_schema to find these automatically

The optimisation decision tree: type=ALL means add index, Using filesort means redesign index for ORDER BY, slow despite index means rewrite the query itself

Commands from this chapter

$ SELECT table_name FROM information_schema.TABLES WHERE table_schema='myapp_db' AND table_name NOT IN (SELECT table_name FROM information_schema.TABLE_CONSTRAINTS WHERE constraint_type='PRIMARY KEY' AND table_schema='myapp_db');

Find tables missing a primary key

$ SELECT SUBSTR(DIGEST_TEXT,1,80), COUNT_STAR, ROUND(AVG_TIMER_WAIT/1e12,3) avg_sec, ROUND(SUM_ROWS_EXAMINED/COUNT_STAR) avg_rows FROM performance_schema.events_statements_summary_by_digest WHERE SCHEMA_NAME='myapp_db' ORDER BY SUM_TIMER_WAIT DESC LIMIT 10;

Top 10 slowest queries with average rows examined

$ EXPLAIN ANALYZE SELECT * FROM orders WHERE status='paid' ORDER BY ordered_at DESC LIMIT 10;

Real execution stats — actual vs estimated rows

$ SELECT kcu.table_name, kcu.column_name FROM information_schema.KEY_COLUMN_USAGE kcu LEFT JOIN information_schema.STATISTICS idx ON kcu.table_name=idx.table_name AND kcu.column_name=idx.column_name WHERE kcu.table_schema='myapp_db' AND kcu.referenced_table_name IS NOT NULL AND idx.index_name IS NULL;

Find foreign keys missing an index