Rahul teaches the thinking behind good database design
Normalisation, when to denormalise, migrations, the optimisation decision tree
Rahul now leads the database team. A new engineer joined and asked: how do I know when a query is fast enough? When should I denormalise? When should I add a cache layer? What is the right architecture for 10 million users?
Rahul said: let me show you how to think about it.
DATABASE DESIGN PRINCIPLES
NORMALISATION -- REMOVING REDUNDANCY
Normalisation is organising tables so each piece of information lives in exactly one place.
Third Normal Form (3NF) -- what most production databases aim for:
- Each table has a single primary key
- Every column depends on the primary key, not on another non-key column
- No redundant data
Bad (denormalised):
orders table: id, user_name, user_email, user_city, product, amount
-- user_name and user_email are repeated in every order row
-- If a user changes email, you update it in orders too? Or not?Good (normalised):
users table: id, name, email, city
orders table: id, user_id, product, amount
-- user data lives once. Change email once in users table.WHEN TO DENORMALISE
Denormalisation is intentionally adding redundant data to improve read performance.
Rule: normalise first, denormalise only when you have measured that it is too slow.
When it makes sense:
- A column is read in 95% of queries but requires an expensive JOIN to get
- The column changes rarely (city on a user's profile changes once a year)
-- Adding city to orders avoids JOIN with users for city-based reports:
ALTER TABLE orders ADD COLUMN user_city VARCHAR(50);
UPDATE orders o JOIN users u ON o.user_id = u.id SET o.user_city = u.city;
-- Now add a trigger to keep it updated when users.city changesSCHEMA VERSIONING WITH MIGRATIONS
Every change to your database schema should be a versioned migration file.
-- migrations/0001_create_users.sql
CREATE TABLE users (...);-- migrations/0002_add_phone_to_users.sql
ALTER TABLE users ADD COLUMN phone VARCHAR(15);-- migrations/0003_add_orders.sql
CREATE TABLE orders (...);Tools: Flyway, Liquibase, Alembic (Python), Knex (Node.js) all manage this.
The rule: never run ALTER TABLE directly in production. Always through a migration.
MEASURING QUERY PERFORMANCE
-- Response time histogram for a query (run 100 times):
SET @start = UNIX_TIMESTAMP(NOW(6));
SELECT * FROM orders WHERE status = 'paid' LIMIT 100;
SELECT UNIX_TIMESTAMP(NOW(6)) - @start AS seconds;-- Performance schema for historical stats:
SELECT
SUBSTR(DIGEST_TEXT, 1, 80) AS query,
COUNT_STAR AS executions,
ROUND(AVG_TIMER_WAIT/1e12, 3) AS avg_sec,
ROUND(MAX_TIMER_WAIT/1e12, 3) AS max_sec,
ROUND(SUM_ROWS_EXAMINED/COUNT_STAR) AS avg_rows_examined
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME = 'myapp_db'
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;THE QUERY OPTIMISATION DECISION TREE
Query is slow. What do I do?Step 1: EXPLAIN it. What type? How many rows?
-- type=ALL and rows=millions? → Add an index on the WHERE column
-- type=ref but rows still high? → Make the WHERE more selective
-- Extra=Using filesort? → Consider an index that covers ORDER BY
-- Extra=Using temporary? → Consider an index that covers GROUP BYStep 2: Is the index being used but query is still slow?
-- Check rows examined vs rows returned in EXPLAIN ANALYZE
-- Large difference? Index is not selective enough
-- Consider a composite index that covers all WHERE conditionsStep 3: Is the query itself correct?
-- Functions on indexed columns? Rewrite as a range.
-- SELECT * when you need 3 columns? Select only what you need.
-- IN with subquery? Try EXISTS or JOIN instead.Step 4: Is the schema correct?
-- Are column types appropriate? (DECIMAL for money, not FLOAT)
-- Is the table fragmented? OPTIMIZE TABLE.
-- Is the buffer pool big enough? Check hit ratio.Step 5: Is it a caching problem?
-- Same query called 1000x/second? Add Redis cache.
-- Report query on live data? Use a replica for reads.USEFUL INFORMATION_SCHEMA QUERIES
-- Tables without a primary key (performance risk):
SELECT table_name
FROM information_schema.TABLES t
WHERE table_schema = 'myapp_db'
AND table_name NOT IN (
SELECT DISTINCT table_name
FROM information_schema.TABLE_CONSTRAINTS
WHERE constraint_type = 'PRIMARY KEY'
AND table_schema = 'myapp_db'
);-- Foreign keys without corresponding indexes:
SELECT kcu.table_name, kcu.column_name, kcu.referenced_table_name
FROM information_schema.KEY_COLUMN_USAGE kcu
LEFT JOIN information_schema.STATISTICS idx
ON kcu.table_name = idx.table_name
AND kcu.column_name = idx.column_name
AND kcu.table_schema = idx.table_schema
WHERE kcu.table_schema = 'myapp_db'
AND kcu.referenced_table_name IS NOT NULL
AND idx.index_name IS NULL;Rahul's final lesson to the new engineer: the best database optimisation is one you never had to do because the schema and indexes were right from the beginning. Design thoughtfully, measure before you change, and keep transactions short.
Normalise first — each piece of data should live in exactly one place, then denormalise only when you have measured the performance problem
Every schema change should be a versioned migration file — never ALTER TABLE directly in production without version control
Tables without a primary key have no internal ordering — InnoDB performance degrades significantly without a PRIMARY KEY
Foreign keys without indexes cause full table scans on every JOIN — query information_schema to find these automatically
The optimisation decision tree: type=ALL means add index, Using filesort means redesign index for ORDER BY, slow despite index means rewrite the query itself