System Design/
Lesson

NormalizationWhat is normalization?Organizing database tables to reduce duplicated data by splitting information into related tables connected by foreign keys. is the process of structuring a relational database to reduce redundancy and prevent data anomalies. It sounds academic, but it solves a very practical problem: when the same piece of data lives in multiple places, it will eventually get out of sync. Normalization gives you a systematic way to prevent that.

Why normalizationWhat is normalization?Organizing database tables to reduce duplicated data by splitting information into related tables connected by foreign keys. matters

Consider this table tracking orders at a coffee shop:

| order_id | customer_name | customer_email      | drink     | size   | price |
|----------|---------------|---------------------|-----------|--------|-------|
| 1        | Alice         | alice@mail.com      | Latte     | Large  | 5.50  |
| 2        | Alice         | alice@mail.com      | Espresso  | Small  | 3.00  |
| 3        | Bob           | bob@mail.com        | Latte     | Large  | 5.50  |
| 4        | Alice         | alice_new@mail.com  | Cappuccino| Medium | 4.50  |

Spot the problem? Alice's email appears three times, and in row 4 it's different from rows 1 and 2. Which one is correct? Nobody knows. This is an update anomaly: when you change data in one place but not everywhere, your database becomes inconsistent.

Other anomalies:

  • Insertion anomaly: You can't add a new drink to the menu without creating a fake order.
  • Deletion anomaly: If you delete Bob's only order, you lose the fact that Bob exists as a customer.

Normalization systematically eliminates these problems.

02

First normal form (1NF)

Rule: Every column holds a single, atomic value. No repeating groups, no arrays, no comma-separated lists.

Violating 1NF:

-- BAD: phone_numbers contains multiple values
CREATE TABLE customers (
  id INTEGER PRIMARY KEY,
  name TEXT,
  phone_numbers TEXT  -- stores "555-0101, 555-0202"
);

This looks fine until you need to find all customers with a specific phone number. You'd have to use string matching (LIKE '%555-0101%'), which is slow, error-prone, and can't use indexes.

Fixed in 1NF:

-- GOOD: separate table for phone numbers
CREATE TABLE customers (
  id INTEGER PRIMARY KEY,
  name TEXT
);

CREATE TABLE customer_phones (
  id INTEGER PRIMARY KEY,
  customer_id INTEGER REFERENCES customers(id),
  phone TEXT NOT NULL
);

-- Now you can query efficiently
SELECT c.name FROM customers c
JOIN customer_phones cp ON cp.customer_id = c.id
WHERE cp.phone = '555-0101';
03

Second normal form (2NF)

Rule: Must be in 1NF, and every non-key column depends on the entire primary key, not just part of it.

2NF only matters when you have a composite primary key (a primary key made of multiple columns). If your primary key is a single column, you're already in 2NF by definition.

Violating 2NF (composite key: student_id + course_id):

-- BAD: student_name depends only on student_id, not on (student_id, course_id)
CREATE TABLE enrollments (
  student_id INTEGER,
  course_id INTEGER,
  student_name TEXT,      -- depends only on student_id (partial dependency)
  course_name TEXT,       -- depends only on course_id (partial dependency)
  grade CHAR(1),          -- depends on both (full dependency)
  PRIMARY KEY (student_id, course_id)
);

The problem: if a student changes their name, you'd need to update every row where they're enrolled. Miss one, and you have inconsistent data.

Fixed in 2NF:

-- GOOD: each fact lives in one place
CREATE TABLE students (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL
);

CREATE TABLE courses (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL
);

CREATE TABLE enrollments (
  student_id INTEGER REFERENCES students(id),
  course_id INTEGER REFERENCES courses(id),
  grade CHAR(1),
  PRIMARY KEY (student_id, course_id)
);

Now student_name lives only in the students table. Update it once, and it's correct everywhere.

04

Third normal form (3NF)

Rule: Must be in 2NF, and no non-key column depends on another non-key column (no transitive dependencies).

Violating 3NF:

-- BAD: department_name depends on department_id, not directly on employee id
CREATE TABLE employees (
  id INTEGER PRIMARY KEY,
  name TEXT,
  department_id INTEGER,
  department_name TEXT     -- depends on department_id, not on id
);

The department_name depends on department_id, which depends on the employee id. That's a transitive dependencyWhat is dependency?A piece of code written by someone else that your project needs to work. Think of it as a building block you import instead of writing yourself.: id -> department_id -> department_name. If the Engineering department renames to "Platform Engineering," you need to update every employee row.

Fixed in 3NF:

-- GOOD: department info lives in its own table
CREATE TABLE departments (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL
);

CREATE TABLE employees (
  id INTEGER PRIMARY KEY,
  name TEXT,
  department_id INTEGER REFERENCES departments(id)
);

-- To get an employee with their department name:
SELECT e.name, d.name AS department
FROM employees e
JOIN departments d ON d.id = e.department_id;
05

NormalizationWhat is normalization?Organizing database tables to reduce duplicated data by splitting information into related tables connected by foreign keys. example: step by step

Here's raw, unnormalized data for a bookstore:

order_idcustomeremailbook_titleauthorauthor_countryqtyprice
1Alice[email protected]Clean CodeRobert MartinUSA135
1Alice[email protected]RefactoringMartin FowlerUK245
2Bob[email protected]Clean CodeRobert MartinUSA135

After 1NF: already there (all values are atomic, no lists).

After 2NF: split out partial dependencies. The composite key is (order_id, book_title). customer and email depend only on order_id. author and author_country depend only on book_title.

orders
order_id (PK)customeremail
1Alice[email protected]
2Bob[email protected]
books
title (PK)authorauthor_country
Clean CodeRobert MartinUSA
RefactoringMartin FowlerUK
order_items
order_id (FK)book_title (FK)qtyprice
1Clean Code135
1Refactoring245
2Clean Code135

After 3NF: author_country depends on author, not on book_title (transitive). Extract authors.

authors
name (PK)country
Robert MartinUSA
Martin FowlerUK
books (updated)
title (PK)author (FK)
Clean CodeRobert Martin
RefactoringMartin Fowler

Now every fact lives in exactly one place. Change an author's country? One row. Rename a customer? One row. No anomalies.

06

When normalizationWhat is normalization?Organizing database tables to reduce duplicated data by splitting information into related tables connected by foreign keys. hurts

Normalization is not free. Every JOINWhat is join?A SQL operation that combines rows from two or more tables based on a shared column, letting you query related data in one request. adds latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds.. In read-heavy systems, the cost of joining five tables on every page load can become significant.

ScenarioNormalization impactBetter approach
OLTP (transactional)Great, prevents anomalies, writes are cleanNormalize to 3NF
OLAP (analytics/reporting)Painful, JOINs across millions of rows are slowDenormalize or use star schema
Read-heavy APIsJOINs add latency to every requestCache or denormalize hot paths
Write-heavy loggingFine, writes are simple insertsNormalize is fine
MicroservicesEach service owns its data, normalize within a serviceDenormalize across service boundaries

The rule of thumb: normalize by default, denormalize with intention. Start with 3NF and only break it when you have measured performance data showing that JOINs are the bottleneck. The next lesson covers exactly when and how to do that.

07

Quick reference: normal forms

Normal FormRuleEliminates
1NFAtomic values, no repeating groupsRepeated data in cells
2NFNo partial dependencies on composite keysRedundancy from partial key matches
3NFNo transitive dependenciesRedundancy from indirect dependencies
BCNFEvery determinant is a candidate keyEdge cases 3NF misses (rare)

In practice, reaching 3NF is enough for the vast majority of applications. BCNF (Boyce-Codd Normal Form) handles edge cases you'll rarely encounter.

AI pitfall
AI-generated schemas tend to over-normalize. Ask it to design a database for an e-commerce app and it will create separate lookup tables for statuses, countries, and currencies, each adding a JOIN to your queries. For small, fixed sets of values, a PostgreSQL ENUM or a simple VARCHAR column is simpler and faster.
Good to know
Price columns in order tables should NOT reference the product's current price. Store the price at the time of purchase in the order_items table. If you normalize this away, changing a product's price retroactively changes every historical order. This is one case where intentional data duplication is correct.
Edge case
2NF violations are rare in practice because most modern schemas use a single auto-increment ID as the primary key rather than composite keys. But they show up in many-to-many junction tables. If your enrollment table has (student_id, course_id) as the primary key AND stores the instructor name, that instructor name depends only on course_id, a 2NF violation.