Database Normalization

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

NormalizationWhat is normalization?Organizing database tables to reduce duplicated data by splitting information into related tables connected by foreign keys. is the process of structuring a relational database to reduce redundancy and prevent data anomalies. It sounds academic, but it solves a very practical problem: when the same piece of data lives in multiple places, it will eventually get out of sync. Normalization gives you a systematic way to prevent that.

Why normalizationWhat is normalization?Organizing database tables to reduce duplicated data by splitting information into related tables connected by foreign keys. matters

Consider this table tracking orders at a coffee shop:

| order_id | customer_name | customer_email      | drink     | size   | price |
|----------|---------------|---------------------|-----------|--------|-------|
| 1        | Alice         | alice@mail.com      | Latte     | Large  | 5.50  |
| 2        | Alice         | alice@mail.com      | Espresso  | Small  | 3.00  |
| 3        | Bob           | bob@mail.com        | Latte     | Large  | 5.50  |
| 4        | Alice         | alice_new@mail.com  | Cappuccino| Medium | 4.50  |

Spot the problem? Alice's email appears three times, and in row 4 it's different from rows 1 and 2. Which one is correct? Nobody knows. This is an update anomaly: when you change data in one place but not everywhere, your database becomes inconsistent.

Other anomalies:

Insertion anomaly: You can't add a new drink to the menu without creating a fake order.
Deletion anomaly: If you delete Bob's only order, you lose the fact that Bob exists as a customer.

Normalization systematically eliminates these problems.

First normal form (1NF)

Rule: Every column holds a single, atomic value. No repeating groups, no arrays, no comma-separated lists.

Violating 1NF:

-- BAD: phone_numbers contains multiple values
CREATE TABLE customers (
  id INTEGER PRIMARY KEY,
  name TEXT,
  phone_numbers TEXT  -- stores "555-0101, 555-0202"
);

This looks fine until you need to find all customers with a specific phone number. You'd have to use string matching (LIKE '%555-0101%'), which is slow, error-prone, and can't use indexes.

Fixed in 1NF:

-- GOOD: separate table for phone numbers
CREATE TABLE customers (
  id INTEGER PRIMARY KEY,
  name TEXT
);

CREATE TABLE customer_phones (
  id INTEGER PRIMARY KEY,
  customer_id INTEGER REFERENCES customers(id),
  phone TEXT NOT NULL
);

-- Now you can query efficiently
SELECT c.name FROM customers c
JOIN customer_phones cp ON cp.customer_id = c.id
WHERE cp.phone = '555-0101';

Second normal form (2NF)

Rule: Must be in 1NF, and every non-key column depends on the entire primary key, not just part of it.

2NF only matters when you have a composite primary key (a primary key made of multiple columns). If your primary key is a single column, you're already in 2NF by definition.

Violating 2NF (composite key: student_id + course_id):

-- BAD: student_name depends only on student_id, not on (student_id, course_id)
CREATE TABLE enrollments (
  student_id INTEGER,
  course_id INTEGER,
  student_name TEXT,      -- depends only on student_id (partial dependency)
  course_name TEXT,       -- depends only on course_id (partial dependency)
  grade CHAR(1),          -- depends on both (full dependency)
  PRIMARY KEY (student_id, course_id)
);

The problem: if a student changes their name, you'd need to update every row where they're enrolled. Miss one, and you have inconsistent data.

Fixed in 2NF:

-- GOOD: each fact lives in one place
CREATE TABLE students (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL
);

CREATE TABLE courses (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL
);

CREATE TABLE enrollments (
  student_id INTEGER REFERENCES students(id),
  course_id INTEGER REFERENCES courses(id),
  grade CHAR(1),
  PRIMARY KEY (student_id, course_id)
);

Now student_name lives only in the students table. Update it once, and it's correct everywhere.

Third normal form (3NF)

Rule: Must be in 2NF, and no non-key column depends on another non-key column (no transitive dependencies).

Violating 3NF:

-- BAD: department_name depends on department_id, not directly on employee id
CREATE TABLE employees (
  id INTEGER PRIMARY KEY,
  name TEXT,
  department_id INTEGER,
  department_name TEXT     -- depends on department_id, not on id
);

The department_name depends on department_id, which depends on the employee id. That's a transitive dependencyWhat is dependency?A piece of code written by someone else that your project needs to work. Think of it as a building block you import instead of writing yourself.: id -> department_id -> department_name. If the Engineering department renames to "Platform Engineering," you need to update every employee row.

Fixed in 3NF:

-- GOOD: department info lives in its own table
CREATE TABLE departments (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL
);

CREATE TABLE employees (
  id INTEGER PRIMARY KEY,
  name TEXT,
  department_id INTEGER REFERENCES departments(id)
);

-- To get an employee with their department name:
SELECT e.name, d.name AS department
FROM employees e
JOIN departments d ON d.id = e.department_id;

NormalizationWhat is normalization?Organizing database tables to reduce duplicated data by splitting information into related tables connected by foreign keys. example: step by step

Here's raw, unnormalized data for a bookstore:

order_id	customer	email	book_title	author	author_country	qty	price
1	Alice	[email protected]	Clean Code	Robert Martin	USA	1	35
1	Alice	[email protected]	Refactoring	Martin Fowler	UK	2	45
2	Bob	[email protected]	Clean Code	Robert Martin	USA	1	35

After 1NF: already there (all values are atomic, no lists).

After 2NF: split out partial dependencies. The composite key is (order_id, book_title). customer and email depend only on order_id. author and author_country depend only on book_title.

orders
order_id (PK)	customer	email
1	Alice	[email protected]
2	Bob	[email protected]

books
title (PK)	author	author_country
Clean Code	Robert Martin	USA
Refactoring	Martin Fowler	UK

order_items
order_id (FK)	book_title (FK)	qty	price
1	Clean Code	1	35
1	Refactoring	2	45
2	Clean Code	1	35

After 3NF: author_country depends on author, not on book_title (transitive). Extract authors.

authors
name (PK)	country
Robert Martin	USA
Martin Fowler	UK

books (updated)
title (PK)	author (FK)
Clean Code	Robert Martin
Refactoring	Martin Fowler

Now every fact lives in exactly one place. Change an author's country? One row. Rename a customer? One row. No anomalies.

When normalizationWhat is normalization?Organizing database tables to reduce duplicated data by splitting information into related tables connected by foreign keys. hurts

Normalization is not free. Every JOINWhat is join?A SQL operation that combines rows from two or more tables based on a shared column, letting you query related data in one request. adds latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds.. In read-heavy systems, the cost of joining five tables on every page load can become significant.

Scenario	Normalization impact	Better approach
OLTP (transactional)	Great, prevents anomalies, writes are clean	Normalize to 3NF
OLAP (analytics/reporting)	Painful, JOINs across millions of rows are slow	Denormalize or use star schema
Read-heavy APIs	JOINs add latency to every request	Cache or denormalize hot paths
Write-heavy logging	Fine, writes are simple inserts	Normalize is fine
Microservices	Each service owns its data, normalize within a service	Denormalize across service boundaries

The rule of thumb: normalize by default, denormalize with intention. Start with 3NF and only break it when you have measured performance data showing that JOINs are the bottleneck. The next lesson covers exactly when and how to do that.

Quick reference: normal forms

Normal Form	Rule	Eliminates
1NF	Atomic values, no repeating groups	Repeated data in cells
2NF	No partial dependencies on composite keys	Redundancy from partial key matches
3NF	No transitive dependencies	Redundancy from indirect dependencies
BCNF	Every determinant is a candidate key	Edge cases 3NF misses (rare)

In practice, reaching 3NF is enough for the vast majority of applications. BCNF (Boyce-Codd Normal Form) handles edge cases you'll rarely encounter.

AI pitfall

AI-generated schemas tend to over-normalize. Ask it to design a database for an e-commerce app and it will create separate lookup tables for statuses, countries, and currencies, each adding a JOIN to your queries. For small, fixed sets of values, a PostgreSQL ENUM or a simple VARCHAR column is simpler and faster.

Good to know

Price columns in order tables should NOT reference the product's current price. Store the price at the time of purchase in the order_items table. If you normalize this away, changing a product's price retroactively changes every historical order. This is one case where intentional data duplication is correct.

Edge case

2NF violations are rare in practice because most modern schemas use a single auto-increment ID as the primary key rather than composite keys. But they show up in many-to-many junction tables. If your enrollment table has (student_id, course_id) as the primary key AND stores the instructor name, that instructor name depends only on course_id, a 2NF violation.

Done

Complete & Next

Create a free account to save your progress

Essential to know

Why normalizationWhat is normalization?Organizing database tables to reduce duplicated data by splitting information into related tables connected by foreign keys. Ask AI for more matters