System Design - Strategic Denormalization

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

The previous lesson taught you to normalize: split data into clean, non-redundant tables. This lesson teaches you when to deliberately break those rules. DenormalizationWhat is denormalization?Intentionally duplicating data across tables to speed up read queries by avoiding expensive joins at query time. is the art of trading write complexity for read speed. Done right, it makes your application noticeably faster. Done wrong, it creates data consistency nightmares.

Why denormalize at all?

Imagine your e-commerce product page. To render it, you need data from five tables:

-- Normalized query: 4 JOINs
SELECT p.name, p.description, p.price,
       c.name AS category,
       b.name AS brand,
       AVG(r.rating) AS avg_rating,
       COUNT(r.id) AS review_count
FROM products p
JOIN categories c ON c.id = p.category_id
JOIN brands b ON b.id = p.brand_id
LEFT JOIN reviews r ON r.product_id = p.id
WHERE p.id = 42
GROUP BY p.id, c.name, b.name;

This query is correct, clean, and normalized. It's also running on every single product page view. With 10 million page views per day, those JOINs add up. Each one requires the database to match rows across tables, sort, and aggregate.

DenormalizationWhat is denormalization?Intentionally duplicating data across tables to speed up read queries by avoiding expensive joins at query time. stores pre-computed results so you can skip the JOINs:

-- Denormalized: one table, no JOINs
SELECT name, description, price, category_name,
       brand_name, avg_rating, review_count
FROM products_denormalized
WHERE id = 42;

Same result, but instead of combining five tables at query time, the data is already combined at write time. The read goes from ~5ms to ~0.5ms. That matters at scale.

The read/write trade-off

DenormalizationWhat is denormalization?Intentionally duplicating data across tables to speed up read queries by avoiding expensive joins at query time. shifts work from read time to write time. Here's the fundamental trade-off:

Normalized:
  Read:  Slow (JOINs at query time)
  Write: Fast (update one row in one table)
  Risk:  Low (single source of truth)

Denormalized:
  Read:  Fast (pre-computed, no JOINs)
  Write: Slow (must update multiple places)
  Risk:  Higher (data can go out of sync)

This is only worth it when your application is read-heavy: meaning reads outnumber writes by 10:1, 100:1, or more. Most web applications are read-heavy: for every review someone posts, thousands of people read the product page.

DenormalizationWhat is denormalization?Intentionally duplicating data across tables to speed up read queries by avoiding expensive joins at query time. strategies

Strategy 1: Duplicate columns

Copy frequently-accessed data from related tables into the main table.

-- Instead of JOINing to get the category name every time:
ALTER TABLE products ADD COLUMN category_name TEXT;

-- When a product is created or updated:
UPDATE products
SET category_name = (SELECT name FROM categories WHERE id = products.category_id)
WHERE id = 42;

Danger: If you rename a category, you must update every product in that category. Forget to do it, and your data is inconsistent.

Strategy 2: Pre-computed aggregates

Store calculated values (counts, averages, sums) instead of computing them on every read.

-- Add aggregate columns to the products table
ALTER TABLE products
  ADD COLUMN review_count INTEGER DEFAULT 0,
  ADD COLUMN avg_rating NUMERIC(3,2) DEFAULT 0;

-- Update aggregates when a review is added
CREATE OR REPLACE FUNCTION update_product_stats()
RETURNS TRIGGER AS $
BEGIN
  UPDATE products
  SET review_count = (SELECT COUNT(*) FROM reviews WHERE product_id = NEW.product_id),
      avg_rating = (SELECT AVG(rating) FROM reviews WHERE product_id = NEW.product_id)
  WHERE id = NEW.product_id;
  RETURN NEW;
END;
$ LANGUAGE plpgsql;

CREATE TRIGGER after_review_insert
AFTER INSERT ON reviews
FOR EACH ROW EXECUTE FUNCTION update_product_stats();

Strategy 3: Materialized views

A materialized viewWhat is materialized view?A database object that stores the pre-computed result of a query on disk - only recomputes when you explicitly refresh it. is a query result that the database stores as a physical table. You get denormalized reads without manually maintaining the duplicated data.

-- Create a materialized view
CREATE MATERIALIZED VIEW product_summary AS
SELECT p.id, p.name, p.price,
       c.name AS category_name,
       b.name AS brand_name,
       COALESCE(AVG(r.rating), 0) AS avg_rating,
       COUNT(r.id) AS review_count
FROM products p
JOIN categories c ON c.id = p.category_id
JOIN brands b ON b.id = p.brand_id
LEFT JOIN reviews r ON r.product_id = p.id
GROUP BY p.id, c.name, b.name;

-- Create an index on the materialized view
CREATE UNIQUE INDEX ON product_summary(id);

-- Reads are fast - it's just a table scan
SELECT * FROM product_summary WHERE id = 42;

-- Refresh when data changes (can be scheduled)
REFRESH MATERIALIZED VIEW CONCURRENTLY product_summary;

The trade-off is staleness: materialized views are only as fresh as the last refresh. For product pages, refreshing every 5 minutes is usually fine. For a live stock ticker, it's not.

Strategy 4: Read replicas

Instead of denormalizing your primary database, create a separate read-only copy with a different schemaWhat is schema?A formal definition of the structure your data must follow - which fields exist, what types they have, and which are required. optimized for queries.

Primary DB (normalized)          Read Replica (denormalized)
┌──────────────┐                 ┌──────────────────────┐
│ products     │   replication   │ product_search_view  │
│ categories   │ ─────────────→  │ (flat, indexed)      │
│ brands       │                 │                      │
│ reviews      │                 │ analytics_summary    │
└──────────────┘                 │ (pre-aggregated)     │
      ↑                          └──────────────────────┘
   writes                               ↑
                                      reads

This gives you the best of both worlds: writes go to the clean, normalized primary, and reads come from the optimized replica. The replica can be seconds behind, which is acceptable for most use cases.

Strategy 5: Application-level caching

Instead of denormalizing in the database, cache the assembled result in Redis or an in-memory cache.

async function getProduct(id: number) {
  // Check cache first
  const cached = await redis.get(`product:${id}`);
  if (cached) return JSON.parse(cached);

  // Cache miss: run the full normalized query
  const product = await db.query(`
    SELECT p.*, c.name AS category_name, b.name AS brand_name,
           AVG(r.rating) AS avg_rating, COUNT(r.id) AS review_count
    FROM products p
    JOIN categories c ON c.id = p.category_id
    JOIN brands b ON b.id = p.brand_id
    LEFT JOIN reviews r ON r.product_id = p.id
    WHERE p.id = CODE_BLOCK
    GROUP BY p.id, c.name, b.name
  `, [id]);

  // Cache for 5 minutes
  await redis.setex(`product:${id}`, 300, JSON.stringify(product));
  return product;
}

This is often the safest starting point for denormalization. Your database stays clean, and cache invalidationWhat is cache invalidation?Removing or updating cached data when the original data changes, so users never see outdated information. is explicit.

Strategy comparison

Strategy	Consistency risk	Implementation effort	Best for
Duplicate columns	High, must update all copies	Low	Rarely-changing reference data
Pre-computed aggregates	Medium, triggers can fail	Medium	Counts, averages, sums
Materialized views	Low, database manages it	Low	Reporting, dashboards
Read replicas	Low, replication lag only	High (ops overhead)	Large-scale read offloading
Application cache	Medium, stale cache risk	Medium	Hot data, API responses

When to denormalize

DenormalizationWhat is denormalization?Intentionally duplicating data across tables to speed up read queries by avoiding expensive joins at query time. is medicine, not vitamins. Take it for a specific symptom, not as a general health practice.

Denormalize when:

Query profiling shows JOINs are the bottleneck (not missing indexes, not bad queries)
The read-to-write ratio is heavily skewed toward reads (100:1 or more)
The data changes infrequently (product catalog, user profiles)
You can accept slight staleness (a review count that's 5 minutes behind is fine)

Don't denormalize when:

You haven't measured the actual performance problem
The data changes frequently and consistency matters (financial transactions)
You're still in the early stages of your product (premature optimization)
A proper indexWhat is index?A data structure the database maintains alongside a table so it can find rows by specific columns quickly instead of scanning everything. would solve the same problem with zero trade-offs

-- Before denormalizing, try adding an index first
CREATE INDEX idx_reviews_product_id ON reviews(product_id);
-- This might make your JOIN fast enough without any denormalization

The mantra: normalize by default, index second, denormalize last.

AI pitfall

Ask AI to "optimize this slow query" and it will often suggest denormalization first. What AI gets wrong: it skips the cheaper fixes. A missing index, a poorly written WHERE clause, or an unnecessary SELECT * are far more common causes of slow queries than normalization overhead. Always run EXPLAIN ANALYZE before denormalizing.

Good to know

Materialized views are the safest form of denormalization because the database manages the duplication. If you forget to refresh them, the data is stale but never inconsistent, the underlying tables remain the source of truth. Start with materialized views before resorting to manual duplication.

Edge case

Database triggers for maintaining denormalized aggregates (like review_count) can cause cascading performance issues. If deleting a user deletes their reviews, which fires triggers to update product stats, which fires triggers to update category stats, one DELETE becomes hundreds of UPDATE operations. Always trace the trigger chain before deploying.

Done

Complete & Next