System Design - Cache Invalidation

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

There is an old joke in computer science: the two hardest problems are cache invalidationWhat is cache invalidation?Removing or updating cached data when the original data changes, so users never see outdated information., naming things, and off-by-one errors. The joke is funny because cache invalidation really is that hard. Storing data in a cache is easy. Knowing when to remove or update it, without serving stale data, without killing your database, and without adding so much complexity that you wish you had never cached anything, that is the challenge.

Every invalidation strategy makes a tradeoff between freshness, performance, and complexity. There is no single correct answer. The right strategy depends on how stale your data can be, how often it changes, and how painful a stale read is for your users.

TTLWhat is ttl?Time-to-Live - a countdown attached to cached data that automatically expires it after a set number of seconds.-based invalidation

TTL (Time to Live) is the simplest invalidation strategy: every cached entry has an expiration time, and when it expires, it is automatically removed. The next request triggers a cache miss and fetches fresh data from the originWhat is origin?The combination of protocol, domain, and port that defines a security boundary in the browser, like https://example.com:443..

// TTL-based: set it and forget it
await redis.setex('product:42', 300, JSON.stringify(product)); // Expires in 5 minutes

// After 5 minutes, redis.get('product:42') returns null → cache miss → fresh fetch

TTL is the right default when you can tolerate bounded staleness. If your product catalog updates a few times a day and users can live with data that is up to 5 minutes old, a 5-minute TTL is simple, reliable, and requires zero extra infrastructure.

The problem with TTL-only invalidation is the window of staleness. If a product's price changes from $10 to $15, users might see $10 for up to 5 minutes. For some data (blog posts, product descriptions), this is fine. For other data (inventory counts, prices during a flash sale), it is not.

Event-based invalidation

Event-based invalidation removes or updates the cache immediately when the underlying data changes. Instead of waiting for the TTLWhat is ttl?Time-to-Live - a countdown attached to cached data that automatically expires it after a set number of seconds. to expire, you actively tell the cache "this data is no longer valid."

// When a product is updated, invalidate its cache
async function updateProduct(productId, updates) {
  await db.query('UPDATE products SET price = ? WHERE id = ?',
    [updates.price, productId]);

  // Immediately invalidate cache
  await redis.del(`product:${productId}`);

  // Also invalidate any list that contains this product
  await redis.del(`category:${updates.categoryId}:products`);
  await redis.del('featured-products');
}

This gives you much better freshness than TTL alone, but it introduces a coordination problem: every code path that modifies data must also invalidate the corresponding cache keys. Miss one, and you have stale data. As your system grows, tracking which cache keys depend on which data becomes increasingly difficult.

A more scalable approach is to use a publish/subscribe mechanism:

// Publisher: broadcast that a product changed
async function updateProduct(productId, updates) {
  await db.query('UPDATE products SET price = ? WHERE id = ?',
    [updates.price, productId]);

  // Publish event - all subscribers will clear their caches
  await redis.publish('product:updated', JSON.stringify({ productId }));
}

// Subscriber: listen for changes and invalidate
const subscriber = new Redis();
subscriber.subscribe('product:updated');
subscriber.on('message', (channel, message) => {
  const { productId } = JSON.parse(message);
  redis.del(`product:${productId}`);
  redis.del('product-list:*'); // Clear product list caches
});

Versioned keys

Versioned keys solve the problem of invalidating groups of related cache entries. Instead of tracking and deleting every individual key, you embed a version number in the key. To invalidate, you increment the version, all old keys become unreachable and eventually expire via TTLWhat is ttl?Time-to-Live - a countdown attached to cached data that automatically expires it after a set number of seconds..

// Store version number in Redis
// Current version: 7
await redis.set('catalog:version', '7');

// Cache keys include the version
async function getProduct(productId) {
  const version = await redis.get('catalog:version');
  const cacheKey = `catalog:v${version}:product:${productId}`;

  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  const product = await db.query('SELECT * FROM products WHERE id = ?', [productId]);
  await redis.setex(cacheKey, 3600, JSON.stringify(product)); // 1 hour TTL
  return product;
}

// To invalidate the entire catalog: just bump the version
async function invalidateCatalog() {
  await redis.incr('catalog:version');
  // All old v7 keys are now orphaned and will expire via TTL
  // New requests will create v8 keys with fresh data
}

This is elegant for bulk invalidation, one version bump invalidates everything, but wasteful if only a single item changed, because the entire cache is effectively cold after a version bump.

Cache stampede (thundering herd)

Cache stampede is what happens when a popular cache key expires and many concurrent requests all experience a cache miss at the same time. All of them query the database simultaneously, potentially overwhelming it.

Popular key expires at T=300s
T=300.001s: Request A → cache miss → query DB
T=300.002s: Request B → cache miss → query DB
T=300.003s: Request C → cache miss → query DB
... 500 more requests → 500 DB queries for the same data

This is not a theoretical problem. A single popular key expiring during a traffic spike can bring down your database. There are three main solutions.

Solution 1: MutexWhat is mutex?A mutual exclusion lock that prevents concurrent access to shared data - only one thread or goroutine can hold it at a time. lock (lock and wait)

Only one request is allowed to repopulate the cache. All other requests wait for that one to finish.

async function getProductWithLock(productId) {
  const cacheKey = `product:${productId}`;
  const lockKey = `lock:${cacheKey}`;

  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // Try to acquire lock (SET NX = set if not exists, EX = expire in 5s)
  const acquired = await redis.set(lockKey, '1', 'EX', 5, 'NX');

  if (acquired) {
    // This request won the lock - fetch from DB and populate cache
    const product = await db.query('SELECT * FROM products WHERE id = ?', [productId]);
    await redis.setex(cacheKey, 300, JSON.stringify(product));
    await redis.del(lockKey);
    return product;
  } else {
    // Another request is already fetching - wait and retry
    await sleep(50);
    return getProductWithLock(productId); // Retry (add max retries in production)
  }
}

Solution 2: Stale-while-revalidate

Serve the stale data immediately while refreshing the cache in the background. The user gets a fast response (possibly slightly stale), and the cache is updated for the next request.

async function getProductSWR(productId) {
  const cacheKey = `product:${productId}`;
  const cached = await redis.get(cacheKey);

  if (cached) {
    const { data, expiresAt, staleAt } = JSON.parse(cached);

    if (Date.now() < staleAt) {
      return data; // Fresh - serve as-is
    }

    if (Date.now() < expiresAt) {
      // Stale but not expired - serve stale, refresh in background
      refreshCache(productId, cacheKey); // fire-and-forget
      return data;
    }
  }

  // Fully expired or missing - must fetch synchronously
  return fetchAndCache(productId, cacheKey);
}

async function refreshCache(productId, cacheKey) {
  const product = await db.query('SELECT * FROM products WHERE id = ?', [productId]);
  const entry = {
    data: product,
    staleAt: Date.now() + 240_000,   // Fresh for 4 minutes
    expiresAt: Date.now() + 300_000,  // Stale-but-servable for 1 more minute
  };
  await redis.setex(cacheKey, 600, JSON.stringify(entry));
}

Solution 3: Probabilistic early expiration

Each request has a small random chance of refreshing the cache before it expires. The closer to expiration, the higher the probability. This spreads the refresh load over time instead of concentrating it at one moment.

async function getProductProbabilistic(productId) {
  const cacheKey = `product:${productId}`;
  const cached = await redis.get(cacheKey);

  if (cached) {
    const { data, expiresAt, ttl } = JSON.parse(cached);
    const remaining = expiresAt - Date.now();
    const probability = Math.exp(-remaining / (ttl * 0.1)); // Higher as expiry approaches

    if (Math.random() < probability) {
      // "Won the lottery" - proactively refresh
      refreshCache(productId, cacheKey);
    }

    if (Date.now() < expiresAt) return data;
  }

  return fetchAndCache(productId, cacheKey);
}

Invalidation strategies comparison

Strategy	Freshness	Complexity	Stampede protection	Best for
TTL only	Bounded staleness (up to TTL)	Very low	None	Data where brief staleness is acceptable
Event-based (delete on write)	Near-immediate	Medium	None (still need stampede protection)	Data that changes infrequently but must be fresh
Event-based (pub/sub)	Near-immediate	High	None	Distributed systems with multiple cache instances
Versioned keys	Immediate (bulk)	Medium	None	Bulk invalidation of related data
Mutex lock	Fresh (single fetcher)	Medium	Yes	Popular keys with expensive DB queries
Stale-while-revalidate	Slightly stale during refresh	Medium	Yes (serves stale)	High-traffic endpoints where some staleness is okay
Probabilistic early expiry	Mostly fresh	High	Yes (spreads load)	Very high traffic, keys with predictable access patterns

Practical advice

Start with TTLWhat is ttl?Time-to-Live - a countdown attached to cached data that automatically expires it after a set number of seconds.-based invalidation. It covers most cases and adds zero complexity. Layer event-based invalidation on top for data where staleness causes real user-facing problems (prices, inventory). Add stampede protection only for keys that receive enough concurrent traffic to actually cause a stampede, most keys in most systems will never have this problem.

The biggest mistake teams make with invalidation is over-engineering it from day one. A simple 5-minute TTL with delete-on-write for critical keys is more reliable than a complex pub/subWhat is pub/sub?A messaging pattern where senders publish events to a channel and any number of listeners receive them in real time. invalidation system that nobody fully understands.

AI pitfall

AI will suggest event-based invalidation as the "correct" approach without considering the complexity it adds. What AI gets wrong: event-based invalidation requires a reliable pub/sub system, subscriber management, and error handling for missed events. For most apps, TTL-based invalidation with a short TTL (30 seconds to 5 minutes) achieves 95% of the benefit with 10% of the complexity.

Edge case

Cache stampede protection (mutex locks) can itself become a bottleneck. If your lock implementation uses Redis SET NX with a 5-second timeout, and the database query takes 6 seconds, the lock expires while the query is still running. A second request acquires the lock and runs the same query. Use a lock timeout that is at least 2x your worst-case query time.

Good to know

"Stale-while-revalidate" is one of the most underused caching strategies. It serves the stale cached value immediately while refreshing the cache in the background. The user gets a fast response, and the data is fresh for the next request. HTTP Cache-Control: stale-while-revalidate=60 enables this at the browser and CDN level with zero application code.

Done

Complete & Next