Scaling Apps: Caching, CDN, Load Balancing

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

Your app works with 100 users. What happens at 1 million? Scaling is how systems handle growth without breaking, not just bigger computers, but smarter architecture.

Vertical vs horizontal scalingWhat is horizontal scaling?Adding more machines to handle increased load, rather than upgrading a single machine to be more powerful.

Vertical scalingWhat is vertical scaling?Making a single machine more powerful by adding CPU, RAM, or storage, rather than adding more machines.: upgrade your computer

Adding more power (CPU, RAM, storage) to a single server. Simple and requires no code changes, but hits hardware limits, costs exponentially more, and creates a single point of failure.

Best for: Small applications, databases that can't be easily distributed.

Horizontal scaling: add more computers

Adding more servers and distributing work among them. Virtually unlimited scale, cost-effective, and resilient, but more complex, requiring load balancing and data sync across servers.

Best for: High-traffic applications, variable traffic patterns (Black Friday, viral content).

The ceiling problem

Vertical scaling hits a ceiling, you can't buy infinite RAM. That's why every major tech company uses horizontal scaling.

Caching: saving work for later

Caching stores copies of expensive-to-compute data so you don't recompute it.

Browser cache

Your browser downloads files on first visit, then reuses them. Cache headers control this: Cache-Control: max-age=3600 keeps files for 1 hour. Tradeoff: cached files might be outdated.

CDNWhat is cdn?Content Delivery Network - a network of servers around the world that caches your files and serves them from the location closest to the user, making pages load faster. cache

A Content Delivery Network is servers spread across the globe. Instead of one server in New York serving everyone, you have edge servers in each region. A Tokyo user gets content from Tokyo, not New York, reducing latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds., originWhat is origin?The combination of protocol, domain, and port that defines a security boundary in the browser, like https://example.com:443. server load, and bandwidthWhat is bandwidth?How much data can flow through a connection at once - like the number of lanes on a highway rather than the speed limit. costs. Common CDNs: Cloudflare, AWS CloudFront, Fastly, Akamai.

Edge computing

Modern CDNs can also run code at the edge. Cloudflare Workers and AWS Lambda@Edge execute logic (personalization, auth, A/B testing) without hitting your main servers.

Application cache

Store frequently accessed data in server memory instead of querying the database every time.

Without cache:
User requests profile → Query database (50ms) → Return data

With cache:
User requests profile → Check cache (1ms) → Return data

Strategy	How it works	Best for
Cache-aside	Check cache first, fetch from DB if missing	Read-heavy workloads
Write-through	Write to cache and DB simultaneously	Data consistency critical
Write-behind	Write to cache, async write to DB	High write throughput
TTL (Time To Live)	Auto-expire cache after set time	Most common approach

Cache invalidation is notoriously hard. Update a user's profile but forget to clear their cache? They see stale data. This is one of the "two hard things in computer science" (along with naming things and off-by-one errors).

Load balancing: distributing the work

With multiple servers, a load balancerWhat is load balancer?A server that distributes incoming traffic across multiple backend servers so no single server gets overwhelmed. decides which one handles each request.

Round Robin: Take turns across servers. Least Connections: Send to the server with fewest active connections. IP Hash: Same user always hits the same server. Geographic: Route to the nearest server.

Most companies use software load balancers (NGINX, HAProxy) or cloud-managed ones (AWS ELB). Good load balancers also run health checks: if a server stops responding, traffic routes around it automatically.

Database scaling

Databases are usually the bottleneck because data needs to stay consistent.

Read replicas

Most apps read far more than they write. Create database copies for reading, writes go to the primary, reads go to replicas.

Write: App → Primary DB → Replicates to → Replica 1, Replica 2, Replica 3
Read:   App → Replica 1 or Replica 2 or Replica 3

Tradeoff: Replication lag (100ms-1s). A comment you just posted might not appear immediately on refresh.

ShardingWhat is sharding?Splitting a database across multiple servers by distributing rows based on a key, so each server handles only a portion of the total data.

Split data across multiple databases: Users A-M go to Database 1, Users N-Z to Database 2. Each handles less load, but cross-shard queries become complicated and rebalancing is hard.

Optimize before you scale

Database indexing: like a book's indexWhat is index?A data structure the database maintains alongside a table so it can find rows by specific columns quickly instead of scanning everything., jump directly to the right data. Query optimization: don't fetch 10,000 rows when you need 10. Connection pooling: reuse database connections instead of opening new ones.

The 80/20 rule

80% of database load often comes from 20% of queries. Fix the hot queries before scaling infrastructure.

Auto-scaling

Auto-scaling automatically adds or removes servers based on demand. It monitors metrics (CPU, memory, request count), triggers at thresholds ("if CPU > 70% for 5 minutes, add a server"), and scales back down when quiet to save money.

Types: Reactive (respond to current load), Predictive (ML-based anticipation), Scheduled (pre-planned for known events).

Scaling lag

Adding a server takes 2-5 minutes. If traffic spikes instantly, auto-scaling might not keep up, that's why "pre-warming" before big events matters.

AI pitfall

AI often recommends caching and horizontal scaling before profiling the actual bottleneck. In practice, a single unindexed query or N+1 fetch loop causes most slowdowns, no amount of infrastructure fixes bad queries.

Quick reference: Scaling strategies

Strategy	What it is	Best for	Tradeoff
Vertical scaling	Bigger server	Simplicity, databases	Hits hardware limits
Horizontal scaling	More servers	High traffic, resilience	Complexity
Read replicas	Copy database for reads	Read-heavy apps	Replication lag
Sharding	Split data across databases	Massive datasets	Query complexity
Caching	Store frequently used data	Almost everything	Stale data risk
CDN	Global edge servers	Static content, global users	Cost at scale
Load balancing	Distribute traffic	Multiple servers	Single point of failure if not redundant
Auto-scaling	Automatic server adjustment	Variable traffic	Scaling lag

Done

Complete & Next