System Design/
Lesson

Every engineering decision is a tradeoff between cost and performance, not just server costs, but total cost, including the most expensive resource: engineer time.

The diminishing returns curve

Performance improvements are not linear. Going from 1s to 500ms might require a CDNWhat is cdn?Content Delivery Network - a network of servers around the world that caches your files and serves them from the location closest to the user, making pages load faster. (hours of work). From 500ms to 200ms: rewriting database queries (days). From 200ms to 50ms: a complete architecture overhaul (months).

Claude Code
Performance
(response time)
  |
1000ms |*
  |  *
 500ms |    *
  |       *
 200ms |          *
  |               *
 100ms |                    *
  |                              *
  50ms |                                         *
  +-------------------------------------------->
               Effort / Cost invested

80% of performance gains come from 20% of the effort (Pareto principle). For most web applications, 200ms response time is fine. Spending three months to reach 50ms only matters for trading platforms or search engines.

02

Know where your bottleneck is

Before optimizing anything, measure. The number one mistake is guessing where the problem is instead of profiling.

Claude Code
Request lifecycle (typical web app):
┌──────────────────────────────────────────┐
│ DNS lookup:           5ms                 │
│ TCP + TLS handshake:  30ms                │
│ Server processing:    150ms               │  <-- people optimize this
│   └── Database query:   120ms             │  <-- this is the real bottleneck
│   └── App logic:        30ms              │
│ Response transfer:    20ms                │
│ Client rendering:     100ms               │
├──────────────────────────────────────────┤
│ Total:                305ms               │
└──────────────────────────────────────────┘

Rewriting your app logic to be twice as fast saves 15ms. Adding a database indexWhat is index?A data structure the database maintains alongside a table so it can find rows by specific columns quickly instead of scanning everything. saves 100ms. Always find the bottleneck first.

LayerToolWhat It Tells You
FrontendChrome DevTools (Performance tab)Rendering, scripting, and layout time
NetworkChrome DevTools (Network tab)Request waterfalls, slow endpoints
BackendAPM tools (Datadog, New Relic)Endpoint latency breakdown, error rates
DatabaseEXPLAIN ANALYZE (SQL)Query execution plan, index usage
InfrastructureCloud monitoring (CloudWatch, etc.)CPU, memory, I/O utilization
03

Total cost of ownership (TCO)

When people say "this costs $50/month," they mean the hosting bill. But TCO includes everything:

Direct costs: Hosting, third-party service fees, licensing.

Indirect costs (usually bigger): Developer time to build and maintain, oncall burden, debugging time, opportunity cost, knowledge silo risk.

A "free" self-hosted solution: one engineer-week to set up ($6,000) + two hours/month to maintain ($3,600/year) = $9,600 in year one. A managed service at $200/month ($2,400/year) is cheaper, and the engineer ships features instead.

04

Build vs buy

Default to buy (use a managed service or library) unless:

  1. The feature is your core competitive advantage
  2. No existing solution fits your requirements
  3. Existing solutions have unacceptable limitations (security, compliance, performance)
  4. You have the team to build AND maintain it long-term
FactorBuildBuy
Time to marketWeeks to monthsHours to days
Upfront costHigh (engineer time)Low to medium (subscription)
Ongoing maintenanceYour responsibilityTheir responsibility
CustomizationUnlimitedLimited to what they offer
Risk: vendor lock-inNoneMedium to high
Risk: talent dependencyHigh (bus factor)Low

Buy: AuthenticationWhat is authentication?Verifying who a user is, typically through credentials like a password or token. (Auth0, Clerk), email (SendGrid, Resend), search (Algolia), payments (Stripe), monitoring (Datadog). These are complex, not your competitive advantage, and a full-time job to maintain.

Build: Your core product logic (e.g., Airbnb's pricing algorithm), highly custom workflows no tool fits, regulated data handling requiring specific compliance controls.

05

The managed services calculation

Claude Code
Self-hosted PostgreSQL:
  EC2 instance:          CODE_BLOCK00/month
  EBS storage:            $30/month
  Backups (S3):           CODE_BLOCK0/month
  Engineer time (setup):  40 hours x CODE_BLOCK50 = $6,000 (one-time)
  Engineer time (maint):  4 hours/month x CODE_BLOCK50 = $600/month
  Oncall burden:          Priceless (but not zero)
  ─────────────────────
  Year 1 total:          ~CODE_BLOCK4,880

AWS RDS PostgreSQL:
  Instance + storage:    $250/month
  Automated backups:     Included
  Maintenance:           1 hour/month x CODE_BLOCK50 = CODE_BLOCK50/month
  ─────────────────────
  Year 1 total:          ~$4,800

The managed service costs 2.5x more in hosting but 3x less in total.

06

When to optimize (and when not to)

  1. Is anyone actually complaining? If not, don't optimize. A p99 of 800ms might be fine.
  2. Is this the bottleneck? Profile before you optimize. Don't speed up the fast part.
  3. What is the ROI? Shaving 200ms for 500 users over 2 weeks: low return. Same effort for 500,000 users: worth it.
  4. Can you throw money at it? A $50/month server upgrade that buys six months of headroom beats a week of code optimization.
  5. One-time or recurring? A regression that worsens with traffic needs a real fix. A monthly slow page can wait.
AI pitfall
When you ask AI to optimize your system, it defaults to the most technically impressive solution: "add Redis caching, set up read replicas, use a CDN." What AI gets wrong is that the cheapest optimization is often just buying a bigger server. A $50/month upgrade to the next instance size can buy you 6 months of headroom without any code changes.
Good to know
The managed services calculation flips at very large scale. At 10 servers, managed services save you money. At 1,000 servers, the managed service markup becomes significant, and hiring a dedicated ops team to run your own infrastructure starts making sense. Most companies never reach that scale.
Edge case
Build-vs-buy gets complicated with vendor lock-in. Using DynamoDB is "buying", but migrating away from DynamoDB later means rewriting your entire data access layer. Using PostgreSQL on RDS is also "buying" managed services, but you can move to any PostgreSQL host with minimal changes. Factor in switching cost when evaluating "buy" options.