System Design - AI-Powered Monitoring Setup

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

Setting up observability involves a lot of boilerplateWhat is boilerplate?Repetitive, standardized code that follows a known pattern and appears in nearly every project - like setting up a server or wiring up database connections.: writing PromQL queries, building Grafana dashboard JSONWhat is json?A text format for exchanging data between systems. It uses key-value pairs and arrays, and every programming language can read and write it., configuring alert rules, generating OpenTelemetryWhat is opentelemetry?A vendor-neutral open standard and SDK for collecting distributed traces, metrics, and logs from your applications. instrumentation code. AI assistants can handle much of this mechanical work, but they have blind spots that can create real problems if you do not catch them.

What AI does well vs. poorly

Task	AI quality	Why
Generating PromQL queries	Good	Well-documented syntax with many examples
Writing Grafana dashboard JSON	Good	Structured format, AI follows schemas well
Creating OTel instrumentation boilerplate	Good	Standard patterns, well-represented in training data
Choosing meaningful metric names	Good	Follows Prometheus naming conventions
Setting alert thresholds	Poor	Does not know your traffic patterns or normal baseline
Deciding what to monitor	Poor	Generates generic checklists, misses your specific business metrics
Estimating cardinality impact	Poor	Adds user ID and endpoint labels without considering cost
Tuning sampling rates	Poor	Cannot estimate your trace volume or budget
Writing runbooks for alerts	Mediocre	Generic steps, misses team-specific procedures
Dashboard layout and organization	Mediocre	Creates too many panels, poor information hierarchy

Prompt 1: Designing an observability strategy

Instead of asking "how should I monitor my app?", give AI the context it needs:

Design an observability strategy for this system:

Architecture: 4 microservices (API gateway, user service,
order service, payment service) running on Kubernetes.
Traffic: ~500 requests/second peak, ~50 requests/second baseline.
Stack: Node.js, PostgreSQL, Redis, Stripe API.
Budget: Grafana Cloud free tier (50GB logs, 10k metrics series).
Team: 3 developers, no dedicated SRE.

Requirements:
1. Define the RED metrics for each service
2. Define USE metrics for PostgreSQL and Redis
3. Suggest 3 business metrics specific to an e-commerce system
4. Propose a sampling strategy for traces that fits the budget
5. List the top 5 alerts with specific thresholds

Constraints:
- Keep total metric series under 10,000
- No high-cardinality labels (no user IDs, no order IDs in metrics)
- Prefer histogram over summary for latency

The specificityWhat is specificity?A scoring system that determines which CSS rule wins when multiple rules target the same element. IDs score higher than classes, which score higher than elements. matters. Without the budget constraint, AI will suggest monitoring everything. Without the cardinality warning, it will add userId labels to every metric.

Prompt 2: Generating a Grafana dashboard

Generate a Grafana dashboard JSON for a Node.js API service
with these panels:

Row 1: Request Rate (by status code), Error Rate (%), p99 Latency
Row 2: Active DB Connections (gauge), DB Query Duration (p50/p95/p99),
       Slow Queries Count (>1s)
Row 3: Node.js Event Loop Lag, Heap Used (MB), GC Pause Duration

Data source: Prometheus (name: "prometheus")
Use these metric names:
- http_requests_total (labels: method, path, status)
- http_request_duration_seconds (histogram)
- db_pool_active_connections
- db_query_duration_seconds (histogram)
- nodejs_eventloop_lag_seconds
- nodejs_heap_size_used_bytes
- nodejs_gc_duration_seconds

Time range: last 1 hour, refresh every 30s
Use stat panels for single values, time series for trends.

AI generates valid Grafana JSONWhat is json?A text format for exchanging data between systems. It uses key-value pairs and arrays, and every programming language can read and write it. reliably because it is a well-known schemaWhat is schema?A formal definition of the structure your data must follow - which fields exist, what types they have, and which are required.. But review the output for:

Missing datasource fields on individual panels
Incorrect PromQL (especially rate() window matching your scrape interval)
Panel sizing that does not fit standard monitor widths

Prompt 3: Writing alert rules in PromQL

Write Prometheus alerting rules for these conditions:

1. Error rate above 5% for more than 2 minutes
2. p99 latency above 2 seconds for more than 5 minutes
3. Database connection pool above 80% utilization
4. No requests received for more than 1 minute (service down)
5. Disk usage above 85%

Use these metric names:
- http_requests_total{status=~"5.."}
- http_request_duration_seconds_bucket
- db_pool_active_connections (max 20)
- node_filesystem_avail_bytes, node_filesystem_size_bytes

For each alert, include:
- The PromQL expression
- For duration
- Severity label (warning or critical)
- A human-readable annotation

What to verify in AI-generated observability configs

Noisy alerts. AI-generated thresholds are almost always wrong for your specific system. An error rate of 5% might be normal during a deploy window. A p99 of 2 seconds might be fine for a report endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. but catastrophic for a login page. Run the alert rules against your historical data before enabling notifications.

Too many dashboards. AI loves to generate 15 panels per service. In practice, a team of 3 developers needs one overview dashboard with RED metrics per service, one infrastructure dashboard with USE metrics, and one business dashboard. Three dashboards, not thirty.

Missing business metrics. AI defaults to technical metrics (CPU, memory, request rate). It rarely suggests "orders per minute," "checkout abandonment rate," or "payment success rate" unless you explicitly ask. These business metrics are often the first signal that something is wrong.

Cardinality bombs. Check every label on every metric. If AI added endpoint as a label and you have 200 unique APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. paths (including path parameters like /users/123), you just created 200x more time series than intended. Normalize paths to route patterns (/users/:id) or use only the top-level resource.

Sampling that misses errors. If AI suggests 1% headWhat is head?A special pointer in Git that indicates the commit you are currently working on - usually the tip of the active branch.-based sampling, you are throwing away 99% of error traces. Always ask for tail-based sampling that keeps 100% of error traces, or at minimum a priority rule for error responses.

The hybrid workflow

Human defines what matters: SLOs (99.9% availability, p99 < 500ms), critical business flows (checkout, signup, payment), and budget constraints
AI generates the implementation: PromQL queries, Grafana JSONWhat is json?A text format for exchanging data between systems. It uses key-value pairs and arrays, and every programming language can read and write it., OTel config, alerting rules
Human reviews for correctness: Check cardinality, verify thresholds, remove noisy panels
Deploy to staging first: Run alerts for a week in non-paging mode, review what would have fired
Human tunes thresholds: Adjust based on actual traffic patterns, remove alerts that fire on noise
AI iterates on feedback: "This alert fires every deploy. Add a 5-minute suppression window after deploys."

The pattern is consistent across all AI-assisted work: AI handles the mechanical generation, humans handle the judgment calls. Observability is particularly sensitive to this because a bad alert that pages you at 3 AM for a false positive erodes trust in the entire system. Better to have 5 high-quality alerts than 50 that nobody trusts.

AI pitfall

AI-generated alert rules almost always use static thresholds ("alert when error rate > 5%"). What AI gets wrong: static thresholds produce false positives during deployments, traffic spikes, and maintenance windows. Better approaches include anomaly detection (alert when the error rate is 3x the historical average) and burn-rate alerts (alert when you are consuming your error budget too fast).

Good to know

The most important observability practice is not a tool or a metric, it is the runbook. When an alert fires at 3 AM, the on-call engineer needs a document that says: "This alert means X. Check Y. If Z, do W." AI can generate draft runbooks for common scenarios, which you then customize with your specific infrastructure details.