Setting up observability involves a lot of boilerplateWhat is boilerplate?Repetitive, standardized code that follows a known pattern and appears in nearly every project - like setting up a server or wiring up database connections.: writing PromQL queries, building Grafana dashboard JSONWhat is json?A text format for exchanging data between systems. It uses key-value pairs and arrays, and every programming language can read and write it., configuring alert rules, generating OpenTelemetryWhat is opentelemetry?A vendor-neutral open standard and SDK for collecting distributed traces, metrics, and logs from your applications. instrumentation code. AI assistants can handle much of this mechanical work, but they have blind spots that can create real problems if you do not catch them.
What AI does well vs. poorly
| Task | AI quality | Why |
|---|---|---|
| Generating PromQL queries | Good | Well-documented syntax with many examples |
| Writing Grafana dashboard JSON | Good | Structured format, AI follows schemas well |
| Creating OTel instrumentation boilerplate | Good | Standard patterns, well-represented in training data |
| Choosing meaningful metric names | Good | Follows Prometheus naming conventions |
| Setting alert thresholds | Poor | Does not know your traffic patterns or normal baseline |
| Deciding what to monitor | Poor | Generates generic checklists, misses your specific business metrics |
| Estimating cardinality impact | Poor | Adds user ID and endpoint labels without considering cost |
| Tuning sampling rates | Poor | Cannot estimate your trace volume or budget |
| Writing runbooks for alerts | Mediocre | Generic steps, misses team-specific procedures |
| Dashboard layout and organization | Mediocre | Creates too many panels, poor information hierarchy |
Prompt 1: Designing an observability strategy
Instead of asking "how should I monitor my app?", give AI the context it needs:
Design an observability strategy for this system:
Architecture: 4 microservices (API gateway, user service,
order service, payment service) running on Kubernetes.
Traffic: ~500 requests/second peak, ~50 requests/second baseline.
Stack: Node.js, PostgreSQL, Redis, Stripe API.
Budget: Grafana Cloud free tier (50GB logs, 10k metrics series).
Team: 3 developers, no dedicated SRE.
Requirements:
1. Define the RED metrics for each service
2. Define USE metrics for PostgreSQL and Redis
3. Suggest 3 business metrics specific to an e-commerce system
4. Propose a sampling strategy for traces that fits the budget
5. List the top 5 alerts with specific thresholds
Constraints:
- Keep total metric series under 10,000
- No high-cardinality labels (no user IDs, no order IDs in metrics)
- Prefer histogram over summary for latencyThe specificityWhat is specificity?A scoring system that determines which CSS rule wins when multiple rules target the same element. IDs score higher than classes, which score higher than elements. matters. Without the budget constraint, AI will suggest monitoring everything. Without the cardinality warning, it will add userId labels to every metric.
Prompt 2: Generating a Grafana dashboard
Generate a Grafana dashboard JSON for a Node.js API service
with these panels:
Row 1: Request Rate (by status code), Error Rate (%), p99 Latency
Row 2: Active DB Connections (gauge), DB Query Duration (p50/p95/p99),
Slow Queries Count (>1s)
Row 3: Node.js Event Loop Lag, Heap Used (MB), GC Pause Duration
Data source: Prometheus (name: "prometheus")
Use these metric names:
- http_requests_total (labels: method, path, status)
- http_request_duration_seconds (histogram)
- db_pool_active_connections
- db_query_duration_seconds (histogram)
- nodejs_eventloop_lag_seconds
- nodejs_heap_size_used_bytes
- nodejs_gc_duration_seconds
Time range: last 1 hour, refresh every 30s
Use stat panels for single values, time series for trends.AI generates valid Grafana JSONWhat is json?A text format for exchanging data between systems. It uses key-value pairs and arrays, and every programming language can read and write it. reliably because it is a well-known schemaWhat is schema?A formal definition of the structure your data must follow - which fields exist, what types they have, and which are required.. But review the output for:
- Missing
datasourcefields on individual panels - Incorrect PromQL (especially
rate()window matching your scrape interval) - Panel sizing that does not fit standard monitor widths
Prompt 3: Writing alert rules in PromQL
Write Prometheus alerting rules for these conditions:
1. Error rate above 5% for more than 2 minutes
2. p99 latency above 2 seconds for more than 5 minutes
3. Database connection pool above 80% utilization
4. No requests received for more than 1 minute (service down)
5. Disk usage above 85%
Use these metric names:
- http_requests_total{status=~"5.."}
- http_request_duration_seconds_bucket
- db_pool_active_connections (max 20)
- node_filesystem_avail_bytes, node_filesystem_size_bytes
For each alert, include:
- The PromQL expression
- For duration
- Severity label (warning or critical)
- A human-readable annotationWhat to verify in AI-generated observability configs
Noisy alerts. AI-generated thresholds are almost always wrong for your specific system. An error rate of 5% might be normal during a deploy window. A p99 of 2 seconds might be fine for a report endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. but catastrophic for a login page. Run the alert rules against your historical data before enabling notifications.
Too many dashboards. AI loves to generate 15 panels per service. In practice, a team of 3 developers needs one overview dashboard with RED metrics per service, one infrastructure dashboard with USE metrics, and one business dashboard. Three dashboards, not thirty.
Missing business metrics. AI defaults to technical metrics (CPU, memory, request rate). It rarely suggests "orders per minute," "checkout abandonment rate," or "payment success rate" unless you explicitly ask. These business metrics are often the first signal that something is wrong.
Cardinality bombs. Check every label on every metric. If AI added endpoint as a label and you have 200 unique APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. paths (including path parameters like /users/123), you just created 200x more time series than intended. Normalize paths to route patterns (/users/:id) or use only the top-level resource.
Sampling that misses errors. If AI suggests 1% headWhat is head?A special pointer in Git that indicates the commit you are currently working on - usually the tip of the active branch.-based sampling, you are throwing away 99% of error traces. Always ask for tail-based sampling that keeps 100% of error traces, or at minimum a priority rule for error responses.
The hybrid workflow
- Human defines what matters: SLOs (99.9% availability, p99 < 500ms), critical business flows (checkout, signup, payment), and budget constraints
- AI generates the implementation: PromQL queries, Grafana JSONWhat is json?A text format for exchanging data between systems. It uses key-value pairs and arrays, and every programming language can read and write it., OTel config, alerting rules
- Human reviews for correctness: Check cardinality, verify thresholds, remove noisy panels
- Deploy to staging first: Run alerts for a week in non-paging mode, review what would have fired
- Human tunes thresholds: Adjust based on actual traffic patterns, remove alerts that fire on noise
- AI iterates on feedback: "This alert fires every deploy. Add a 5-minute suppression window after deploys."
The pattern is consistent across all AI-assisted work: AI handles the mechanical generation, humans handle the judgment calls. Observability is particularly sensitive to this because a bad alert that pages you at 3 AM for a false positive erodes trust in the entire system. Better to have 5 high-quality alerts than 50 that nobody trusts.