Integration & APIs/
Lesson
Good to know
AI's biggest value in observability is not generating dashboards, it is translating natural language questions into PromQL or Loki queries. "Show me the error rate for Stripe over the last hour" is much easier to type than writing the PromQL from scratch. Use AI as a query translator and you will save hours per week.

Integration observability has a lot of boilerplateWhat is boilerplate?Repetitive, standardized code that follows a known pattern and appears in nearly every project - like setting up a server or wiring up database connections.: dashboard configurations, alert rules, PromQL queries, runbookWhat is runbook?A documented, step-by-step guide for responding to a specific production incident. templates, log parsing queries. AI handles this boilerplate well. But the hard parts, choosing the right SLO targets, tuning alert thresholds to avoid fatigue, understanding which correlation patterns matter, require your knowledge of the system.

What AI does well vs. poorly

AI does wellAI does poorly
Generating PromQL queries from natural languageChoosing SLO targets for your specific partners
Writing Prometheus alert rule YAMLTuning for durations and thresholds to avoid alert fatigue
Creating Grafana dashboard JSONKnowing which metrics matter most for your business
Drafting runbook templatesAdding team-specific diagnostic steps and escalation paths
Suggesting log query syntax (Loki, CloudWatch)Understanding your correlation ID propagation chain
Generating OpenTelemetry instrumentation boilerplateDesigning sampling strategies for your traffic patterns
Listing common failure modes for known APIs (Stripe, Twilio)Predicting failure modes unique to your integration topology
AI pitfall
AI-generated Prometheus alert rules almost always set the for duration too short, 1 minute is the default. A 1-minute window catches every transient spike and causes constant flapping. Set for to 5 minutes minimum for warnings and 2-3 minutes for critical alerts to filter out noise.
02

Prompt templates

1. Designing an integration dashboard

Prompt:

Create a Grafana dashboard JSON for monitoring these integrations:
- Stripe (payments): track success rate, p99 latency, error breakdown by code
- Twilio (SMS): track delivery rate, send latency, rate limit proximity
- Internal fulfillment API: track request rate, error rate, circuit breaker state

Use PromQL with these metric names:
- integration_requests_total (labels: partner, endpoint, status)
- integration_request_duration_seconds (labels: partner, endpoint)
- integration_circuit_breaker_state (labels: partner, state)

Layout: one row per partner, panels for rate/errors/latency in each row.

AI will generate a complete Grafana JSONWhat is json?A text format for exchanging data between systems. It uses key-value pairs and arrays, and every programming language can read and write it. configuration, rows, panels, PromQL expressions, thresholds. This saves an hour of manual dashboard building. But verify the PromQL expressions actually match your metric label values, and adjust the Y-axis scales and thresholds for your actual traffic volumes.

2. Writing alert rules for integration SLOs

Prompt:

Write Prometheus alerting rules for these integration SLOs:
- Stripe: 99.9% success rate, p99 < 2s, measured over 30-day rolling window
- Twilio: 99.5% delivery rate, p99 < 5s
- Fulfillment API: 99.0% success rate, p99 < 10s

Use multi-window, multi-burn-rate alerting (Google SRE approach).
Include severity labels and runbook annotations.
Metric names: integration_requests_total, integration_request_duration_seconds.

What to verify in the output:
AI will typically generate syntactically correct Prometheus rules. But check these common issues:

  • Does the for duration make sense? AI often sets it too short (1m), causing flapping alerts
  • Are the burn rate thresholds reasonable? A 10x burn rate for a critical alert is standard; AI sometimes uses 2x, which is too sensitive
  • Did it include both fast-burn (short window, high threshold) and slow-burn (long window, lower threshold) rules?
  • Are runbookWhat is runbook?A documented, step-by-step guide for responding to a specific production incident. URLs pointing to your actual wiki, or are they placeholder URLs?

3. Analyzing logs for integration issues

Prompt:

I am seeing intermittent 502 errors from our fulfillment partner API.
Here are 10 sample log entries:

[paste structured JSON log entries]

Analyze these logs and suggest:
1. The most likely root cause
2. What additional log fields I should check
3. A log query (Loki/CloudWatch syntax) to find all related events
4. Whether this looks like a partner issue or an issue on our side

AI is good at pattern recognition in logs, spotting timing patterns, correlating error codes with specific endpoints, noticing that errors cluster at certain times. What it cannot do is compare against your system's baseline behavior. It does not know that "502 errors from fulfillment spike every day at 3 PM because of their maintenance window." That context is yours to add.

4. Generating runbooks

Prompt:

Generate a runbook for this alert:
- Alert name: StripePaymentIntegrationDown
- Trigger: Stripe error rate > 10% for 5 minutes
- Impact: Users cannot complete purchases
- Our Stripe integration: REST API calls from order-service
- Monitoring: Grafana dashboard at /d/stripe-health

Include: what it means, impact, diagnostic steps,
common causes with fixes, and escalation path.
Format as markdown.

AI-generated runbooks are solid 80% drafts. They cover the standard diagnostic flow: check the partner status page, check your logs, check recent deploys. What they miss: your specific escalation contacts, internal tools and shortcuts your team uses, the weird edge cases you have seen before ("If the error message mentions 'idempotency keyWhat is idempotency key?A unique client-generated string sent with a mutation request so the server can safely deduplicate retried requests. collision', it means the retry logic is broken, check the dedup cache TTLWhat is ttl?Time-to-Live - a countdown attached to cached data that automatically expires it after a set number of seconds.").

03

What to verify in AI-generated observability code

Every time AI generates observability configuration for you, check for these issues:

1. Overly broad alerts
AI loves generating alerts on every possible failure mode. If it gives you 15 alert rules, you probably need 4. Ask yourself: "If this alert fires, does someone need to take action right now?" If not, it is a dashboard metric, not an alert.

2. Missing correlation ID propagation
When AI generates instrumentation code (logging middlewareWhat is middleware?A function that runs between receiving a request and sending a response. It can check authentication, log data, or modify the request before your main code sees it., tracing setup), it often forgets to propagate the correlation ID to downstream calls. Check that every outgoing HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. request includes the x-correlation-id header and that the tracing context is injected.

3. Placeholder thresholds
AI cannot know your actual traffic patterns, so it picks round numbers: 5% error rate, 2-second latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds., 1-minute evaluation window. These might be too aggressive or too lenient for your system. Replace them with thresholds based on your historical data.

4. High-cardinality labels
AI sometimes adds labels like userId, requestId, or transactionId to Prometheus metrics. These create millions of time series and will crash your metrics system. Ensure metric labels are low-cardinality (partner, endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users., status category).

Edge case
AI sometimes adds high-cardinality labels like userId or requestId to Prometheus metrics. Each unique label combination creates a new time series. Adding userId to your integration metrics creates one series per user per partner per endpoint, potentially millions of series that will crash your Prometheus server. Keep metric labels low-cardinality (partner, endpoint, status category).
04

The hybrid workflow

Step 1, AI generates the scaffold
Prompt AI for dashboard JSONWhat is json?A text format for exchanging data between systems. It uses key-value pairs and arrays, and every programming language can read and write it., alert rules, and runbookWhat is runbook?A documented, step-by-step guide for responding to a specific production incident. templates. This saves hours of YAMLWhat is yaml?A human-readable text format used for configuration files, including Docker Compose and GitHub Actions workflows. and JSON writing and gets the syntax right.

Step 2, You tune the thresholds
Replace AI's placeholder numbers with real thresholds based on your historical data. Set for durations based on how long transient issues typically last in your system. Adjust SLO targets based on partner SLAs and business impact.

Step 3, You add tribal knowledge
Fill in the runbooks with your team's actual escalation contacts, internal tool links, known quirks, and past incident learnings. This is the part AI cannot generate.

Step 4, AI reviews the final config
Paste your finalized alert rules back to AI and ask it to review for syntax errors, missing labels, inconsistent naming, and gaps in coverage. AI is excellent at this review pass because it is checking structure, not making judgment calls about thresholds.

The pattern is the same as every other AI-assisted workflow: let AI handle the mechanical parts, keep the judgment calls for yourself, then let AI do a final review pass.