Observability Q&A

Stop alt-tabbing between Grafana, kubectl, and a Loki tab. Ask @agentcy is checkout-svc latency healthy? in Slack — the agent queries the right datasource, renders the right panel, and replies with the chart inline plus a one-line interpretation, every claim cited back to a Grafana URL.

The trick: Grafana's image renderer. Without it, agents can return raw timeseries numbers. With it, agents return real charts the human can glance at — same panels engineers already trust.

13SREObservability Q&A

Context Graph tribal knowledge

service ↔ dashboard map · panel tags · prior queries

Sources

Grafana

k8s

Logs

question

Agentcy

Observability agent

rendered panel

Output

Chart

+ refs

At a glance

Inputs: Grafana (dashboards + datasources via Prometheus, Loki, SQL), Kubernetes (live pod state for context), an image renderer.
Trigger: Slack @agentcy mention or chat message.
Output: Slack thread with rendered Grafana panels + interpretation + Grafana URL.
Gates: read-only — no silencing, no annotations (Grafana connector is read-only in v1).

Stack

Layer	What we use
Channel	Slack bindings — `@agentcy` mention in `#sre`, `#oncall`, `#ask-platform`
Connectors	Grafana (6 tools) · Kubernetes (live pod state)
Renderer	`grafana-image-renderer` — sidecar service that turns panels into PNG
Agent	Observability agent with skills for "panel choice" + "metric interpretation"
Realm	`infrastructure` — scopes which Grafana org and which clusters the agent can see

What you'll build

The agent receives @agentcy is checkout-svc latency healthy? in Slack.
It runs grafana_list_dashboards to find dashboards mentioning checkout-svc.
It picks the right one (e.g. Checkout Service · Latency) via grafana_get_dashboard, identifies the relevant panel.
It calls grafana_render_panel for that panel — gets back an SVG (preferred) or PNG (renderer fallback).
It cross-checks live cluster state with kubernetes.list_pods for checkout-svc.
It posts the rendered chart in the Slack thread with a one-line interpretation, citing the dashboard URL and the cluster snapshot timestamp.

Prerequisites

Three things, two of them often missed.

1 · A reachable Grafana

Your own self-hosted Grafana, Grafana Cloud, or the bundled stack Agentcy ships for evaluation:

bash

# From an unpacked Agentcy release
docker compose --profile grafana up -d
# → grafana on :13000, renderer sidecar internal on :8081

2 · A service-account token (`Viewer` role is enough)

In Grafana → Administration → Service accounts → Add service account → Add token. Copy the glsa_… (self-hosted) or glc_… (Cloud) value.

3 · ⚠️ The image renderer

This is the prerequisite people miss. Grafana's /render/d-solo/... endpoint is what turns a panel into a PNG, and it requires grafana-image-renderer — either as a plugin installed in Grafana, or as a sidecar HTTP service. Without it:

SVG-supported panels (timeseries / stat / gauge / bar / table / text) still render because Agentcy synthesizes those in-process.
Heatmaps, custom panel plugins, and complex visualizations return renderer_unavailable and the agent gracefully skips them.

Recommended: run the renderer as a sidecar.

docker-compose.ymlRailway: separate service

yaml

services:
  grafana:
    image: grafana/grafana-oss:11.3.0
    environment:
      GF_RENDERING_SERVER_URL: http://renderer:8081/render
      GF_RENDERING_CALLBACK_URL: http://grafana:3000/

  renderer:
    image: grafana/grafana-image-renderer:latest
    environment:
      ENABLE_METRICS: "true"

toml

# infra/grafana/railway.toml — main Grafana service
[deploy]
healthcheckPath = "/api/health"
healthcheckTimeout = 180

# Then add a second Railway service for the renderer:
# image: grafana/grafana-image-renderer:latest
# In the Grafana service env: GF_RENDERING_SERVER_URL=http://${{renderer.RAILWAY_PRIVATE_DOMAIN}}:8081/render

The bundled infra/grafana/ Compose stack ships this preconfigured.

Configure the source

REST APITypeScript SDK

bash

curl -X POST http://localhost:8080/api/v1/sources \
  -H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
  -d '{
    "name": "grafana-prod",
    "connector": "grafana",
    "realm": "infrastructure",
    "config": {
      "base_url": "https://grafana.internal",
      "api_token": "glsa_…",
      "org_id": 1
    }
  }'

await client.sources.create({
  name: 'grafana-prod',
  connector: 'grafana',
  realm: 'infrastructure',
  config: {
    base_url: 'https://grafana.internal',
    api_token: process.env.GRAFANA_TOKEN,
    org_id: 1,
  },
})

Validation hits /api/user and rejects on 401/403, so bad credentials fail fast at creation time.

Worked example

rego

# policies/observability.rego
package agentcy

# Strict read-only — no writes to Grafana or k8s.
default allow := false

allow if {
  input.tool in {
    "grafana_list_dashboards",
    "grafana_get_dashboard",
    "grafana_list_datasources",
    "grafana_query",
    "grafana_render_panel",
    "grafana_render_dashboard",
    "kubernetes.list_pods",
    "kubernetes.describe_deployment",
    "kubernetes.logs_for_pod",
  }
}

# Rate-limit expensive renders — at most 30 per minute per user.
deny[msg] if {
  input.tool == "grafana_render_dashboard"
  count(input.recent_calls_60s) > 30
  msg := "rate limit: 30 dashboard renders per minute per user"
}

Channel binding (Slack):

yaml

name: observability-qa
trigger:
  channel: slack
  bindings:
    - mention: "@agentcy"
      channel: ["#sre", "#oncall", "#ask-platform"]
realm: infrastructure
agent: observability-agent
enabled_skills: [panel-choice, metric-interpretation]
prompt: |
  You answer observability questions about our infrastructure.

  When asked about a service:
    1. Find the most relevant dashboard via grafana_list_dashboards (search by service name + relevant tag).
    2. Identify the most relevant panel via grafana_get_dashboard.
    3. Render the panel via grafana_render_panel for the time range that answers the question (default: last 30m).
    4. If the user asked about *current* state, also query kubernetes for live pod status.
    5. Reply with the rendered panel + a one-line interpretation + the dashboard URL.

  Always cite. Never invent metric values — if the data isn't there, say so.

What good looks like

shaked   @agentcy is checkout-svc latency healthy right now?

agentcy  📊 Checkout p99 latency last 30 min:
         [rendered panel — small image of the latency timeseries]
         p99 is 142ms (baseline ~120ms). Slight elevation since 14:02 UTC,
         coincides with the canary rollout of v2.4.1 (3/12 pods updated).
         All 12 pods are Ready in cluster prod-us.

         Source: https://grafana.internal/d/checkout-latency
         Cluster snapshot: 2026-04-25 14:18 UTC

A more complex one — "why are Loki errors spiking?":

shaked   @agentcy why are 5xx errors spiking in checkout-svc?

agentcy  🔍 Pulled the last 15 min of 5xx from Loki for {service="checkout-svc"}:
         [rendered chart — error count by route]
         98% of errors come from POST /checkout/finalize.
         Top error message: "downstream timeout: payments-svc (deadline=2s)".
         Cross-checked: payments-svc p99 is 3.4s right now (vs 800ms baseline).

         Likely root cause: payments-svc latency, not checkout-svc itself.
         Source: https://grafana.internal/d/checkout-errors
                 https://grafana.internal/d/payments-latency

Variations

PR-bot integration — when a PR opens, render the SLO panel for the affected service and post it as a PR comment so reviewers see "here's what current latency looks like before this lands".
Daily SLO digest — cron at 09:00 UTC, render the top 5 SLO panels, post to #sre-status. Suppress unchanged panels via memory.
Voice query — bind to the voice channel — "hey Agentcy, show me checkout latency" — TTS reads back the interpretation while the chart drops in Slack.
Alert enrichment — when SRE & Incident Response fires, this recipe's agent acts as the panel-rendering subroutine that adds visual context to the diagnosis.

Troubleshooting

renderer_unavailable on every render. The renderer sidecar isn't reachable. Check GF_RENDERING_SERVER_URL on the Grafana process — should be http://renderer:8081/render (Compose) or the Railway internal URL. Verify with curl $GF_RENDERING_SERVER_URL/version from inside the Grafana container.
Tools return SVGs but renders are blank/white. The renderer is up but Grafana can't call back into itself. Set GF_RENDERING_CALLBACK_URL=http://grafana:3000/ (or the actual address Grafana is reachable at from the renderer's perspective).
Agent picks the wrong dashboard. grafana_list_dashboards searches by title and tags. Tag your dashboards consistently (team:checkout, service:checkout-svc) and the agent's choices get dramatically better. Document this in your panel-choice skill.
Charts always show "no data". Time range mismatch. The agent defaults to now-30m → now. If your dashboard's variables (datasource, instance) are required, the agent needs to discover them — grafana_get_dashboard returns the panel's targets, including any templated variables. Make sure your skill teaches the agent to fill them in.
Slow renders timing out. Heatmaps and large dashboards can take >30s. The renderer's default timeout is 30s — bump RENDERING_TIMEOUT=60 on the renderer service. For dashboards with many panels, prefer rendering individual panels via grafana_render_panel rather than the full grafana_render_dashboard.
Token works in curl but Agentcy says 401. Org mismatch. Tokens are scoped per Grafana org. If your token was created in org 2, set org_id: 2 in the source config — default is 1.

Connector: Grafana — full tool reference, bundled stack, Railway deployment
Concept: Skills — for the panel-choice + metric-interpretation skills
How-To: Slack channel
Recipe: SRE & Incident Response — same connector, alert-driven instead of Q&A

Observability Q&A ​

At a glance ​

Stack ​

What you'll build ​

Prerequisites ​

1 · A reachable Grafana ​

2 · A service-account token (Viewer role is enough) ​

3 · ⚠️ The image renderer ​

Configure the source ​

Worked example ​

What good looks like ​

Variations ​

Troubleshooting ​

Next ​