Appearance
Observability Q&A
Stop alt-tabbing between Grafana, kubectl, and a Loki tab. Ask @agentcy is checkout-svc latency healthy? in Slack — the agent queries the right datasource, renders the right panel, and replies with the chart inline plus a one-line interpretation, every claim cited back to a Grafana URL.
The trick: Grafana's image renderer. Without it, agents can return raw timeseries numbers. With it, agents return real charts the human can glance at — same panels engineers already trust.
13SREObservability Q&A
Context Graph tribal knowledge
service ↔ dashboard map · panel tags · prior queries
Sources
Grafana
k8s
Logs
question
Agentcy
Agentcy
Observability agent
rendered panel
Output
Chart
+ refs
At a glance
- Inputs: Grafana (dashboards + datasources via Prometheus, Loki, SQL), Kubernetes (live pod state for context), an image renderer.
- Trigger: Slack
@agentcymention or chat message. - Output: Slack thread with rendered Grafana panels + interpretation + Grafana URL.
- Gates: read-only — no silencing, no annotations (Grafana connector is read-only in v1).
Stack
| Layer | What we use |
|---|---|
| Channel | Slack bindings — @agentcy mention in #sre, #oncall, #ask-platform |
| Connectors | Grafana (6 tools) · Kubernetes (live pod state) |
| Renderer | grafana-image-renderer — sidecar service that turns panels into PNG |
| Agent | Observability agent with skills for "panel choice" + "metric interpretation" |
| Realm | infrastructure — scopes which Grafana org and which clusters the agent can see |
What you'll build
- The agent receives
@agentcy is checkout-svc latency healthy?in Slack. - It runs
grafana_list_dashboardsto find dashboards mentioningcheckout-svc. - It picks the right one (e.g. Checkout Service · Latency) via
grafana_get_dashboard, identifies the relevant panel. - It calls
grafana_render_panelfor that panel — gets back an SVG (preferred) or PNG (renderer fallback). - It cross-checks live cluster state with
kubernetes.list_podsforcheckout-svc. - It posts the rendered chart in the Slack thread with a one-line interpretation, citing the dashboard URL and the cluster snapshot timestamp.
Prerequisites
Three things, two of them often missed.
1 · A reachable Grafana
Your own self-hosted Grafana, Grafana Cloud, or the bundled stack Agentcy ships for evaluation:
bash
# From an unpacked Agentcy release
docker compose --profile grafana up -d
# → grafana on :13000, renderer sidecar internal on :80812 · A service-account token (Viewer role is enough)
In Grafana → Administration → Service accounts → Add service account → Add token. Copy the glsa_… (self-hosted) or glc_… (Cloud) value.
3 · ⚠️ The image renderer
This is the prerequisite people miss. Grafana's /render/d-solo/... endpoint is what turns a panel into a PNG, and it requires grafana-image-renderer — either as a plugin installed in Grafana, or as a sidecar HTTP service. Without it:
- SVG-supported panels (timeseries / stat / gauge / bar / table / text) still render because Agentcy synthesizes those in-process.
- Heatmaps, custom panel plugins, and complex visualizations return
renderer_unavailableand the agent gracefully skips them.
Recommended: run the renderer as a sidecar.
yaml
services:
grafana:
image: grafana/grafana-oss:11.3.0
environment:
GF_RENDERING_SERVER_URL: http://renderer:8081/render
GF_RENDERING_CALLBACK_URL: http://grafana:3000/
renderer:
image: grafana/grafana-image-renderer:latest
environment:
ENABLE_METRICS: "true"toml
# infra/grafana/railway.toml — main Grafana service
[deploy]
healthcheckPath = "/api/health"
healthcheckTimeout = 180
# Then add a second Railway service for the renderer:
# image: grafana/grafana-image-renderer:latest
# In the Grafana service env: GF_RENDERING_SERVER_URL=http://${{renderer.RAILWAY_PRIVATE_DOMAIN}}:8081/renderThe bundled infra/grafana/ Compose stack ships this preconfigured.
Configure the source
bash
curl -X POST http://localhost:8080/api/v1/sources \
-H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
-d '{
"name": "grafana-prod",
"connector": "grafana",
"realm": "infrastructure",
"config": {
"base_url": "https://grafana.internal",
"api_token": "glsa_…",
"org_id": 1
}
}'ts
await client.sources.create({
name: 'grafana-prod',
connector: 'grafana',
realm: 'infrastructure',
config: {
base_url: 'https://grafana.internal',
api_token: process.env.GRAFANA_TOKEN,
org_id: 1,
},
})Validation hits /api/user and rejects on 401/403, so bad credentials fail fast at creation time.
Worked example
rego
# policies/observability.rego
package agentcy
# Strict read-only — no writes to Grafana or k8s.
default allow := false
allow if {
input.tool in {
"grafana_list_dashboards",
"grafana_get_dashboard",
"grafana_list_datasources",
"grafana_query",
"grafana_render_panel",
"grafana_render_dashboard",
"kubernetes.list_pods",
"kubernetes.describe_deployment",
"kubernetes.logs_for_pod",
}
}
# Rate-limit expensive renders — at most 30 per minute per user.
deny[msg] if {
input.tool == "grafana_render_dashboard"
count(input.recent_calls_60s) > 30
msg := "rate limit: 30 dashboard renders per minute per user"
}Channel binding (Slack):
yaml
name: observability-qa
trigger:
channel: slack
bindings:
- mention: "@agentcy"
channel: ["#sre", "#oncall", "#ask-platform"]
realm: infrastructure
agent: observability-agent
enabled_skills: [panel-choice, metric-interpretation]
prompt: |
You answer observability questions about our infrastructure.
When asked about a service:
1. Find the most relevant dashboard via grafana_list_dashboards (search by service name + relevant tag).
2. Identify the most relevant panel via grafana_get_dashboard.
3. Render the panel via grafana_render_panel for the time range that answers the question (default: last 30m).
4. If the user asked about *current* state, also query kubernetes for live pod status.
5. Reply with the rendered panel + a one-line interpretation + the dashboard URL.
Always cite. Never invent metric values — if the data isn't there, say so.What good looks like
shaked @agentcy is checkout-svc latency healthy right now?
agentcy 📊 Checkout p99 latency last 30 min:
[rendered panel — small image of the latency timeseries]
p99 is 142ms (baseline ~120ms). Slight elevation since 14:02 UTC,
coincides with the canary rollout of v2.4.1 (3/12 pods updated).
All 12 pods are Ready in cluster prod-us.
Source: https://grafana.internal/d/checkout-latency
Cluster snapshot: 2026-04-25 14:18 UTCA more complex one — "why are Loki errors spiking?":
shaked @agentcy why are 5xx errors spiking in checkout-svc?
agentcy 🔍 Pulled the last 15 min of 5xx from Loki for {service="checkout-svc"}:
[rendered chart — error count by route]
98% of errors come from POST /checkout/finalize.
Top error message: "downstream timeout: payments-svc (deadline=2s)".
Cross-checked: payments-svc p99 is 3.4s right now (vs 800ms baseline).
Likely root cause: payments-svc latency, not checkout-svc itself.
Source: https://grafana.internal/d/checkout-errors
https://grafana.internal/d/payments-latencyVariations
- PR-bot integration — when a PR opens, render the SLO panel for the affected service and post it as a PR comment so reviewers see "here's what current latency looks like before this lands".
- Daily SLO digest — cron at 09:00 UTC, render the top 5 SLO panels, post to
#sre-status. Suppress unchanged panels via memory. - Voice query — bind to the voice channel — "hey Agentcy, show me checkout latency" — TTS reads back the interpretation while the chart drops in Slack.
- Alert enrichment — when SRE & Incident Response fires, this recipe's agent acts as the panel-rendering subroutine that adds visual context to the diagnosis.
Troubleshooting
renderer_unavailableon every render. The renderer sidecar isn't reachable. CheckGF_RENDERING_SERVER_URLon the Grafana process — should behttp://renderer:8081/render(Compose) or the Railway internal URL. Verify withcurl $GF_RENDERING_SERVER_URL/versionfrom inside the Grafana container.Tools return SVGs but renders are blank/white. The renderer is up but Grafana can't call back into itself. Set
GF_RENDERING_CALLBACK_URL=http://grafana:3000/(or the actual address Grafana is reachable at from the renderer's perspective).Agent picks the wrong dashboard.
grafana_list_dashboardssearches by title and tags. Tag your dashboards consistently (team:checkout,service:checkout-svc) and the agent's choices get dramatically better. Document this in your panel-choice skill.Charts always show "no data". Time range mismatch. The agent defaults to
now-30m → now. If your dashboard's variables (datasource, instance) are required, the agent needs to discover them —grafana_get_dashboardreturns the panel'stargets, including any templated variables. Make sure your skill teaches the agent to fill them in.Slow renders timing out. Heatmaps and large dashboards can take >30s. The renderer's default timeout is 30s — bump
RENDERING_TIMEOUT=60on the renderer service. For dashboards with many panels, prefer rendering individual panels viagrafana_render_panelrather than the fullgrafana_render_dashboard.Token works in
curlbut Agentcy says 401. Org mismatch. Tokens are scoped per Grafana org. If your token was created in org 2, setorg_id: 2in the source config — default is 1.
Next
- Connector: Grafana — full tool reference, bundled stack, Railway deployment
- Concept: Skills — for the panel-choice + metric-interpretation skills
- How-To: Slack channel
- Recipe: SRE & Incident Response — same connector, alert-driven instead of Q&A