Appearance
Platform Engineering
08PlatformPlatform engineering
Context Graph tribal knowledge
service catalog · conventions · CODEOWNERS · onboarding facts
Sources
k8s
GitHub
Grafana
@agentcy
Agentcy
Agentcy
Platform agent + IaC tools
answer + cite
Output
Slack #platform
At a glance
The agent's job: be the first responder in #ask-platform. Answer self-service questions developers would otherwise interrupt the platform team with — "why is my pod restarting?", "is the prod SLO panel up?", "how do I deploy a new service?" — in-thread, with citations.
Stack
Kubernetes — pods, deployments, events, logs.
GitHub — service repos, GitHub Actions runs, CODEOWNERS.
Grafana — live SLO panels.
Slack —
#ask-platformchannel binding.- Memory — runbooks, conventions, "we always X never Y" facts.
- Channels & Triggers — Slack mention triggers the agent.
What you'll build
A Slack channel binding so any @platform mention in #ask-platform triggers the agent. The agent uses the catalog to find the right tool per question:
- "why is my checkout-svc pod crashing?" →
kubernetes.events_for_pod+kubernetes.logs_for_pod. - "how do I create a new service?" → memory recall for the service-template runbook +
github.get_fileto fetch the cookie-cutter template. - "is the SLO panel green?" →
grafana_render_panelfor the latest snapshot. - "who owns service-X?" →
github.get_file(.github/CODEOWNERS)or memory recall.
The agent threads its reply to the original message so the channel doesn't get noisy.
Prerequisites
- Kubernetes with read-only access (pods, deployments, events, logs).
- GitHub with
contents: readon service repos andmembers: readon the org. - Grafana configured (for SLO panels).
- Slack with the bot in
#ask-platform. Scopes:chat:write,channels:history(to receive mentions). - A frontier-class LLM — these questions vary widely.
- Memory seeded with platform conventions (see step 3).
- Realm —
infrastructurerecommended.
Step-by-step
1. Configure the connectors
text
1. Open /connectors → "+ Add Connector".
2. Pick Kubernetes — paste kubeconfig or use in-cluster service
account. Realm: infrastructure.
3. Pick GitHub — PAT or App with read on the repos.
4. Pick Grafana — service account token, Viewer role.
5. Slack already exists — re-use.bash
curl -X POST http://localhost:8080/api/v1/sources \
-H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
-d '{
"name":"k8s-prod","connector":"kubernetes","realm":"infrastructure",
"config":{"auth":{"kind":"in_cluster"}}
}'
curl -X POST http://localhost:8080/api/v1/sources \
-H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
-d '{
"name":"github-services","connector":"github","realm":"infrastructure",
"config":{"auth":{"kind":"pat","token":"ghp_..."},"orgs":["acme"]}
}'
curl -X POST http://localhost:8080/api/v1/sources \
-H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
-d '{
"name":"grafana-prod","connector":"grafana","realm":"infrastructure",
"config":{"base_url":"https://grafana.internal","api_token":"glsa_..."}
}'2. Bind Slack #ask-platform to the agent
text
1. Open /channels → Slack card → Bindings → "+ New".
2. Agent: default · channel: slack · slack_channel: #ask-platform ·
mentioned: yes · preserve_thread: yes.
3. Save.
4. Invite @agentcy-bot to #ask-platform in Slack.bash
curl -X POST http://localhost:8080/api/v1/bindings \
-H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
-d '{
"agent":"default",
"match_rule":{
"channel":"slack",
"slack_channel":"#ask-platform",
"mentioned":true,
"preserve_thread":true
}
}'3. Seed memory with platform conventions
The agent's quality depends heavily on having internal context the LLM can't infer. Add memories.
text
1. Open /memory → "+ New memory" — repeat for each fact below.
2. Kind: fact. Realm: infrastructure.
3. Suggested seed memories:
- Service template lives in acme/service-template (cookie-cutter).
- All services deploy via the deploy-svc GitHub Action; never kubectl apply by hand.
- Production cluster is prod-us-east; staging is staging-us-east.
- Every service must register in service-catalog.yaml with an owner.
- Production SLO panels are in Grafana folder "SLOs · Production".
- On-call rotation lives in PagerDuty schedule "platform-oncall".
4. Save each.bash
curl -X POST http://localhost:8080/api/v1/memory \
-H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
-d '{
"kind":"fact","realm":"infrastructure",
"text":"Service template lives in acme/service-template (cookie-cutter). To create a new service, run `gh repo create --template acme/service-template`."
}'
curl -X POST http://localhost:8080/api/v1/memory \
-H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
-d '{
"kind":"fact","realm":"infrastructure",
"text":"All services deploy via the deploy-svc GitHub Action workflow. Never kubectl apply by hand. The action is triggered by pushing a tag matching v*.*.*."
}'
# ... repeat for each fact4. (Optional) Restrict to read-only
Platform helper should never accidentally kubectl delete pod. Add a Rego policy:
text
1. Open /security → Policies → "+ New Policy".
2. Name: platform-helper-readonly.
3. Paste the Rego from the API tab.
4. Test → enable.rego
package agentcy.platform_helper
deny[msg] {
input.subject.binding == "ask-platform"
input.resource.tool_effect != "read"
msg := "platform-helper is read-only — no destructive tools from #ask-platform"
}
# Allow Slack thread replies (write but safe)
allow {
input.subject.binding == "ask-platform"
input.resource.connector == "slack"
input.resource.tool == "slack.post_message"
}5. Try it from Slack
In #ask-platform:
@agentcy why is checkout-svc-7c8f2-abc12 stuck in CrashLoopBackoff?
The agent calls kubernetes.events_for_pod, gets the recent events, returns a thread reply with the cause + suggested fix. All in ~5 seconds.
Worked example
A "what good looks like" Q&A — these are real-shape, not made up:
Dev: @agentcy how do I bump the memory limit for orders-svc?
Agentcy: Production deploys require a tagged release of
acme/orders-svc. To bump the memory limit:
- Edit
k8s/deployment.yamlin the repo — changeresources.limits.memory.- Open a PR (CODEOWNERS will require approval from
@platform-team).- After merge, push a tag:
git tag v1.2.4 && git push --tags.- The
deploy-svcGitHub Action picks up the tag and deploys.Current limit on prod is
512Mi(latest deploy:kubernetes.get_deployment("orders-svc")). Last bump was 2026-02-04 (PR #389).Sources: acme/orders-svc CODEOWNERS · deploy-svc workflow · platform memory.
A second example:
Dev: @agentcy is the auth service SLO panel green right now?
Agentcy: [rendered Grafana panel attached] Auth p99 latency is at 142 ms (SLO threshold 300 ms). Error rate 0.04% (threshold 1%). All green for the last 24h.
Panel: Auth · production SLOs.
The transcript shows: kubernetes.get_deployment, github.get_file, memory.recall, grafana_render_panel, slack.post_message.
Variations
- Per-team channels. Bind
#ask-platform-data,#ask-platform-frontend, etc., each with its own scoped memory and connector list. - Memory-only mode for new joiners. Use the same agent but binding to
#new-joiner-questions— restrict to memory recall only, no live infra reads. - Add a "build me a new service" path. Combined with Code Review use case — agent in
#ask-platformproposes a PR using the service template; the code-review agent reviews it. - Voice mode. With voice enabled, on-call can ask the agent questions hands-free during incidents.
Troubleshooting
The agent's answers are vague. Almost always means memory is sparse. Add more facts. The agent only knows what you've told it via memory + what's reachable via tools.
Tool calls hit "permission denied" on K8s. Service account RBAC needs at minimum pods/log: get, events: list, deployments: get in the relevant namespaces. Read-only view cluster role is usually enough.
Replies appear in the channel, not in-thread.preserve_thread: true was off in the binding. Update the binding.
The bot replies even to messages that aren't questions for it.mentioned: true is off. Set it so only @agentcy mentions trigger the agent.
LLM cost is creeping up. This task is high-frequency by design. Set a cost_cap_usd_per_day on the binding's task. Frontier models cost more but reduce escalations to the platform team — measure both.
Next
- Concept: Channels & Triggers
- Concept: Memory System — investing in memory pays off most here.
- Use Case: SRE & Incident Response — the same agent can pivot to incident response.
- Use Case: CI/CD Intelligence