Platform Engineering

08PlatformPlatform engineering

Context Graph tribal knowledge

service catalog · conventions · CODEOWNERS · onboarding facts

Sources

k8s

GitHub

Grafana

@agentcy

Agentcy

Platform agent + IaC tools

answer + cite

Output

Slack #platform

At a glance

The agent's job: be the first responder in #ask-platform. Answer self-service questions developers would otherwise interrupt the platform team with — "why is my pod restarting?", "is the prod SLO panel up?", "how do I deploy a new service?" — in-thread, with citations.

Stack

Kubernetes — pods, deployments, events, logs.
GitHub — service repos, GitHub Actions runs, CODEOWNERS.
Grafana — live SLO panels.
Slack — #ask-platform channel binding.
Memory — runbooks, conventions, "we always X never Y" facts.
Channels & Triggers — Slack mention triggers the agent.

What you'll build

A Slack channel binding so any @platform mention in #ask-platform triggers the agent. The agent uses the catalog to find the right tool per question:

"why is my checkout-svc pod crashing?" → kubernetes.events_for_pod + kubernetes.logs_for_pod.
"how do I create a new service?" → memory recall for the service-template runbook + github.get_file to fetch the cookie-cutter template.
"is the SLO panel green?" → grafana_render_panel for the latest snapshot.
"who owns service-X?" → github.get_file(.github/CODEOWNERS) or memory recall.

The agent threads its reply to the original message so the channel doesn't get noisy.

Prerequisites

Kubernetes with read-only access (pods, deployments, events, logs).
GitHub with contents: read on service repos and members: read on the org.
Grafana configured (for SLO panels).
Slack with the bot in #ask-platform. Scopes: chat:write, channels:history (to receive mentions).
A frontier-class LLM — these questions vary widely.
Memory seeded with platform conventions (see step 3).
Realm — infrastructure recommended.

Step-by-step

1. Configure the connectors

Web UIREST API

text

1. Open /connectors → "+ Add Connector".
2. Pick Kubernetes — paste kubeconfig or use in-cluster service
   account. Realm: infrastructure.
3. Pick GitHub — PAT or App with read on the repos.
4. Pick Grafana — service account token, Viewer role.
5. Slack already exists — re-use.

bash

curl -X POST http://localhost:8080/api/v1/sources \
  -H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
  -d '{
    "name":"k8s-prod","connector":"kubernetes","realm":"infrastructure",
    "config":{"auth":{"kind":"in_cluster"}}
  }'

curl -X POST http://localhost:8080/api/v1/sources \
  -H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
  -d '{
    "name":"github-services","connector":"github","realm":"infrastructure",
    "config":{"auth":{"kind":"pat","token":"ghp_..."},"orgs":["acme"]}
  }'

curl -X POST http://localhost:8080/api/v1/sources \
  -H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
  -d '{
    "name":"grafana-prod","connector":"grafana","realm":"infrastructure",
    "config":{"base_url":"https://grafana.internal","api_token":"glsa_..."}
  }'

2. Bind Slack `#ask-platform` to the agent

Web UIREST API

text

1. Open /channels → Slack card → Bindings → "+ New".
2. Agent: default · channel: slack · slack_channel: #ask-platform ·
   mentioned: yes · preserve_thread: yes.
3. Save.
4. Invite @agentcy-bot to #ask-platform in Slack.

bash

curl -X POST http://localhost:8080/api/v1/bindings \
  -H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
  -d '{
    "agent":"default",
    "match_rule":{
      "channel":"slack",
      "slack_channel":"#ask-platform",
      "mentioned":true,
      "preserve_thread":true
    }
  }'

3. Seed memory with platform conventions

The agent's quality depends heavily on having internal context the LLM can't infer. Add memories.

Web UIREST API

text

1. Open /memory → "+ New memory" — repeat for each fact below.
2. Kind: fact. Realm: infrastructure.
3. Suggested seed memories:
   - Service template lives in acme/service-template (cookie-cutter).
   - All services deploy via the deploy-svc GitHub Action; never kubectl apply by hand.
   - Production cluster is prod-us-east; staging is staging-us-east.
   - Every service must register in service-catalog.yaml with an owner.
   - Production SLO panels are in Grafana folder "SLOs · Production".
   - On-call rotation lives in PagerDuty schedule "platform-oncall".
4. Save each.

bash

curl -X POST http://localhost:8080/api/v1/memory \
  -H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
  -d '{
    "kind":"fact","realm":"infrastructure",
    "text":"Service template lives in acme/service-template (cookie-cutter). To create a new service, run `gh repo create --template acme/service-template`."
  }'

curl -X POST http://localhost:8080/api/v1/memory \
  -H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
  -d '{
    "kind":"fact","realm":"infrastructure",
    "text":"All services deploy via the deploy-svc GitHub Action workflow. Never kubectl apply by hand. The action is triggered by pushing a tag matching v*.*.*."
  }'
# ... repeat for each fact

4. (Optional) Restrict to read-only

Platform helper should never accidentally kubectl delete pod. Add a Rego policy:

Web UIRego

text

1. Open /security → Policies → "+ New Policy".
2. Name: platform-helper-readonly.
3. Paste the Rego from the API tab.
4. Test → enable.

rego

package agentcy.platform_helper

deny[msg] {
  input.subject.binding == "ask-platform"
  input.resource.tool_effect != "read"
  msg := "platform-helper is read-only — no destructive tools from #ask-platform"
}

# Allow Slack thread replies (write but safe)
allow {
  input.subject.binding == "ask-platform"
  input.resource.connector == "slack"
  input.resource.tool == "slack.post_message"
}

5. Try it from Slack

In #ask-platform:

@agentcy why is checkout-svc-7c8f2-abc12 stuck in CrashLoopBackoff?

The agent calls kubernetes.events_for_pod, gets the recent events, returns a thread reply with the cause + suggested fix. All in ~5 seconds.

Worked example

A "what good looks like" Q&A — these are real-shape, not made up:

Dev: @agentcy how do I bump the memory limit for orders-svc?
Agentcy: Production deploys require a tagged release of acme/orders-svc. To bump the memory limit:
Edit k8s/deployment.yaml in the repo — change resources.limits.memory.
Open a PR (CODEOWNERS will require approval from @platform-team).
After merge, push a tag: git tag v1.2.4 && git push --tags.
The deploy-svc GitHub Action picks up the tag and deploys.
Current limit on prod is 512Mi (latest deploy: kubernetes.get_deployment("orders-svc")). Last bump was 2026-02-04 (PR #389).
Sources: acme/orders-svc CODEOWNERS · deploy-svc workflow · platform memory.

A second example:

Dev: @agentcy is the auth service SLO panel green right now?
Agentcy: [rendered Grafana panel attached] Auth p99 latency is at 142 ms (SLO threshold 300 ms). Error rate 0.04% (threshold 1%). All green for the last 24h.
Panel: Auth · production SLOs.

The transcript shows: kubernetes.get_deployment, github.get_file, memory.recall, grafana_render_panel, slack.post_message.

Variations

Per-team channels. Bind #ask-platform-data, #ask-platform-frontend, etc., each with its own scoped memory and connector list.
Memory-only mode for new joiners. Use the same agent but binding to #new-joiner-questions — restrict to memory recall only, no live infra reads.
Add a "build me a new service" path. Combined with Code Review use case — agent in #ask-platform proposes a PR using the service template; the code-review agent reviews it.
Voice mode. With voice enabled, on-call can ask the agent questions hands-free during incidents.

Troubleshooting

The agent's answers are vague. Almost always means memory is sparse. Add more facts. The agent only knows what you've told it via memory + what's reachable via tools.

Tool calls hit "permission denied" on K8s. Service account RBAC needs at minimum pods/log: get, events: list, deployments: get in the relevant namespaces. Read-only view cluster role is usually enough.

Replies appear in the channel, not in-thread.preserve_thread: true was off in the binding. Update the binding.

The bot replies even to messages that aren't questions for it.mentioned: true is off. Set it so only @agentcy mentions trigger the agent.

LLM cost is creeping up. This task is high-frequency by design. Set a cost_cap_usd_per_day on the binding's task. Frontier models cost more but reduce escalations to the platform team — measure both.

Concept: Channels & Triggers
Concept: Memory System — investing in memory pays off most here.
Use Case: SRE & Incident Response — the same agent can pivot to incident response.
Use Case: CI/CD Intelligence

Platform Engineering ​

Stack ​

What you'll build ​

Prerequisites ​

Step-by-step ​

1. Configure the connectors ​

2. Bind Slack #ask-platform to the agent ​

3. Seed memory with platform conventions ​

4. (Optional) Restrict to read-only ​

5. Try it from Slack ​

Worked example ​

Variations ​

Troubleshooting ​

Next ​

Platform Engineering

Stack

What you'll build

Prerequisites

Step-by-step

1. Configure the connectors

2. Bind Slack `#ask-platform` to the agent

3. Seed memory with platform conventions

4. (Optional) Restrict to read-only

5. Try it from Slack

Worked example

Variations

Troubleshooting

Next