Appearance
Voice
Voice turns Agentcy into a spoken interface. You speak; STT transcribes; the agent loop runs; TTS synthesizes a reply; everything streams over WebRTC or a long-lived WebSocket.
Feature-gated behind AGENTCY_FEATURES_VOICE=true. Route group: /api/v1/voice/*.
Providers
| Role | Provider | Default? |
|---|---|---|
| STT | Deepgram Nova-2 | yes |
| STT | Google Cloud | — |
| STT | Azure | — |
| TTS | ElevenLabs | yes |
| TTS | Deepgram Aura | — |
| TTS | Azure | — |
Pick via env:
env
AGENTCY_FEATURES_VOICE=true
VOICE_STT_PROVIDER=deepgram
VOICE_TTS_PROVIDER=elevenlabs
DEEPGRAM_API_KEY=…
ELEVENLABS_API_KEY=…
ELEVENLABS_VOICE_ID=EXAVITQu4vr4xnSDxMaLStart a session
bash
curl -X POST http://localhost:8080/api/v1/voice/sessions \
-H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
-d '{
"realm":"infrastructure",
"voice_id":"EXAVITQu4vr4xnSDxMaL",
"audio":{
"sample_rate": 24000,
"vad": true,
"noise_suppression": true
},
"limits":{"max_duration_secs": 1800}
}'Response:
json
{
"session_id":"vs_01HABC…",
"ws_url":"wss://your-agentcy/voice/sessions/vs_…/ws?token=…"
}Connect the browser to ws_url and send audio frames (PCM16LE at the declared sample rate). Received frames are TTS audio; you play them back.
The browser stack in frontend/lib/voice/ implements the full client — use that as the reference implementation.
Message model
Voice inherits the chat agent loop. Every turn:
- User audio → STT partials → interim transcript shown in the UI.
- VAD detects end-of-utterance → finalize transcript.
- Transcript goes to the agent loop like a regular chat message.
- Agent response content deltas → TTS → audio frames back.
- Tool calls and approvals: raised as banners in the UI; speech pauses until resolved.
Interruptions (barge-in)
When the agent is speaking and the user starts speaking again:
- VAD fires on user-start.
- Client sends
interruptframe. - Server cancels in-flight TTS, waits for user utterance, processes it.
Barge-in works well with Deepgram + ElevenLabs turbo; latency is ~300 ms end-to-end on a decent link.
Audio config
Reasonable defaults are set; only change if you know what you're doing.
json
"audio": {
"sample_rate": 24000, // 16000 for phone, 24000 for web, 48000 for pro
"channels": 1,
"vad": true, // voice activity detection
"vad_silence_ms": 400, // end-of-utterance silence
"noise_suppression": true,
"echo_cancellation": true
}Limits
json
"limits": {
"max_duration_secs": 1800, // kill after 30 min
"max_silence_secs": 180 // kill after 3 min of silence
}Org-wide caps via env:
env
VOICE_MAX_CONCURRENT_SESSIONS_PER_ORG=3
VOICE_MAX_MINUTES_PER_DAY=240Telephony (SIP / Twilio)
Out of scope for this release. To bridge a phone call to Agentcy, use Twilio Media Streams to a small proxy that maps audio frames to our WebSocket frame format. Reference proxy lives in deployments/twilio-bridge/ (not shipped by default).
Cost
Voice is substantially more expensive per minute than text chat because of STT + TTS + LLM. Rough: $0.05–$0.15/minute depending on providers and verbosity. The usage endpoint reports minutes and cost:
bash
curl "http://…/voice/usage?days=7" -H "authorization: Bearer $TOKEN" | jqGotchas
- Browsers need a user gesture to start playback. The reference client creates the AudioContext on a click handler; otherwise iOS blocks audio.
- Mobile networks drop WebSockets under flaky conditions. The client must reconnect with
?resume=<last_frame_id>— session state survives up to 60 seconds of disconnect. - STT languages vary. Deepgram Nova-2 is English-first; set
stt_languagefor others.
Next
- Concept: Agent Loop
- How-To: Approval Flows — approvals during a voice turn are UI banners, not speech.