Voice

Voice turns Agentcy into a spoken interface. You speak; STT transcribes; the agent loop runs; TTS synthesizes a reply; everything streams over WebRTC or a long-lived WebSocket.

Feature-gated behind AGENTCY_FEATURES_VOICE=true. Route group: /api/v1/voice/*.

Providers

Role	Provider	Default?
STT	Deepgram Nova-2	yes
STT	Google Cloud	—
STT	Azure	—
TTS	ElevenLabs	yes
TTS	Deepgram Aura	—
TTS	Azure	—

Pick via env:

env

AGENTCY_FEATURES_VOICE=true
VOICE_STT_PROVIDER=deepgram
VOICE_TTS_PROVIDER=elevenlabs
DEEPGRAM_API_KEY=…
ELEVENLABS_API_KEY=…
ELEVENLABS_VOICE_ID=EXAVITQu4vr4xnSDxMaL

Start a session

bash

curl -X POST http://localhost:8080/api/v1/voice/sessions \
  -H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
  -d '{
    "realm":"infrastructure",
    "voice_id":"EXAVITQu4vr4xnSDxMaL",
    "audio":{
      "sample_rate": 24000,
      "vad":         true,
      "noise_suppression": true
    },
    "limits":{"max_duration_secs": 1800}
  }'

Response:

json

{
  "session_id":"vs_01HABC…",
  "ws_url":"wss://your-agentcy/voice/sessions/vs_…/ws?token=…"
}

Connect the browser to ws_url and send audio frames (PCM16LE at the declared sample rate). Received frames are TTS audio; you play them back.

The browser stack in frontend/lib/voice/ implements the full client — use that as the reference implementation.

Message model

Voice inherits the chat agent loop. Every turn:

User audio → STT partials → interim transcript shown in the UI.
VAD detects end-of-utterance → finalize transcript.
Transcript goes to the agent loop like a regular chat message.
Agent response content deltas → TTS → audio frames back.
Tool calls and approvals: raised as banners in the UI; speech pauses until resolved.

Interruptions (barge-in)

When the agent is speaking and the user starts speaking again:

VAD fires on user-start.
Client sends interrupt frame.
Server cancels in-flight TTS, waits for user utterance, processes it.

Barge-in works well with Deepgram + ElevenLabs turbo; latency is ~300 ms end-to-end on a decent link.

Audio config

Reasonable defaults are set; only change if you know what you're doing.

json

"audio": {
  "sample_rate":       24000,        // 16000 for phone, 24000 for web, 48000 for pro
  "channels":          1,
  "vad":               true,         // voice activity detection
  "vad_silence_ms":    400,          // end-of-utterance silence
  "noise_suppression": true,
  "echo_cancellation": true
}

Limits

json

"limits": {
  "max_duration_secs": 1800,         // kill after 30 min
  "max_silence_secs":   180          // kill after 3 min of silence
}

Org-wide caps via env:

env

VOICE_MAX_CONCURRENT_SESSIONS_PER_ORG=3
VOICE_MAX_MINUTES_PER_DAY=240

Telephony (SIP / Twilio)

Out of scope for this release. To bridge a phone call to Agentcy, use Twilio Media Streams to a small proxy that maps audio frames to our WebSocket frame format. Reference proxy lives in deployments/twilio-bridge/ (not shipped by default).

Cost

Voice is substantially more expensive per minute than text chat because of STT + TTS + LLM. Rough: $0.05–$0.15/minute depending on providers and verbosity. The usage endpoint reports minutes and cost:

bash

curl "http://…/voice/usage?days=7" -H "authorization: Bearer $TOKEN" | jq

Gotchas

Browsers need a user gesture to start playback. The reference client creates the AudioContext on a click handler; otherwise iOS blocks audio.
Mobile networks drop WebSockets under flaky conditions. The client must reconnect with ?resume=<last_frame_id> — session state survives up to 60 seconds of disconnect.
STT languages vary. Deepgram Nova-2 is English-first; set stt_language for others.

Concept: Agent Loop
How-To: Approval Flows — approvals during a voice turn are UI banners, not speech.

Voice ​

Providers ​

Start a session ​

Message model ​

Interruptions (barge-in) ​

Audio config ​

Limits ​

Telephony (SIP / Twilio) ​

Cost ​

Gotchas ​

Next ​

Voice

Providers

Start a session

Message model

Interruptions (barge-in)

Audio config

Limits

Telephony (SIP / Twilio)

Cost

Gotchas

Next