Skip to content

Voice

Voice turns Agentcy into a spoken interface. You speak; STT transcribes; the agent loop runs; TTS synthesizes a reply; everything streams over WebRTC or a long-lived WebSocket.

Feature-gated behind AGENTCY_FEATURES_VOICE=true. Route group: /api/v1/voice/*.

Providers

RoleProviderDefault?
STTDeepgram Nova-2yes
STTGoogle Cloud
STTAzure
TTSElevenLabsyes
TTSDeepgram Aura
TTSAzure

Pick via env:

env
AGENTCY_FEATURES_VOICE=true
VOICE_STT_PROVIDER=deepgram
VOICE_TTS_PROVIDER=elevenlabs
DEEPGRAM_API_KEY=…
ELEVENLABS_API_KEY=…
ELEVENLABS_VOICE_ID=EXAVITQu4vr4xnSDxMaL

Start a session

bash
curl -X POST http://localhost:8080/api/v1/voice/sessions \
  -H "authorization: Bearer $TOKEN" -H 'content-type: application/json' \
  -d '{
    "realm":"infrastructure",
    "voice_id":"EXAVITQu4vr4xnSDxMaL",
    "audio":{
      "sample_rate": 24000,
      "vad":         true,
      "noise_suppression": true
    },
    "limits":{"max_duration_secs": 1800}
  }'

Response:

json
{
  "session_id":"vs_01HABC…",
  "ws_url":"wss://your-agentcy/voice/sessions/vs_…/ws?token=…"
}

Connect the browser to ws_url and send audio frames (PCM16LE at the declared sample rate). Received frames are TTS audio; you play them back.

The browser stack in frontend/lib/voice/ implements the full client — use that as the reference implementation.

Message model

Voice inherits the chat agent loop. Every turn:

  1. User audio → STT partials → interim transcript shown in the UI.
  2. VAD detects end-of-utterance → finalize transcript.
  3. Transcript goes to the agent loop like a regular chat message.
  4. Agent response content deltas → TTS → audio frames back.
  5. Tool calls and approvals: raised as banners in the UI; speech pauses until resolved.

Interruptions (barge-in)

When the agent is speaking and the user starts speaking again:

  • VAD fires on user-start.
  • Client sends interrupt frame.
  • Server cancels in-flight TTS, waits for user utterance, processes it.

Barge-in works well with Deepgram + ElevenLabs turbo; latency is ~300 ms end-to-end on a decent link.

Audio config

Reasonable defaults are set; only change if you know what you're doing.

json
"audio": {
  "sample_rate":       24000,        // 16000 for phone, 24000 for web, 48000 for pro
  "channels":          1,
  "vad":               true,         // voice activity detection
  "vad_silence_ms":    400,          // end-of-utterance silence
  "noise_suppression": true,
  "echo_cancellation": true
}

Limits

json
"limits": {
  "max_duration_secs": 1800,         // kill after 30 min
  "max_silence_secs":   180          // kill after 3 min of silence
}

Org-wide caps via env:

env
VOICE_MAX_CONCURRENT_SESSIONS_PER_ORG=3
VOICE_MAX_MINUTES_PER_DAY=240

Telephony (SIP / Twilio)

Out of scope for this release. To bridge a phone call to Agentcy, use Twilio Media Streams to a small proxy that maps audio frames to our WebSocket frame format. Reference proxy lives in deployments/twilio-bridge/ (not shipped by default).

Cost

Voice is substantially more expensive per minute than text chat because of STT + TTS + LLM. Rough: $0.05–$0.15/minute depending on providers and verbosity. The usage endpoint reports minutes and cost:

bash
curl "http://…/voice/usage?days=7" -H "authorization: Bearer $TOKEN" | jq

Gotchas

  • Browsers need a user gesture to start playback. The reference client creates the AudioContext on a click handler; otherwise iOS blocks audio.
  • Mobile networks drop WebSockets under flaky conditions. The client must reconnect with ?resume=<last_frame_id> — session state survives up to 60 seconds of disconnect.
  • STT languages vary. Deepgram Nova-2 is English-first; set stt_language for others.

Next

Built by AgentcyLabs. For in-house deployment or Agentcy Cloud (PaaS) access, visit agentcylabs.com.