How Ortavox achieves sub-600ms AI voice latency
Building a voice AI agent that feels human requires solving a latency problem most platforms ignore. Here is a full technical walkthrough of the Ortavox pipeline — from the moment a user speaks to the moment your agent responds.
The 600ms problem
Human conversation operates on a turn-taking rhythm with gaps of roughly 200–400ms between speakers. Any AI response beyond 800ms feels robotic. Most voice AI platforms chain sequential services — wait for speech to end, send audio to STT, get transcript, send to LLM, wait for full response, send to TTS, stream audio. This waterfall easily exceeds 2,000ms.
Ortavox breaks this waterfall by parallelizing and overlapping every stage of the pipeline, achieving a p50 end-to-end latency under 600ms from end-of-user-speech to first audio byte.
Stage 1 — Voice Activity Detection (VAD) at the edge
The pipeline starts before the user finishes speaking. Ortavox runs a lightweight VAD model directly in the WebSocket connection layer. VAD operates on 20ms audio frames and detects the precise moment speech ends — not with a fixed silence timeout, but with a probabilistic end-of-utterance score.
Why this matters: Traditional silence-based VAD adds 500–1,500ms of unnecessary padding. Ortavox's edge VAD triggers transcription within 50ms of actual speech end, eliminating this dead time entirely.
VAD also powers interruption handling. If new speech is detected while the agent is playing audio, Ortavox immediately halts TTS playback, clears the audio buffer, and re-enters the listening state — all within 300ms.
Stage 2 — Streaming Speech-to-Text (STT)
As soon as VAD signals end-of-utterance, buffered audio is sent to the STT provider. Ortavox supports Deepgram Nova-3, OpenAI Whisper / gpt-4o-mini-transcribe, Google Chirp 3, Groq Whisper, Cartesia Ink, Gladia Solaria, Speechmatics, and MistralAI Voxtral in streaming mode. Deepgram Nova-3 is the default because it returns interim transcripts in real time — Ortavox begins feeding text to the LLM before transcription is 100% complete.
This streaming overlap between STT and LLM inference typically saves 80–150ms per turn compared to waiting for a final transcript.
# Supported STT providers
deepgram: nova-3, nova-2-phonecall, nova-2-medical, nova-2-conversationalai
openai: gpt-4o-mini-transcribe, whisper-1, gpt-realtime
google: chirp_3, chirp_2, telephony
groq: whisper-large-v3-turbo, distil-whisper-large-v3-en
mistralai: voxtral-small-latest, voxtral-mini-latest
cartesia: ink-whisper
gladia: solaria-1
speechmatics: enhanced, standard
azure: cognitive-services/speech
Stage 3 — LLM Inference with streaming tokens
The transcript (or partial transcript) is combined with the conversation history and system prompt, then sent to the configured LLM. Ortavox supports OpenAI GPT-4.1 / GPT-4o / GPT-5, Anthropic Claude Sonnet 4 / Opus 4, Google Gemini 2.5 / 3, Groq Llama 4, and MistralAI.
Critically, Ortavox does not wait for a complete LLM response before starting TTS. It monitors the token stream for a natural sentence boundary — a period, question mark, or semantic pause — then immediately forwards those tokens to the TTS engine. This LLM-to-TTS pipeline overlap typically saves 200–400ms per turn.
Function calling is fully supported. Define JSON tools in your agent config; Ortavox intercepts function-call tokens, executes your endpoint synchronously, injects the result, and resumes generation — all transparently within the same turn.
# Supported LLM providers
openai: gpt-4.1, gpt-4.1-mini, gpt-4o, gpt-4o-mini, gpt-5, gpt-realtime
anthropic: claude-opus-4, claude-sonnet-4, claude-3-7-sonnet, claude-3-5-sonnet
google: gemini-3-flash, gemini-2.5-pro, gemini-2.5-flash
groq: llama-4-maverick-17b, llama-4-scout-17b, llama-3.3-70b, kimi-k2
mistralai: mistral-large, mistral-medium, ministral-14b
Stage 4 — Text-to-Speech (TTS) synthesis
Sentence fragments from the LLM stream are forwarded to the TTS engine in real time. Ortavox supports Cartesia Sonic (official partner), ElevenLabs, OpenAI TTS, Google Gemini TTS, Groq PlayAI, MistralAI Voxtral, AWS Polly, Azure Neural Voices, and Kyutai (self-hosted). Each sentence fragment is synthesized as a separate audio chunk and queued for playback, so the first audio byte reaches the caller before the LLM has finished generating the full response.
Backpressure management: Ortavox tracks audio playback position and maintains a small buffer (1–2 sentences ahead). If the LLM generates faster than TTS can play, tokens are queued but not sent prematurely. If TTS is faster than LLM generation, the buffer runs down gracefully to a silence frame rather than producing audible glitches.
# Supported TTS providers
cartesia: sonic-3, sonic-2 (official partner — 90ms latency)
elevenlabs: eleven_v3, eleven_flash_v2_5, eleven_turbo_v2_5, eleven_multilingual_v2
openai: gpt-4o-mini-tts, tts-1-hd, tts-1
google: gemini-2.5-flash-preview-tts, gemini-2.5-pro-preview-tts
groq: playai-tts, playai-tts-arabic
mistralai: voxtral-mini-tts-latest
aws: generative, long-form, neural, standard
azure: neural voices (500+ options)
kyutai: tts-1.6b-en_fr (self-hosted)
Stage 5 — WebSocket delivery and telephony
Audio is delivered to the caller via a persistent WebSocket connection. For web applications, you stream raw PCM or Opus audio directly. For telephony, Ortavox provides a Twilio Media Stream integration: point your Twilio phone number at the Ortavox WebSocket endpoint and the platform handles mu-law/linear PCM conversion, DTMF detection, and SIP signaling.
Enterprises can bring their own SIP trunk or Twilio account. Ortavox acts purely as the Voice AI layer — you retain full ownership of your telephone numbers and carrier relationships.
End-to-end latency breakdown
The following is a representative breakdown of latency contributions for a typical turn using GPT-4o + Deepgram Nova-2 + ElevenLabs Turbo v2.5:
| Stage | p50 | p95 |
|---|---|---|
| VAD end-of-speech detection | ~20ms | ~50ms |
| STT transcription (streaming) | ~80ms | ~150ms |
| LLM first token (TTFT) | ~220ms | ~380ms |
| TTS first audio byte | ~180ms | ~250ms |
| Network + buffer overhead | ~40ms | ~80ms |
| Total (first audio to caller) | ~540ms | ~910ms |
Measured from end-of-user-speech to first audio byte delivered to caller. Pipeline stages run in parallel — total is not the sum of individual stages.
Security and compliance
Ortavox is SOC 2 Type II certified. All audio and transcript data is encrypted in transit (TLS 1.3) and at rest (AES-256). Call recordings are deleted after 30 days by default and can be disabled entirely.
For HIPAA-regulated workloads (healthcare, patient communications), Ortavox Enterprise includes a Business Associate Agreement (BAA) and data residency options in US, EU, and APAC regions.
Getting started
- 1
Get your API key
Sign up for a free Hobby account (100 min/month, no credit card). Your API key is available instantly in the dashboard.
- 2
Configure your agent
Choose your LLM, STT provider, and TTS voice. Set your system prompt and optionally define function-calling tools via JSON config.
- 3
Connect via WebSocket or Twilio
Stream audio over our WebSocket API or point a Twilio Media Stream at your Ortavox endpoint. The platform handles VAD, turn-taking, and interruptions automatically.
- 4
Go live
Your agent handles real calls with sub-600ms response latency. Monitor call transcripts, latency metrics, and usage in the dashboard.
Ready to build?
Start with 100 free minutes. No credit card required. First call in under 5 minutes.