How Ortavox achieves sub-600ms AI voice latency

Building a voice AI agent that feels human requires solving a latency problem most platforms ignore. Here is a full technical walkthrough of the Ortavox pipeline — from the moment a user speaks to the moment your agent responds.

The 600ms problem

Human conversation operates on a turn-taking rhythm with gaps of roughly 200–400ms between speakers. Any AI response beyond 800ms feels robotic. Most voice AI platforms chain sequential services — wait for speech to end, send audio to STT, get transcript, send to LLM, wait for full response, send to TTS, stream audio. This waterfall easily exceeds 2,000ms.

Ortavox breaks this waterfall by parallelizing and overlapping every stage of the pipeline, achieving a p50 end-to-end latency under 600ms from end-of-user-speech to first audio byte.

Stage 1 — Voice Activity Detection (VAD) at the edge

The pipeline starts before the user finishes speaking. Ortavox runs a lightweight VAD model directly in the WebSocket connection layer. VAD operates on 20ms audio frames and detects the precise moment speech ends — not with a fixed silence timeout, but with a probabilistic end-of-utterance score.

Why this matters: Traditional silence-based VAD adds 500–1,500ms of unnecessary padding. Ortavox's edge VAD triggers transcription within 50ms of actual speech end, eliminating this dead time entirely.

VAD also powers interruption handling. If new speech is detected while the agent is playing audio, Ortavox immediately halts TTS playback, clears the audio buffer, and re-enters the listening state — all within 300ms.

Stage 2 — Streaming Speech-to-Text (STT)

As soon as VAD signals end-of-utterance, buffered audio is sent to the STT provider. Ortavox supports Deepgram Nova-3, OpenAI Whisper / gpt-4o-mini-transcribe, Google Chirp 3, Groq Whisper, Cartesia Ink, Gladia Solaria, Speechmatics, and MistralAI Voxtral in streaming mode. Deepgram Nova-3 is the default because it returns interim transcripts in real time — Ortavox begins feeding text to the LLM before transcription is 100% complete.

This streaming overlap between STT and LLM inference typically saves 80–150ms per turn compared to waiting for a final transcript.

# Supported STT providers

deepgram: nova-3, nova-2-phonecall, nova-2-medical, nova-2-conversationalai

openai: gpt-4o-mini-transcribe, whisper-1, gpt-realtime

google: chirp_3, chirp_2, telephony

groq: whisper-large-v3-turbo, distil-whisper-large-v3-en

mistralai: voxtral-small-latest, voxtral-mini-latest

cartesia: ink-whisper

gladia: solaria-1

speechmatics: enhanced, standard

azure: cognitive-services/speech

Stage 3 — LLM Inference with streaming tokens

The transcript (or partial transcript) is combined with the conversation history and system prompt, then sent to the configured LLM. Ortavox supports OpenAI GPT-4.1 / GPT-4o / GPT-5, Anthropic Claude Sonnet 4 / Opus 4, Google Gemini 2.5 / 3, Groq Llama 4, and MistralAI.

Critically, Ortavox does not wait for a complete LLM response before starting TTS. It monitors the token stream for a natural sentence boundary — a period, question mark, or semantic pause — then immediately forwards those tokens to the TTS engine. This LLM-to-TTS pipeline overlap typically saves 200–400ms per turn.

Function calling is fully supported. Define JSON tools in your agent config; Ortavox intercepts function-call tokens, executes your endpoint synchronously, injects the result, and resumes generation — all transparently within the same turn.

# Supported LLM providers

openai: gpt-4.1, gpt-4.1-mini, gpt-4o, gpt-4o-mini, gpt-5, gpt-realtime

anthropic: claude-opus-4, claude-sonnet-4, claude-3-7-sonnet, claude-3-5-sonnet

google: gemini-3-flash, gemini-2.5-pro, gemini-2.5-flash

groq: llama-4-maverick-17b, llama-4-scout-17b, llama-3.3-70b, kimi-k2

mistralai: mistral-large, mistral-medium, ministral-14b

Stage 4 — Text-to-Speech (TTS) synthesis

Sentence fragments from the LLM stream are forwarded to the TTS engine in real time. Ortavox supports Cartesia Sonic (official partner), ElevenLabs, OpenAI TTS, Google Gemini TTS, Groq PlayAI, MistralAI Voxtral, AWS Polly, Azure Neural Voices, and Kyutai (self-hosted). Each sentence fragment is synthesized as a separate audio chunk and queued for playback, so the first audio byte reaches the caller before the LLM has finished generating the full response.

Backpressure management: Ortavox tracks audio playback position and maintains a small buffer (1–2 sentences ahead). If the LLM generates faster than TTS can play, tokens are queued but not sent prematurely. If TTS is faster than LLM generation, the buffer runs down gracefully to a silence frame rather than producing audible glitches.

# Supported TTS providers

cartesia: sonic-3, sonic-2 (official partner — 90ms latency)

elevenlabs: eleven_v3, eleven_flash_v2_5, eleven_turbo_v2_5, eleven_multilingual_v2

openai: gpt-4o-mini-tts, tts-1-hd, tts-1

google: gemini-2.5-flash-preview-tts, gemini-2.5-pro-preview-tts

groq: playai-tts, playai-tts-arabic

mistralai: voxtral-mini-tts-latest

aws: generative, long-form, neural, standard

azure: neural voices (500+ options)

kyutai: tts-1.6b-en_fr (self-hosted)

Stage 5 — WebSocket delivery and telephony

Audio is delivered to the caller via a persistent WebSocket connection. For web applications, you stream raw PCM or Opus audio directly. For telephony, Ortavox provides a Twilio Media Stream integration: point your Twilio phone number at the Ortavox WebSocket endpoint and the platform handles mu-law/linear PCM conversion, DTMF detection, and SIP signaling.

Enterprises can bring their own SIP trunk or Twilio account. Ortavox acts purely as the Voice AI layer — you retain full ownership of your telephone numbers and carrier relationships.

End-to-end latency breakdown

The following is a representative breakdown of latency contributions for a typical turn using GPT-4o + Deepgram Nova-2 + ElevenLabs Turbo v2.5:

Stagep50p95
VAD end-of-speech detection~20ms~50ms
STT transcription (streaming)~80ms~150ms
LLM first token (TTFT)~220ms~380ms
TTS first audio byte~180ms~250ms
Network + buffer overhead~40ms~80ms
Total (first audio to caller)~540ms~910ms

Measured from end-of-user-speech to first audio byte delivered to caller. Pipeline stages run in parallel — total is not the sum of individual stages.

Security and compliance

Ortavox is SOC 2 Type II certified. All audio and transcript data is encrypted in transit (TLS 1.3) and at rest (AES-256). Call recordings are deleted after 30 days by default and can be disabled entirely.

For HIPAA-regulated workloads (healthcare, patient communications), Ortavox Enterprise includes a Business Associate Agreement (BAA) and data residency options in US, EU, and APAC regions.

Getting started

  1. 1

    Get your API key

    Sign up for a free Hobby account (100 min/month, no credit card). Your API key is available instantly in the dashboard.

  2. 2

    Configure your agent

    Choose your LLM, STT provider, and TTS voice. Set your system prompt and optionally define function-calling tools via JSON config.

  3. 3

    Connect via WebSocket or Twilio

    Stream audio over our WebSocket API or point a Twilio Media Stream at your Ortavox endpoint. The platform handles VAD, turn-taking, and interruptions automatically.

  4. 4

    Go live

    Your agent handles real calls with sub-600ms response latency. Monitor call transcripts, latency metrics, and usage in the dashboard.

Ready to build?

Start with 100 free minutes. No credit card required. First call in under 5 minutes.