EngineeringLatencyVoice AIArchitecture

How to Achieve Sub-600ms AI Voice Latency: A Technical Deep Dive

March 15, 2025·9 min read·Ortavox Engineering

Most voice AI platforms exceed 2,000ms. Here is exactly how Ortavox achieves sub-600ms end-to-end latency through parallel pipeline processing, edge VAD, streaming LLM inference, and overlapped TTS synthesis.

Human conversation has a natural rhythm. The gap between when one person stops speaking and the other starts is typically 200–400ms. When an AI voice agent takes longer than 800ms to respond, the conversation feels robotic. When it takes 2,000ms or more — which is common on most platforms — it actively damages user trust.

At Ortavox, we obsess over this number. Our p50 end-to-end latency is under 600ms, measured from the moment the user stops speaking to the moment the first audio byte reaches the caller. Here is exactly how we achieve it.

Why most platforms are slow

The typical voice AI pipeline is sequential: wait for speech to end → send to STT → get full transcript → send to LLM → wait for complete response → send to TTS → wait for audio → stream to caller. This waterfall easily exceeds 2,000ms even with fast providers.

StageSequential (typical)Ortavox (parallel)
VAD end-of-speech500–1,500ms (silence timeout)~20ms (probabilistic)
STT transcription200–400ms (wait for final)~80ms (streaming)
LLM first token300–500ms~220ms
TTS first audio300–500ms (wait for full text)~180ms (streaming)
Total1,300–2,900ms~500–600ms

The four optimizations that matter

1. Probabilistic VAD instead of silence timeouts

Traditional VAD waits for 500–1,500ms of silence before declaring end-of-speech. Ortavox runs a lightweight neural VAD model on every 20ms audio frame, producing a probability score for end-of-utterance. When the score crosses a threshold, transcription is triggered immediately — no silence timeout. This eliminates 400–1,000ms from every single turn.

2. Streaming STT with immediate LLM forwarding

Rather than waiting for a final transcript, Ortavox forwards interim transcripts to the LLM as they arrive from Deepgram Nova-2. By the time the final transcript arrives, the LLM has already processed 60–80% of the input and is generating tokens. This overlap saves 80–150ms per turn.

3. Sentence-boundary TTS triggering

Ortavox monitors the LLM token stream for natural sentence boundaries. The moment a complete sentence fragment is available (typically 15–40 tokens), it is forwarded to TTS for synthesis. The first audio chunk reaches the caller before the LLM has finished generating the full response. This overlap saves 200–400ms per turn.

4. WebSocket persistence and audio buffer management

HTTP connections add 50–200ms of overhead per request. Ortavox maintains persistent WebSocket connections for the entire call. Audio chunks from TTS are queued in a small ring buffer (1–2 sentences ahead) and streamed continuously.

Measured results

Metricp50p95p99
End-to-end latency (first audio byte)541ms892ms1,240ms
VAD end-of-speech detection18ms45ms82ms
STT to LLM first token (TTFT)298ms410ms580ms
LLM to TTS first audio183ms260ms390ms
Interruption halt latency94ms210ms340ms

Ready to build?

Start with 100 free minutes. No credit card required.