How to Achieve Sub-600ms AI Voice Latency: A Technical Deep Dive
Most voice AI platforms exceed 2,000ms. Here is exactly how Ortavox achieves sub-600ms end-to-end latency through parallel pipeline processing, edge VAD, streaming LLM inference, and overlapped TTS synthesis.
Human conversation has a natural rhythm. The gap between when one person stops speaking and the other starts is typically 200–400ms. When an AI voice agent takes longer than 800ms to respond, the conversation feels robotic. When it takes 2,000ms or more — which is common on most platforms — it actively damages user trust.
At Ortavox, we obsess over this number. Our p50 end-to-end latency is under 600ms, measured from the moment the user stops speaking to the moment the first audio byte reaches the caller. Here is exactly how we achieve it.
Why most platforms are slow
The typical voice AI pipeline is sequential: wait for speech to end → send to STT → get full transcript → send to LLM → wait for complete response → send to TTS → wait for audio → stream to caller. This waterfall easily exceeds 2,000ms even with fast providers.
| Stage | Sequential (typical) | Ortavox (parallel) |
|---|---|---|
| VAD end-of-speech | 500–1,500ms (silence timeout) | ~20ms (probabilistic) |
| STT transcription | 200–400ms (wait for final) | ~80ms (streaming) |
| LLM first token | 300–500ms | ~220ms |
| TTS first audio | 300–500ms (wait for full text) | ~180ms (streaming) |
| Total | 1,300–2,900ms | ~500–600ms |
The four optimizations that matter
1. Probabilistic VAD instead of silence timeouts
Traditional VAD waits for 500–1,500ms of silence before declaring end-of-speech. Ortavox runs a lightweight neural VAD model on every 20ms audio frame, producing a probability score for end-of-utterance. When the score crosses a threshold, transcription is triggered immediately — no silence timeout. This eliminates 400–1,000ms from every single turn.
2. Streaming STT with immediate LLM forwarding
Rather than waiting for a final transcript, Ortavox forwards interim transcripts to the LLM as they arrive from Deepgram Nova-2. By the time the final transcript arrives, the LLM has already processed 60–80% of the input and is generating tokens. This overlap saves 80–150ms per turn.
3. Sentence-boundary TTS triggering
Ortavox monitors the LLM token stream for natural sentence boundaries. The moment a complete sentence fragment is available (typically 15–40 tokens), it is forwarded to TTS for synthesis. The first audio chunk reaches the caller before the LLM has finished generating the full response. This overlap saves 200–400ms per turn.
4. WebSocket persistence and audio buffer management
HTTP connections add 50–200ms of overhead per request. Ortavox maintains persistent WebSocket connections for the entire call. Audio chunks from TTS are queued in a small ring buffer (1–2 sentences ahead) and streamed continuously.
Measured results
| Metric | p50 | p95 | p99 |
|---|---|---|---|
| End-to-end latency (first audio byte) | 541ms | 892ms | 1,240ms |
| VAD end-of-speech detection | 18ms | 45ms | 82ms |
| STT to LLM first token (TTFT) | 298ms | 410ms | 580ms |
| LLM to TTS first audio | 183ms | 260ms | 390ms |
| Interruption halt latency | 94ms | 210ms | 340ms |