EngineeringVoice AIArchitecture

Interruptions, Barge-In, and Turn-Taking: The Hardest Problems in Conversational Voice AI

February 18, 2026·10 min read·Ortavox Engineering

Most voice AI demos sound great until someone tries to interrupt. Here is a deep technical analysis of how interruption detection, barge-in, and turn-taking work — and why getting them wrong destroys call quality.

There is a specific moment in every voice AI demo where the product either feels human or feels robotic. It is not the quality of the voice. It is not even the quality of the LLM response. It is the moment the user tries to speak while the bot is talking. How the agent handles that split second determines whether the conversation feels natural or broken.

This post covers everything we have learned about interruptions, barge-in, and turn-taking from running millions of production voice calls. Some of it is counterintuitive. Almost all of it is harder than it looks.

What 'interruption handling' actually involves

When a user speaks while the agent is playing audio, the system must do five things in sequence — all within a few hundred milliseconds:

  1. 1.Detect that new speech has started (VAD on the incoming audio stream)
  2. 2.Distinguish real speech from background noise, echo, or the agent's own audio bleeding into the microphone
  3. 3.Make a confidence decision: is this a genuine interruption or an accidental sound?
  4. 4.Halt the outgoing TTS audio buffer and stop LLM generation
  5. 5.Clear the conversation state and re-enter the listening mode cleanly

Step 2 is where most platforms fail. Echo cancellation on phone calls is imperfect. The agent's audio playing through a speaker — especially on a phone held to someone's ear — creates acoustic feedback that triggers false VAD activations. Without explicit echo cancellation at the platform level, the agent will constantly interrupt itself.

The four failure modes

1. False barge-in (most common)

The agent stops mid-sentence because it detected background noise, a cough, or echo as speech. This is the most common production failure. The fix requires tuning VAD sensitivity per-deployment: a call center with noisy open offices needs different thresholds than a healthcare call where the patient is in a quiet room.

2. Missed interruption (worst UX)

The user speaks but the agent ignores them and keeps talking. This typically happens when VAD sensitivity is set too high (to avoid false barge-ins), or when the platform processes audio in chunks that are too large (200ms+ frames instead of 20ms frames). The user feels unheard — the most damaging outcome for conversational trust.

3. Slow halt latency

The agent detects the interruption but takes 500–800ms to actually stop talking. This happens when the audio buffer is large (to handle jitter), when the TTS synthesis pipeline doesn't support mid-stream cancellation, or when the WebSocket backpressure management is naive. The user hears the agent keep talking for nearly a second after they spoke.

4. Dirty state after barge-in

The agent stops, hears the new input, but then continues with a non-sequitur because the LLM context was not properly cleared. The half-generated response text from before the interruption is still in the buffer and gets appended to the new prompt. This produces responses that reference things the agent never said.

How Ortavox handles it

Ortavox runs a 20ms-frame neural VAD model on every inbound audio stream. VAD produces a probability score, not a binary flag — this score is compared against an adaptive threshold that adjusts based on the ambient noise floor of the first 2 seconds of the call. This significantly reduces false barge-ins in noisy environments without sacrificing real-interruption detection.

When barge-in is confirmed (probability > threshold for 2 consecutive 20ms frames), the system: (1) immediately sends a cancel signal to the TTS stream, (2) drops all buffered audio chunks that have not yet been sent, (3) cancels any pending LLM generation tokens via streaming abort, and (4) resets the conversation turn state. Measured halt latency: p50 94ms, p95 210ms.

For deployments with extreme noise (factory floors, construction sites), set VAD sensitivity to 'low' in your agent config. For quiet professional environments, 'high' sensitivity gives the most responsive interruption detection.

Turn-taking: beyond interruptions

Turn-taking is the broader problem of managing the conversational floor — who speaks when. Interruptions are just one aspect. The harder problem is end-of-turn detection: knowing when the user has finished speaking and the agent should respond.

Silence-timeout approaches (wait 500ms of silence → respond) are simple but add 500ms to every single turn. They also fail in two ways: triggering on mid-sentence pauses (the user says 'I was wondering...' and pauses to think), and not triggering when a user ends with an upward intonation (trailing off rather than stopping).

More sophisticated approaches use prosodic features (pitch, energy, duration) combined with semantic completeness signals from the partial transcript. Ortavox's VAD model is trained to distinguish 'thinking pause' from 'end of turn' using these features, reducing premature agent responses by approximately 60% compared to pure silence-timeout VAD in our production data.

Backchanneling: the next frontier

Human conversations use backchannels — brief acknowledgments ('mm-hmm', 'I see', 'right') — to signal that the listener is engaged. Current voice AI agents are entirely silent when the user is speaking. This silence feels unnatural in longer turns (10+ seconds of user speech). Backchanneling is an active area of research for the next generation of conversational AI and will meaningfully improve perceived naturalness.

Ready to build?

Start with 100 free minutes. No credit card required.