Mistral Voxtral TTS: What It Means for Voice AI Developers
Mistral just released Voxtral TTS — an open-weights 4B parameter model with 70ms latency, 3-second voice cloning, and $0.016/1K character pricing. Here is what it changes for developers building voice agents.
On March 23, 2026, Mistral AI released Voxtral TTS — a ~4 billion parameter text-to-speech model that is simultaneously competitive with ElevenLabs v3, cheaper than most hosted alternatives, and available as open weights on Hugging Face. For developers building voice AI agents, this is a significant development worth paying close attention to.
What Voxtral TTS is
Voxtral TTS is a transformer-based autoregressive flow-matching model with three components: a 3.4B parameter decoder backbone (built on Ministral 3B), a 390M parameter flow-matching acoustic transformer, and a 300M parameter neural audio codec. Total: approximately 4 billion parameters.
| Spec | Value |
|---|---|
| Architecture | Transformer + flow-matching acoustic model |
| Parameter count | ~4B (3.4B backbone + 390M acoustic + 300M codec) |
| Languages | 9 (EN, FR, DE, ES, NL, PT, IT, HI, AR) |
| Voice cloning threshold | 3 seconds of reference audio |
| Latency (typical input) | ~70ms |
| Real-time factor | ~9.7x (generates audio 9.7x faster than real-time) |
| Max output length | 2 minutes per generation |
| Pricing (API) | $0.016 per 1,000 characters |
| License | CC BY NC 4.0 (open weights on Hugging Face) |
How it compares to ElevenLabs and OpenAI
Mistral's own benchmarks place Voxtral above ElevenLabs Flash v2.5 in naturalness and on par with ElevenLabs v3 (the current quality standard) — with emotion-steering support added. OpenAI TTS-1-HD is slower and less expressive by comparison. The 70ms latency at typical input puts Voxtral in the same tier as Cartesia's Sonic model, which has been the go-to choice for latency-sensitive voice agents.
| Model | Latency | Price per 1K chars | Voice cloning | Open weights |
|---|---|---|---|---|
| Voxtral TTS (Mistral) | ~70ms | $0.016 | Yes (3s) | Yes (CC BY NC) |
| ElevenLabs Flash v2.5 | ~75ms | ~$0.030 | Yes | No |
| ElevenLabs v3 | ~150ms | ~$0.060 | Yes | No |
| OpenAI TTS-1-HD | ~200ms | $0.030 | No | No |
| Cartesia Sonic | ~80ms | $0.015 | Yes (voice design) | No |
| Azure Neural TTS | ~120ms | $0.016 | Custom only | No |
Pricing comparison note: ElevenLabs pricing depends heavily on tier. The figures above reflect pay-as-you-go API rates as of March 2026. Volume discounts apply on all platforms.
The open-weights angle is significant
Every other major TTS provider in this comparison is fully closed. Voxtral's CC BY NC 4.0 license means you can download the weights, run inference on your own hardware, and avoid per-character API costs entirely for non-commercial workloads. For enterprise deployments with data residency requirements or extremely high call volumes (500K+ min/month), self-hosted Voxtral may be economically compelling.
The catch: CC BY NC means no commercial self-hosting without a separate agreement with Mistral. For commercial use, you are still on the API. But the open weights enable fine-tuning, evaluation, and research that closed models cannot.
Voice cloning at 3 seconds
Voxtral's voice adaptation requires as little as 3 seconds of reference audio — lower than most competitors. More impressively, it supports zero-shot cross-lingual voice transfer: you can generate French-accented English from a French speaker reference, or Spanish-accented English from a Spanish reference. This captures natural personality: pauses, rhythm, intonation, emotional range.
For voice agent applications — where brands often want a consistent agent persona across languages — this is genuinely useful. A single reference voice can be adapted across all 9 supported languages without per-language re-recording.
How to use Voxtral with Ortavox
Ortavox supports bring-your-own-key (BYOK) for TTS providers, including any OpenAI-compatible TTS endpoint. Once Voxtral's API endpoint stabilizes post-launch, you can configure it as your TTS provider directly in your Ortavox agent settings:
{
"tts": {
"provider": "custom",
"endpoint": "https://api.mistral.ai/v1/audio/speech",
"model": "voxtral-tts",
"voice": "your-cloned-voice-id",
"api_key": "YOUR_MISTRAL_API_KEY",
"audio_format": "pcm_16000"
}
}At $0.016/1K chars and ~70ms latency, Voxtral is now the strongest cost-to-quality TTS option for voice agents. We recommend benchmarking it against your current ElevenLabs or Cartesia setup — especially if you handle multilingual calls.
What this means for the market
Mistral entering TTS puts pricing pressure on ElevenLabs, Cartesia, and Azure. ElevenLabs Flash v2.5 at $0.030/1K chars is now roughly 2x more expensive than Voxtral for comparable quality. Cartesia at $0.015 remains competitive on price but lacks open weights and cross-lingual cloning.
For platform-agnostic voice infrastructure — like Ortavox — this is straightforwardly positive. More TTS options at lower cost means cheaper calls for developers. The fact that Voxtral is competitive with ElevenLabs v3 quality means there is no longer a quality trade-off for choosing the cheaper option.