NewsTTSVoice AIMistral

Mistral Voxtral TTS: What It Means for Voice AI Developers

March 25, 2026·7 min read·Ortavox Team

Mistral just released Voxtral TTS — an open-weights 4B parameter model with 70ms latency, 3-second voice cloning, and $0.016/1K character pricing. Here is what it changes for developers building voice agents.

On March 23, 2026, Mistral AI released Voxtral TTS — a ~4 billion parameter text-to-speech model that is simultaneously competitive with ElevenLabs v3, cheaper than most hosted alternatives, and available as open weights on Hugging Face. For developers building voice AI agents, this is a significant development worth paying close attention to.

What Voxtral TTS is

Voxtral TTS is a transformer-based autoregressive flow-matching model with three components: a 3.4B parameter decoder backbone (built on Ministral 3B), a 390M parameter flow-matching acoustic transformer, and a 300M parameter neural audio codec. Total: approximately 4 billion parameters.

Spec	Value
Architecture	Transformer + flow-matching acoustic model
Parameter count	~4B (3.4B backbone + 390M acoustic + 300M codec)
Languages	9 (EN, FR, DE, ES, NL, PT, IT, HI, AR)
Voice cloning threshold	3 seconds of reference audio
Latency (typical input)	~70ms
Real-time factor	~9.7x (generates audio 9.7x faster than real-time)
Max output length	2 minutes per generation
Pricing (API)	$0.016 per 1,000 characters
License	CC BY NC 4.0 (open weights on Hugging Face)

How it compares to ElevenLabs and OpenAI

Mistral's own benchmarks place Voxtral above ElevenLabs Flash v2.5 in naturalness and on par with ElevenLabs v3 (the current quality standard) — with emotion-steering support added. OpenAI TTS-1-HD is slower and less expressive by comparison. The 70ms latency at typical input puts Voxtral in the same tier as Cartesia's Sonic model, which has been the go-to choice for latency-sensitive voice agents.

Model	Latency	Price per 1K chars	Voice cloning	Open weights
Voxtral TTS (Mistral)	~70ms	$0.016	Yes (3s)	Yes (CC BY NC)
ElevenLabs Flash v2.5	~75ms	~$0.030	Yes	No
ElevenLabs v3	~150ms	~$0.060	Yes	No
OpenAI TTS-1-HD	~200ms	$0.030	No	No
Cartesia Sonic	~80ms	$0.015	Yes (voice design)	No
Azure Neural TTS	~120ms	$0.016	Custom only	No

ℹ

Pricing comparison note: ElevenLabs pricing depends heavily on tier. The figures above reflect pay-as-you-go API rates as of March 2026. Volume discounts apply on all platforms.

The open-weights angle is significant

Every other major TTS provider in this comparison is fully closed. Voxtral's CC BY NC 4.0 license means you can download the weights, run inference on your own hardware, and avoid per-character API costs entirely for non-commercial workloads. For enterprise deployments with data residency requirements or extremely high call volumes (500K+ min/month), self-hosted Voxtral may be economically compelling.

The catch: CC BY NC means no commercial self-hosting without a separate agreement with Mistral. For commercial use, you are still on the API. But the open weights enable fine-tuning, evaluation, and research that closed models cannot.

Voice cloning at 3 seconds

Voxtral's voice adaptation requires as little as 3 seconds of reference audio — lower than most competitors. More impressively, it supports zero-shot cross-lingual voice transfer: you can generate French-accented English from a French speaker reference, or Spanish-accented English from a Spanish reference. This captures natural personality: pauses, rhythm, intonation, emotional range.

For voice agent applications — where brands often want a consistent agent persona across languages — this is genuinely useful. A single reference voice can be adapted across all 9 supported languages without per-language re-recording.

How to use Voxtral with Ortavox

Ortavox supports bring-your-own-key (BYOK) for TTS providers, including any OpenAI-compatible TTS endpoint. Once Voxtral's API endpoint stabilizes post-launch, you can configure it as your TTS provider directly in your Ortavox agent settings:

json

{
  "tts": {
    "provider": "custom",
    "endpoint": "https://api.mistral.ai/v1/audio/speech",
    "model": "voxtral-tts",
    "voice": "your-cloned-voice-id",
    "api_key": "YOUR_MISTRAL_API_KEY",
    "audio_format": "pcm_16000"
  }
}

✓

At $0.016/1K chars and ~70ms latency, Voxtral is now the strongest cost-to-quality TTS option for voice agents. We recommend benchmarking it against your current ElevenLabs or Cartesia setup — especially if you handle multilingual calls.

What this means for the market

Mistral entering TTS puts pricing pressure on ElevenLabs, Cartesia, and Azure. ElevenLabs Flash v2.5 at $0.030/1K chars is now roughly 2x more expensive than Voxtral for comparable quality. Cartesia at $0.015 remains competitive on price but lacks open weights and cross-lingual cloning.

For platform-agnostic voice infrastructure — like Ortavox — this is straightforwardly positive. More TTS options at lower cost means cheaper calls for developers. The fact that Voxtral is competitive with ElevenLabs v3 quality means there is no longer a quality trade-off for choosing the cheaper option.

Ready to build?

Start with 100 free minutes. No credit card required.

Get started free More articles