speech-to-text streaming-latency diarization open-weights api-pricing

Voxtral Transcribe 2 ships with sub-200ms latency

Two new speech-to-text models: Mini for batch work at $0.003/min with diarization, Realtime for live agents at sub-200ms latency, open-weights option available.

July 3, 2026

Summary

Replaces chunked-audio adapters with native streaming architecture, enabling real-time voice agents without offline model hacks. Diarization and context biasing reduce post-processing overhead for meeting transcription and domain-specific vocabularies.

Why it matters

Implementation verdict

Realtime replaces Deepgram Nova and Assembly for voice agents; Mini undercuts on cost/accuracy for batch. Requires API key or Hugging Face weight download. Ready now—playground available in Mistral Studio for immediate testing. Worth trialing if you're building voice UX or call center automation.

Sources

1.latency configurable down to sub-200ms
2.approximately 4% word error rate on FLEURS and $0.003/min
3.At 480ms delay, it stays within 1-2% word error rate, enabling voice agents with near-offline accuracy
4.4B parameter footprint, it runs efficiently on edge devices
5.Voxtral Realtime ships under Apache 2.0

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs