Run local speech pipeline for Reachy Mini robots
VAD → STT → LLM → TTS cascade on single machine eliminates cloud dependency; swap components as models improve.
May 28, 2026
Summary
Removes API latency, cost, and privacy surface from voice agent deployments. Developers can iterate on pipeline components independently without redeploying entire infrastructure.
Why it matters
Removes API latency, cost, and privacy surface from voice agent deployments. Developers can iterate on pipeline components independently without redeploying entire infrastructure.
Implementation verdict
Replaces cloud speech backends (OpenAI Realtime API, Hugging Face Inference Endpoints). Requires llama.cpp + speech-to-speech CLI + 2-3 terminal sessions to bootstrap. Ready now—Gemma-4, Silero VAD, Parakeet-TDT, Qwen3-TTS tested and recommended. Latency bottleneck is LLM inference; decouple via Responses API protocol to scale.
Sources
- 1.speech-to-speech, our cascaded VAD → STT → LLM → TTS pipeline that exposes a Realtime API-compatible /v1/realtime WebSocket
- 2.Cascades are the most flexible option in the open-source landscape today, and with the right pieces they're also the fastest
- 3.The main bottleneck in the system is LLM inference latency
- 4.Full support for the Responses API protocol, including tool-call streaming used by the speech-to-speech backend, landed in vLLM 0.21.0
Dev Signal
Get briefs like this in your inbox — free, 3x a week.
100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.