voice-agents local-inference cascade-architecture robotics llm-latency

Run local speech pipeline for Reachy Mini robots

VAD → STT → LLM → TTS cascade on single machine eliminates cloud dependency; swap components as models improve.

May 28, 2026

Summary

Removes API latency, cost, and privacy surface from voice agent deployments. Developers can iterate on pipeline components independently without redeploying entire infrastructure.

Why it matters

Removes API latency, cost, and privacy surface from voice agent deployments. Developers can iterate on pipeline components independently without redeploying entire infrastructure.

Implementation verdict

Replaces cloud speech backends (OpenAI Realtime API, Hugging Face Inference Endpoints). Requires llama.cpp + speech-to-speech CLI + 2-3 terminal sessions to bootstrap. Ready now—Gemma-4, Silero VAD, Parakeet-TDT, Qwen3-TTS tested and recommended. Latency bottleneck is LLM inference; decouple via Responses API protocol to scale.

Sources

1.speech-to-speech, our cascaded VAD → STT → LLM → TTS pipeline that exposes a Realtime API-compatible /v1/realtime WebSocket
2.Cascades are the most flexible option in the open-source landscape today, and with the right pieces they're also the fastest
3.The main bottleneck in the system is LLM inference latency
4.Full support for the Responses API protocol, including tool-call streaming used by the speech-to-speech backend, landed in vLLM 0.21.0

Dev Signal

Get briefs like this in your inbox — free, every weekday.

100+ sources compressed into one 4-minute read. Ranked, cited, implementation-ready.

Read the full issue →All briefs