Most production voice assistants still make me wait two or three seconds at the P95—and that’s when they’re not fumbling a tool call. Hugging Face and Cerebras just posted a speech-to-speech demo that, for the first time, takes P95 latency off the table as a UX problem.
Open Stack, Closed Loops
The pipeline is a fully cascaded, modular system: Nvidia’s Parakeet for ASR → Google DeepMind’s Gemma 4 31B VLM on Cerebras → Alibaba’s Qwen3TTS for synthesis. Every component is open and replaceable. That’s not academic—it already runs on 9,000 Reachy Mini robots in the wild.
Cerebras solves the bottleneck that makes voice AI feel broken: the language model turn. Their hardware delivers inference fast enough that the rest of the pipeline isn’t waiting around. The blog cites median latency improvements, but the real win is stability at the tail. P95 delays that would make a conversation feel unreliable are gone.
Why This Changes the Game for Embodied AI
For a robot or voice assistant, a 300ms median with occasional 2-second spikes feels dead. Cerebras’ deterministic performance means every response arrives within a tight window. The demo uses Gemma 4 31B, not a tiny distilled model—so you get real reasoning capability without the latency tax.
The team isn’t chasing cost savings here. They’re chasing the threshold where you stop noticing the AI is a machine. That’s the only latency metric that matters for conversational AI.
Poke at the demo on Hugging Face Spaces, grab the repo, and swap in your own models. The future of voice AI won’t be built behind closed APIs—it’ll be open, modular, and fast enough to shut human interlocutors up.
Source: Hugging Face and Cerebras bring Gemma 4 to real-time voice AI
Domain: huggingface.co
Comments load interactively on the live page.