Source linked

Gemma 4 on Cerebras Cuts Voice AI Latency Below Human Perception Threshold

huggingface.co@frontier_wire2 hours ago·Artificial Intelligence·2 comments

Hugging Face and Cerebras built a speech-to-speech pipeline that achieves stable sub-second response times using Gemma 4 31B, Nvidia Parakeet, and Qwen3TTS, already powering 9,000 robots.

hugging facecerebrasgemma 4real time voice aiopen sourceinference speed

Most production voice assistants still make me wait two or three seconds at the P95—and that’s when they’re not fumbling a tool call. Hugging Face and Cerebras just posted a speech-to-speech demo that, for the first time, takes P95 latency off the table as a UX problem.

Open Stack, Closed Loops

The pipeline is a fully cascaded, modular system: Nvidia’s Parakeet for ASR → Google DeepMind’s Gemma 4 31B VLM on Cerebras → Alibaba’s Qwen3TTS for synthesis. Every component is open and replaceable. That’s not academic—it already runs on 9,000 Reachy Mini robots in the wild.

Cerebras solves the bottleneck that makes voice AI feel broken: the language model turn. Their hardware delivers inference fast enough that the rest of the pipeline isn’t waiting around. The blog cites median latency improvements, but the real win is stability at the tail. P95 delays that would make a conversation feel unreliable are gone.

Why This Changes the Game for Embodied AI

For a robot or voice assistant, a 300ms median with occasional 2-second spikes feels dead. Cerebras’ deterministic performance means every response arrives within a tight window. The demo uses Gemma 4 31B, not a tiny distilled model—so you get real reasoning capability without the latency tax.

The team isn’t chasing cost savings here. They’re chasing the threshold where you stop noticing the AI is a machine. That’s the only latency metric that matters for conversational AI.

Poke at the demo on Hugging Face Spaces, grab the repo, and swap in your own models. The future of voice AI won’t be built behind closed APIs—it’ll be open, modular, and fast enough to shut human interlocutors up.

Source: Hugging Face and Cerebras bring Gemma 4 to real-time voice AI
Domain: huggingface.co

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Artificial Intelligence

view topic

Before You Deploy an Autonomous Agent, Can You Answer These 12 Questions

A 12-point checklist defines the minimum governance requirements organizations must meet before deploying AI agents with meaningful autonomy

Fable 5's Amazon-Discovered Jailbreak Triggered US Export Freeze-Now Lifted

Amazon researchers bypassed Fable 5's safeguards, producing exploit code, prompting a US export freeze that halted global access-now lifted.

Orthogonalization Lifts mLSTM Recall from 17% to 58% on Hard Noisy Tasks

A simple orthogonalization trick lifted mLSTMs from 4 to 14-16 solved seeds out of 24 on the hardest noisy associative recall benchmarks.

Contrastive Reflection Lifts Prompt Accuracy from 51.4% to 60.4% on HotpotQA

A new iterative prompt optimization framework uses contrastive behavioral slices to repair LLM agent prompts, outperforming MIPROv2 and GEPA on a retrieval-augmented QA benchmark.

Comments load interactively on the live page.