ServiceNow Benchmarks 7 ASR Models on Code-Switched Speech-Findings Are Stark

Q: What is the significance of: ServiceNow Benchmarks 7 ASR Models on Code-Switched Speech-Findings Are Stark?

ElevenLabs Scribe V2, Gemini 3 Flash, and AssemblyAI Universal 3-Pro top the charts, but performance varies wildly by language pair and metric.

Over half the world speaks more than one language, yet most ASR benchmarks ignore code-switching — the natural mid-sentence blend of two languages. ServiceNow AI researchers just published the first serious enterprise-graded benchmark for this, and the results show that picking the wrong ASR model can cost you real accuracy on bilingual customers.

259 Spanish-English Utterances and 7 ASR Models

The team built a dataset from internal IT and HR interactions, covering four language pairs: Spanish-English, French-English, Canadian French-English, and German-English. Each utterance runs 12–40 words with at least three switchable content words. They used GPT-5 to generate natural code-switched text, then ElevenLabs Multilingual V2 for TTS synthesis, with native-speaker linguists reviewing everything. The final dataset: 259 Spanish-English, 298 French-English, 188 Canadian French-English, and 173 German-English utterances. This isn't a toy — it's grounded in real contact-center scenarios like password resets and benefits inquiries.

Top Models: ElevenLabs Scribe V2, Gemini 3 Flash, AssemblyAI Universal 3-Pro

Across the board, three models consistently lead: ElevenLabs Scribe V2, Google's Gemini 3 Flash, and AssemblyAI's Universal 3-Pro. But the headline masks a critical nuance: performance depends heavily on the language pair. A model that nails Spanish-English might stumble on German-English. Deepgram's Nova 3 Multilang and OpenAI's Whisper Large V3 Turbo fall in the middle tier. Mistral's Voxtral Small and Nvidia's Parakeet TDT 0.6b V3 trail behind, especially on semantic metrics.

Why Semantic WER and Answer Error Rate Matter More Than Raw WER

Standard Word Error Rate is a blunt instrument — it can't tell a harmless typo from a catastrophic wrong word. That's why ServiceNow added Semantic WER (using Gemma-4-31B as judge) and Answer Error Rate (AER), which measures whether an LLM reading the ASR transcript can answer downstream comprehension questions correctly. The gap between WER and AER in some models is alarming: a model may have decent WER but fail to preserve enough meaning to answer a simple question about the utterance. For enterprise voice agents, that's a failed interaction.

The team released everything publicly through their AU-Harness evaluation framework. If you're building a voice agent for a bilingual customer base, don't assume one ASR fits all — run your own language pair through this harness first.

Source: Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech
Domain: huggingface.co

ServiceNow Benchmarks 7 ASR Models on Code-Switched Speech-Findings Are Stark

259 Spanish-English Utterances and 7 ASR Models

Top Models: ElevenLabs Scribe V2, Gemini 3 Flash, AssemblyAI Universal 3-Pro

Why Semantic WER and Answer Error Rate Matter More Than Raw WER

More in Machine Learning