EVA-Bench 2.0 Puts Voice Agents Through 213 Enterprise Scenarios Across 121 Tools

213 evaluation scenarios across 121 tools—that's a 4x increase from the original release—and every one has been validated against GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 for solvability. ServiceNow-AI just dropped EVA-Bench Data 2.0, and it's the first voice-agent benchmark I've seen that treats domain specificity as a first-class problem rather than an afterthought.

Why Domain-Specificity Matters

A voice agent that nails flight re-booking with alphanumeric confirmation codes can completely choke on HR policy questions. That's the insight behind expanding from one enterprise domain to three: Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM), and Healthcare HR Service Delivery (HRSD). Each domain has its own vocabulary, workflow complexity, and user expectations. The Healthcare domain is grounded in actual US policy: NPI numbers, FMLA, insurance coverage—stuff that practitioners actually deal with.

The benchmark covers 35+ distinct workflows across these domains. But raw scale isn't the point. EVA-Bench 2.0 explicitly samples three scenario types: single-intent calls, multi-intent calls with up to four intents in one conversation, and adversarial calls where callers try to bypass troubleshooting, misclassify urgency, or access unauthorized records. They also include unsatisfiable goals—cases where the user's request simply cannot be fulfilled. Models tend to struggle more with those than with happy-path interactions, so leaving them out would give a misleadingly rosy picture.

Reproducibility as a First-Class Design Goal

Without reproducible scenarios, a score difference tells you nothing. Is it a real capability gap or just how the conversation happened to play out? Every scenario in EVA-Bench 2.0 has exactly one correct resolution path. The user goal is structured as a decision tree, not a vague intent statement. The simulator always gets the same instructions, scenario generation eliminates cases where multiple action sequences could succeed, and authentication flows (OTP-based elevation where it would actually appear in production) are pinned to specific scenarios.

Scenario generation uses SyGra, a graph-based synthetic data pipeline, with GPT-5.4 as the backbone. Three components—user goal, tool schema, and policy—are generated jointly to prevent the inconsistencies you get when you produce them independently. It's a concrete, reproducible process that any team building its own evaluation dataset can crib from.

What This Enables Next

If you're evaluating a voice agent, EVA-Bench 2.0 gives you 213 realistic, validated scenarios with no ambiguity about the correct outcome. If you're building your own benchmark, the team published enough detail on their generation process to serve as a practical reference. And they preview a multilingual extension coming next, which widens the benchmark beyond English-only enterprise deployments. That's the kind of incremental, specific progress that actually moves the evaluation bar forward.

Source: EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios
Domain: huggingface.co

EVA-Bench 2.0 Puts Voice Agents Through 213 Enterprise Scenarios Across 121 Tools

Why Domain-Specificity Matters

Reproducibility as a First-Class Design Goal

What This Enables Next

More in Artificial Intelligence