Claude Opus 4.7 scores 47 % on ITBench‑AA’s SRE benchmark, the highest among frontier models, yet every contender falls below the 50 % threshold.
Benchmark Design and Methodology
ITBench‑AA evaluates agents on 59 SRE tasks—40 public and 19 brand‑new, held‑out cases—each presenting a Kubernetes incident snapshot with alerts, events, traces, metrics, logs, and topology. Models run inside the open‑source Stirrup harness, receiving shell access to a sandboxed file system. For each task, the agent submits a JSON list of root‑cause Kubernetes entities (Deployments, Services, Pods, etc.). Scoring uses average precision at full recall: missing any ground‑truth root cause yields a zero for that repeat; identifying all grants a precision‑based score. A 100‑turn cap and three repeats per task keep the evaluation bounded.
Model Performance Landscape
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47 %, followed by GPT‑5.5 (xhigh) at 46 % and Qwen3.7 Max at 42 %. Open‑weight models trail: Gemma 4 31B (Reasoning) scores 37 % at $0.14 per task, GLM‑5.1 (Reasoning) 40 % at $1.23 per task, and Gemini 3.1 Pro Preview 30 % at $2.23 per task. Turn counts vary nearly threefold; longer trajectories do not translate to higher accuracy. For example, Gemini 3.1 Pro Preview averages 83 turns per task yet scores 30 %, while Gemma 4 31B averages 58 turns for 37 %. Models that over‑investigate surface upstream fault‑injection mechanisms or co‑occurring symptoms, inflating false positives and hurting precision.
Cost vs Accuracy Trade‑off
The benchmark exposes a stark cost‑accuracy divide. Claude Opus 4.7, the top performer, costs $5.38 per task. In contrast, Gemma 4 31B delivers 37 % accuracy for just $0.14 per task, outperforming Gemini 3.1 Pro Preview on both metrics. GLM‑5.1 matches Gemini 3.5 Flash’s score at a lower price. These figures illustrate that the frontier of agentic enterprise IT tasks remains expensive, and that open‑weight models can still compete on value.
Forward Look
ITBench‑AA’s first SRE benchmark confirms that even the most advanced frontier models struggle with real‑world Kubernetes incident diagnosis. The next iteration—expanding to FinOps and CISO tasks—will test whether these gaps close as models learn to navigate more complex enterprise domains. For now, developers and operators must weigh the high cost of top performers against the modest gains in accuracy, and consider hybrid approaches that combine lightweight agents with human oversight.
Source: ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks - by Artificial Analysis and IBM
Domain: huggingface.co
Comments load interactively on the live page.