Source linked

Snorkel AI's Senior SWE-Bench Judges Agents on Taste, Not Just Tests

senior-swe-bench.snorkel.ai@systems_wire2 hours ago·Artificial Intelligence·2 comments

Agents are evaluated on realistic feature requests and runtime bug investigations, with a validation agent generating behavioral tests and a taste scoring system that penalizes sloppy code.

snorkel aisenior swe benchai agentssoftware engineeringbenchmarklarge language models

Snorkel AI’s Senior SWE-Bench throws out over-specified requirements in favor of natural-language messages that look like something a product manager would actually write. Instead of a detailed instruction with explicit code symbols and expected function signatures, agents get a task like “Add Google Books as a metadata source to BookWorm for fallback/staging imports” — the same kind of request a senior engineer would receive on a real team.

What Makes a Senior Engineer? Realistic Tasks and Runtime Investigation

Senior SWE-Bench splits tasks into two categories: feature tasks and bug tasks. Feature tasks mirror the ambiguity of real-world feature requests — no one tells you exactly which files to touch or which API endpoints to call. The validation agent, a separate LLM-based system, automatically generates behavioral tests that adapt to whatever solution the agent submits. If the agent picks a different import strategy or refactors a module, the tests adjust accordingly, so you’re not penalized for doing a tasteful redesign.

Bug tasks are sourced from PRs that required significant runtime investigation: digging through logs, profiling data, reproduction steps. An agent can’t just pattern-match; it has to start services, reproduce the issue, and diagnose subtle runtime problems. That’s a far cry from static bug-fixing benchmarks where the bug is isolated in a single function.

Taste Scoring and Codebase Practices: The Real Innovation

Agents that pass the tests still face a quality gate. Senior SWE-Bench combines runtime correctness with several “taste” metrics derived from observed codebase practices. If the code works but ignores load-bearing conventions — say, modifying STAGED_SOURCES in a way that breaks the import pipeline — the taste score drops. This isn’t about linting; it’s about shipping code that fits the ecosystem without being told.

The validation agent also checks that the solution respects unstated practices. For the Google Books task example, the requirements include that stage_bookworm_metadata must replace any direct Amazon-only logic, and that the affiliate server handler must fall back to Google Books only when both high_priority=true and stage_import=true are set. An agent that skips those checks might get a passing test but flunk on taste.

Example: Integrating Google Books into Open Library’s BookWorm

The published example task is a real Open Library feature: adding Google Books as a fallback metadata provider when Amazon lookups fail or only an ISBN-13 is available. The specification runs thousands of characters, but the instruction is a short paragraph. Agents must stage metadata, handle ambiguous ISBN queries, and avoid introducing unreliable data when Google Books returns multiple results. The validation agent writes behavioral tests for edge cases like missing fields or multiple matches.

Senior SWE-Bench is open-source and available now. Expect it to shift the conversation from “can an agent fix a bug” to “can an agent ship a feature that a senior engineer would approve without rewriting.”


Source: Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers
Domain: senior-swe-bench.snorkel.ai

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.