Top AI-generated code detector scores 0.56 F1 when asked to tag individual lines as human or AI-written. That’s barely above random guessing for a binary classification problem. The benchmark exposing this failure: HybridCodeAuthorship, released by researchers who built a real-world-style dataset of interleaved human and AI lines from GitHub.
Why the Old Benchmarks Are Useless for Real Codebases
Every prior benchmark for AI-code detection used whole-file or whole-snippet labels: either the code is all human or all AI. That’s not how developers use Copilot or similar assistants. In practice, a developer writes the loop structure, asks the LLM to fill in a function body, then edits the result. The final file is a patchwork.
HybridCodeAuthorship mimics this pattern by constructing Python files from CodeSearchNet — a massive collection of links to open-source GitHub repos — and replacing specific lines with code generated by current LLMs. The result is line-level ground truth across thousands of files, each with varying proportions of AI and human contributions.
HybridCodeAuthorship Puts a Stake in the Ground — Barely
The authors benchmarked two state-of-the-art detectors on their dataset. The best performer, AIGCode Detector, achieved an F1 score of just 0.48 at the chunk level and 0.56 at the line level. Those numbers tell you that even the most sophisticated models today can't reliably distinguish a human-written line from an AI one when they appear mixed together. For any organization trying to audit code for compliance, productivity analysis, or security risk, that’s a gap you can drive a truck through.
What this means next: if you’re relying on AI code detection for risk management, you’re flying blind on hybrid codebases. HybridCodeAuthorship is now the standard to beat, and the surface area for improvement is enormous — from 0.56 to something useful.
Source: HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection
Domain: arxiv.org
Comments load interactively on the live page.