4,300 Coding Agent Sessions Expose Where LLM Serving Falls Short

430,000 tool calls across 4,300 coding-agent sessions, all from actual daily use of Claude Code and Codex, not some synthetic benchmark. That's the TraceLab dataset released by University of Washington researchers, and it's the first real look at what serving infrastructure has to handle when agents go autonomous.

Long loops, short outputs, and the KV-cache problem

Coding agents don't chat. They run long autonomous loops — think 80+ steps without human interruption — with contexts that stretch to tens of thousands of tokens but generate outputs often under 50 tokens. That's a brutal pattern for current LLM serving systems, which optimize for balanced prompt-to-generation ratios. The TraceLab analysis shows that prefix cache hit rates hit 70-80%, but the remaining misses cluster around tool call boundaries and agent state resets, making naive caching strategies leak capacity.

Tool calls are anything but uniform

Some tools get called thousands of times (file read, grep, lint), others appear once. That heavy-tailed distribution kills any one-size-fits-all batching or latency prediction. The researchers document that a small set of tools — roughly 20 types — account for 90% of calls, but the tail includes shell commands, git operations, and API interactions that take orders of magnitude longer. If you're serving agents, you need semantic-aware tool-latency prediction, not static timeouts.

Four concrete optimization targets

TraceLab doesn't just wave hands. The paper calls out four specific serving improvements: lower-overhead tool calling (reduce the per-invocation token cost), append-length-aware prefill (allocate compute proportional to how much context actually changed), semantic-aware tool-latency prediction (learn which tool types typically block), and improved KV-cache management around human-paced gaps (those moments when the developer reviews code before letting the agent continue). Each of these maps directly to a measurable inefficiency in the trace data.

Open data, immediate utility

The team dumped the full trace collection pipeline and analysis code on GitHub. Any serving-system engineer can run their own optimizations against these 4,300 sessions to benchmark improvements. That's the standard we need — real workloads, not cherry-picked examples — before claiming your new scheduler or cache eviction policy matters. TraceLab closes that gap, and the next generation of agent serving should be measured against it.

Source: TraceLab: Characterizing Coding Agent Workloads for LLM Serving
Domain: arxiv.org