ToolSense Audit Reveals LLM Tool Retrieval Collapses 64% on Realistic Queries

Parametric tool retrieval that collapses by 50 to 64 percentage points on ambiguous queries falls below a basic embedding baseline – that's what ToolSense found in five model configurations trained on ToolBench's ~47k tools.

Knowledge-Retrieval Dissociation: The Core Finding

The authors of ToolSense built a diagnostic framework that takes any tool catalog and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. When they applied it to the ToolBench catalog and evaluated five parametric model training configurations, the results told a different story from the standard ToolBench benchmarks. On the RRB, several configurations collapsed by roughly 50 to 64 percentage points compared to fully-specified ToolBench queries, landing below the embedding-model baseline. That's not a minor degradation – that's the parametric approach failing where it was supposed to excel.

How ToolSense Diagnoses Real-World Tool Understanding

ToolSense's RRB uses queries at three ambiguity tiers to simulate the messy, underspecified requests a deployed agent actually receives. Standard ToolBench benchmarks rely on verbose, fully-specified queries and constrained decoding that restricts outputs to valid token paths – a setup that masks whether the model understands the tools or just pattern-matches. The probing benchmarks (MCQ and QA) go further: they test factual knowledge about tool semantics directly. Some models that performed strongly on retrieval scored near-random on these probes. That dissociation – strong retrieval without factual understanding – is exactly the kind of failure that bites you in production.

What This Means for LLM Agent Deployments

If your agent's tool retrieval is parametric and you're relying on ToolBench-style benchmarks for validation, you may be in for a rude awakening. The 50-64 point drop on realistic queries is not an edge case; it's the typical outcome for several training configurations. And near-random factual probes suggest the model has no internal model of what each tool does – it's just good at following retrieval training signals. The ToolSense framework is open-source at github.com/SAP/toolsense, meaning any team building on large tool catalogs can now run their own diagnostics before shipping. If you're deploying an agent over thousands of tools, you should benchmark with ToolSense before trusting parametric retrieval in the wild.

Source: ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs
Domain: arxiv.org

ToolSense Audit Reveals LLM Tool Retrieval Collapses 64% on Realistic Queries

Knowledge-Retrieval Dissociation: The Core Finding

How ToolSense Diagnoses Real-World Tool Understanding

What This Means for LLM Agent Deployments

More in Artificial Intelligence