The authors design controllable environments inspired by practical embodied AI scenarios to quantify exploration and exploitation errors in language model agents. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, the authors design a metric to quantify exploration and exploitation errors from agent's actions. The authors evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on their task, with different models exhibiting distinct failure modes. The authors further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. The authors release their code, providing a valuable resource for the AI research community.
Source: Exploration and Exploitation Errors Are Measurable for Language Model Agents
Comments load interactively on the live page.