CUGA topped AppWorld from July 2025 to February 2026 and WebArena from February to September 2025, despite running on gpt-oss-120b, a smaller open-weight model, rather than a frontier API. That benchmark dominance comes from the harness, not the model size: planning, reflection, and variable tracking keep a long-horizon agent on course when a raw model would drift.
What the Harness Saves You From Writing
Most agent projects spend a week wiring model clients, tool adapters, state management, and UI streaming before the agent does anything useful. CUGA inverts that. You pip install cuga, then build a CugaAgent with a tool list and a system prompt, and await agent.invoke(...). Everything below that call is the harness: interchangeable tool bindings (OpenAPI, MCP, LangChain), declarative guardrails, multi-agent delegation over A2A, Docling-powered RAG, and one-env-var provider switching across OpenAI, watsonx, Ollama, and more.
The harness plans before acting, then executes with a mix of tool calls and generated code (CodeAct). On a twenty-step task it tracks intermediate results and runs a reflection step that catches bad calls and re-plans instead of barreling ahead. That machinery is why CUGA topped those benchmarks, not a hand-tuned prompt.
One App, One File, Four Arguments
IBM Research published two dozen single-file FastAPI apps, each wrapping one CugaAgent. The IBM Cloud advisor agent fits in a single main.py. The agent factory takes four arguments:
def make_agent():
from cuga import CugaAgent
from _llm import create_llm
return CugaAgent(
model=create_llm(provider=os.getenv("LLM_PROVIDER"), model=os.getenv("LLM_MODEL")),
tools=_make_tools(),
special_instructions=_SYSTEM,
cuga_folder=str(_DIR / ".cuga"),
)
tools and special_instructions carry the actual app logic. _make_tools() mixes a local function (search IBM Cloud Catalog via API) with MCP servers for generic web search. Nothing in the app code knows which model sits behind it. The cuga_folder holds state and policy files, so the same agent runs governed in production without a rewrite.
Smaller Models, Same Benchmarks
CUGA's reflection step and variable-tracking let a smaller open-weight model hold up where it normally would not. The hosted apps run on gpt-oss-120b, not a frontier API. The cost/latency tradeoff is set from config, not code: Fast, Balanced, and Accurate reasoning modes, with code execution in local, Docker, or E2B sandboxes. That dial matters because the harness carries load the model would otherwise have to, making smaller models viable for complex agent loops.
The pattern across all two dozen apps is a clean split: generic capabilities come from shared MCP servers, task-specific logic lives in inline tools. The result is a library of copyable blueprints that prove a harness, not a framework, is the right abstraction for production agent work. The same code that recommends IBM Cloud services in a demo runs sovereign and governed in production without a single line change.
Source: Build real agentic apps using CUGA: two dozen working examples on a lightweight harness
Domain: huggingface.co
Comments load interactively on the live page.