NUS MRAgent Uses 118K Tokens per Query - LangMem Burns 27x More

National University of Singapore's MRAgent framework uses just 118,000 tokens per query on the LongMemEval benchmark, while LangMem burns through 3.26 million - a 27x reduction in token consumption without sacrificing accuracy.

That runtime savings? MRAgent finishes in 586 seconds versus A-Mem's 1,122 seconds. LangMem doesn't even publish runtime numbers, probably because nobody waits that long.

Why Passive Retrieval Fails for Long-Horizon Agents

Static retrieval pipelines - vector search, graph traversal, top-k similarity - break when agents need to reason over dozens of sessions and hundreds of dialogue turns. Three problems: they can't revise queries mid-reasoning, they flood context windows with irrelevant noise, and they rely on fixed relevance functions that ignore accumulating evidence.

MRAgent's authors argue that memory recall should mimic cognitive neuroscience: start with small triggers (a name, an action, a place), follow associative stepping stones, and reconstruct the full story sequentially. That means an agent can drop dead ends before wasting tokens on heavy content.

Active Memory Reconstruction with Cue-Tag-Content

Instead of treating memory as a static database, MRAgent organizes it as a multi-layered graph with three node types: Cues (fine-grained keywords), Tags (semantic bridges that summarize relationships), and Content (actual episodic or semantic memories).

When an agent gets a query like "How did Nate use his prize money after winning his third tournament?", it extracts cues from the prompt - "Nate", "video game tournament", "win" - then navigates to candidate Tags. It evaluates short Tag summaries to decide whether to follow them. Tags like "Tournament Victory" get a green light; "Tournament Participation" gets dropped. Only after pruning does it retrieve full episodic content.

This iterative search-and-prune loop keeps the LLM's context clean and focused. The agent knows when it has enough evidence and stops, avoiding redundant exploration.

Benchmark Results and the Cost Difference

On LoCoMo and LongMemEval, using Gemini 2.5 Flash and Claude Sonnet 4.5 backbones, MRAgent outperformed standard RAG, A-MEM, MemoryOS, LangMem, and Mem0 across all question types. But the headline numbers are the token counts.

In LongMemEval, MRAgent consumed 118K tokens per sample. A-Mem needed 632K. LangMem burned 3.26 million. That's not a typo - 3.26 million tokens per query. At current API pricing, that difference alone could bankrupt a deployment.

The Upfront Ingestion Catch

MRAgent requires a pre-built Cue-Tag-Content graph. Developers must set up an automated distillation pipeline - a background job that passes raw interaction histories through prompt templates to extract cues, tags, and content before storing them in a graph database.

The authors designed MRAgent with an automated LLM-driven pipeline to do this labeling. No manual tagging required. You orchestrate the ingestion once, then the agent reaps the efficiency gains on every query.

Code is available on GitHub. For any team building long-horizon agents that don't want their API bill to exceed their rent, MRAgent's architecture is the template to copy.

Source: New agentic memory framework uses 118K tokens per query. LangMem burns through 3.26M.
Domain: venturebeat.com