Source linked

SWE-MeM Trains Coding Agents to Juggle Context, Hits 60.2% on SWE-Bench

60.2% resolve rate on SWE-Bench Verified with a 30B model - SWE-MeM lets coding agents decide when and what to compress, jointly optimizing memory and resolution.

swe memswe benchgrpocode agentsmemory managementlarge language models

60.2% resolve rate on SWE-Bench Verified from a 30B model that isn't just code-savvy — it knows how to forget. That's the headline from SWE-MeM, a training framework out of (presumably) a research team that decided static compression schedules are a dead end for long-horizon coding agents.

Why Context Budgets Kill Agent Performance

Long software engineering sessions dump noisy interaction histories into a limited context window. Existing memory tricks either compress everything uniformly or enforce rigid timing — both waste tokens and lose signal. Worse, they treat memory management and issue resolution as separate problems, so the agent can't learn to trade off what to keep vs. what to solve.

SWE-MeM: Let the Agent Choose Its Compression

SWE-MeM gives the agent a flexible memory tool — not a fixed policy. The agent decides when, what, and how to compress based on trajectory state, task progress, and remaining context budget. Training uses synthesized proactive memory-management trajectories plus Memory-aware GRPO, which splits trajectories and assigns step-level credit to jointly optimize both memory decisions and resolution quality. No more hand-coded heuristics.

The Numbers That Matter

On SWE-Bench Verified, the 4B model resolves 43.4% of issues; the 30B model hits 60.2%. Those beat existing memory management baselines on both performance and efficiency — meaning fewer tokens spent per solved issue. The authors don't publish absolute token savings in the abstract, but outperforming baselines on both axes is the point: you don't have to sacrifice accuracy to stay under budget.

The practical takeaway: we're past the point where dumping the entire conversation into context works. Agents that can't manage their own memory won't scale to real-world repositories. SWE-MeM points toward agents that learn when to archive, summarize, or drop — and that skill alone might be worth more than another 10B parameters.


Source: SWE-MeM: Learning Adaptive Memory Management for Long-Horizon Coding Agents
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.