JetBrains ran 353 real-world coding tasks across Java (225), C# (38), and Python (90) to decide which AI agent to recommend, and Codex with GPT-5.4 mini solved 39.9% of them versus Junie's 39.1%. That 0.8 percentage point gap meant the benchmark alone wasn't enough. The real tie-breaker came from an online A/B test with real users, where Codex showed lower churn and higher engagement.
How JetBrains Set the Bar for Agent Selection
The methodology was brutally practical. JetBrains first ruled out any setup that would push more than 2% of users over $20/month in token costs, then ranked candidates by solve rate, median cost per task, and median latency. They tested across three ecosystems using tasks that required actual code changes in real repositories - not toy exercises. The benchmark dataset lives in the Developer Productivity AI Arena (DPAIA) repository on GitHub, so anyone can reproduce the results. For each candidate agent they tracked compilation success and tool calls, but those signals didn't change the final ranking.
Codex vs Junie: The Numbers That Mattered
Both agents were evaluated within the same cost tier. Codex used GPT-5.4 mini with medium reasoning (the best balance of solve rate vs cost). Junie used Gemini 3 Flash, pre-selected by Junie's own team as their most promising model. Head-to-head across all ecosystems: Codex solve rate 39.9%, Junie 39.1%; Codex median cost per task $0.1387, Junie $0.1132; Codex median latency 170.4s, Junie 147.6s. Cost per successful solve tells a similar story: Codex at $0.4941, Junie at $0.4337. In Java, Junie actually won solve rate (45.2% vs 43.9%) and was cheaper. In C#, Codex took solve rate (62.6% vs 58.7%). In Python, both struggled - Codex solved 20.2%, Junie 15.6%. The offline numbers were too close to call a clear winner.
Why Online A/B Testing Broke the Tie
JetBrains ran an A/B experiment with real users, tracking activation rate, churn (users switching to another agent or returning to chat), and failure rate. They couldn't measure task success directly at scale, so they used behavioral signals. Codex came out ahead on those engagement metrics. That tipped the decision. I like this approach: when offline benchmarks are a toss-up, let actual developer behavior decide. It's honest engineering judgment, not chasing a synthetic leaderboard.
The recommendation isn't permanent. JetBrains explicitly says they'll re-evaluate as models evolve and new agents appear. The DPAIA repository is open, so the community can watch and even contribute. For now, Codex is the default. But you can swap agents anytime - the choice is yours, not enforced.
Source: Codex is now the recommended agent in JetBrains IDEs
Domain: blog.jetbrains.com
Comments load interactively on the live page.