Ornith-1.0 9B Doubles Qwen3.5 on Agentic Coding Benchmarks

Ornith-1.0's 9B model scores 43.1% on Terminal-Bench 2.1 — more than double the 21.3% of Qwen3.5-9B, and the 35B variant hits 64.2% against Qwen3.5-35B's 41.4%.

Self-Improving RL: Joint Optimization of Scaffold and Solutions

DeepReinforce AI built Ornith-1.0 with a reinforcement learning loop that doesn't just train the model to produce code. It jointly optimizes the scaffold — the agentic orchestrator that drives rollouts — and the resulting solution. As the model generates better search trajectories, the scaffold adapts, and vice versa. That co-training is what lets these models punch above their parameter count.

All variants are post-trained on top of Gemma 4 or Qwen 3.5, but the RL fine-tuning is the differentiator. The 397B MoE variant even trades blows with Claude Opus 4.8: 82.4% vs 87.6% on SWE-bench Verified, but Ornith leads on SWE Atlas QnA (41.2% vs 48.8%) and RF (42.6% vs 46.7%).

Benchmark Results: Beating Models Well Above Their Weight Class

The tables are dense with comparisons. Ornith-1.0-397B scores 77.5% on Terminal-Bench 2.1 (Terminus-2), beating Qwen3.5-397B's 53.5% by 24 points and coming within reach of Claude Opus 4.8's 85%. On SWE-bench Multilingual it ties Qwen3.5-397B at 69.3% but thrashes Gemma4-31B's 51.7%. The 9B model's 27.2% on NL2Repo nearly doubles Qwen3.5-9B's 16.2%.

DeepReinforce ran all evaluations with consistent harnesses: Harbor/Terminus-2 for Terminal-Bench, OpenHands for SWE-bench, mini-SWE-agent for SWE Atlas. All runs used temperature=1.0, top_p=0.95 to 1.0, and context windows up to 400K tokens. The numbers are averaged over 5 runs with 32 CPU cores and 48GB RAM per evaluation node.

From 9B to 397B: Sizes, Formats, and Deployment

Ornith-1.0 ships in three architectures: a dense 9B that fits on a single 80GB GPU, a 35B MoE, and a 397B MoE. Each comes in bf16, FP8 (for memory-efficient serving on FP8-capable GPUs), and GGUF quantized variants for local inference via llama.cpp or Ollama. Context window is 256K tokens on all checkpoints. The recommended sampling parameters are temperature=0.6, top_p=0.95, top_k=20 to reproduce benchmark results use temperature=1.0.

Serving requires recent runtimes: Transformers ≥ 5.8.1, vLLM ≥ 0.19.1, SGLang ≥ 0.5.9. The model exposes an OpenAI-compatible interface with reasoning and tool-call parsers built in.

With MIT licensing and no regional restrictions, Ornith-1.0 gives every developer a shot at agentic coding performance that, until now, required closed APIs or massive proprietary clusters.

Source: Ornith-1.0: self-improving open-source models for agentic coding
Domain: github.com

Ornith-1.0 9B Doubles Qwen3.5 on Agentic Coding Benchmarks

Self-Improving RL: Joint Optimization of Scaffold and Solutions

Benchmark Results: Beating Models Well Above Their Weight Class

From 9B to 397B: Sizes, Formats, and Deployment

More in Artificial Intelligence