GPT-5.6 Sol Hits 91.9% on Terminal-Bench, but Only Government-Vetted Partners Get It

GPT-5.6 Sol scores 91.91% on Terminal-Bench 2.1, a command-line automation benchmark that measures planning, tool use, and iterative error correction. That edges out Claude Mythos 5 at 88% and beats its own Max configuration at 88.76%. But you can't just fire up an API key and start running it. OpenAI is gating access behind a White House preview process, limiting initial access to a narrow set of trusted partners whose details are shared with the U.S. government.

Three Tiers, One Government Gate

OpenAI split the GPT-5.6 family into three permanent capability tiers that will evolve independently. Sol is the flagship at $5.00 per million input tokens and $30.00 per million output tokens, same pricing as GPT-5.5. Terra targets high-volume production workloads at half the cost: $2.50 input / $15.00 output. Luna is the budget option for low-latency pipelines at $1.00 / $6.00. All three are accessible only through the API and Codex, with no open-source release pending due to dual-use risks from the model's cybersecurity capabilities.

The release coordination stems from a June 2 executive order from President Donald J. Trump that requires federal agencies to collaborate on benchmarking and capability assessment before wide model deployment. OpenAI says it previewed the models and capabilities with the U.S. government ahead of launch, and at the government's request, limited access to "a small group of trusted partners." The order set a 30-day review window ending July 2, so broader availability hinges on that outcome.

Subagents and Max Reasoning Effort

The architectural change in GPT-5.6 Sol centers on how inference compute gets allocated. A new max reasoning effort mode explicitly grants the model extended time for complex problems. The ultra mode goes further, deploying specialized subagents that divide and conquer multi-step, long-horizon projects. On Agent's Last Exam, a benchmark covering 55 professional domains, Sol is the only model to break 50% success, scoring 50.9% in code mode while using fewer tokens than prior architectures.

On GeneBench v1 for long-horizon genomics analysis, Sol systematically outperforms GPT-5.5 while consuming fewer total tokens across simulated latency periods. Terminal-Bench 2.1 was the standout: 91.91% in ultra mode, a new state-of-the-art for command-line automation. These benchmarks are screenshotted in the announcement, but I'll trust the numbers until third-party verification lands.

Prompt Caching Economics and Cerebras Speed

OpenAI revamped prompt caching for agentic loops that shuttle massive context windows around. Developers can set explicit cache breakpoints with a guaranteed 30-minute minimum cache lifetime. First cache write carries a 1.25x premium over the standard uncached input rate, but subsequent cache reads get a 90% discount. For enterprises that repeatedly pass codebase definitions or conversation histories, that pricing predictability matters.

For latency-critical applications, OpenAI is launching GPT-5.6 Sol on Cerebras hardware this July, claiming up to 750 tokens per second. That targets real-time, frontier-grade reasoning where every millisecond counts.

Red-Teaming and Real-Time Safety Friction

OpenAI dedicated roughly 700,000 A100e GPU hours to automated red-teaming, searching for universal jailbreaks that bypass safeguards across varied contexts. The result is a multi-layered safety stack: model-level refusals baked into weights, real-time token-by-token classifiers for cyber and biological output, and reasoning review pauses that kick in mid-generation when a high-risk violation is flagged. A secondary, larger reasoning model then verifies the context and withholds malicious output before it reaches the user.

Here's the operational friction: legitimate defensive security work (code reviews, vulnerability discovery, patch engineering) uses the same primitives as offensive exploits. OpenAI admits the classifiers will regularly false-positive. Expect localized latency spikes, paused API generations, and intermittent request refusals during the preview period. Persistent flagging can trigger automated account-level reviews across historical conversations. OpenAI is negotiating customer-operated safety overrides and privacy-preserving detection to insulate corporate data from manual review.

Under testing, Sol remains optimized for defensive containment. Running against Chromium and Firefox codebases, it isolated bugs and exploitation primitives but could not autonomously engineer a functional, full-chain exploit, keeping it below OpenAI's "Cyber Critical" threshold.

The Geopolitical Precedent

OpenAI took the unusual step of publicly critiquing the government gatekeeping within the official product announcement: "We don't believe this kind of government access process should become the long-term default. It keeps the best tools from users, developers, enterprises, cyber defenders, and global partners who need them." That tension between capability and access will define the next few months.

General availability across ChatGPT and the wider public API is expected to roll out incrementally over the coming weeks, but the regulatory scaffolding around this release sets a precedent that will shape every frontier model launch from here.

Source: OpenAI unveils GPT-5.6 Sol, Terra and Luna models - but only accessible to limited preview partners for now, per US Gov
Domain: venturebeat.com