Source linked

Subquadratic's SubQ Benchmarks Show 56x Speedup and 98% Retrieval at 12M Tokens

technologyreview.com@market_structure2 hours ago·Artificial Intelligence·3 comments

Independent tests by Appen confirm Subquadratic's sparse-attention model is 56× faster than FlashAttention and scores 98% on long-context retrieval at scales most models can't touch.

subquadraticsubqappensparse attentionlarge language modelstransformers

Appen's independent evaluation found SubQ runs 56 times faster than FlashAttention and scored 98% on a 12-million-token needle-in-a-haystack test. That's not a press-release number. That's a third-party firm saying this Miami startup might have actually solved the quadratic attention bottleneck that has plagued LLMs since the 2017 Transformer paper.

Subquadratic came out of stealth last month with big claims: faster, cheaper, 12× larger context windows. The AI community rightly shrugged. Dan McAteer summed it up: "either the biggest breakthrough since the Transformer ... or it's AI Theranos." Now the company has published Appen's results, and they deserve a close look.

Why Dense Attention Burns Through Compute

Every Transformer-based LLM uses dense attention: it multiplies each token's representation against every other token. A 10,000-word document triggers about 50 million multiplications. Double the tokens, quadruple the compute. That's the quadratic wall. Subquadratic's SubQ replaces dense attention with a dynamic sparse attention mechanism that skips most of those comparisons. The company won't disclose exactly how it selects which token pairs matter, but claims the selection is computed on the fly per input.

Previous sparse-attention attempts used fixed patterns (e.g., always compare word 1 to word 5). That doesn't work for natural language. Subquadratic says their dynamic selection does. The skepticism is warranted: many have tried, all failed to match dense attention's quality. But Appen's numbers suggest SubQ might be different.

Appen's Tests: Speed, Coding, and Retrieval

Appen ran three key benchmarks. On raw speed, SubQ was 56× faster than FlashAttention, a widely used sparse-attention implementation. On LiveCodeBench (real competitive coding problems), SubQ scored 89.7%, competitive with frontier models from OpenAI and Anthropic. On the RULER needle-in-a-haystack test at 12 million tokens, SubQ hit 98% - near-perfect retrieval at a scale where even GPT-4-class models struggle.

CEO Justin Dangel gave a cost comparison: running Anthropic's Opus 4.6 through the RULER 128 test cost $2,600. SubQ cost $8. That's a 325× cost reduction, though it's a single measurement on a specific task. Still, the pattern is consistent: SubQ trades generality for extreme efficiency on long-context and coding workloads.

The Skeptic's File: Borrowed Weights and Limited Access

Subquadratic bootstrapped SubQ using weights from Qwen, an open-source Chinese model. That's common practice, but undermines the claim of a full architecture reinvention. Independent researcher Will Depue (ex-OpenAI) says the public evidence doesn't yet justify the stronger claim of solving the quadratic bottleneck. Until SubQ is widely accessible - the company has a massive waitlist and only a few hundred enterprise testers - these benchmarks are promising but not proof.

Subquadratic's CTO Alex Whedon acknowledges the delays, saying the small team can't serve everyone at once. But he's blunt: "We're more up against it than OpenAI is." If sparse attention truly works at production scale, the next year will show whether SubQ is the first viable post-Transformer architecture or just another well-marketed paper. The Appen results make me lean toward the former, but I'm waiting for the open-source release or an API anyone can hammer on.


Source: A startup claims it broke through a bottleneck that's holding back LLMs
Domain: technologyreview.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.