Source linked

131 Tokens/s on an 8GB Laptop by Killing the LLM at Inference

By moving the knowledge store offline and using only a tiny router and a 1B model at runtime, response time drops from 4.4 seconds to 518ms on consumer hardware.

arxivlarge language modelsmixture of expertsoffline knowledge storebm25 routing8gb laptop

A single researcher sliced end-to-end latency from 4,465 ms to 518 ms on an 8 GB laptop—not by shrinking the model, but by never calling it at runtime. Effective throughput jumped from 15.7 to 131 tokens per second, and the small model streaming decode held steady at 226–237 tok/s with a 29–62 ms time-to-first-token.

What Changed: Removing the Large Model from the Hot Path

Earlier work had already proved a 35B-class Mixture-of-Experts model fits in 8 GB of GPU memory. That solved placement but not latency: the large model still took roughly four seconds per query. The fix was brutal and effective. During an offline phase, the large model reads source documents and writes verified answer entries into a structured knowledge store. At runtime, only a lightweight router, a deterministic renderer, and a 1B-class model are active. The large model is gone.

The structural bottleneck became obvious when the author tested three different large model families—Qwen, Gemma, and GLM class—and all three showed the same multi-second runtime cost. All three also produced usable knowledge stores offline. The problem isn't a specific model; it's the architecture of invoking a giant transformer for every query.

Routing and Fidelity: BM25 Beats Naive Search, Confidence Gate Lifts Accuracy

On a 563-entry store built from seventeen real documents, naive keyword routing collapsed to 1.5% top-1 accuracy. BM25-based routing hit 92.8% top-1 (99.4% top-3). A confidence gate escalated 12.3% of queries to the small model and pushed effective top-1 to 98.0%.

Exact-match fidelity varied wildly by envelope format—from 9/9 down to 0/9 for identical content—which tells me the small model's output formatting is brittle. A 16-case verification gate blocked all ten corrupted entries while admitting all six supported ones. That's a practical safety net for production use.

What This Enables Next

The bottleneck is structural, not implementation-specific. Three different model families all exhibited the same multi-second runtime cost when invoked at query time. The offline knowledge store pattern decouples the expensive reasoning from the fast retrieval loop. This is the first credible recipe I've seen for running a 35B-class model's knowledge at interactive speeds on a stock consumer laptop—and it suggests that the next wave of local intelligence won't be running the LLM itself, but querying what the LLM has already learned.


Source: The Brain That Goes Quiet: Serving a Large Model's Knowledge at 131 Tokens per Second on an 8 GB Laptop by Removing the Large Model from the Runtime Path
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.