Sub-Second LLM Inference à travers les niveaux sans changements de VPN ou de pare-feu

0.54 seconds median time-to-first-token for an HPC-hosted LLM through an institutional firewall, no VPN or rule changes required. That's the headline number from STREAM, a new middleware that stitches local, HPC, and cloud GPU tiers into a single routing fabric with an OpenAI-compatible API endpoint.

Dual-Channel Streaming Beats Batch Mode by 21x

STREAM's key architectural bet is decoupling the control plane from the data plane. The control plane uses Globus Compute for authentication and job dispatch. The data plane runs over a WebSocket relay. That split lets token streams punch through firewalls that would normally block an interactive SSH or batch-job port.

Llama 3.2 3B on an institutional HPC cluster goes from an 11.40-second batch-mode TTFT to 0.54 seconds over the relay. That's a 21.1x speedup. The relay operator can't read the tokens either - AES-256-GCM encrypts the entire payload end-to-end.

Three-Tier Routing With a Judge Keeps Simple Queries Free

STREAM runs a local LLM as a complexity judge. When you ask a trivial question like "What's the capital of France?", the judge routes it to your local model. Harder queries go to HPC or cloud. The result: 85.1% of 1,200 benchmark queries across ten domains stayed on the free local or HPC tiers. Cloud API calls dropped to 14.9%, slashing cost and data-exposure risk.

Tier-aware context summarization prevents long conversations from forcing simple follow-ups into expensive tiers. If the whole conversation history is large, STREAM summarizes it before sending a trivial query to HPC or local. The cloud tier only sees the full context when the complexity judge deems it necessary.

HPC-as-API Proxy Removes the Cluster Expertise Barrier

STREAM exposes the entire multi-tier system as a single OpenAI-compatible endpoint. Any standard client - LangChain, LlamaIndex, a curl script - can call it without understanding PBS scripts, Slurm partitions, or GPU allocation. The sub-second TTFT makes this proxy mode practical; nobody is going to wait 11 seconds per token just to avoid a cloud bill.

Llama 3.2 3B measured 0.26 seconds locally, 0.54 seconds via HPC relay, and 1.68 seconds on cloud. The local tier is fastest but hardware-limited; the HPC tier offers a middle ground that's nearly as fast and infinitely scalable within institutional walls. STREAM makes that middle ground callable like an API, which is exactly what research labs with GPU clusters need to turn batch-oriented HPC into an interactive LLM backend.

Source: STREAM: Multi-Tier LLM Inference Middleware with Dual-Channel HPC Token Streaming
Domain: arxiv.org

Sub-Second LLM Inference à travers les niveaux sans changements de VPN ou de pare-feu

Dual-Channel Streaming Beats Batch Mode by 21x

Three-Tier Routing With a Judge Keeps Simple Queries Free

HPC-as-API Proxy Removes the Cluster Expertise Barrier

More in Systems Engineering