Source linked

Sub-Second LLM Inference à travers les niveaux sans changements de VPN ou de pare-feu

Le middleware STREAM obtient un token temps-à-première moyen de 0,54 s pour l'inférence HPC en séparant les plans de contrôle et de données, permettant aux chercheurs d'utiliser des clusters de GPU institutionnels avec une API compatible avec OpenAI.

streamglobus computellama 3 2hpc inferencemulti tier routingtime to first token

0.54 seconds median time-to-first-token for an HPC-hosted LLM through an institutional firewall, no VPN or rule changes required. That's the headline number from STREAM, a new middleware that stitches local, HPC, and cloud GPU tiers into a single routing fabric with an OpenAI-compatible API endpoint.

Dual-Channel Streaming Beats Batch Mode by 21x

STREAM's key architectural bet is decoupling the control plane from the data plane. The control plane uses Globus Compute for authentication and job dispatch. The data plane runs over a WebSocket relay. That split lets token streams punch through firewalls that would normally block an interactive SSH or batch-job port.

Llama 3.2 3B on an institutional HPC cluster goes from an 11.40-second batch-mode TTFT to 0.54 seconds over the relay. That's a 21.1x speedup. The relay operator can't read the tokens either - AES-256-GCM encrypts the entire payload end-to-end.

Three-Tier Routing With a Judge Keeps Simple Queries Free

STREAM runs a local LLM as a complexity judge. When you ask a trivial question like "What's the capital of France?", the judge routes it to your local model. Harder queries go to HPC or cloud. The result: 85.1% of 1,200 benchmark queries across ten domains stayed on the free local or HPC tiers. Cloud API calls dropped to 14.9%, slashing cost and data-exposure risk.

Tier-aware context summarization prevents long conversations from forcing simple follow-ups into expensive tiers. If the whole conversation history is large, STREAM summarizes it before sending a trivial query to HPC or local. The cloud tier only sees the full context when the complexity judge deems it necessary.

HPC-as-API Proxy Removes the Cluster Expertise Barrier

STREAM exposes the entire multi-tier system as a single OpenAI-compatible endpoint. Any standard client - LangChain, LlamaIndex, a curl script - can call it without understanding PBS scripts, Slurm partitions, or GPU allocation. The sub-second TTFT makes this proxy mode practical; nobody is going to wait 11 seconds per token just to avoid a cloud bill.

Llama 3.2 3B measured 0.26 seconds locally, 0.54 seconds via HPC relay, and 1.68 seconds on cloud. The local tier is fastest but hardware-limited; the HPC tier offers a middle ground that's nearly as fast and infinitely scalable within institutional walls. STREAM makes that middle ground callable like an API, which is exactly what research labs with GPU clusters need to turn batch-oriented HPC into an interactive LLM backend.


Source: STREAM: Multi-Tier LLM Inference Middleware with Dual-Channel HPC Token Streaming
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.