Source linked

GroqRackは、デコード遅延でGPUを打ち負かすが、バッチスケーリングで損失

GPU はまだプレフィル フェーズを所有していますが、GroqRack はバッチサイズが上昇するまで、デコード中に出力トークンあたりの時間を減らします。

groqgroqracknvidiallm inferenceprefill decodeai accelerators

Llama2-7B decodes at 3x lower latency on GroqRack than on an A100 at batch size 1, but that edge collapses to zero by batch size 16.

That's the headline from a new phase-aware evaluation of LLM inference across GPU and emerging AI accelerators, posted on arXiv by researchers who separate prefill and decode performance rather than averaging them into a single number that hides phase-specific strengths.

Why Phase-Aware Metrics Expose the Real Winner

Most LLM serving benchmarks report a combined latency or throughput. That conflates two fundamentally different workloads: compute-bound prefill (where you process the prompt and generate the first token) and memory-bound decode (where you stream output one token at a time).

The paper measures time-to-first-token (TTFT) for prefill and time-per-output-token (TPOT) for decode separately across GPUs and the GroqRack accelerator. GPUs consistently win the prefill phase, which is heavy on matrix multiplications parallelizable across many cores. GroqRack's architecture, built around a deterministic systolic array, is not designed for that kind of compute.

GroqRack's Decode Edge Vanishes With Batching

During decode at batch size 1, GroqRack delivers significantly lower TPOT than any GPU tested. The reason: Groq's streaming processor has minimal memory movement overhead for single-sequence generation, so it can fire out tokens faster than a GPU waiting on HBM bandwidth.

But the paper shows that edge disappears as soon as you increase batch size. GPUs regain advantage in decode throughput because their massive parallelism lets them process many sequences concurrently. GroqRack does not currently support batching, so it cannot amortize its fixed latency across multiple requests.

Heterogeneous Disaggregation as the Next Step

The researchers go further: they analyze disaggregated architectures where prefill runs on GPUs and decode runs on GroqRack, connected over a network. Their simulation shows measurable gains under certain workload and network conditions, specifically when decode dominates the request mix and latency targets are tight.

This is the kind of paper that makes you realize the single-accelerator mindset is already obsolete. The real question is not which chip wins, but how to compose the right mix for the phase at hand. That's where the performance leverage is hiding.


Source: Prefill/Decode-Aware Evaluation of LLM Inference on Emerging AI Accelerators
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.