Source linked

AWS P-EAGLE paraleliza la decodificación especulativa y ofrece una aceleración de 1,69x

Al reemplazar la generación de dibujos secuenciales con la predicción paralela de múltiples tokens, P-EAGLE elimina el coste de latencia lineal de la especulación más profunda, logrando hasta una mejora de 1,69 veces la capacidad de transmisión en comparación con EAGLE-3 en NVIDIA B200.

awssagemakerp eaglespeculative decodinglarge language modelsnvidia b200

P-EAGLE eliminates the sequential bottleneck that has limited speculative decoding to shallow depths, replacing K serial draft passes with a single parallel forward pass.

How P-EAGLE Breaks the Sequential Chain

Standard EAGLE draft tokens depend on the previous token's embedding and hidden state, forcing K sequential forward passes to propose K candidates. Even EAGLE-3, which uses direct token prediction and multi-layer representations, cannot escape this linear latency cost. P-EAGLE introduces two learnable placeholders - a mask token embedding and a shared hidden state - that substitute for the missing inputs at positions 2 through K. All K draft positions are constructed simultaneously and processed through the drafter's transformer layers (just 4 layers, 2-5% of target model parameters) in one forward pass. Deeper speculation costs the same as shallow speculation.

Benchmark Results: Up to 1.69x Over EAGLE-3

AWS tested P-EAGLE on Qwen3-Coder-30B-A3B-Instruct with FP8 quantization on NVIDIA B200 GPUs. On the Speedy-Bench Code benchmark at concurrency 8, P-EAGLE with K=7 delivered 4,638 output tokens per second compared to EAGLE-3's best of 3,762 - a 1.24x speedup. At concurrency 1, P-EAGLE hit 1.41x over EAGLE-3, and the best single configuration (P-EAGLE K=11 at concurrency 4) showed 3,710 vs 3,215 OTPS. Against baseline inference (no speculation), P-EAGLE achieved up to 4.17x throughput. The ratio P-EAGLE/EAGLE-3 peaked at 1.69x on certain settings; the key insight is that P-EAGLE maintains gains even at high concurrency (c=128), where EAGLE-3's sequential overhead erodes.

One-Click Deployment on SageMaker JumpStart

Four models ship with pre-trained P-EAGLE heads: GPT-OSS-120B, GPT-OSS-20B, Qwen3-Coder-30B-A3B-Instruct, and Gemma-4-31B-IT. Deploying from SageMaker JumpStart requires zero manual drafter training or custom containers. The environment variable SM_VLLM_SPECULATIVE_CONFIG accepts a JSON with "parallel_drafting": true and "num_speculative_tokens": 3 (default). AWS claims the output is mathematically identical to the target model's standard generation, because speculative decoding verifies all draft tokens. No quality compromise, just parallel speed.

P-EAGLE's parallel drafting removes the fundamental limit that forced practitioners to trade speculation depth for latency. With native SageMaker AI support and an open-source contribution, expect to see this technique become the default for production LLM serving - especially on reasoning workloads where median output lengths approach 3,900 tokens.


Source: Parallelize speculative decoding with P-EAGLE on Amazon SageMaker AI
Domain: aws.amazon.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.