At 33 milliseconds per token on an NVIDIA B200, Moondream's Photon inference engine isn't just fast—it's fast because it solved the GPU bubble problem.
Most AI inference stacks leave the GPU idle for part of every decode loop. The CPU plans the next step, launches kernels, commits the sampled token—and the GPU waits. Photon's trick: overlap that CPU housekeeping with the GPU compute for the next token, so the GPU never stalls. Moondream reports up to 35% higher decode throughput from this fix alone.
How GPU Bubbles Form
Autoregressive generation forces a sequential decode loop: one token at a time, each depending on the last. The GPU runs the forward pass for a token, but the CPU must then pick the next token, update metadata, select requests, and launch the next forward. That CPU work is a fixed cost per token, and the GPU sits idle during that window. Moondream calls this idle stretch a "GPU bubble."
The naive approach is blocking: CPU finishes all bookkeeping before launching the next GPU forward. The GPU twiddles thumbs. The fix is to pipeline—start the next GPU forward while the CPU is still processing results from the previous step.
Pipelined Decoding: Overlap the Idle
Moondream's key insight: the sampled token doesn't need to leave the GPU before the next forward starts. The next forward can read it straight from GPU memory. The copy to CPU for detokenization, streaming, and termination checks can happen in the background. That one delay-tolerant copy is what removes the bubble.
To make pipelining safe, Photon uses three mechanisms. First, ping-pong slots: two sets of pre-allocated GPU and pinned host buffers (DecodeSlots) so the second step's forward doesn't overwrite results the CPU hasn't read yet. Second, separate copy streams: each step's device-to-host copy runs on its own stream, anchored by a CUDA event so it waits only on that step's outputs, not the next forward. Third, CUDA graphs: fixed buffer addresses allow Photon to capture the decode step once and replay it, eliminating kernel launch overhead.
Concrete Gains on B200
On a single NVIDIA B200, Photon achieves ~33ms per token for a vision-language model. That's near-realtime. The pipelined approach yields 35% higher decode throughput compared to a blocking implementation—no model changes, no hardware upgrades, just smarter scheduling.
Moondream also notes that the technique avoids runtime GPU memory allocations (which force device sync), and that the forwards all share one compute stream—the ping-pong slots don't parallelize GPU work, they only let CPU and GPU overlap. The copy stream runs concurrently with the next compute stream, hiding the bookkeeping latency.
With this approach, Moondream shows that the next big gain in LLM inference isn't a new model architecture—it's smarter scheduling of the hardware we already have.
Source: Popping the GPU Bubble
Domain: moondream.ai
Comments load interactively on the live page.