On Qwen3-1.7B end-to-end autoregressive generation, AgentCompile averages 5.66x speedup over PyTorch eager across five workloads — no hand-tuned kernels required.
AgentCompile isn’t another “LLM writes CUDA” gimmick. The LLM produces nothing final. Instead it outputs semantic labels, candidate priorities, parameter hints, and risk annotations — all fed as search metadata into a traditional compiler pipeline. The real work happens through templates, interface and hardware constraint checks, empirical validation, and latency-based selection.
How a Compiler Uses LLM Advice Without Blind Trust
The authors treat the LLM as a smart search adviser, not a code generator. Given compiler-derived region summaries and bounded candidate spaces, the LLM proposes which specializations are worth trying and which CUDA implementation families are plausible. The compiler then materializes candidates, checks them against hardware constraints, and backtests each candidate empirically.
If specialization is unsupported or unprofitable, AgentCompile falls back gracefully. This design sidesteps the biggest problem with LLM-generated kernels: hallucinated or incorrect code that silently degrades performance. The LLM guides the search; the compiler validates the result.
What the Benchmarks Actually Show
End-to-end numbers across five representative workloads tell the story: Qwen3-1.7B saw 5.66x speedup, Qwen3-4B saw 4.05x, and Llama-3.2-1B-Instruct saw 4.26x. All measured against PyTorch eager mode. These aren’t synthetic microbenchmarks — they reflect full autoregressive generation pipelines.
AgentCompile doesn’t require custom kernels or hand-tuned CUDA libraries. It compiles the model graph directly into CUDA implementations, selecting the fastest candidate per region. The authors plan to open-source the project, giving the community a tool that turns LLM advisory metadata into measurable latency wins without sacrificing correctness.
Source: AgentCompile: An LLM-Guided Compiler for Direct CUDA Inference
Domain: arxiv.org
Comments load interactively on the live page.