AWS Ships AI Agents That Write and Profile Trainium Kernels for You

A custom softmax kernel fused with a scale operation on Trainium hit a max error of 0.000008 across four test shapes, all well within bfloat16 tolerance—and the code was generated, debugged, and profiled by an AI agent, not hand-tuned by an expert. AWS just announced Neuron Agentic Development, a collection of five skills and multiple agents that turn natural language, PyTorch, or NumPy into production-ready NKI kernels for Trainium and Inferentia.

Five Skills That Automate the Kernel Pipeline

The package breaks the developer workflow into discrete steps: write, debug, profile, and analyze. The neuron-nki-writing skill translates high-level descriptions into correct NKI code, respecting hardware constraints like 128 partition dimension and 512/4096 PSUM free dimension. neuron-nki-debugging systematically resolves all 28 NCC error codes and validates numerical parity against CPU references. Two profiling skills (neuron-nki-profiling and neuron-nki-profile-querying) capture execution traces via Neuron Explorer and then run SQL queries against the resulting parquet files to compute performance bounds and pinpoint bottleneck engines. A fifth skill, neuron-nki-docs, serves API signatures and architecture guides on demand.

Agents orchestrate these skills autonomously. The neuron-nki-agent is the unified entry point that selects the right workflow. There are also specialized agents for writing, debugging, documentation, and profile analysis. Each can run up to 10 iterations before simplifying—meaning the agent keeps trying different fixes until the kernel compiles and passes correctness checks.

From PyTorch to Optimized NKI in Minutes

AWS demonstrated the workflow on a real softmax bottleneck. Prompt: "Write an NKI kernel that computes scaled softmax: softmax(x * scale) along the last dimension, for input shape [batch, seq_len, hidden_dim] in bfloat16." The agent produced a three-pass kernel (row max, sum-of-exp, normalize) using nisa.activation(np.exp, ...) for hardware-accelerated exp, float32 accumulation, and proper tiling with P_MAX=128 and F_MAX=2048. When run against a PyTorch reference, it initially failed because nisa.tensor_tensor doesn't auto-broadcast reduction results. The agent consulted its reference patterns, identified the correct broadcast mechanism via stride-0 access views (.ap()), and rewrote the kernel. All four test shapes passed on real Trainium hardware.

Profiling That Points to the Exact Source Lines

For a SwiGLU MLP kernel, the profile-analysis agent ran a two-part investigation. It first extracted kernel-level statistics and performance bounds, finding that the Tensor Engine dominated execution with significant idle gaps. A deeper query into the DMA engine revealed redundant and inefficient transfers: inputs were being reloaded eight times, and DMA instructions were well below target size. The agent even identified the three exact lines of NKI code responsible for the suboptimal transfers. That level of granularity—source-line-level inefficiency localization—turns a vague "profile is slow" into an actionable fix.

AWS is clear this is just the first step. The vision is to make the entire profile-diagnose-refactor loop fully agentic, so developers don't have to interpret profiling results and hand-craft fixes. They also plan to extend the approach beyond custom kernels to model porting, operator gaps, and correctness validation at scale. The Neuron Agentic Development repository is live now—clone it, spin up a trn2.3xlarge instance, and tell an agent to write your next kernel.

Source: Stop hand-tuning kernels: How Neuron Agentic Development accelerates AWS Trainium optimizations
Domain: aws.amazon.com

AWS Ships AI Agents That Write and Profile Trainium Kernels for You

Five Skills That Automate the Kernel Pipeline

From PyTorch to Optimized NKI in Minutes

Profiling That Points to the Exact Source Lines

More in Machine Learning