ANEForge: Direct Python Access to Apple Neural Engine Hits 90us Dispatch

A small fused program on ANEForge dispatches in about 90 microseconds, barely above the 70us hardware floor Apple's ANE daemon enforces. That's not a benchmark of a toy kernel; it's the cost of one complete call into Apple's fixed-function neural accelerator from Python, with no CoreML in the path.

What ANEForge Unlocks

Apple's Neural Engine sits on every recent Mac and iPhone, but production code can only reach it through CoreML. CoreML treats the ANE as a scheduling hint, not a guarantee; a model silently falls back to CPU or GPU if the engine is busy or the graph doesn't match. ANEForge compiles a lazy tensor graph built from 58 fused operators and 19 native bridge operators into a single ANE program. That program goes straight through the same ANE daemon and kernel-driver stack that Apple's own internal frameworks use.

Performance Against the Dispatch Floor

The package hits dispatch latencies that are within 20us of the theoretical minimum. A small fused program completes in ~90us; the engine's per-program dispatch floor sits at ~70us. For a full pretrained ResNet-18 forward pass, ANEForge clocks 0.33ms end-to-end. Vision Transformer and sentence encoder runs match against their framework references. The Stable Diffusion U-Net forward pass validates the approach on a real-world workload.

Training and Stateful Workloads on Fixed-Function Silicon

Most people assume the ANE is inference-only. ANEForge reaches the engine's native fused attention, streams int8, int4, and sparse weights, and keeps decoder and optimizer state resident across steps. It runs the forward pass, backward pass, and optimizer update of training on the engine. That makes it the first public tool to use Apple's neural accelerator for full training loops, not just inference.

The package targets macOS 14 and later, with each release verified against a recorded OS and ANE-compiler version. ANEForge turns the ANE from a black-box scheduling option into a programmable compute unit that Python can orchestrate directly.

Source: ANEForge: Python for direct computation on the Apple Neural Engine
Domain: arxiv.org

ANEForge: Direct Python Access to Apple Neural Engine Hits 90us Dispatch

What ANEForge Unlocks

Performance Against the Dispatch Floor

Training and Stateful Workloads on Fixed-Function Silicon

More in Artificial Intelligence