IMAX CGLA Cuts Whisper ASR Energy 10.48x vs RTX 4090

The IMAX CGLA's projected 28nm ASIC hits 11.58J power-delay product for Whisper-tiny.en Q8_0. That's 2.35x lower than Jetson AGX Orin (27.16J) and 10.48x lower than an RTX 4090 (121.38J). Most impressive: this is a programmable coarse-grained linear array, not a fixed-function accelerator.

Dot-Product Bottleneck: 90.6% of Whisper Execution Time

Profiling Whisper-tiny.en on an ARM Cortex-A72 shows dot-product operations consume 90.6% of FP16 execution time and 87.1% of Q8_0 execution time. That single kernel is the reason your edge ASR inference burns battery. The IMAX team decided to attack exactly that bottleneck.

IMAX CGLA Architecture and Kernel Offloading

IMAX is a Coarse-Grained Linear Arrays architecture programmable enough to handle real ASR pipelines. The offloading combines kernel mapping, local-memory sizing, and burst scheduling. Implementation details include inline FP16-to-FP32 conversion, 2-way SIMD FMA on a 64-bit datapath, column-wise multithreading, and mixed execution where aligned vector segments run on IMAX while residual segments run concurrently on the host CPU. They evaluated with an FPGA prototype and a 28nm ASIC projection at 840MHz.

For Whisper-tiny.en, 32KB local memory with burst length 16 jointly minimizes PDP and EDP. That same 32KB covers 93.8% of dot-product operations in the tiny model, dropping to about 66.5% for Whisper-base.en and Whisper-small.en. The PDP gap narrows as model size grows, but remains competitive.

2.35x Lower PDP Than Jetson AGX Orin

Under a TDP-based cross-platform comparison, the numbers tell the story. Jetson AGX Orin (27.16J) and RTX 4090 (121.38J) are the baselines. IMAX at 11.58J is a clean win for local ASR. Even for the base and small models, where coverage drops, the gap still favors IMAX.

These results position IMAX as a programmable architecture for lower-PDP local ASR in the tiny-model regime. Expect this line of work to push edge speech interfaces toward silicon that does one thing well and wastes almost nothing doing it.

Source: Design and Evaluation of Energy-Efficient Whisper Dot-Product Kernel Offloading on a CGLA Architecture
Domain: arxiv.org

IMAX CGLA Cuts Whisper ASR Energy 10.48x vs RTX 4090

Dot-Product Bottleneck: 90.6% of Whisper Execution Time

IMAX CGLA Architecture and Kernel Offloading

2.35x Lower PDP Than Jetson AGX Orin

More in Systems Engineering