CXL-Hybrid Memory Boosts LLM Inference Throughput by 35.7%

Up to 35.7% throughput improvement for LLM inference without touching the software stack — that's what ITME claims by turning CXL-hybrid memory into a TB-scale byte-addressable remote memory expansion.

The Shared Context Bottleneck

Agentic and long-context LLM workloads are pushing past single-server DRAM limits. Industry has been forced toward disaggregated shared context layers that externalize cumulative inference states — TB-scale KV caches — across distributed clusters. Today's approach uses a DPU inside a just-a-bunch-of-flash (JBOF) architecture to accelerate NVMe-over-fabrics target processing. That works but carries significant software-level optimization overhead and cost-efficiency burdens. The ideal architecture for scaling shared context infrastructure is still up for grabs.

How ITME Exploits Deterministic Access Patterns

ITME (Inference Tiered Memory Expansion) flips the script: instead of bolting on a software-heavy offload layer, it leverages a CXL-hybrid memory to present a massive, byte-addressable remote memory pool. The key insight is that model weights and prefix caches have deterministic access patterns — the system knows exactly what will be accessed next. That predictability lets ITME proactively manage data movement across the memory-storage hierarchy without complex software orchestration. Direct byte-addressability simplifies the stack drastically compared to NVMe-oF + DPU.

Validation on SK Hynix CMM and an FPGA Prototype

The authors validated ITME using production-grade SK Hynix CMM (Compute Memory Module) and PCIe Gen5 NVMe SSDs. They also built an FPGA-based hardware prototype to demonstrate functional feasibility. The result: up to 35.7% throughput improvement over conventional CPU-offloading when accommodating large KV cache footprints that exceed host memory limits. That's not a simulation — that's measured against real hardware.

CXL-hybrid memory won't solve every memory wall problem, but for inference workloads with predictable access patterns, it offers a path to TB-scale remote memory with minimal software pain.

Source: ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories
Domain: arxiv.org

CXL-Hybrid Memory Boosts LLM Inference Throughput by 35.7%

The Shared Context Bottleneck

How ITME Exploits Deterministic Access Patterns

Validation on SK Hynix CMM and an FPGA Prototype

More in Artificial Intelligence