Source linked

CXL-Hybrid Memory يزيد من إجمالي إجمالي إجمالي إجمالي إجمالي إجمالي إجمالي إجمالي الإجمالي بنسبة 35.7%

من خلال إزالة الذاكرة عن بعد قابلة للترفيه على نطاق TB من خلال تخزين CXL المعدني، فإن ITME يلجأ إلى إزالة الذاكرة عن طريق البرمجيات المرتبطة بـ 35.7٪ من الناتج المحلي على الأجهزة SK Hynix الإنتاجية.

sk hynixcxlllm inferencememory expansionkv cachedisaggregated memory

Up to 35.7% throughput improvement for LLM inference without touching the software stack — that's what ITME claims by turning CXL-hybrid memory into a TB-scale byte-addressable remote memory expansion.

The Shared Context Bottleneck

Agentic and long-context LLM workloads are pushing past single-server DRAM limits. Industry has been forced toward disaggregated shared context layers that externalize cumulative inference states — TB-scale KV caches — across distributed clusters. Today's approach uses a DPU inside a just-a-bunch-of-flash (JBOF) architecture to accelerate NVMe-over-fabrics target processing. That works but carries significant software-level optimization overhead and cost-efficiency burdens. The ideal architecture for scaling shared context infrastructure is still up for grabs.

How ITME Exploits Deterministic Access Patterns

ITME (Inference Tiered Memory Expansion) flips the script: instead of bolting on a software-heavy offload layer, it leverages a CXL-hybrid memory to present a massive, byte-addressable remote memory pool. The key insight is that model weights and prefix caches have deterministic access patterns — the system knows exactly what will be accessed next. That predictability lets ITME proactively manage data movement across the memory-storage hierarchy without complex software orchestration. Direct byte-addressability simplifies the stack drastically compared to NVMe-oF + DPU.

Validation on SK Hynix CMM and an FPGA Prototype

The authors validated ITME using production-grade SK Hynix CMM (Compute Memory Module) and PCIe Gen5 NVMe SSDs. They also built an FPGA-based hardware prototype to demonstrate functional feasibility. The result: up to 35.7% throughput improvement over conventional CPU-offloading when accommodating large KV cache footprints that exceed host memory limits. That's not a simulation — that's measured against real hardware.

CXL-hybrid memory won't solve every memory wall problem, but for inference workloads with predictable access patterns, it offers a path to TB-scale remote memory with minimal software pain.


Source: ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.