Source linked

KernelSight-LM يضرب السقف 7 مرات في GPU Kernel Latency Prediction

arxiv.org@systems_wireyesterday·Artificial Intelligence·6 comments

يمتلك KernelSight-LM نموذج التشغيل على المستوى الرمزي مع نموذج الكورنيش السطحي، ويحصل على 3.8 في المائة من أخطاء التوقف في الكورنيش على بطاقات GPU غير المرغوب فيها باستخدام واحد فقط من إزالة التقييم - تحسين 7.3 في المائة مقارنة بالخطوط الأساسية المماثلة.

kernelsight lmllm inferencegpu kernelroofline modelinference simulator

KernelSight-LM predicts per-kernel GPU latency on unseen hardware to 3.8% error with just one calibration sweep — a 7.3x improvement over a comparable roofline baseline's 27.7% error.

How KernelSight-LM Decomposes LLM Inference

LLM inference couples serving-layer policies (prefix caching, continuous batching) with low-level GPU kernel execution. KernelSight-LM decomposes each serving step into four components: a roofline kernel model with a learned efficiency term, a communication model, a host-overhead model, and a discrete-event scheduler that captures caching and batching mechanics. That scheduler is what lets the simulator reproduce real-world serving behavior instead of treating each token as an independent event.

Two Prediction Tiers Trade Data for Accuracy

Two tiers let users choose based on available target-GPU data. The cross-generation tier uses no target-GPU measurements — just hardware specs and kernel microbenchmarks from previously profiled GPUs — and achieves 12.1% per-kernel error, a 1.8x improvement over the 22.0% roofline baseline. The target-measured tier adds one model-agnostic kernel-microbenchmark sweep on the target GPU, dropping per-kernel error to 3.8%.

End-to-End Errors Match Dedicated Profiling Tools

Across six model families, the cross-generation tier yields median errors of 15.4% for TTFT, 12.8% for TPOT, and 3.0% for throughput. The target-measured tier improves those to 14.3%, 6.2%, and 2.7% respectively. These numbers meet the accuracy of dedicated profiling tools while collecting far less on-device data.

KernelSight-LM’s kernel-level bottleneck breakdowns let engineers plan capacity and run hardware-software co-design experiments without deploying every model-variant on every GPU generation.

Source: KernelSight-LM: A Kernel-Level LLM Inference Simulator
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Artificial Intelligence

view topic

Fable 5's Amazon-Discovered Jailbreak Triggered US Export Freeze-Now Lifted

Amazon researchers bypassed Fable 5's safeguards, producing exploit code, prompting a US export freeze that halted global access-now lifted.

Orthogonalization Lifts mLSTM Recall from 17% to 58% on Hard Noisy Tasks

A simple orthogonalization trick lifted mLSTMs from 4 to 14-16 solved seeds out of 24 on the hardest noisy associative recall benchmarks.

Contrastive Reflection Lifts Prompt Accuracy from 51.4% to 60.4% on HotpotQA

A new iterative prompt optimization framework uses contrastive behavioral slices to repair LLM agent prompts, outperforming MIPROv2 and GEPA on a retrieval-augmented QA benchmark.

Only Strong Teachers Beat Repeated Attempts: Feedback Study on 13 Models

A controlled student-teacher evaluation across four hard benchmarks shows self-generated feedback adds nothing beyond unguided retries; the real bottleneck is a student's ability to act on feedback, not just...

Comments load interactively on the live page.