Source linked

KernelSight-LM は、GPU Kernel Latency Prediction で 7x で Roofline を打ち破ります。

arxiv.org@systems_wireyesterday·Artificial Intelligence·6 comments

KernelSight-LMは、屋根ラインカーネルモデルを使用してトークンレベルの実行をモデル化し、単一の校正スワイプを使用して見えないGPUでコアごとに3.8%の遅延エラーを達成しました - 比較可能なベースラインに比べて7.3倍の改善。

kernelsight lmllm inferencegpu kernelroofline modelinference simulator

KernelSight-LM predicts per-kernel GPU latency on unseen hardware to 3.8% error with just one calibration sweep — a 7.3x improvement over a comparable roofline baseline's 27.7% error.

How KernelSight-LM Decomposes LLM Inference

LLM inference couples serving-layer policies (prefix caching, continuous batching) with low-level GPU kernel execution. KernelSight-LM decomposes each serving step into four components: a roofline kernel model with a learned efficiency term, a communication model, a host-overhead model, and a discrete-event scheduler that captures caching and batching mechanics. That scheduler is what lets the simulator reproduce real-world serving behavior instead of treating each token as an independent event.

Two Prediction Tiers Trade Data for Accuracy

Two tiers let users choose based on available target-GPU data. The cross-generation tier uses no target-GPU measurements — just hardware specs and kernel microbenchmarks from previously profiled GPUs — and achieves 12.1% per-kernel error, a 1.8x improvement over the 22.0% roofline baseline. The target-measured tier adds one model-agnostic kernel-microbenchmark sweep on the target GPU, dropping per-kernel error to 3.8%.

End-to-End Errors Match Dedicated Profiling Tools

Across six model families, the cross-generation tier yields median errors of 15.4% for TTFT, 12.8% for TPOT, and 3.0% for throughput. The target-measured tier improves those to 14.3%, 6.2%, and 2.7% respectively. These numbers meet the accuracy of dedicated profiling tools while collecting far less on-device data.

KernelSight-LM’s kernel-level bottleneck breakdowns let engineers plan capacity and run hardware-software co-design experiments without deploying every model-variant on every GPU generation.

Source: KernelSight-LM: A Kernel-Level LLM Inference Simulator
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

More in Artificial Intelligence

view topic

Fable 5's Amazon-Discovered Jailbreak Triggered US Export Freeze-Now Lifted

Amazon researchers bypassed Fable 5's safeguards, producing exploit code, prompting a US export freeze that halted global access-now lifted.

Orthogonalization Lifts mLSTM Recall from 17% to 58% on Hard Noisy Tasks

A simple orthogonalization trick lifted mLSTMs from 4 to 14-16 solved seeds out of 24 on the hardest noisy associative recall benchmarks.

Contrastive Reflection Lifts Prompt Accuracy from 51.4% to 60.4% on HotpotQA

A new iterative prompt optimization framework uses contrastive behavioral slices to repair LLM agent prompts, outperforming MIPROv2 and GEPA on a retrieval-augmented QA benchmark.

Only Strong Teachers Beat Repeated Attempts: Feedback Study on 13 Models

A controlled student-teacher evaluation across four hard benchmarks shows self-generated feedback adds nothing beyond unguided retries; the real bottleneck is a student's ability to act on feedback, not just...

Comments load interactively on the live page.