Source linked

CloakLM Obfuscates GPU Memory to Foil Model Theft from PCIe Snooping

arxiv.org@systems_wire3 hours ago·Cybersecurity·1 comments

Attackers can reconstruct entire DNNs from passive PCIe observation or HBM dumps because model weights sit in large, contiguous memory regions. CloakLM breaks that regularity with software-only shuffling and page...

cloaklmgpu memory obfuscationmodel exfiltrationconfidential computingvllmpytorch

Hermes already demonstrated lossless DNN reconstruction from passive PCIe observation. TunnelS can exfiltrate HBM contents at high throughput via driver-level access without disrupting inference. Both attacks exploit one invariant: model weights are stored in large, contiguous, repeatedly accessed memory regions.

CloakLM removes that invariant entirely without hardware changes. It is a software-only memory-obfuscation framework that combines three mechanisms: PCIe traffic shaping, inter- and intra-layer weight shuffling, and physical HBM page remapping. Authorized execution sees a valid virtual memory layout with negligible overhead; unauthorized observers see fragmented, semantically incoherent state.

Three Knobs to Fragment the Memory Layout

First, PCIe traffic shaping inserts dummy transfers and reorders legitimate memory transactions so that passive snooping no longer yields a clean sequence of weight reads. Second, inter-layer shuffling permutes the order in which weight tiles are placed across HBM pages, while intra-layer shuffling scrambles coefficients within each tile. Third, physical page remapping maps logically contiguous weight buffers to physically scattered HBM pages, so even a raw HBM dump looks like noise.

CloakLM integrates directly with vLLM and PyTorch - two workhorses of LLM serving. No hardware modifications are required, and the authors position it as complementary to confidential computing (e.g., AMD SEV-SNP or Intel TDX) rather than a replacement.

Near-Native Performance Under Attack

Evaluation used distributed inference workloads on LLaMA and Qwen models. The overhead is described as "near-native" - no specific latency or throughput numbers are given in the abstract, but the claim is that any performance regression is small enough not to undermine adoption. The key point: CloakLM substantially increases resistance to PCIe snooping and HBM dump attacks, making inference-time model exfiltration significantly less practical.

This work arrives just as model providers are moving to shared GPU infrastructure where the host-to-GPU interconnect, accelerator fabric, and neighboring components are outside the tenant trust boundary. CloakLM gives those providers a software-only knob to turn that doesn't require new silicon or breaking the existing serving stack.


Source: CloakLM: Obfuscating GPU Memory Layout to Mitigate Model Ex-filtration for Serving
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.