Source linked

FlexServe Cuts Secure LLM Inference Time to First Token by 10x on Mobile

FlexServe uses ARM TrustZone's flexible resource isolation to achieve 10x faster time-to-first-token than naive TrustZone-based LLM serving, without sacrificing security against a compromised OS.

flexservearm trustzonemobile llmsecure inferenceresource isolationllm serving

10.05x speedup in time to first token for secure on-device LLM inference — that’s FlexServe’s average improvement over a naive TrustZone strawman prototype. For multi-model agent workflows, the end-to-end speedup reaches 24.30x. Those numbers come from a prototype implemented by the paper’s authors and compared against two TrustZone-based baseline designs.

Why TrustZone Makes LLM Inference Painful

ARM TrustZone is the standard hardware isolation for sensitive workloads on mobile SoCs. It protects model weights and user data even when the OS kernel is compromised, ideal for on-device LLM serving. The problem: TrustZone treats memory and the NPU as monolithic secure zones. Switching in and out of secure mode is expensive, and the NPU can’t be shared efficiently between secure and non-secure tasks. That overhead kills inference performance, especially time-to-first-token (TTFT).

Flexible Resource Isolation: Swap, Don’t Carve

FlexServe’s core contribution is a Flexible Resource Isolation mechanism built on two primitives. Flex-Mem lets memory pages be toggled between unprotected and protected modes at page granularity, avoiding fixed carve-outs. Flex-NPU does the same for the neural processing unit — the NPU can be dynamically assigned to secure or normal world execution. Both switches are fast enough to be performed per inference request.

On top of that, the system adds an LLM-Aware Memory Manager that prefetches model weights into Flex-Mem pages, and a Secure Inference Pipeline that overlaps NPU execution with memory transfers. A Multi-Model Scheduler optimizes task ordering for agent-style workflows that chain multiple models.

What the Numbers Actually Mean

Against a naive strawman that loads the entire model into TrustZone’s secure memory before inference — the obvious but slow approach — FlexServe averages 10.05x faster TTFT. Even against an optimized strawman that uses pipelining and a statically allocated secure NPU, FlexServe still achieves a 2.44x TTFT improvement. For multi-model agent chains (e.g., a planner followed by a generator), the end-to-end speedup jumps to 4.05x over the optimized baseline.

Those results come from a mobile prototype; the paper doesn’t name the exact SoC, but the technique is architecture-agnostic within ARM TrustZone hardware.

FlexServe proves that securing on-device LLM inference doesn’t have to mean accepting crippled performance. Expect these isolation primitives to show up in production mobile stacks as on-device agents become the norm.


Source: FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation
Domain: arxiv.org

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.