Source linked

Cerebrium Cuts GPU Cold Starts 80% with Memory Checkpointing

By saving and restoring fully initialized GPU containers-model weights, compiled kernels, CUDA state-Cerebrium turns a multi-minute startup into seconds.

cerebriumgvisorcheckpointinggpu cold startcudaserverless inference

80% faster cold starts sounds like a marketing number, but Cerebrium actually shows the receipts: restoring a checkpointed gVisor sandbox including GPU memory cuts startup time from minutes to seconds for workloads like vLLM.

Why Your GPU Workload Takes Three Minutes

That three-minute cold start isn't pulling the container image—Cerebrium already solved that with a custom image runtime. The real cost is everything after the image lands: importing Python modules, loading PyTorch, assembling model weights, copying them to GPU, running torch.compile, capturing CUDA graphs, initializing the KV cache. Every single step is deterministic. Yet every scale-up pays to recompute the same result.

Cerebrium's insight: if the output of initialization is identical every time, freeze it once and restore it on demand. Their production numbers show a 80%+ reduction in cold start time for real CUDA workloads.

How Checkpointing Skips the Rebuild

The mechanics are straightforward. Pause all application processes and GPU work. Dump CPU memory and GPU memory to files. Upload those files to fast durable storage. Restore does the reverse: pull files, rehydrate memory, repair state that can't survive a move, unpause. The restored process picks up exactly where it left off—PyTorch already imported, model weights resident on GPU, kernels compiled, runtime warm.

Cerebrium implements this inside their custom gVisor-based container runtime. When a container starts, the runtime checks: does a compatible checkpoint exist for this image, GPU type, machine type, runtime version? If yes, skip the normal boot and restore directly into the sandbox. If not, boot from scratch and allow the user to trigger a checkpoint once warmed.

What It Takes to Freeze CUDA State

Making this work reliably for real GPU workloads like vLLM required extending the runtime to answer timing-critical questions at container start. Cerebrium added a checkpoint service on every host that handles storage, caching, and fallback logic. The checkpoint itself captures CPU memory, GPU memory, process state, model weights, and compiled kernels. Restoring breaks if any piece is missing or stale—so the service tracks compatibility with image versions, GPU types, and runtime versions.

This isn't academic. Cerebrium runs large language models, real-time avatars, transcription models, and diffusion models in production. The same initialization cost that makes you keep GPUs warm to avoid a three-minute wait becomes a solved problem. Scale down aggressively, release GPUs, and restore from checkpoint when traffic comes back.


Source: Reduce GVisor Cold Starts with GPU Snapshotting
Domain: cerebrium.ai

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.