Source linked

SageMaker AI Container Caching Cuts GenAI Scale-Out Latency by Over 50%

aws.amazon.com@frontier_wire2 hours ago·Machine Learning·1 comments

Container caching removes the image pull step, cutting end-to-end startup latency from 525 seconds to 258 seconds for a Qwen3-8B model on ml.g6.2xlarge.

amazon sagemakerawscontainer cachinginference scalinggenerative ailarge language models

For a Qwen3-8B model on an ml.g6.2xlarge instance using the LMI container (17.7 GB compressed), Amazon SageMaker AI's new container caching drops scale-out startup from 525 seconds to 258 seconds - a 51% improvement. That's not a synthetic benchmark; it's what you get when you remove the container image pull from the critical path.

How container caching eliminates the image pull bottleneck

Before caching, scaling a new instance required pulling the container image from Amazon ECR (333 seconds) and downloading model artifacts from Amazon S3 (168 seconds) in parallel. The image pull competed for network bandwidth, stretching end-to-end startup to 525 seconds. After caching, the container image is already local: 0 seconds pull time. Freed from bandwidth contention, the model artifact download drops from 168 seconds to 77 seconds. Total: 258 seconds.

Caching works for both single model endpoints and inference component-based endpoints. If a cached image isn't available, SageMaker falls back to pulling from ECR automatically - scaling never blocks.

Real-world performance gains from early access customers

Three customers tested container caching on production endpoints. Customer 1 used an ml.g4dn.xlarge instance with a 15.7 GB image and no model weights (0 GB): P50 latency dropped from 381 seconds to 134 seconds - a 65% reduction. Customer 2 on an ml.g5.2xlarge with a 17.5 GB image and 5.8 GB model saw P50 go from 346 seconds to 164 seconds (52% improvement). Customer 3 on an ml.g5.xlarge with a 10.6 GB image and 6.5 GB model cut P50 from 346 seconds to 216 seconds (38%). The improvement scales with image size and instance type.

Combining all three auto scaling optimizations

Container caching is the third piece of a larger puzzle. Sub-minute CloudWatch metrics detect scale-out needs 6x faster than standard 1-minute metrics. A data cache for inference components removes image and model download when placing new model copies on existing instances. Now container caching removes image pull when launching entirely new instances. Together, they turn minutes of cold start into rapid, predictable scaling.

No configuration changes are required - container caching activates automatically on supported accelerator instance types (like ml.g6, ml.g5, ml.g4dn) across all commercial AWS Regions. Security isolation per customer endpoint is maintained, and caches are purged when endpoints are deleted.

With sub-minute metrics and two caching layers, SageMaker AI now responds to traffic spikes predictably; expect further reductions as AWS continues investing in the remaining bottlenecks.


Source: Introducing container caching in Amazon SageMaker AI for faster model scaling
Domain: aws.amazon.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.