Five Production Patterns for LLM Inference That Dodge Rate Limits Without Retries

10 concurrent requests to Amazon Bedrock hit a 3 RPM rate limit on the primary model, yet all 10 completed — the LLM gateway automatically diverted the remaining 7 to a secondary model without a single application retry. That’s the difference between a brittle prototype and a production inference pipeline.

AWS engineers Marcos Ortiz, Khubyar Behramsha, and Sushovan Basak just published a detailed walkthrough of five resilience patterns for LLM inference, moving from native Bedrock features to a full gateway architecture. Each pattern addresses a specific failure mode: regional outages, quota exhaustion, noisy neighbors, and provider-level throttling.

Cross-Region Inference Multiplies Throughput Without Code Changes

Amazon Bedrock’s cross-Region inference (CRIS) profiles automatically route requests to the optimal AWS Region based on real-time availability, latency, and demand. In one demo, 10 requests sent to a CRIS profile were distributed across three Regions: 70% to us-east-2, 20% to us-west-2, and 10% to us-east-1. No manual traffic management, no custom retry logic.

CRIS profiles stay within a geographic boundary (US or EU), satisfying data residency rules while increasing aggregate throughput beyond any single-Region quota. For latency-tolerant workloads, global profiles can route across multiple commercial Regions, further boosting capacity.

Account Sharding Creates Fault Isolation Boundaries

When one team’s traffic spike should not tank another team’s inference, AWS recommends sharding across multiple accounts. Each account gets independent Bedrock quotas and its own CRIS profile. In the demo, two accounts independently distributed requests across three Regions, with one account sending 70% to us-east-2 and 30% to us-west-2, while the other split 20%/30%/50% across three Regions. A failure in account A does not touch account B.

This pattern is particularly useful for multi-tenant SaaS platforms where strict workload isolation is non-negotiable.

Model Fallback with LiteLLM: 10 Requests, 100% Success Despite Rate Limits

The gateway layer is where resilience gets surgical. Using LiteLLM (the open-source proxy AWS also containerizes in their Multi-Provider Generative AI Gateway), the team defined a primary model capped at 3 RPM and a fallback with 25 RPM. When 10 requests hit the gateway, the first 3 went to the primary. The instant the primary refused, LiteLLM silently routed the remaining 7 to the fallback. Result: 100% completion, 0 application changes.

This pattern also supports cost-aware fallback — throttle expensive models and spill over to cheaper alternatives.

Load Balancing and Multi-Tenant Quota Isolation Prevent Noisy Neighbors

Load balancing across multiple model instances with a shuffle strategy lets you A/B test new models or distribute load. In the demo, two primary models handled 3 requests each before hitting limits; the remaining 4 fell back to a third model, again achieving 100% success.

Multi-tenant quota isolation ensures one consumer’s burst does not degrade others. The team configured Consumer A at 3 RPM and Consumers B and C at 10 RPM each. When all three sent 5 concurrent requests, Consumer A saw only a 60% success rate (3 succeeded, 2 rate-limited), while B and C achieved 100%. The gateway enforces independent rate-limit buckets per tenant — fair allocation by design.

Making These Patterns Operational at Scale

Each pattern builds on the previous one: start with CRIS for zero-effort throughput gains, add account sharding for isolation, then introduce a gateway for fallback, load balancing, and per-consumer quotas. For production, the AWS Solution for Multi-Provider Generative AI Gateway containerizes LiteLLM on Amazon ECS or EKS with automatic scaling, AWS WAF, secrets management, and CloudWatch observability — turning these five patterns into a deployable architecture.

Source: Implementing resilience patterns with Amazon Bedrock and LLM gateway
Domain: aws.amazon.com