Shopify's AI Stack Routes um tote Modelle, ohne einen Beat zu überspringen

When Claude Fable 5 shut down, Shopify engineers didn't scramble - their internal LLM proxy silently switched them to Claude Opus or GPT 5.5. Farhan Thawar, Shopify's head of engineering, lays out the stack in a VentureBeat podcast, and it's the kind of infrastructure-first thinking most enterprises only dream about.

The Proxy That Makes Model Exits Invisible

Shopify buys tokens in bulk and routes every AI call through a single proxy. When a provider goes down, changes pricing, or deprecates a model, failover happens "automatically, seamlessly," per Thawar. Engineers don't even notice. The proxy also gives them reporting and the ability to "spray across different providers" without locking into any one API. Thawar's point: never be super tied to a single model vendor. A backup plan isn't enough - you need zero-touch failover.

Distillation Pipeline That Cuts Costs 30x

Thawar's team runs a dedicated distillation pipeline that turns a large teacher model into a smaller, specialized student. Engineers feed in the teacher, training data, evals, and a target model (say, Opus 4.8 distilling down to Qwen 3.5). The pipeline runs for about a day and spits out metrics on speed, cost, and accuracy. Results can be dramatic: 2x cheaper and faster in typical cases, up to 30x cheaper and faster in extreme ones. No approval process required - engineers just deploy if the tradeoff looks good. Their internal platform Tangle visualizes the pipeline as it runs.

Accuracy stays the priority. As Thawar puts it, "It isn't just about cost and latency, which are big; it's about accuracy." The distilled models power Sidekick, Shopify's AI assistant for merchants, which handles dozens of specialized subtasks. Smaller models mean less toil, faster responses, lower bills.

Taming Tokenmaxxing with Circuit Breakers

Shopify also built a usage dashboard tracking who burns the most expensive tokens and who spends hours on reasoning. If a model runs for 10 hours and churns through tokens, the system pings the user: "Did you mean to spend this?" Sometimes the answer is "absolutely." Other times stops a runaway job. It's a humane circuit breaker against the tokenmaxxing problem.

Thawar's dream is to push the distillation pipeline further: eventually, engineers won't specify a target model. They'll just hand the pipeline a teacher, data, and evals and say: "You tell me the right distillation target." The system might return a model small enough to run on a phone - or it might say no profitable distillation exists. That's the kind of leverage that makes model churn a feature, not a crisis.

Source: How Shopify built an AI stack that doesn't care which models survive
Domain: venturebeat.com

Shopify's AI Stack Routes um tote Modelle, ohne einen Beat zu überspringen

The Proxy That Makes Model Exits Invisible

Distillation Pipeline That Cuts Costs 30x

Taming Tokenmaxxing with Circuit Breakers

More in Artificial Intelligence