Source linked

A Retry Storm Burned Half a Month's AI Budget in One Day

A single day's LLM API calls cost more than an entire month of server infrastructure-here's how a schema mismatch and automatic retries created a $XX retry storm.

llmclaude coderetry stormidempotencyschema migrationtask queue

Half of my month's AI API bill landed on a single day, and that day's LLM usage alone cost more than running the entire server fleet for a month. The person who built it—our CFO, shipping to production in two days with Claude Code—had no idea what happened. "Honestly, I don't remember what I did."

They didn't need to remember. The money wasn't burned by a human.

The Retry Storm That Threw Away Successes

I assumed the CFO had just hammered the AI all day—twenty-plus commits around the generation flow, death by a thousand cuts. But the app-side logs told a different story: the same heavy batch had run 21 times for a single tenant. A human doesn't press the same button 21 times.

The batch called several LLMs in sequence, then wrote results to the DB. Every LLM call returned a 200—successful, billed, paid. The failure hit on the very last step: a column that wasn't in the DB yet triggered column does not exist and a 500. The task queue saw the 500, assumed a transient glitch, and re-ran the entire batch. From scratch. Twenty-one courses eaten and paid for, then forgotten.

Two Pitfalls That Made It Unstoppable

Pitfall one: the deploy order was backwards. Code shipped assuming a new column existed; the migration that added that column hadn't been applied to production yet. Deterministic failure—the kind that never fixes itself no matter how many times you retry.

Pitfall two: the managed task queue kept retrying out of kindness. For a transient network blip, that's correct. For a missing column, it's a money furnace. Every retry re-ran the full LLM payload because the batch wasn't idempotent—no skip-already-processed logic. Deterministic failure × automatic retry × non-idempotent work. That's the pattern that burns money quietly.

Three Lessons for Anyone Running Cost-Bearing AI Workflows

First: a deterministic failure doesn't get better with repetition. Schema mismatches, 4xx-class errors—abort immediately and cap retries. Second: any batch that calls billing APIs or LLMs must be idempotent from day one. Without it, a re-run isn't a redo—it's double billing. Third: deploy in the order schema, then code. Code before schema mass-produces deterministic errors.

The CFO's face scrunched when I explained: "It succeeded, you got billed, and then it threw the success away." That's the counterintuitive horror. Next time you deploy a cost-bearing batch, wire up idempotency and a deterministic failure detector before you hit deploy.


Source: Why did one day of AI cost more than a month of servers?
Domain: junueno.dev

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.