Una tormenta de Retry quemó la mitad del presupuesto de IA de un mes en un día

Half of my month's AI API bill landed on a single day, and that day's LLM usage alone cost more than running the entire server fleet for a month. The person who built it—our CFO, shipping to production in two days with Claude Code—had no idea what happened. "Honestly, I don't remember what I did."

They didn't need to remember. The money wasn't burned by a human.

The Retry Storm That Threw Away Successes

I assumed the CFO had just hammered the AI all day—twenty-plus commits around the generation flow, death by a thousand cuts. But the app-side logs told a different story: the same heavy batch had run 21 times for a single tenant. A human doesn't press the same button 21 times.

The batch called several LLMs in sequence, then wrote results to the DB. Every LLM call returned a 200—successful, billed, paid. The failure hit on the very last step: a column that wasn't in the DB yet triggered column does not exist and a 500. The task queue saw the 500, assumed a transient glitch, and re-ran the entire batch. From scratch. Twenty-one courses eaten and paid for, then forgotten.

Two Pitfalls That Made It Unstoppable

Pitfall one: the deploy order was backwards. Code shipped assuming a new column existed; the migration that added that column hadn't been applied to production yet. Deterministic failure—the kind that never fixes itself no matter how many times you retry.

Pitfall two: the managed task queue kept retrying out of kindness. For a transient network blip, that's correct. For a missing column, it's a money furnace. Every retry re-ran the full LLM payload because the batch wasn't idempotent—no skip-already-processed logic. Deterministic failure × automatic retry × non-idempotent work. That's the pattern that burns money quietly.

Three Lessons for Anyone Running Cost-Bearing AI Workflows

First: a deterministic failure doesn't get better with repetition. Schema mismatches, 4xx-class errors—abort immediately and cap retries. Second: any batch that calls billing APIs or LLMs must be idempotent from day one. Without it, a re-run isn't a redo—it's double billing. Third: deploy in the order schema, then code. Code before schema mass-produces deterministic errors.

The CFO's face scrunched when I explained: "It succeeded, you got billed, and then it threw the success away." That's the counterintuitive horror. Next time you deploy a cost-bearing batch, wire up idempotency and a deterministic failure detector before you hit deploy.

Source: Why did one day of AI cost more than a month of servers?
Domain: junueno.dev

Una tormenta de Retry quemó la mitad del presupuesto de IA de un mes en un día

The Retry Storm That Threw Away Successes

Two Pitfalls That Made It Unstoppable

Three Lessons for Anyone Running Cost-Bearing AI Workflows

More in Systems Engineering