Published 8 January 20269 min read

From $4,000 to $40 a month: the real cost curve of agent guardrails

What stopped the bleeding wasn't a smaller model — it was killing the planner's appetite for extra steps.

agentic-aiproductioncost-engineering

The Stripe email landed on a Tuesday in March 2025, 09:14. I was on my second coffee. Subject: "Your OpenAI usage is approaching the threshold." Body: $1,247 spent, four days into the billing cycle.

I did the math on the back of a Migros receipt. $1,247 ÷ 4 × 30 = $9,350. Then I remembered traffic had been climbing since Monday, so the realistic projection was closer to $4K. Still. Four thousand dollars a month for what was supposed to be a tool-selection helper on a marketplace search bar. Three days. The agent had been live for three days.

I closed the laptop, drank the coffee, opened the laptop. Started reading the logs.

The marketplace CFO doesn't know my name. He knows my line item. That's the whole relationship. When the line item gets weird, his finance person sends a polite email, and the politeness gets thinner each week. By Friday, polite was already on day four.

What was actually running

Nova's tool-selection agent sits in front of a 7-million-product catalog. When someone searches "kahve makinesi 2.000 TL altı, espresso da yapsın", the agent decides which retrieval tools to call and in what order. We had five: BM25 search, vector search, image-match (for "looks like this"), brand-affinity scoring, and stock-check.

The first version was, I'll own this, the kind of setup you put together and don't poke. Planner: GPT-4o. Tool result interpreter: GPT-4o. No depth cap. No budget cap. The reasoner chained however many tool calls felt right.

What "felt right" turned out to be: an average of 8 LLM calls per user query. 32K input tokens. 4K output tokens. Times 12K daily queries, that's around 95K calls, 3.5M input tokens, 480K output tokens. Every day.

The planner loved long planning steps. Call BM25, read the results, decide they were ambiguous, plan a vector search, read those, decide image-match might help "for completeness", chain another step, then summarize. Each step was another full-context call. The model didn't know about my Stripe limit. It was being thorough.

Here's the catch. I had given thoroughness zero opportunity cost. The planner prompt said "you may call additional tools if the result is incomplete." Nothing told it that incomplete-but-close was a place to stop. So it never stopped.

There was no upper bound on how long a query could run. No logic gate. No "if you've called this many tools, stop." The planner's appetite for chaining was infinite, and I had given it an unlimited kitchen.

Guard 1 — the budget envelope

I shipped the budget envelope on Tuesday afternoon. This is the version I would not show anyone:

// First attempt — Tuesday 14:30, panic mode
class BudgetEnvelope {
  private spent = 0;
  charge(input: number, output: number) {
    this.spent += input + output;
    // would've saved us $800 if we'd added this a week earlier
    if (this.spent > 50_000) {
      console.warn("Budget high");  // <- this is not a guardrail
    }
  }
}

That console.warn is the kind of code you write at 14:30 on no sleep. It logs. It does nothing. The agent keeps spending.

By Wednesday morning I had the real version:

class BudgetEnvelope {
  private spent = 0;
  constructor(private readonly capPerQuery = 50_000) {}
 
  charge(input: number, output: number) {
    this.spent += input + output;
    if (this.spent > this.capPerQuery) {
      // throw, don't warn. the planner will catch and finalize
      // with whatever tool results it has so far.
      throw new BudgetExceeded(this.spent, this.capPerQuery);
    }
  }
}

50K tokens per query, hard cap. The planner caught the exception and returned a best-effort answer from whatever tools had already run. Some answers got worse. Most stayed the same, because most of the "extra" steps were planner indulgence, not signal.

This one change dropped the daily projection from $4K to $1.4K. A 66% cut from a single class. The planner had been spending most of its budget on the eighth tool call, where marginal information was near zero.

I want to underline this, because it's the part I keep forgetting and re-learning. The budget envelope is not a cost control. It's a planning shape control. Once the planner knew it had a fixed budget, it started front-loading the high-value tool calls because the prompt said "you have a token budget." The behavior change in the first call was bigger than the cutoff in the last.

Guard 2 — model tiering

Next obvious move. Did the planner really need GPT-4o? Planning is structured. The planner outputs a JSON object: "the next tool is X." It's not writing prose. It's not reasoning about quantum mechanics.

I swapped the planner to Claude Haiku 4.5. Moved the result-interpretation step to a small model with a tight structured-output schema. The reasoner-quality model stayed in one place: the final answer composition, where natural language actually mattered.

Tiering, in production:

Step	Before	After
Planner	GPT-4o	Haiku 4.5
Tool result parse	GPT-4o	Haiku 4.5 + schema
Final composition	GPT-4o	GPT-4o

Monthly projection went from $1.4K to $380. Planning quality dropped maybe 2% on my eval set, well inside the noise floor of how I was scoring "good plan" anyway.

A parenthetical on the eval set. It was 230 hand-labeled queries that I built over three afternoons at my café. Not a gold standard you can blindly run against. Just enough to answer "does the new planner behave like last week's?" Verifying that the regime hadn't shifted was the goal, not perfection.

This is the step most people would do first. I did it second on purpose. Swapping models is the easy fix that gets celebrated. Cutting steps is the boring fix that actually moves the bill.

Guard 3 — caching that earns its keep

The first cache I tried was a dumb LRU on the raw query string. It hit 4% of traffic. People don't search the exact same string.

The second cache used sentence embeddings. Hash the embedding into a coarse bucket, look up similar past queries, reuse the tool-selection plan if cosine similarity is above 0.92. The first Tuesday after deploy, morning rush hit 38%. People were searching the same things in slightly different words. Weekly average settled at 22%.

async function planWithCache(query: string) {
  const emb = await embed(query);
  const hit = await cache.findSimilar(emb, 0.92);
  if (hit) {
    // reuse the plan, not the answer.
    // answers go stale; tool selection rarely does.
    return runPlan(hit.plan, query);
  }
  const plan = await planner(query);
  await cache.put(emb, plan, ttl="6h");
  return runPlan(plan, query);
}

The important detail: I cached the plan, not the answer. Stock and pricing change. Which retrieval tools to fire for "wireless headphones under 3K" doesn't. Tool result cache went on deterministic queries only — same SKU lookup, same brand affinity score — with a 15-minute TTL, because that's how long the warehouse data stays consistent.

$380 to $120. Caching pulled another two-thirds out.

Cache invalidation is supposed to be the hard problem, but this time it wasn't. The plan cache TTL was short, six hours. Plans were keyed by category, not by user. Two different users with the same "coffee machine" intent shared a plan but got different run results. Skip that detail and user A's answer goes to user B, and you've stepped into the kind of bug you can't walk back from.

Guard 4 — adaptive depth

Cost was bearable now, but there was one more thing. Reading the logs, I noticed something the planner had been hiding from itself: most queries don't need five tools. They need two.

A junior on my team had built a confidence scorer for tool results. Basically "are these results clean enough to answer?" It was sitting unused in a different service. I wired it in.

async function adaptivePlan(query: string, plan: Tool[]) {
  const results = [];
  for (const tool of plan) {
    const r = await run(tool, query);
    results.push(r);
    const confidence = scoreResults(results, query);
    // if we're already confident, stop. the planner
    // wants to keep going. the planner is wrong.
    if (confidence > 0.85 && results.length >= 2) break;
  }
  return results;
}

Numbers from the next week of logs: 18% of queries needed more than two tool calls. The other 82% were stopping early with the same answer they'd have gotten at step 5. The planner was doing five-step plans because the prompt examples showed five-step plans. The real world did not need them.

$120 to $40.

This step took longer than the others. Calibrating "confidence > 0.85" meant two days of resifting logs. What happens at 0.75? Does answer quality drop? At 0.90? You cut 12% instead of 18%. Threshold tuning like this can't be done in theory without real production traffic to test your hypothesis against. I waited a week, then dialed it in from the logs.

What I'd tell myself in March 2025

The biggest single drop didn't come from a cheaper model. It came from guard 1 and guard 4 together: the budget envelope plus stop-when-you're-done. Together they did 80% of the savings. The model swap was 20%.

The counter-intuitive piece. Cheaper models are the lazy answer. The internet loves them because they're a one-line PR. "Switched to Haiku, saved 80%" is a great tweet. The real engineering work is figuring out which steps shouldn't run at all, and that takes a few evenings with your logs.

Reading this, some people will say "I'd start with Haiku." A reasonable reflex, but a misleading one. Even with Haiku, the planner, left without a budget, was calling 14 tools instead of 8. I saw it in the test. When per-unit cost drops, the planner's appetite goes free. Cheapening the model before binding the shape just pads the bottom of the pit with a soft mattress. You still fall. Slower.

The math on where I landed: $40/month, 360K queries (12K daily × 30). That's $0.0003 per query. The marketplace CFO's reaction when I sent the new numbers was a single thumbs-up emoji, which from him counts as a parade.

The hardest part isn't writing the agent. It's writing the parts that don't run.

Looking back, the four guardrails are all saying the same thing in different layers. Don't trust the model, bound it. The budget envelope bounds tokens. Model tiering bounds the cost class the decision is made in. The cache bounds the same decision being made twice. Adaptive depth bounds the planner's imagined plan length. Four places, same sentence: finishing is your job, not thinking forever.

My second-most-expensive month was when I forgot to delete a debug log statement that called the LLM to summarize itself. I noticed it on day six. My coffee had gone cold while I was reading the invoice, and I just sat at the desk for ten minutes.