build in publicAI cost optimizationmodel tieringLLM cost auditfounder story

Which Prompts Need the Expensive Model? I Audited All 90

I'd wired the expensive model into all 90-odd prompts and never once asked which of them actually needed it. So I went and looked, prompt by prompt, and the answer was a bit humbling.

By Roman Rose, Founder, Parity Layer8 min read

Key takeaways

  • You find out which prompts need the expensive model by auditing them one at a time on your own traffic, not by eyeballing which ones feel hard, because your gut is a terrible judge of this.
  • Spend is almost always wildly concentrated: a small handful of high-volume prompt types burn most of the bill, and those are rarely the ones that actually need the smartest model.
  • The real split isn't hard versus easy, it's whether a cheaper model comes back better or at least as good on that exact prompt with your inputs, which you can only know per prompt.
  • Guessing the tiering from your gut is the same mistake that got the bill big in the first place, so you prove each prompt against your own baseline instead of moving on a hunch.
  • On the prompts that pass, most teams land in the 30 to 60% range, with the baseline sitting there as an instant fallback so a wrong call never costs you quality.

The short answer is you find out which prompts need the expensive model by auditing them one at a time on your own real traffic, and the number that genuinely need it is almost always far smaller than the number you've got running on it. You break your spend down by prompt, you find the handful of high-volume prompts burning most of the bill, and for each one you test a cheaper model against your current one on your own inputs, then you keep the expensive model only where cheaper actually comes back worse. Everything else, and honestly it's usually most of it, can move down a tier with no quality drop. That's the whole method, and everything under here is just me showing how I got to it and why I trust the answer, because I only found this out the hard way after wiring the expensive model into all 90-odd of my prompts and never once asking, prompt by prompt, whether each of them needed it.

Now, the reason this matters is that most prompts a business runs all day, the classification, the extraction, the summarising, the qualifying, the generation off structured data, don't actually need a frontier model to hit the same quality bar, they just default to it because you picked one model once and pointed everything at it. That's exactly what I did when I built Sentrama, and it's exactly how the bill got big. So the useful question was never whether a cheaper model exists, it obviously does, it was which of my specific prompts could take one without my customers ever noticing, and the only honest way to answer that is to look.

Why does every prompt end up on the expensive model in the first place?

Because you pick one model once, right, and picking per prompt sounds like a nightmare so nobody does it. When you're building fast you choose the best model you can afford, you wire it into everything, and it works, and working things get ignored. I taught myself to vibe code about six months ago and built Sentrama, my AI sales dialer and CRM, basically from scratch in a couple of months with a teammate called Jay, and there's a sister company, Real Recruitment, on the same kind of prompts, and across both of them I just set the latest OpenAI model as the default for everything and left it. The data was qualifying itself, the reason for each call was writing itself off everything we'd scraped about a person, the reports were going out on their own, and obviously the fact it all just worked is exactly why I stopped paying any attention to what it cost. Then we onboarded more and more customers, more and more SDRs, more and more calls, and the bill climbed from about a grand a month to nearly four and then up into five figures, and one day I'm sat in a finance meeting staring at a number I'd been ignoring for months. And the uncomfortable bit wasn't the number, it was realising I had no idea which prompts were even responsible for it, because I'd never once broken it down.

So which prompts were actually driving the bill?

Not the ones I'd have guessed, which is basically the whole lesson. When I finally broke the spend down by prompt instead of staring at one big aggregate number, it was wildly lopsided, a small handful of high-volume prompt types were burning most of the bill and the long tail of everything else barely registered, and that shape is completely normal, most teams find a Pareto-ish split where a fraction of prompts drive most of the spend. But here's the trap I nearly fell into, the expensive prompts and the prompts that need an expensive model are not the same set. A prompt can burn a fortune purely because it runs ten thousand times a day and generates long outputs, not because it's cognitively hard, and a lot of my heaviest spenders were dead simple jobs, summarise this call, pull these fields out, classify this record, that a much cheaper model handles perfectly well. So the money wasn't sat where the difficulty was, it was sat where the volume was, and I'd been paying frontier prices on volume that never needed the intelligence.

How do you actually tier a prompt without just guessing?

You resist the urge to sort them into hard and easy off your gut, because your gut is exactly what got you here. The instinct is to look down the list and go, that one feels complex so it stays expensive, that one feels trivial so it goes cheap, and I promise you that instinct is wrong often enough to be dangerous, because some prompts that look trivial have a nasty edge case buried in them and some that look scary are actually really forgiving. So the honest split isn't hard versus easy at all, it's a much narrower question, does a cheaper model produce an answer that's better, or at least as good, on this specific prompt with my actual inputs, and the only way to answer that is to run it on your own traffic and compare. Here's roughly how the audit shook out for me, and the pattern shows up nearly everywhere I've looked since.

Prompt typeMy gut saidWhat the data said
High-volume classification and field extractionSimple, obviously cheapCorrect, moves down a tier with no quality drop and this is where most of the saving was hiding
Call and meeting summarisationNeeds the smart model to read nuanceMostly fine on a cheaper model, a couple of edge-case variants genuinely didn't and stayed put
Writing the reason for each call from scraped dataTrivial templated stuff, cheapActually more sensitive than I thought, some variants needed the stronger model to not sound generic
Long-horizon multi-step reasoning and anything agenticNeeds the expensive modelCorrect, left it exactly where it was, this is not the stuff to cut
Roughly how my 90-odd prompts sorted once I actually looked, rather than guessed. The split is illustrative, not a promise, and yours will differ, which is the whole reason you measure your own.

The honest headline from that table is that my instincts were right maybe half the time, which is basically a coin flip, and a coin flip is not a good enough basis for a five-figure decision. The prompts I was most confident were trivial included a couple that quietly needed the stronger model, and the prompts I was nervous about downgrading turned out completely fine. If I'd tiered the whole thing off my gut I'd have degraded a couple of things my customers would've noticed and left money on the table on the ones I was too cautious to touch.

Why not just trust a benchmark to tell you which model each prompt needs?

Because a benchmark isn't your traffic, and it never will be. A model can top a leaderboard and still fall over on the specific, slightly messy prompt you run ninety times a day, and the leaderboards get contaminated and gamed anyway, so this cheaper model scores well tells you basically nothing about whether it holds up on your reason-for-the-call generation or your particular flavour of summary. Per-token prices have genuinely fallen fast, you'll see figures like the cost to hit a given quality bar dropping something like ten times a year, so there's always a cheaper option a click away, that was never the hard part. The hard part is proof. My customers were used to a certain quality across those prompts and I wasn't about to quietly make their results worse to fix my own margins, and they had no idea what model ran underneath and it honestly wasn't their job to care. So the question was never is there a cheaper model, it was can I prove a cheaper one is at least as good on this exact prompt before I switch it, and a leaderboard cannot answer that for you.

So how do you prove it per prompt?

You capture your real production traffic for that prompt and you judge a cheaper model against your current one on three things at once. The first is format, the boring one that breaks everything, so you hold it to a full match, because if your app wants strict JSON with certain fields and the cheaper model comes back with prose or drops a key, it doesn't matter how clever the answer was, the whole pipeline downstream is just broken. The second is categorical, meaning did the two models make the same call, the same classification, the same yes or no, and when they disagree you re-judge that case blind so a coin-flip disagreement doesn't get counted as a real miss. And the third is semantic, are they actually saying the same thing in substance, obviously allowing for the fact that two good answers are basically never word-for-word identical. And then the bit almost everyone skips, which is honestly what makes the whole thing work, you measure all of that against how much your own baseline model already disagrees with itself, because if you ask your existing model the same prompt twice you don't get the same answer back, there's a natural spread, and you genuinely cannot call a swap worse until you know what your own model's normal wobble looks like first. A prompt only gets downgraded if the cheaper model stays inside your baseline's own deviation band, using your own baseline model as the judge. After three or four months down a deep, deep rabbit hole that's the method I landed on, it's now patent pending, but the concept is dead simple, you can't measure better until you've measured your own wobble. If you want the fuller walk-through of the mechanism it's on how it works, the deep methodology version is in is a cheaper AI model good enough, and how to prove it, and if you're weighing this against the usual routers there's a breakdown in AI model routing explained.

What did the audit actually save, and where does it not apply?

On the prompts that passed, the saving lands somewhere in the 30 to 60% range for most teams, and I'm giving you a band and not one hero number on purpose, because it genuinely varies prompt by prompt, which is the entire point of auditing each one on its own rather than swinging the whole bill at once. My own number came down a fair bit harder than that band, but I'm not going to sit here and promise you my result as if it's yours. And I'd rather be straight about where this doesn't apply, because that's what makes the rest believable. This isn't for coding agents, it's genuinely worse there and the failure modes get nasty, it's built for the high-volume workhorse prompts a business runs over and over, the classification and extraction and summarisation and qualification and generation-off-structured-data stuff that runs all day. And if your AI spend is genuinely tiny, a rounding error next to payroll, then honestly don't bother auditing it yet, go build something. The moment it stops being a rounding error and starts showing up in a finance meeting, that's when going prompt by prompt earns its keep, and either way your baseline sits there as an instant fallback, so a downgraded prompt that ever drifts just snaps straight back to your original model and you're never worse off than today.

The one-line version

Most of your prompts never needed the expensive model, you just never checked. Break the spend down by prompt, ignore your gut about which ones are hard, and prove a cheaper model per prompt on your own traffic before you switch it. The ones that pass save you money at no quality cost, and the baseline catches the rest.

Frequently asked questions

How do I figure out which of my prompts actually need the expensive model?

Break your spend down by prompt type first, because it's almost always wildly concentrated in a small handful of high-volume prompts. Then, prompt by prompt, test a cheaper model against your current one on your own real traffic and keep the expensive model only where cheaper genuinely comes back worse. Don't sort them by which ones feel hard, your gut is wrong often enough to be dangerous. The only reliable answer is measured, per prompt, on your inputs.

Won't downgrading a prompt to a cheaper model quietly hurt quality?

Only if you switch on a guess. If you prove each prompt first, format is held to a full match so nothing downstream breaks, categorical and semantic equivalence are judged against your own baseline's self-consistency, and a prompt only moves if the cheaper model stays inside your baseline's normal deviation. And if a switched prompt ever drifts later, your baseline is the instant fallback, so you're never worse off than today.

Isn't it obvious which prompts are simple and which are hard?

It really isn't, and that assumption is the expensive mistake. When I audited my own, my gut was right roughly half the time, some prompts I was sure were trivial had a nasty edge case that needed the stronger model, and some I was nervous about downgrading turned out completely fine. Difficulty and cost also aren't the same thing, a lot of your biggest spenders are simple jobs that just run at huge volume, so you have to measure rather than eyeball it.

What kind of saving should I expect from tiering my prompts?

On the prompts that pass the proof, most teams land somewhere in the 30 to 60% range. It's a band and not a single figure because it genuinely varies prompt by prompt, which is exactly why you audit each one on its own rather than swinging your whole bill at a single cheaper model. You can watch it prove or fail a swap on your own prompts on up to 10 prompts free, no credit card, before you commit to anything.

Sources

  1. 1.Epoch AI - LLM inference prices have fallen rapidly but unequally across tasks
  2. 2.a16z - LLMflation: LLM inference cost trends
  3. 3.Singh et al. (2025) - The Leaderboard Illusion
  4. 4.Zhang et al. (2024) - A Careful Examination of LLM Performance on Grade School Arithmetic (GSM1k)
  5. 5.Zheng et al. (2023) - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  6. 6.State of FinOps 2026 (AI/LLM spend attribution)

Prove it on your own prompts

See whether a cheaper model matches or beats your output for 30-60% less. Up to 10 prompts free, no credit card.

Keep reading