build in publicLLM evaluationAI cost optimizationmodel routingfounder story

How I Cut My Own AI Bill Without Dropping My Customers' Quality (2026)

The whole thing started because I refused to make my customers' results worse to save myself money. So I built a way to prove a cheaper model matched mine on my own prompts first. Here is how that actually works.

By Roman Rose, Founder, Parity Layer6 min read

Key takeaways

  • The reason to prove a cheaper model, rather than just swap one in, is that you refuse to quietly degrade the quality your customers rely on.
  • Benchmarks are not your traffic. A leaderboard win is a hypothesis, and the only test that counts is your own prompts against the model it would replace.
  • You judge a swap on three things at once: format holds, the models agree (re-judged blind when they disagree), and it means the same thing.
  • You cannot call a swap worse until you know how much your own model already disagrees with itself, so you measure that self-consistency band first.
  • Most teams land in the 30 to 60% range on the prompts that pass, with an instant fallback to the baseline so a later drift never burns you.

So there was basically one thing I was never going to do, right, which was quietly make my customers' results worse just to make my own margins a bit better, and honestly that was kind of the whole line I wasn't going to cross. The whole reason I went looking for a cheaper model in the first place was to save money, obviously, but the second it started to cost quality I'd have walked away from it, because the people paying me were used to a certain standard and they had no idea what model was running underneath, and honestly it wasn't really their job to care. So the real question was never "is there a cheaper model", there always is, right, the question was more like "can I actually prove a cheaper model is at least as good on my own prompts before I go and switch anything", and that, it turns out, is a much harder and honestly a much more interesting thing to answer than it sounds.

So let me just answer that first, plainly. Basically you prove a cheaper model is good enough by testing it against your own production prompts, not against some leaderboard, and you judge it on three things all at once, which is does it come back in the exact format your app expects so nothing downstream breaks, and does it make the same categorical calls your current model makes, and is it actually saying the same thing in substance. And then crucially, right, you measure all of that against how much your own baseline model already disagrees with itself, because you can't honestly call a swap worse or better until you know what your own model's normal wobble looks like in the first place. So if the cheaper model stays inside that band on your real traffic, it's good enough, and everything below is just the long, long-winded story of how I actually got there.

How did I end up here in the first place?

I guess the journey started when I taught myself to vibe code, maybe like six months ago. We were originally an agency, stitching together a third-party dialer and a CRM and a bunch of other calls to deliver our service, and we very quickly realised that if we could pull all of it under one roof we'd obviously save a lot on cost and give the customer a genuinely better experience at the same time. I started with Claude Code, which quickly became my best friend, and I went on this whole journey basically with a teammate called Jay who came from the data side, and about two months in we'd built the platform from scratch and started migrating everyone onto it. That platform is Sentrama now, our AI sales dialer and CRM, and there's a sister company, Real Recruitment, built on the same kind of prompts.

Very quickly I was writing prompts, embedding them into the platform, and I was using the latest model from OpenAI, and honestly I basically just set it and left it. It worked, so I stopped paying attention to what it cost, right, and then we onboarded more and more customers, more and more SDRs, more and more clients, and we made more and more calls, and the OpenAI bill just climbed from about a thousand a month to nearly four thousand, and then kept going into five figures. After one particular finance meeting I sat there thinking, okay, why don't I just try a cheaper model, and that sent me down a deep, deep rabbit hole, because I did maths at A level and I'm genuinely super into stats and proofs and proving something actually works, and we never ship anything at any of my companies without proving it's going to have an impact and measuring every metric we can. I've got an admin dashboard which I think would send most people a bit crazy.

Why don't benchmarks answer this?

Because a benchmark isn't your traffic, basically. A model can top a leaderboard and still fall over on the weird, specific, slightly messy prompt you actually run in production ninety times a day, and the leaderboards get contaminated and gamed anyway, so "it scores well" tells you almost nothing about whether it'll hold up on your reason-for-the-call generation or your cold-call summaries. The only test that means anything is your own prompt, your own inputs, your own definition of a good answer, which is basically why I stopped trusting the leaderboards and started testing against my own live traffic instead.

So how do you actually prove it?

So basically you compare the cheaper model to your current one across three axes, all at the same time. The first one is format, and this is the boring one that breaks everything, right, so we hold it to a full match, because if your app is expecting strict JSON with certain fields and the cheaper model comes back with prose or drops a key, it honestly doesn't matter how smart the actual answer was, the whole pipeline downstream is just broken. The second one is categorical, which just means did the two models make the same call, the same classification, the same yes or no, and then when they disagree we re-judge that case blind so a coin-flip kind of disagreement doesn't get counted as a real one. And then the third one is semantic, which is just, are they actually saying the same thing in substance, obviously allowing for the fact that two good answers are basically never word-for-word identical.

Why does your model's disagreement with itself matter?

Now this is the part most people skip, and it's honestly the part that makes the whole thing work. If you ask your existing model the same prompt twice you don't get the exact same answer back, right, there's a natural spread, so before I can go and claim a cheaper model is worse, I first have to know how much my own baseline already disagrees with itself, and that self-consistency band is basically what you measure everything else against. So we get your own baseline model to judge the cheaper model's answers as worse, same, or better, weighing the actual downstream business impact rather than just word overlap, and then a swap only passes if it stays inside your baseline's own deviation. After three to four months of trialling different methods that's basically the solution we landed on, and it's now patent pending, but the concept itself is dead simple, you can't measure better until you've measured your own wobble. That's a very long-winded way of putting it, but that's essentially it.

Proof-based routing vs a static gateway, what's the honest difference?

A static gateway lets you point traffic at whatever model you pick and swap it with a config change, and honestly that's genuinely useful for what it is, cheap to run and fast to set up, no waiting around. So I'll just be straight about the trade-off, because there is one. A static gateway is instant, right, but it doesn't actually prove anything, the decision to switch is still yours and it's still basically a guess, and if the cheaper model quietly degrades on one prompt type you tend to find out from a customer, not from a dashboard. Proof-based routing does the boring work first, it captures your real traffic and proves the swap per prompt before anything moves, so it obviously isn't instant, you pay for it in a bit of upfront calibration and patience. What you get back is that you only switch the prompts where cheaper genuinely matches or beats your baseline, proven on your own data, and everything else just stays exactly as it was, with your baseline sitting there as an instant fallback the moment anything drifts.

Is this for coding agents?

No, and I'd honestly rather just say that plainly than pretend otherwise. This isn't for coding agents, it's genuinely worse there and the failure modes get pretty nasty, it's really built for the high-volume production prompts a business actually runs over and over and over, so classification, extraction, summarisation, qualification, generation from structured data, that kind of workhorse stuff that runs all day. So if you're sitting there routing a coding agent, this just isn't your tool, and I'd genuinely rather tell you that now than sell you something that's a bad fit and have you find out later.

Want the detailed method?

This is the story version. If you want the measurement rigour written up properly, the blind judge, the swapped answer order, the length control, the confidence intervals and the research behind it, I did that separately in is a cheaper AI model good enough, and how to prove it.

Frequently asked questions

Is a cheaper AI model good enough to replace my current one?

On a specific, well-defined task, often yes, but only once you prove it on your own prompts. A cheaper model is not universally better. After you measure its output against your current model on real traffic, it can match or beat the baseline, though on broad reasoning and long-horizon agentic work it usually still loses. The answer is per prompt type, which is exactly why you measure per prompt type instead of switching everything at once.

How many comparisons do I need before I trust the result?

Enough to be statistically confident, not one lucky call. A single win is variance. You want a run of comparisons on your own prompts before you route real traffic, and you keep an instant fallback armed in case anything drifts later.

Can a model be trusted to judge another model's answer?

Yes, if you correct for how these judges fail. They favour the answer they see first, they favour longer answers, and they favour their own family's output. So you judge blind, you swap the order, and the judge is never one of the contestants. It should reason at the same standard as the model it is replacing, anchored to your baseline's class.

Won't switching break my response format?

It can, which is why format is proven separately and held to a full match. The output shape is validated on every call, and if the cheaper model ever drifts off-format the system reverts to your baseline instantly. The cost saving never comes at the price of a broken field.

Sources

  1. 1.Zhang et al. (2024) - GSM1k: A Careful Examination of LLM Performance on Grade School Arithmetic
  2. 2.Singh et al. (2025) - The Leaderboard Illusion
  3. 3.Zheng et al. (2023) - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  4. 4.Yang et al. (2023) - Large Language Models as Optimizers (OPRO)
  5. 5.Ong et al. (2024) - RouteLLM

Prove it on your own prompts

See whether a cheaper model matches or beats your output for 30-60% less. Up to 10 prompts free, no credit card.

Keep reading