AI margincredit pricingpower usersunit economicstoken costsFinOps for AI

Why 10% of Your Users Eat Your AI Margin, and How to Control the Token Whales (2026)

About 10% of your users burn 70-80% of your tokens, so a flat credit price means your heaviest users quietly run at negative margin. Here is the worked math, and the cost-side fix that flattens the tail without rate-limiting your best customers.

Parity LayerJune 25, 202610 min

Key takeaways

Roughly 10% of users drive 70-80% of token consumption, so a flat credit price leaves your whale tier running at negative margin while blended margin looks healthy.
AI-product gross margin is about 52% in 2026 versus 70-90% for traditional SaaS; inference is now a real variable cost of goods.
Waiting for cheaper tokens does not heal margin: volume growth, migration to newer models, and agentic token use eat the deflation, led by your whales.
Repricing and rate-limiting punish your best users; cutting per-call cost 30-60% flattens the fat tail without touching price or experience.
Prove a cheaper model is at least as good on your own prompts at about 95% confidence over 30+ comparisons, then switch with instant fallback.

In most AI products, about 10% of users burn 70-80% of the tokens, per Kyle Poyar's analysis of AI credit pricing (Growth Unhinged). You priced credits or AI actions once, for an average user who does not exist. The whales pay you the same flat-ish rate and cost you many times more to serve, so a slice of your heaviest users quietly runs at negative margin. The fix is not rate-limiting your best customers. It is cutting the per-call inference cost 30-60% so the expensive tail flattens by itself.

GitHub Copilot makes the danger concrete: it charged $10/month and lost an average of $20/user, up to $80 for the heaviest, before moving to usage-based token billing on June 1 2026 (GitHub). That is a coding tool, so it is evidence here, not a use case Parity serves. The same fat-tail math hits support agents, summarizers, classifiers, content generators, and RAG answer engines that price in credits.

Why do 10% of users eat most of your AI margin?

Because token consumption follows a power law, not an average. Poyar's data shows 70-80% of AI token usage comes from roughly 10% of users (Growth Unhinged). You set one credit price for a blended average, but the average is a fiction: a small group of power users runs long prompts, big contexts, and chained calls, so their true cost of goods dwarfs what their plan brings in. The mean hides the whale.

Three behaviors push the tail wider. Power users send larger inputs and ask for longer outputs, both of which are billed per token. They use your product more days per month, so volume compounds. And on credit plans with rollover or generous included buckets, they front-load consumption while light users leave credits on the table, which makes your reported blended margin look healthier than the per-whale reality.

What does a power user actually cost you?

More than the plan price, often by a wide margin. The clean way to see it is a usage-tier table: split your base into light, median, and whale, then put inference cost per user against the revenue that user's plan generates. The light and median tiers subsidize the whales, and if the whale tier is large enough it drags the whole cohort toward or below breakeven.

User tier	Share of users	Share of tokens	Monthly inference cost / user (illustrative)	Monthly revenue / user	Gross margin on AI
Light	60%	8%	$1.20	$20	+94%
Median	30%	15%	$4.50	$20	+78%
Whale	10%	77%	$34.00	$20	-70%
Blended	100%	100%	$5.47	$20	+73%

Illustrative only. A flat $20 plan with a power-law token distribution: the blended margin looks fine (+73%) while the whale tier runs deeply negative. The blended cost is the user-weighted average of the tier costs ($1.20, $4.50, $34.00 at 60/30/10 shares). Cost and revenue figures are examples, not Parity measurements.

The blended +73% is the number that lulls founders to sleep. It is real, and it is also the average that hides the whale. The moment whales grow as a share of your base, or a competitor's launch pulls them to heavier workflows, that blended line slides. Replit is the cautionary public example: its gross margins swung between -14% and +36% across 2025 (Growth Unhinged), the signature of a business whose COGS moves with usage it priced as if it were fixed.

The structural trap

You sell a spread. As Software Pricing Partners puts it, "When your credits roughly correlate with tokens and customers know the providers publish their prices, you have made your margin visible. You are selling a spread, and the buyer's job is to compress it" (Six Fatal Flaws of Credit-Based Pricing). Whales are the buyers most motivated to compress it, because they consume the most.

Why is the AI-product margin already lower than you think?

Because inference is now a real variable cost of goods, and it has already pulled the category's gross margin down to roughly half of classic SaaS. Traditional software runs 70-90% gross margin. AI products averaged about 52% in 2026, up from 41% in 2024 and 45% in 2025, per ICONIQ Growth (State of AI). Bessemer's State of AI 2025 found fast-ramping "Supernovas" averaging about 25% gross margin, with steadier "Shooting Stars" near 60% (Bessemer).

The SaaS CFO walks the mechanism in one line: start at $100 revenue with $20 traditional COGS, an 80% margin. Add an AI feature that costs $15 in inference, COGS becomes $35, and margin falls from 80% to 65%, and that is before heavy users (The SaaS CFO). The anchor worth taping to your monitor: for every $1M in AI product revenue, roughly $230K can walk out as inference cost before anyone on your team is paid. Whales are the reason the "before heavy users" caveat matters so much.

This is why Bain frames AI as introducing genuine variable costs into businesses that used to be near-fixed-cost, complicating the Rule of 40 (Bain). The whale problem is the same problem viewed through one cohort.

Should I just wait for token prices to fall?

No. Per-token prices are falling fast, roughly 10x per year by a16z's LLMflation measure (a16z) and a median near 50x per year by Epoch AI (Epoch AI), yet product margins do not self-heal. Three forces eat the discount: aggregate GenAI volume keeps climbing fast, customers migrate to each newest and most expensive frontier model, and reasoning or agentic workloads multiply the tokens spent per task.

Ethan Ding captures the migration effect bluntly: when a better model ships, "99% of demand immediately shifts" to it (Ethan Ding). Your whales lead that migration. They are the first to demand the new flagship and the first to push it into longer chains, so the per-token deflation you were counting on gets spent against you by exactly the cohort already running at negative margin.

Can I just reprice or rate-limit the whales instead?

You can, and the market is doing it constantly, but it is the painful lever. PricingSaaS counted more than 1,800 pricing changes across the top 500 SaaS and AI companies in 2025, about 3.6 per company, with credit-based pricing up 126% year over year and hybrid seat-plus-credit plans rising from 27% to 41% (Growth Unhinged). Cursor's redenomination raised effective per-unit rates more than 20x (Software Pricing Partners). Every one of those changes risks churning the heavy users who are also your loudest advocates and your best expansion accounts.

Rate limits and throttles are worse: you degrade the experience for the 10% who use the product most. Tom Tunguz frames the real escape: "Reselling inference at cost is a zero-margin business: a payment rail, not a software company." The fix, he argues, is to widen margin by reducing inference cost through routing, caching, and distillation (Tunguz). That is a cost-side move, and it does not touch your price or your power users.

How does cutting per-call cost flatten the fat tail?

Because the whale problem is multiplicative: whales hurt because they make many calls, so any per-call saving compounds hardest on the heaviest users. Cut the cost of a single call 30-60% and the whale tier, which generates 77% of your tokens, gets the largest absolute reduction. The light tier barely notices because it was already profitable. You flatten the tail without sending a single "your plan is changing" email.

User tier	Tokens	Cost before (illustrative)	Cost after 45% cut	Margin before	Margin after
Light	8%	$1.20	$0.66	+94%	+97%
Median	15%	$4.50	$2.48	+78%	+88%
Whale	77%	$34.00	$18.70	-70%	+7%
Blended	100%	$5.47	$3.01	+73%	+85%

Illustrative only. A 45% per-call cost cut on a $20 plan moves the whale tier from -70% to break-even-plus and lifts blended margin from +73% to +85%, with no price change and no throttling. Each blended cost is the user-weighted average of its tier costs at 60/30/10 user shares. Figures are examples, not Parity measurements.

Notice where the gain lands. The whale tier moves from -70% to roughly +7%, the difference between a cohort that bleeds you and one that pays its own way. Blended margin climbs to +85% because the cut is biggest exactly where the tokens are. This is the FinOps-for-AI move applied to a cohort, not a billing redesign. For the broader playbook, see the FinOps for AI guide.

How do you cut the cost without degrading the whale's output?

By proving a cheaper model is at least as good on your own prompts before any traffic moves, and falling back instantly when it is not. Generic routers (OpenRouter, LiteLLM, Portkey, Martian, Not Diamond, Cloudflare AI Gateway, Helicone) pick models by heuristic or by classifiers tuned on generic benchmarks like MMLU and RouterBench. That creates two failure modes: over-routing, where a cheap model degrades a hard task and ships quality risk to your user, and under-routing, where you overpay out of caution. None of them prove equivalence on your actual traffic before switching.

Parity runs that proof on the customer's own prompts. A blind self-baseline judge compares the cheaper model's answer against the baseline, a switch requires about 95% statistical confidence over 30 or more comparisons on your own prompts before any traffic moves, and the response format is guaranteed with instant fallback to the baseline. Once a cheaper model's prompt is optimized for your specific task and measured against your baseline, it can match or beat that baseline on that task. The output is better, or at least as good, proven on your prompts, not asserted from a leaderboard. See how it works and how to prove a cheaper model is good enough.

Why this beats a router for the whale problem

A whale's workload is the hardest to route blindly, because it is long, varied, and high-stakes for your churn. Routing by generic benchmark risks degrading exactly the cohort you most need to keep. Proving equivalence on that whale's own prompts first, then switching with instant fallback, is the difference between protecting margin and gambling with your best accounts.

What about the credit price itself?

You can leave it alone, which is the point. Real products show how visible the spread already is. Intercom Fin charges $0.99 per resolution (Intercom). Across the category, vendors increasingly meter AI by credits or actions, weight steps by model tier (cheaper models cost fewer credits than frontier ones), and bill roughly in proportion to the underlying token volume. In every such case the customer can roughly back into your provider cost. Cutting the cost behind the credit widens your spread quietly, where repricing widens it loudly and invites the buyer to fight back.

The strategic sequence is simple: cut per-call cost first, protect or widen margin, and keep repricing as a later lever for value capture rather than survival. You stop selling a thinning spread and start selling software again.

Key takeaways

Roughly 10% of your users drive 70-80% of token consumption, so a flat credit price means the whale tier often runs at negative margin while your blended number looks fine.
AI-product gross margin sits near 52% in 2026 versus 70-90% for classic SaaS; whales are why the "before heavy users" caveat keeps biting.
Waiting for cheaper tokens does not heal margin: volume growth, migration to the newest models, and agentic token multiplication eat the deflation, and whales lead all three.
Repricing and rate-limiting punish your best users; cutting per-call cost 30-60% flattens the fat tail without touching price or experience.
Prove a cheaper model is at least as good on your own prompts (about 95% confidence over 30+ comparisons), then switch with instant fallback, instead of routing blind on generic benchmarks.

You can run this on your own prompts for free, up to 10 prompts, no credit card, at the Parity dashboard. Start with the LLM cost optimization guide if you want the full playbook first.

Frequently asked questions

What counts as a power user or token whale?

A power user is any account whose token consumption sits far above the median, typically through long inputs, long outputs, frequent sessions, or chained calls. In most AI products the top 10% of users generate 70-80% of total tokens, so a handful of accounts can dominate your inference bill and your COGS.

How can blended margin look fine while whales lose money?

Blended margin is an average across all users. Light and median users are usually profitable and outnumber whales, so their healthy margins mask a small negative-margin tail. The risk is that as whales grow as a share of usage, the blended number slides toward or below breakeven, as Replit's -14% to +36% swings in 2025 illustrate.

Why not just raise prices or add usage-based billing for heavy users?

You can, and many companies did (over 1,800 SaaS pricing changes in 2025), but repricing risks churning your heaviest and most vocal users, and rate limits degrade the experience for the people who use your product most. Cutting per-call cost first widens margin quietly without touching price, then repricing becomes a value lever instead of a survival move.

How much can cutting per-call cost actually help the whale tier?

Because whales make the most calls, a per-call saving compounds hardest on them. In an illustrative model, a 45% per-call cost cut moves the whale tier from -70% to about +7% margin and lifts blended margin from +73% to +85%, with no price change and no throttling. Your real numbers depend on your prompts and traffic.

How is this different from a model router?

Generic routers (OpenRouter, LiteLLM, Portkey, Cloudflare AI Gateway, Helicone, and similar) pick models by heuristic or by classifiers tuned on generic benchmarks, which risks degrading hard tasks (over-routing) or overpaying (under-routing). Parity proves a cheaper model is at least as good on your own prompts, at about 95% confidence over 30+ comparisons, before switching, with the response format guaranteed and instant fallback to the baseline.

Sources

Prove it on your own prompts

See whether a cheaper model matches or beats your output for 30-60% less. Up to 10 prompts free, no credit card.

Start free How it works

Keep reading

Why Your AI Bill Exploded Even Though Tokens Got 10x Cheaper (2026)

Per-token prices fell about 10x in a year. Your bill still doubled. Here is the Jevons-paradox reason, and the only fix that cuts cost without cutting quality.

How to Reduce AI API Costs in 2026: Stop Overspending (The Full Playbook)

Every lever, ranked by savings and effort, ending with the one most teams skip because it is the hardest to do right: routing to a cheaper model proven to match or beat your baseline on your own prompts.

Produce Better AI Output for Less: Cheaper Models, Proven (2026)

A well-optimized cheaper model can match or beat your expensive default on a specific task. The evidence, the honest limits, and the proof that makes it safe to route real traffic.