AI cost per user: how to model it before you ship
2026-05-22 / 4 min / ai / unit-economics / pricing / founders
Founders ask 'how much will the AI cost?' and quote API prices. The API price is the least useful number in that conversation. Here's a practical model for cost per active user per month: the six terms that matter, and the five levers that actually move them.
"How much will the AI cost?" is the question every founder asks. It is also the one most founders ask the wrong way.
What they usually want is the API price - dollars per million tokens, dollars per request. That number is on the provider's pricing page. It is the least useful number in the conversation.
The number you actually need is cost per active user per month. Everything that matters in AI economics lives there - pricing, margin, scale risk, whether the feature survives a quarter at 10x volume. If you cannot state it, you cannot make a decision.
Here is the model I use when working through that number with a founder.
What the formula actually looks like
Cost per active user per month is six terms, not one. Each is where the first model usually goes wrong.
- Sessions per user - the "monthly active user" in your dashboard generates 5 to 30 sessions, not one. Variance kills the average.
- Turns per session - a single user message often hides 3 to 8 backend calls. Tool use, retrieval, reranking, follow-ups. Per-request math undercounts by 3-5x.
- Tokens per turn - input tokens are usually 5 to 20x output tokens once you add retrieved context, system prompts, and tool definitions. Founders model output-heavy when the bill is input-heavy.
- Retry multiplier - 1.10 to 1.25 in a healthy system. Higher if you hit refusals or schema failures. Most first models assume 1.00.
- Retrieval cost - embedding calls, vector DB reads, reranking. Often 10 to 30% of inference spend by itself. Frequently missed entirely.
- Overhead - eval traffic, shadow runs during a model switch, observability at AI volume. Usually 5 to 15% on top of everything else.
A model with only the first three terms is 2-3x light on the real cost.
Why the average is a trap
A founder shows me a number: "We estimate 60 cents per user per month."
The right next question is: which user?
In every AI product I have worked on, user-level cost is heavy-tailed. The P50 user costs well below the average. The P95 user costs 4 to 10x the P50. The P99 user can cost 30 to 50x as much as the P50 user.
The reasons are mundane. A small share of users hammer the feature. Long sessions accumulate context. Some users find a workflow that hits your most expensive code path on every turn. None of it shows up in the mean.
This breaks two decisions. Pricing - a flat $X/user/month tier can flip negative on the P99 cohort alone. The fix is not a higher flat price. It is a usage-aware cap or an explicit tier. Scale - doubling MAU does not double cost. It can triple it, because new users arrive in the same distribution and the long tail compounds.
Always model the distribution, never just the average.
What founders consistently underestimate
In approximate order of impact:
- Context bloat over time - prompts grow as the product evolves. A prompt that started at 1,500 tokens is at 6,000 a year later. The price per turn drifted up while nobody was watching.
- Retrieval pulled too aggressively - "top 10 chunks" felt safe at design time. Half are never used by the model and you paid for them anyway.
- Eval traffic in production - good teams run a real eval harness against production traffic or new model versions. That traffic is real money on the bill.
- Idle session context - a chat that keeps a session open for 30 days replays history on every turn. Founders model the first turn and forget the tenth carries the first nine.
- The wrong model by default - 60 to 80% of traffic does not need the most expensive model. The discovery usually comes after a quarter of overpaying.
None of these are visible on day one. All of them are visible by month three.
The five levers that actually move cost
Once you have a real number and the distribution behind it, five levers consistently work. The first matters more than the others combined.
- Use the smallest model that passes your evals - the single largest source of savings in every product I have seen. Most teams default to the strongest model out of caution and never revisit. Run your eval set against a cheaper model. If it passes, switch.
- Cache aggressively - prompt-prefix caching, exact-match on deterministic tools, semantic caching on repeat questions. A well-cached product cuts inference spend 20 to 40% with no quality drop.
- Cap the long tail - max tokens per session, max turns per hour, max retrieval depth per query. The difference between predictable margin and bleeding on power users.
- Route across providers - once you cross meaningful traffic, staying single-provider leaves money on the table. Capability-aware routing with a strict latency budget consistently delivers 20 to 30% in provider savings - an orchestration layer I built for one team landed at 28%.
- Charge for the expensive parts - if a workflow is structurally expensive (long-context summarization, multi-step agents, custom retrieval over a customer's full archive), price it as usage-based, not bundled into a flat tier.
The first three are engineering. The last two are product. All five are work, not luck.
Six questions before you ship
If you cannot answer these in numbers, you do not have a cost model yet.
- What is your projected cost per active user per month, broken into inference, retrieval, and overhead?
- What does that number look like at the P50, P95, and P99 user?
- What is your retry multiplier, and how did you measure it?
- What share of your traffic runs on the most expensive model, and have you tested whether it needs to?
- What is your cost ceiling per user, and what happens when a user hits it?
- At 10x current usage, what in the model breaks first?
A team that can answer those six is much less likely to have a margin surprise in month four.
The conversation worth having
The right time to model AI cost per user is before you ship, when the levers are cheap to pull. The expensive version is discovering the math in production and re-architecting under pressure. If you are still upstream of that, scope the engagement first.
This is most of what gets covered in a paid scoping call. We sit with the formula, fill in the numbers from your traffic, and find the two or three levers that move your specific product. The output is a defensible number and a short list of what to change first.
If you are about to commit to a number on a slide, book a scoping call first. One hour of math has saved more than one company a quarter of revenue.
Read next
- Why AI keeps fixing your app into new bugs
When your AI coding tool gets stuck, another prompt often makes the app worse. Break the debugging loop with reproduction, evidence, small diffs, and tests.
- Your AI-built app is a prototype until production proves it
AI coding tools can get a SaaS demo online fast. Use this AI-built app production-readiness checklist before real users, payments, private data, and integrations.
