All Articles
AI6 min read10 September 2025

The Hidden Costs of Production AI

The cost discussion around AI applications almost always focuses on inference. Once you are at scale, inference is rarely the most important cost.

AIEngineeringProductionArchitectureCosts

The cost discussion around AI applications almost always focuses on inference. How many tokens does the application use? What is the cost per call? How do we reduce unnecessary model invocations? These are reasonable questions and the numbers can be significant. But after watching several teams reach production scale, I have come to think that inference cost is rarely the most important cost.

The costs that tend to matter more once an AI application is running at scale are harder to see in a dashboard. Human review is one. Most AI applications in regulated industries or high-stakes domains cannot simply deploy model outputs directly to end users without some level of human oversight. The infrastructure for that oversight, the tools, the processes, the training, the time of the people doing the reviewing, all of this is expensive and tends to be underestimated during planning.

Evaluation costs are another. A team that is serious about knowing how well their AI application is working needs to invest continuously in evaluation. Ground truth datasets need to be maintained and expanded. Evaluation runs need to happen regularly, especially after model updates or prompt changes. The people who understand the domain well enough to evaluate model outputs are expensive. This is not an optional cost that can be deferred until scale justifies it. Teams that skip evaluation early build blind spots that become problems later.

The operational complexity cost is often the most surprising. AI applications have different failure modes than conventional software. They fail silently and unexpectedly. A model update from the provider changes behaviour in ways the application owner did not anticipate. A new category of user queries surfaces a failure mode that did not appear in testing. Debugging these failures requires different skills and different tooling than debugging a conventional application. Building that capability takes time and tends to be more expensive than anticipated.

Prompt maintenance is a cost that teams rarely plan for. The prompts that make an AI application work are not set-and-forget. They need updating as the application scope changes, as model behaviour shifts, and as new failure modes are discovered. In larger applications with many prompts, prompt management becomes an engineering discipline in itself.

I am not arguing that these costs make AI applications uneconomical. In many cases the value delivered is large enough that the costs are clearly justified. The argument is that teams making the build decision using only the expected inference cost as input to the economics calculation will be surprised by what the total cost of ownership actually looks like.

Found this useful?

Share it with someone who'd enjoy it.