claude-api-costs-in-production.txt

What does it actually cost to run Claude in production?

Author

Adam

Date

The first question I get on almost every AI scoping call is some version of "what will this cost us per month?". It is a fair question, and the honest answer is: far less than most people expect, provided the integration is designed with cost in mind from day one.

How LLM pricing works

Claude, like every major model API, is priced per token. A token is roughly three quarters of a word. You pay one rate for the text you send in (the prompt, your documents, the conversation history) and a higher rate for the text the model writes back.

That output rate being four to five times the input rate matters. Workloads that read a lot and write a little (classification, extraction, routing) are cheap. Workloads that write a lot (drafting long reports) cost more per call.

A realistic example

Say a customer operations team runs 20,000 enquiries a month through an assistant that reads the enquiry plus account context (about 2,000 input tokens) and drafts a reply (about 300 output tokens). On Claude Sonnet that is roughly 120 to 150 dollars a month. Not per seat: total. For a team of ten handling those enquiries, the software cost is a rounding error next to the hours saved.

If you want to model your own numbers, I built a free calculator that does the arithmetic across Claude, GPT, and Gemini at the same usage: the LLM cost calculator.

The three levers that cut costs most

  • Prompt caching: if every request re-sends the same instructions or reference documents, caching that prefix can cut input costs by up to 90% on repeat calls.
  • Model routing: most pipelines have steps that do not need the strongest model. Routing easy steps to Haiku and hard ones to Sonnet or Opus routinely halves the bill.
  • Batch processing: anything that does not need an instant answer (overnight reports, bulk tagging) can run through batch endpoints at half price.

None of these are exotic. They are standard engineering decisions, but they have to be made deliberately. Retrofitting them after launch is harder than designing them in.

Where bills actually blow up

In practice the painful bills I have seen were never caused by the per-token price. They were caused by unbounded conversation history being re-sent on every turn, retry loops without backoff, and agents allowed to wander without step limits. All preventable with sensible guardrails.

If you are weighing Claude against other models on cost and capability, my Claude vs ChatGPT comparison covers the trade-offs in detail. And if you want a number for your specific workload, book a free scoping call and we will model it together.