Compare LLM API costs and local inference TCO side by side
Which models do you want to price?
Select the providers you're considering. Results sort cheapest first.
Anthropic
OpenAI
How much do you use per day?
Not sure? Pick a preset — you can fine-tune from there.
System prompt + user message combined
The model's response — usually 5–20% of input
Repeated prompts served from cache — use 0% if unsure
This calculator solves the question every AI engineer eventually hits: “Should I pay for API calls or buy hardware?” It compares cloud LLM API costs (Claude, GPT-5, Gemini) against local inference hardware total cost of ownership — and tells you the exact month where local pays for itself.
Three modes in one tool: API Cost mode shows your monthly spend across all major models side by side. Local Hardware TCO models the real cost of running a GPU or Pi accelerator, including electricity derived from actual inference time (not a naïve 24/7 power-draw assumption). Break-even mode overlays the two and draws the crossover line.
API Cost mode — enter your daily input and output token volumes, your cache hit rate, and select which models to compare. The table sorts cheapest-first. Start here if you’re running or planning a cloud-only workload.
Local Hardware TCO mode — pick a hardware option (RTX 5060 Ti 16GB, RTX 5060 8GB, Pi AI HAT+ 2), select a model that runs on it, enter your token volume and local electricity rate, and set an amortization period. The calculator computes inference time from published tok/s benchmarks, then derives actual electricity cost from that runtime.
Break-even mode — choose one cloud model and one hardware option. The chart shows cumulative cost over 36 months for each path, and marks the month where local becomes cheaper.
Key inputs:
API cost: monthly_cost = (input_tokens/day × 30 × (1 − cache_rate) × input_price_per_Mtok) + (input_tokens/day × 30 × cache_rate × cached_price_per_Mtok) + (output_tokens/day × 30 × output_price_per_Mtok)
Local TCO: inference time is derived from published decode benchmarks (tok/s). Electricity cost = (inference_seconds/day × TGP_watts × 30) / (3_600_000) kilowatt-hours × your rate. Hardware cost is amortized linearly over your chosen period. Total monthly cost = amortized_hardware + electricity.
TGP (Total Graphics Power) is used as a conservative upper bound. Real inference draw is typically 60–85% of TGP, so your actual electricity cost may run lower.
The 10x cost surprise is real. A prototype that costs $50/month at toy load scales to $500–$5,000/month in production, and the developers who get hit hardest are the ones who never modeled the token math before launch.
Two failure modes I’ve seen: (1) teams that default to GPT-4-class models for everything because “quality” — not realizing that Claude Haiku or Gemini Flash handles 80% of their tasks at 10–15× lower cost; (2) teams that buy expensive GPU hardware expecting to save money, without doing the break-even math — and find out the card pays off in 18 months at their actual load, by which time a cheaper API model has launched.
The cache hit rate input exists because most people underestimate it. If your system prompt is 4,000 tokens and most requests share it, prompt caching can cut your input bill by 40–50% with zero code complexity on the Anthropic API. This calculator lets you model that directly.
Use this tool before you commit to an architecture. Run the break-even tab before you order hardware. The numbers will change your decision more often than you expect.
For informational purposes only. Not financial, medical, or legal advice. You are solely responsible for how you use these tools.