Last updated: June 26, 2026
“LLM costs are killing my side project.” If you’ve ever opened a billing dashboard and felt that, you’re in good company — it’s one of the most common things developers say once they ship anything on top of GPT, Claude, or Gemini. The token meter runs quietly in the background until the invoice stops matching the traffic, and a lot of people describe the same moment: it burned through the budget in a few hours and nobody warned them.
Here’s the part that doesn’t get said enough: most API bills are bigger than they need to be by half or more, and almost none of the fixes mean rewriting your stack. Below are nine ways to spend less per token, ordered roughly from “do it this afternoon” to “worth a sprint.”
1. Stop sending everything to the biggest model
The single most expensive habit is calling a frontier model for work a small one would nail. The gap is enormous. As of mid-2026, lightweight tiers — Google’s Gemini Flash-Lite, DeepSeek’s Flash tier, OpenAI’s nano models — run on the order of $0.10–$0.30 per million input tokens, while the flagship models cost roughly $2.50–$15. That’s often 10–25× cheaper for the same call. Prices move constantly, so check each provider’s current pricing page, but the pattern holds: classify your tasks, and let the cheap model handle the routine ones. Summaries, classification, extraction, and formatting almost never need the expensive model.
2. Route by difficulty instead of picking one model
A step past “use a cheaper model” is to use both, and decide per request. This is model routing (or a model cascade): a quick classifier sends easy prompts to a cheap model and reserves the expensive one for the genuinely hard tasks. Since most requests in a typical app are routine, the weighted-average cost drops a long way — teams that add routing commonly report 40–70% savings without users noticing a quality change.
3. Turn on prompt caching — the fastest way to save on tokens
If you send the same system prompt, instructions, or document on every call, you’re paying to re-read it every time. Prompt caching fixes that, and as of 2026 OpenAI, Anthropic, and Google all support it. Cached input can cost as little as a tenth of the normal rate — up to ~90% off the repeated portion. If your system prompt is more than a thousand tokens and you make thousands of calls a day, you’ll see the savings the same day you deploy it. This is usually the highest-return change on the list relative to effort.
4. Batch the work that isn’t time-sensitive
Not every request needs an answer in two seconds. Both OpenAI and Anthropic offer a Batch API that processes requests asynchronously — results come back within a day — at roughly a 50% discount. Nightly report generation, bulk tagging, embeddings backfills, evals: anything that can wait is trivially eligible. For a lot of teams that’s 20–40% of total spend moved to half price for basically no engineering.
5. Trim the prompt and the context
You pay for every token you send, including the whitespace. Tighten bloated system prompts, drop the few-shot examples the model no longer needs, and stop replaying the entire conversation history on every turn — summarize older messages instead. Compacting context routinely cuts input tokens 50–70% on long-running chats and agents.
6. Cap and shape the output
Output tokens usually cost several times more than input tokens, so a chatty model is an expensive one. Set max_tokens to something sane, ask explicitly for concise answers, and request structured output (JSON or a short schema) when you’ll parse it anyway. “Answer in one sentence” is a real cost control, not just a style note.
7. Cache answers, not just prompts
If users ask the same things — and they do — don’t pay twice. An exact-match cache catches repeated queries; a semantic cache catches questions that are worded differently but mean the same thing, returning the stored answer instead of a fresh generation. For support bots and FAQ-style traffic this can quietly remove a big chunk of calls.
8. Use retrieval (RAG) instead of stuffing the context window
Pasting a whole manual, codebase, or knowledge base into every prompt is the most common way to accidentally 10× your token count. Retrieval-augmented generation flips it: index the documents once, then fetch only the few relevant chunks per question. You send a fraction of the tokens and usually get a more focused answer too. If you’re wiring tools and context into an agent, our guide to the Model Context Protocol is a good next read.
9. Go open-source, or switch to a cheaper provider
For steady, high-volume workloads, hosted frontier APIs aren’t your only option. Open-weight models — Llama, Qwen, DeepSeek, Mistral — can be self-hosted, and fast inference providers like Groq, Together, and Fireworks serve open models at low per-token rates. For development and testing, lean on free tiers (Google AI Studio and Groq both have generous ones) so you’re not paying real money to debug. One caution: stay on the right side of each provider’s terms. Sharing or pooling API keys to dodge rate limits, or using “free token” schemes that abuse a service, gets accounts banned — it doesn’t save money. The legitimate levers above are where the real savings live.
Putting it together
These stack. Caching knocks down repeated input, batching halves the async work, routing keeps most traffic on cheap models, and trimming shrinks everything that’s left. Teams that layer all of them typically land at 70–85% lower spend — turning a painful bill into a rounding error — without shipping a worse product.
Frequently asked questions
What’s the cheapest LLM API right now?
As of mid-2026 the budget tiers — Gemini Flash-Lite, DeepSeek’s Flash tier, and OpenAI’s nano models — are the cheapest production options, around $0.10–$0.30 per million input tokens. The exact leader changes month to month, so compare current pricing pages before you commit.
Does prompt caching really cut costs that much?
For the cached portion, yes — cached input commonly costs about a tenth of normal input on Anthropic and OpenAI. The savings only apply to the repeated part of your prompt (your system instructions or a fixed document), so it helps most when that part is large and reused often.
Is it cheaper to self-host an open-source model?
It can be at high, steady volume, where you’re paying for GPUs you keep busy. At low or spiky volume, a hosted API — especially a cheap provider serving open models — is usually cheaper and far less hassle than running your own infrastructure.
How do I test without burning my budget?
Develop against free tiers (Google AI Studio, Groq), point your test suite at a cheap model, and only use the expensive model for final checks. Set a hard monthly spend limit in your provider dashboard so a runaway loop can’t surprise you.
Where this matters most
If you’re building AI agents, all of this compounds — agentic workflows fire many calls per task, so a 60% per-call saving is a 60% saving on the whole system. The same goes for coding assistants; see our roundup of the best AI coding tools in 2026 and our Claude vs ChatGPT vs Gemini comparison if you’re still choosing where to spend those tokens in the first place.
Written by the GeekSourceCodes team. Prices and model names change fast — verify against official pricing pages before you budget.