Insights · Cost & Token Optimization

AI Token Optimization Strategies: a realistic approach to reducing the cost of tokens used

Most teams hunting for inference savings reach for the wrong lever first. Here's what actually moves the bill, in the order to do it — based on what we see when we open up token traces on real engagements.

Inference cost Caching Routing 17 June 2026 · 12 min read

The short version

You don't need ten techniques. You need four, in order: instrument first, then prompt caching, then structured outputs + batch API, then model routing with judge sampling. Everything else — semantic caching, prompt compression, fine-tuning, local-first dev — is a surgical add-on once the foundations are running. Stop chasing the shiny piece. Start with what you can measure.

Token spend has a strange property: it grows quietly. A workload that cost $4,000 a month last spring is $28,000 a month this spring, the product surface looks the same, and no one on the team can quite explain where the extra calls came from. By the time it shows up on a board slide, the instinct is to chase whatever optimization technique trended on Hacker News that week — semantic caching, fine-tuning, routing to cheaper models. Most of those are real. Most of them are also the wrong place to start.

The honest answer, from running this audit on enterprise stacks burning $50K to $500K a month on inference, is that the levers stack in a specific order. The first three give you 50–80% of the savings most teams can realistically capture. The rest are surgical instruments — powerful when you need them, dangerous when you reach for them too early. This piece walks the stack, with the numbers and the failure modes that don't make it into the marketing pages.

00Before anything else: instrument

You cannot optimize what you cannot measure. Sounds obvious. Yet half the engagements we walk into don't know their cache-hit rate per route, the output-length distribution, or the per-feature unit cost. They know the total monthly bill and that's it.

The minimum kit before any other lever:

A tracing layer on every LLM call — Helicone, Langfuse, or Phoenix. Token counts in and out, model used, cache hit/miss, latency, error type.
Cost broken down per route (which user feature, which agent step), not per day. Cost-per-feature is the only number that lets you decide what to optimize.
An offline eval set on which you re-run prompts after every change. If you don't have one, build one before touching anything else.

Skipping this step is the single most expensive mistake teams make. You'll lower one number, raise another silently, and ship a regression you find out about from a customer.

01Prompt caching — the highest-ROI lever, period

Anthropic, OpenAI, and Google all now offer prompt caching at deeply discounted rates. The mechanic is the same: when you reuse a stable prefix — system prompt, tool schemas, few-shot examples, retrieved documents — across many calls, the provider tokenizes it once and reads from cache on subsequent calls at a fraction of the input price.

The 2026 economics are striking:

−90%

Anthropic cache read vs base input price

−50–90%

OpenAI prefix cache (model-dependent)

−75%

Gemini context cache read discount

For Anthropic Claude Sonnet 4.6, cache reads bill at $0.30 / million tokens versus $3.00 base — a 10× cut on the cached portion of every call. The trade-off is the write cost: 1.25× base price on a 5-minute TTL, or 2× on the 1-hour TTL. So caching pays off only when the same prefix gets used multiple times within that window.

The one thing that matters: prompt order

The mistake we see constantly is teams enabling caching, watching the bill barely move, and concluding it "doesn't work." It works. It's that their prompt is structured wrong. Caches match on prefix. A single dynamic value (a timestamp, a user ID, a randomized greeting) at the top of the prompt invalidates the entire cache for that call.

The rule: static content first, dynamic content last. System instructions, tool schemas, persona, formatting rules, few-shot examples — all of these go in the stable prefix block. Per-user data, retrieved documents specific to this query, the user's actual message — these go in the variable tail.

One team we worked with moved their cache-hit rate from 7% to 74% with a single prompt restructuring exercise — no architectural change, no model change, just reordering. Their monthly bill dropped 59% the following month. Aim for ≥70% cache-hit rate on stable workloads. Below 60%, your prefix is in the wrong place.

Where it fails: per-user dynamic context placed before the static portion; workloads where calls are spaced further apart than the cache TTL (you keep paying the write premium); choosing the 1-hour TTL when your real traffic refreshes within 5 minutes — you pay the 2× write multiplier for no benefit.

02Structured outputs & the death of retry loops

OpenAI Structured Outputs (GA since Aug 2024) and Anthropic constrained decoding (Nov 2025) let you bind the model's output to a JSON schema. The model can't ramble, can't drift, can't return a half-broken JSON your parser then has to retry on. It emits exactly the schema or nothing.

There's no headline % savings here because the benefit is silent: every retry loop you used to run — "parse failed, ask again, this time really return valid JSON" — disappears. On output-heavy workloads, that's typically 15–40% of total token spend you didn't realize was retries. Latency drops too, since modern engines (vLLM, SGLang, TensorRT-LLM) ship XGrammar-based constrained decoding with sub-40 microsecond per-token overhead.

If your application has anything resembling "extract these fields from a document," "classify this into one of these categories," or "decide which tool to call next," and you're not using structured outputs, that's the second cheapest fix on the list.

Where it fails: over-nested schemas the model can't satisfy and gives up on; tool-use loops where the model genuinely needs latitude in choosing the next call; reasoning tasks where rigid output structure constrains the chain-of-thought.

03Batch APIs — the 50% discount almost nobody uses

OpenAI, Anthropic, Together, Fireworks, and Groq all offer batch endpoints. You submit a set of independent requests; results come back within 24 hours; the bill is 50% of the synchronous price on both input and output. Anthropic Message Batches handles up to 100,000 requests per submission.

For any workload that doesn't need a sub-minute response, batch is free money. The workloads that qualify are bigger than people realize:

Offline eval runs (re-scoring your golden dataset)
Document enrichment pipelines (tagging, summarizing, embedding generation)
Daily classification jobs (incoming tickets, leads, content moderation queues)
RAG index regeneration
Catalog-wide content generation
Post-hoc summarization (call transcripts, meeting notes processed overnight)

The pattern: if it would be acceptable for a result to arrive while the user is asleep, it belongs on batch. Most teams have at least one workload like this running on synchronous endpoints out of habit. Moving it is a one-day project for 50% off.

04Model routing — the second biggest lever, with a real failure mode

Not every query needs a frontier model. Triage classifications, intent detection, simple lookups, formatting passes — all of these run perfectly well on a Haiku, a Gemini Flash, a Llama 3.x, or a Mistral. The job of a router is to classify the incoming query's difficulty and dispatch it to the cheapest model that can handle it, escalating to the expensive tier only when the cheap one flunks a confidence check.

The headline number from RouteLLM (UC Berkeley, Anyscale, Canva, ICLR 2025): 85% cost reduction on MT Bench at 95% of GPT-4 quality, calling the strong model on only 14% of queries. Verify before you celebrate, though — the same paper shows the savings drop to 45% on MMLU and 35% on GSM8K. Routing wins depend heavily on your query distribution. A conservative real-world expectation is 30–60% cost reduction, not 85%.

Other documented production patterns: a 70/30 Haiku/Opus split cuts input cost by roughly 67%. An 80/20 split toward cheaper tiers approaches 79%. As of 2026, around 37% of enterprises now run five or more models in production behind a router.

And then there's DeepSeek V3 / V4 at $0.14 per million input tokens — 18–36× cheaper than frontier. For any non-frontier-reasoning workload, this is the single biggest unit-economics shift of the past year. Worth evaluating before assuming Claude or GPT is the right floor.

Where it fails — and this one will quietly break your product: the router sends a query to the cheap model that cannot actually handle it, the cheap model returns a plausible-looking answer, and no one notices for weeks. We've seen accuracy regressions of 8–12 percentage points slip through this way. The non-negotiable: log every routed decision, sample 1–5% for offline judge eval, alert on quality drift. Without that, routing is a quality bomb on a slow fuse.

05The local-first development loop

Here's a pattern we're seeing converge in 2026, particularly at teams with strong engineering culture: local models for the inner development loop, frontier APIs only for the outer loop.

The argument is simple. A developer iterating on a prompt or an agent during a working day will fire maybe 200–1,000 model calls — most of which are throwaway. Most of those calls are trying to answer "does my code path work," not "is the model output high-quality." Running every one of those against a paid API is paying a frontier-grade price for what's effectively a unit test. A shared on-prem GPU server — a single H100, or a pair of RTX 5090s, or an M4 Ultra Mac — serving a team via Ollama, vLLM, or SGLang can absorb most of that traffic at zero marginal cost.

The hardware is finally ready. A dual RTX 5090 setup runs DeepSeek-R1 70B at quantized weights and roughly 27 tokens/second, at 35–45% of an H100's cost. Qwen3-Coder-Next (Feb 2026) reaches 58.7% on SWE-bench Verified, runs on a single 24GB GPU, and delivers sub-150ms feedback at over 1,100 tokens/second on a 5090. Epoch AI's data shows frontier capability from 6–12 months prior is now runnable on a single consumer GPU.

The right framing, though, is narrower than "local for dev." We've watched teams over-commit to this and burn weeks. The honest scope:

Local works for: the coding loop (Continue.dev + Ollama for autocomplete, refactor suggestions, throwaway agent runs in the IDE); offline iteration on prompt drafts before they reach a real eval; privacy-sensitive prototyping on customer data that can't leave the network; offline judge passes on golden datasets.
Local does NOT work for: final prompt engineering against the production model. A prompt tuned to Qwen will not transfer cleanly to Claude or GPT — different tokenizers, different instruction-following profiles, different tool-use disciplines. Tool-calling, structured-output reliability, and long-context recall still favour frontier models meaningfully.

There's also a worth-knowing inverse: Arize's published case study goes the other direction — prototype the prompt against Claude Sonnet, then distill / migrate the stabilized prompt to Llama 3.2 3B for production. Their result: per-inference cost dropped from $0.44 to effectively zero, latency improved 16–25%. The key is they ran 84 evals per candidate model on a 14-conversation golden set before switching. The pattern is "prototype big, ship small," not "prototype small, hope big."

The defensible operating model, then: local for the inner loop (developers typing, autocompleting, exploring), frontier for the outer loop (eval, CI regression, production traffic). Once a workload is stable, evaluate whether a smaller distilled or fine-tuned local model can take over that production path. Two distinct optimizations — don't conflate them.

Tooling consensus (mid-2026)

Ollama — individual developer iteration on a laptop or shared workstation. Simple, batteries-included, single-user.
vLLM — shared on-prem server, multiple developers, broad hardware support, continuous batching, paged attention.
SGLang — same niche as vLLM, currently winning on multi-turn conversations and structured output workloads. xAI, NVIDIA and Azure deployments adopting it for production. Worth evaluating alongside vLLM.
LM Studio — non-engineer access; useful when product managers or analysts want to run a local model without command-line.
Continue.dev — IDE integration (VS Code, JetBrains) pointed at any of the above. Closes the loop from local model to developer workflow.

06Semantic caching — useful, but mis-sold

Semantic caching looks like prompt caching's smarter cousin: instead of matching on exact prefix, it embeds the incoming query, looks for previously-cached responses whose embedding is within some cosine similarity threshold, and returns the stored answer if found. Tools: GPTCache, Portkey, Helicone, Redis Vector.

On a stable-query workload — a customer support FAQ bot, structured lookups — published case studies show up to 73% cost reduction. On the wrong workload, it's the most dangerous technique on this list.

Where it fails — really fails: a poorly-tuned threshold produces a confident wrong answer with a 200 OK status. Production data from teams that didn't sample for false positives has documented false-positive rates above 95% — the cache is returning the wrong answer almost every time it hits, and the team doesn't know because no human is checking. Other recurring failure modes: forgetting to scope the cache key by model_name + temperature (a Haiku query gets an Opus response; a fact-lookup at temp=0.1 gets a creative-writing answer at temp=0.8); no cross-encoder failover when uncertainty is high; no false-positive sampling. Start at threshold 0.92, tune from there, and sample at least 1% of hits to a judge.

Use it for narrow, stable, high-volume domains. Skip it for code generation, reasoning, or anything where two semantically-close prompts can have meaningfully different correct outputs.

07Prompt compression & surgical fixes

When context is already saturated and caching can't help (because every call has different content), prompt compression becomes useful. LLMLingua and LLMLingua-2 algorithmically remove low-information tokens; LongLLMLingua tunes for RAG workloads.

Published benchmarks: 2–3× compression yields ~80% cost reduction with under 5% accuracy hit on most NLP tasks; aggressive 5–7× compression hits 85–90% savings at the cost of 5–15% accuracy regression. LongLLMLingua reportedly improved RAG quality 17–21% while cutting tokens 4×.

The two trade-offs to know: code, math, and legal/contractual reasoning are domains where every token carries semantic weight — compression hurts faster than it helps. And compression can fight prompt caching: if the compressor's output isn't deterministic, you destroy the cache prefix on every call. Pick one or the other on the same workload, not both.

08Fine-tuning & distillation — last, not first

Fine-tuning a smaller open-weight model on your specific task, or distilling from a frontier teacher into a smaller student, is the highest-leverage optimization for a known, stable, high-volume workload. Oxide AI reported 95% task accuracy and 37% lower latency from a LoRA-tuned Llama against GPT-4 on their narrow problem. A teacher/student pair (GPT-5 distilled into a Llama or Mistral) can deliver near-teacher quality at roughly 10× cheaper unit cost on narrow tasks.

The reason it's last on this list: it's the most expensive to maintain. You need ≥500 high-quality labeled examples. You need an eval pipeline that catches regressions. If the underlying task spec shifts every quarter, you're retraining constantly. And if frontier-model reasoning is doing the real work, distillation will quietly collapse on edge cases.

Only fine-tune workloads that are stable, high-volume, and narrow. If the workload is two of three, you're better off optimizing prompt + cache + route. If it's only one, leave it alone.

09The honest order to do these in

If you take one thing from this piece, take this sequence. We've watched enough engagements to be confident it's the order that minimizes regret per dollar.

Week 1 — Instrument. Helicone or Langfuse on every call. Cache-hit rate, per-route cost, p50/p95 latency, output length distribution. Build the offline eval set.
Week 2 — Prompt caching. Restructure prompts: static content first, dynamic last. Target ≥70% cache-hit rate. Expect 40–75% cost reduction on stable workloads. This pays for itself before you finish the sprint.
Week 2 — Structured outputs. Schema-enforce every output you can. Kill retry loops.
Week 3 — Batch API. Move every workload that doesn't need sub-minute response onto batch. 50% off, immediate.
Weeks 4–6 — Model routing. Start conservatively. Haiku / Flash for classification and triage; frontier for reasoning. Mandatory: judge sampling on routed decisions from day one.
Weeks 4–6, parallel — Local-first dev loop. Stand up a shared Ollama or vLLM endpoint. Continue.dev for IDE integration. Cut 50–80% of dev-time API spend with zero production risk, provided final evals run against the production model.
Month 3 — Semantic caching, narrow scope only. FAQ-style workloads with mandatory false-positive sampling.
Month 3 — Prompt compression, surgical. Where caching can't help and context is bloated.
Month 4+ — Fine-tune / distill the top one to three workloads by token spend. Only after you've identified them through the instrumentation in step 1.

The techniques most teams reach for first — routing, fine-tuning, semantic caching — have the highest looks impressive in a blog post / lowest actual ROI without prior instrumentation ratio. They're real techniques. They're not first.

10The anti-patterns to watch for

A few patterns we see often enough to flag:

Routing without judge sampling → silent quality collapse. The single most expensive mistake in the routing stack.
Aggressive semantic caching without false-positive monitoring → confident wrong answers indistinguishable from fresh inference. Documented 95%+ false-positive rates in the wild.
1-hour cache TTL when traffic refreshes every 5 minutes → you pay the 2× write multiplier for no benefit. Match TTL to actual traffic pattern.
Compressing prompts in ways that break cache prefix → the two optimizations cancel out, often net negative.
Building routing or caching layers on top of an unmeasured baseline → you can't tell if it worked, and you'll undo something good while chasing something neutral.
Optimizing the unit cost of a workload that shouldn't exist. The single largest source of overspend on the $50K+/month accounts we audit isn't unit-cost inefficiency — it's workloads that aren't producing business value. Before you cut the cost of a call, decide if the call belongs in the product at all.

That last one is the unglamorous truth. Token optimization is engineering. The dollars saved by removing a workload your customers didn't notice was there are bigger than the dollars saved by every technique above combined.

Bottom line

Instrument first. Cache prompts. Enforce structured outputs. Push async to batch. Then route, with judge sampling. Then reach for the surgical tools. The teams getting the largest cost cuts in 2026 aren't running the most exotic stack — they're running the boring stack, in order, on top of real measurement.