Cache Pinning, Context Budgets, and the 85% Cost Reduction

Three Stores, Triple the Price

My grandmother had three grandchildren and a long shopping list. When she needed to get groceries done quickly, she would sometimes send all three of us to different stores at the same time — thinking that splitting the work would save time.

The problem was, each store thought it was seeing us for the first time. No shopkeeper knew what the others had already sold us. So we came home with three jars of the same chutney, no rice, and a bill three times what it needed to be.

This is exactly what was happening in the system. AI requests were being sent to multiple providers in rotation, spreading the work across all of them evenly. Each provider stores a warm memory of recent requests — things it has seen before, questions it can answer quickly without starting from scratch. But because requests kept rotating, no single provider ever built up that warm memory. Every call arrived like a stranger at the door. Every call cost as much as the very first one.

The cache hit rate was 22%. Which means for every five requests, only one was benefiting from any memory at all.

Always Send the Same Grandchild

The fix was not complicated. Stop rotating. When a thread of work began with one provider, keep all the follow-up requests with that same provider. Let the shopkeeper learn your order.

We called it provider pinning. When a conversation or task started, the system noted which provider handled the first request. Every subsequent request in that same thread went to the same provider. The shopkeeper who already knew the order — who had the context warm and ready — handled the next question too.

The cache hit rate went from 22% to 94% in days. Not because the providers got smarter. Because the system stopped confusing them.

The cost dropped immediately. The same volume of work, the same week, cost a fraction of what it had been costing. And the responses were faster too — a warm cache means less computation, which means less waiting.

Packing the Right Suitcase

My grandmother packed differently for different trips. For a weekend visit nearby, a small bag — just what was needed for two days. For a three-week pilgrimage, a proper case with everything she might need. She never took the big suitcase for a short trip. She said it was exhausting to manage what you didn't need.

Different tasks need different amounts of context. A quick question needs a small window — just enough history to understand what is being asked. A complex multi-day task might need the full record of everything that has happened. The mistake is treating every task as if it needs the big suitcase.

The system now works with context budgets — hard limits on how much history each request gets, based on what kind of task it is. Simple, fast tasks get a small window. Good for most everyday work, a medium one with room for recent conversations and key memories. Only the hardest problems get the full case, with everything available.

The key insight: You don't need all context — you need the right context. When a task would overflow its budget, the system doesn't fail. It compresses — ranking memories by how useful they are right now, summarizing old exchanges, letting go of what doesn't matter for this particular task.

Grandmother's Recipe Book

My grandmother didn't write down every conversation she ever had about cooking. She wrote recipes. The one technique that made the dish work. A short note about who taught her and when. Sometimes a small warning — "don't add the salt before the end, I made this mistake once and ruined the whole pot."

Everything else she let go of.

Long conversations accumulate the same way that kitchens accumulate clutter. A thread running for days might contain fifty exchanges — most of them noise, repeated questions, half-formed thoughts, things that felt important in the moment and aren't relevant anymore. Sending all of that to an AI provider is like reading out the entire conversation history before asking a simple question.

A background process now runs after every few exchanges. It reads the recent history and compresses it — keeping the original problem statement, the decision that was made, the reasoning behind it, and any updates that changed the direction. Everything else is released. When the system needs context for the next request, it retrieves the compressed version first. The full history is still there if truly needed, but it is no longer the default.

That compression reduced the average token count per request by 78%. Not by losing important information. By letting go of the unimportant.

What Changed

Three fixes working together — pinning, budgets, compression. Here is what the numbers looked like before any of them, and after all three were in place.

Metric	Before	After
Cache Hit Rate	22%	94%
Avg Context Tokens / Request	42,000	9,200
Cost Per Request	$0.24	$0.036
Monthly Cost (10K requests)	$2,400	$360
Latency (p95)	3.2s	1.1s
Task Success Rate	84%	91%

Each fix was valuable on its own. Pinning alone recovered cache efficiency. Budgets alone would have reduced token spend. Compression alone would have made context denser. Together, they were transformative — each one amplifying the others.

Doing More With Less

My grandmother's lesson about the three stores was not really about groceries. It was about the cost of not paying attention to how work gets distributed. Efficiency is not the same as speed. Speed sent three grandchildren to three stores. Efficiency sends the right grandchild to the right store, with a clear list, and remembers what was bought last week.

At small scale, none of this matters. Ten requests a day — who cares if the cache is warm or cold. At 10,000 requests a day, that 22% cache hit rate is $2,400 a month in unnecessary spending. At 100,000 requests, it is $24,000. The inefficiency was always there. It just did not have enough volume to be visible.

Thiruvalluvar wrote about wise people who understand the limits of their resources and work skillfully within them. That is not a pessimistic instruction. It is a practical one. Working within limits is what makes scale possible. A system that burns without accounting will always hit a wall. A system that has learned its own constraints can grow as far as the work takes it.

We are building toward that kind of discipline — the kind my grandmother's recipe book embodied. Write down what matters. Let go of what doesn't. Send the same grandchild to the same store. Do exactly what is needed. And stop.