The Cache Is the Conversation

I was prototyping a Q&A feature that searches across a content library. The first version was a straight pipeline: every question fired a routing call to figure out which sources were relevant, fetched their contents, then asked the model to answer. Every question. Even the follow-ups.

The fix isn’t more code. It’s less code with the model making more decisions. And once you see why, the word “agent” stops sounding like marketing.

What the prototype did

The classic pattern: hardcode the steps, run them all, every time.

question → route → load sources → answer

This is fine when each step is cheap. It becomes wasteful when one of the steps is “ask a language model to make a decision,” and the decision rarely changes within a conversation.

Think about asking a friend who’s just been reading a book: “What does chapter three say about that topic?” Then a moment later: “And what’s the second example they used?”

The hardcoded pipeline treats the second question identically to the first. It re-runs the routing call. It re-fetches the chapters. It pays LLM calls before it even starts answering — to derive routing information it already had a minute ago.

The shift: let the model decide

Instead of always running the routing step, we expose it as a tool the model can choose to invoke. We give the model the user’s question and the conversation history, and we tell it: “Here’s a tool called route_to_sources. Call it when you need to look up new material.”

The model now decides per turn. On a fresh topic, it calls the tool, gets sources back, answers. On a follow-up about content already loaded in the conversation, it skips the tool and answers directly from what it already sees.

That decision happens inside a single forward pass through the model. There’s no if statement in our code. We don’t write a heuristic. The choice is the model’s.

So this is what caching really means

In normal software, a cache is an explicit thing:

if (cache.has(key)) return cache.get(key);
else return await expensiveLookup();

You write the if. You pick the key. You decide what counts as a hit. The cache is a data structure you maintain.

In the agent flow, the “cache” looks completely different:

There’s no external store.
There’s no key.
The check isn’t code — it’s reasoning.

The conversation history is the cache. Everything the model loaded in earlier turns is sitting right there in its context window. When a new question comes in, the model looks at the question, looks at the history, and judges: “Do I already have what I need, or should I fetch?” It’s a fuzzy semantic check, not a key-value lookup.

This is why agents feel different from old pipelines. The decision-making moved from our code into the model’s reasoning step.

Why this feels natural — the human analogy

When someone asks you a follow-up in conversation, you don’t think:

Go to the bookshelf.
Pull down the relevant book.
Look up the answer.
Come back and respond.

You think:

Do I already remember enough from a minute ago to answer?
If yes — answer.
If no — then go to the bookshelf.

That first step — the glance at your working memory — is itself a cache check. It’s nearly free. The bookshelf trip is the expensive operation. Humans don’t even realize they’re doing the check, because it’s instantaneous and uses the same machinery as the answering itself.

The LLM’s context window is exactly that working memory. The agent pattern lets the model do that glance step before deciding to do the expensive operation. The old pipeline skipped the glance entirely — it always went to the bookshelf.

That’s what made the penny drop for me — why, in my head, it had always seemed obvious that the agent should be the one deciding when to fetch, not the code around it. It’s how humans think.

The agent is the software. That’s the move. We stop writing software that uses an LLM as a step inside our control flow, and start writing software where the LLM is the control flow. Our job shifts from “decide when to call the model” to “describe what the model can do.”

And once that lands, the answer to “why isn’t it built this way from the start” is: because we were thinking of LLMs as expensive functions, not as judgment. The pipeline pattern is what you write when you don’t trust the model to make decisions. The agent pattern is what you write when you do.

Why a hand-coded heuristic isn’t good enough

You might think: “Can’t we just check for follow-up signal words like ‘previous’ or ‘again’ and skip the routing then?” You could. It would catch the obvious cases.

But real follow-ups don’t always announce themselves:

“What do they mean by that?” — no signal word, but clearly a follow-up.
“And the second chapter?” — only signaled by ellipsis.
“Why?” — one word, fully context-dependent.

A human reading those instantly knows whether they have enough context. So does the model — if you give it the choice to make. A regex falls over on every one of them.

Tool descriptions are program logic, not documentation

This is where the second piece comes in. The description we write for a tool — the natural-language sentence that tells the model what the tool does and when to use it — is not documentation. It’s the cache policy.

When we write something like:

route_to_sources — Searches the content library for relevant material. Call this when the user asks about a topic not already covered by earlier turns. Skip it on clarifying or follow-up questions about content already in the conversation.

…we are literally writing the if-statement of the cache. In natural language. Evaluated by the model.

Small wording changes shift call rates measurably. Change “Searches” to “Looks up” — different behavior. Drop “not already covered by earlier turns” — the model calls the tool more often, costing money. Add “Prefer answering directly when possible” — the model gets stingy and skips routing on questions where it should have routed.

Because of this, you don’t want tool descriptions sprinkled across route handlers as inline strings, the way you might write log messages. You want them somewhere structured. That’s what “registry-managed” means.

What “registry-managed prompts” looks like in practice

A prompt registry is just a small module — call it lib/prompts/registry.ts — that owns every prompt the system uses. Each prompt has:

A name (a stable identifier, like libraryRouterDescription).
One or more variants (so you can A/B test wordings without forking the whole route).
A version trail in git history.
Eventually, evaluation results — when you change a description, you can rerun a golden-set of test conversations and see how the call rate shifted.

Call sites import by name: resolvePromptVariant("libraryRouterDescription"). The route doesn’t know what the current text says. It just asks for the policy.

The benefits compound:

One source of truth. The same tool used by two different routes gets the same description. No silent drift.
Reviewable. PR diffs that change a tool description are now obvious — they live in a known file. Inline strings hide in handler bodies and slip past review.
Testable. You can evaluate a description change against a corpus of historical questions, measure how routing decisions shifted, and decide if the change was good before shipping.
Versionable. When call rates regress in prod, you can git blame the description to find the change.

It’s the same instinct as not scattering SQL queries inline through application code. SQL describes a contract with the database; you want it in one place where it can be reviewed, optimized, and audited. Tool descriptions describe a contract with the model. Same logic.

The deeper shift

In non-agent code, we (the engineers) decide when expensive operations happen and code that decision as control flow. In agent code, we describe what operations are available and let the model decide when to invoke them based on the situation.

We give up imperative control in exchange for context-sensitive judgment. That’s why tool descriptions are load-bearing: they are not documentation, they are program logic. Small wording changes shift behavior measurably because you are literally editing the decision policy.

Putting it together

Two ideas to take away:

The “cache” in an agent is the conversation history itself, and the cache check is the model’s reasoning. The policy lives inside the model’s reasoning step rather than being written as code.
The tool description is the program. Treat it the way you treat SQL: name it, version it, review it carefully, evaluate it before deploying. Don’t hide it as an inline string.

And the instinct that says “wait, shouldn’t the software just figure this out on its own?” — that instinct is exactly right. The agent is the software. We were just used to writing it the other way.