Back to Blog

An LLM-Wiki for a 640-Page Book

I used Karpathy's LLM-wiki pattern on my personal notes, then adapted it for Q&A over a 640-page book that stays put. Same pattern, different layer count.

ai llm rag-alternative context-engineering pattern gemini

Karpathy’s LLM-wiki pattern ran cleanly on my personal notes. Then I had a different need — Q&A over a 640-page book that stays put — and adapted the same pattern for that shape of source.

The pattern has three layers. Raw sources sit at the bottom — the LLM reads them but never modifies them. On top of that the LLM maintains a wiki of summaries, entity pages, concept pages, and cross-references. And a config document defines how the wiki gets structured and updated. At query time the model reads the wiki, not the sources. Compressed, interlinked, always current. On my notes — a corpus I keep growing — that’s exactly what I want. The LLM’s compiled wiki is cleaner than any chunk of my raw notes, and it stays current as the notes accumulate.

The book I was building Q&A over doesn’t grow — thirty chapters, fixed at publication. And users asking about a chapter wanted the author’s actual words with a page citation, not a paraphrase. So the adaptation was two layers instead of three: raw sources and a routing index, no middle paraphrase layer for this use.

Same pattern, configured to the source.

The routing index

The routing index is a catalog, not a summary. One entry per chapter with its slug, title, page range, the concepts and terms that actually appear in that chapter, and a short “covers” paragraph describing what topics get discussed. No claims made, no conclusions drawn. Routing metadata in the author’s own terminology.

Query time becomes two LLM calls instead of one retrieval.

User question
   │
   ▼
┌──────────────────────────┐
│  Router call             │  reads full index (~30KB) + question
│  structured JSON output  │  returns 1–3 chapter slugs
└──────────────────────────┘
   │
   ▼
Slug validation — unknown slugs dropped
   │
   ▼
┌──────────────────────────┐
│  Answer call             │  loads only those chapters' raw markdown
│  streaming SSE           │  streams answer with direct quotes + page citations
└──────────────────────────┘
   │
   ▼
Answer with verbatim quotes from the source

The router call

The router reads the full index plus the user’s question and returns 1–3 chapter slugs. The index is small enough — around 30KB — that the whole thing fits comfortably in context. The router’s job is cheap: narrow 29 chapters to a handful.

Two design choices matter here. First, the router uses a responseSchema — the structured-output mode — to return JSON of shape { slugs: string[] }. That kills the class of bugs where the model responds with prose instead of JSON, or wraps the JSON in a code fence, or decides to be helpful and add a natural-language preamble. It either gets valid JSON back or it throws.

Second, the returned slugs get validated against a known set loaded from the index itself. Invalid slugs — the router hallucinating a chapter that doesn’t exist — get dropped silently, not routed to. If the model returns three slugs and one is garbage, the two real ones get kept. Only if every returned slug is invalid does the router throw. Forgiving on partial failure, strict on total failure.

Temperature is zero. This is a routing decision, not a creative one.

The answer call

Whatever chapters the router picked get loaded from storage as raw markdown. The system prompt wraps those chapters with instructions to quote directly from the source and add page citations after every quote. The answer streams back as Server-Sent Events — tokens flow to the client as they’re generated, and a final done event carries token usage for both calls.

The full book is never in any prompt. 640 pages is too much context even by today’s standards, and the router-then-answer split means the answer call only ever loads 1–3 chapters worth of text. The router sees metadata, the answer call sees the subset of source it needs, neither ever sees the whole book.

One operational detail worth naming: the answer route watches req.signal.aborted between tokens and breaks out of the loop cleanly if the client disconnects. The query log gets written in a finally block — successes, failures, and aborts all hit the same logging path. Without the await on that log insert, the serverless runtime occasionally drops the in-flight DB write when the request closes.

Building the index: thin, rich, verified

First pass at the index was too thin. I had the LLM write one short descriptor per chapter — a title and a sentence or two about the chapter’s theme. The router handled obvious questions but anything non-trivial fell through the cracks. The vocabulary in a real question almost never overlapped with a one-sentence thematic description. Either the wrong chapter got picked or nothing did.

So I had the model re-read each chapter and expand the concepts/terms list — produce the terms that literally appear in the chapter as routing keywords. Routing recall jumped. Non-trivial questions started landing on the right chapter.

A new failure mode showed up underneath. When I spot-checked a few answers that felt slightly off, the router was picking a chapter based on a term that wasn’t actually in that chapter’s markdown — the model had cross-attributed the term from a neighboring chapter it also knew about. Confident routing, wrong target.

The fix was mechanical. A bash loop: for every proposed term, grep the claimed chapter markdown. If the term isn’t literally there, drop it from the index.

# for each chapter entry in the index, extract its claimed terms
# and verify each one appears literally in the chapter markdown
for chapter in chapters/*.md; do
  slug=$(basename "$chapter" .md)
  for term in $(extract_terms_from_index "$slug"); do
    if ! grep -qF "$term" "$chapter"; then
      echo "DROP: '$term' from $slug (not in source)"
      mark_for_removal "$slug" "$term"
    fi
  done
done

Across 29 chapters and 1038 proposed keywords, 791 survived. The other 247 split between two categories — hallucinated paraphrase variants that never appear literally in any chapter, and terms cross-contaminated from chapters the model was keeping mentally adjacent.

Concrete shape of the failure: one chapter had routing keywords added that only existed in two other chapters. If a user asked a question using one of those keywords, the router would pick the wrong chapter, the answer call would load a page that didn’t support the quote, and the response would be confident prose with nothing to back it. The grep-verify pass eliminated that whole class of failure in one afternoon.

The covers paragraph — the prose description — is harder to verify the same way. A sentence can mention a term that doesn’t appear verbatim but is semantically present. I kept those paragraphs with some residual cross-reference noise; the routing keywords are the load-bearing part of the index anyway.

Two opposing failure modes

The iteration arc is what I’d put on a whiteboard if I had to explain this to a junior engineer. Two opposing failure modes, with verification as the thing that lets you push aggressively on recall without paying the precision tax.

Thin index — precision-from-sparseness failure

Too little metadata per chapter means the router can’t find a match for specific questions. Vocabulary mismatch between question and index, nothing to grip. Recall drops because the router has nothing to latch onto.

Rich index — precision-from-hallucination failure

Plenty to match on, but some of what’s there isn’t real. Recall is high but every match is a coin flip on whether it points to a chapter that can actually support an answer.

Verification is the knob between those two failure modes. With a cheap mechanical check — does this term literally exist in the claimed chapter — you can let the LLM be as aggressive as it wants on the enrichment pass, then let the verifier prune everything that doesn’t hold up. Recall goes up, precision gets preserved by the filter, and the LLM doesn’t need to self-police.

What this configuration trades

A full three-layer wiki gives you things a two-layer one doesn’t. Cross-source synthesis. Spotting contradictions. Concept pages that tie material from different chapters. All useful.

For this use I didn’t need them. Users asking about a specific chapter want the author’s words from that chapter. A direct quote beats a fluent blend that merges material across the book. Fluency without attribution isn’t what they’re paying for.

The two configurations can coexist in the same project. Some source types flow through the full three-layer wiki. Others flow through the routing variant. Same backbone, dispatched on source type.

The broader takeaway

Routing metadata produced by an LLM and consumed by another LLM is a place where cheap mechanical verification earns its keep. The model writes the metadata fluently and plausibly — that’s the whole point, and the whole risk. The downstream model reads the metadata and routes based on it. Errors compound silently because nothing structural tests the metadata against the source.

The same shape shows up anywhere an LLM generates structured data that another LLM then trusts. Function call arguments extracted from user input. Tags assigned to documents. Classifications that drive downstream routing. In each case there’s usually a mechanical check available — a grep, a regex, a schema validation, a “does this referenced thing exist” lookup. The check is fast and boring and it compounds.

Where I ended up

I used the LLM-wiki pattern for my growing notes, then adapted it for a book that doesn’t change. Same concept, two configurations, same engineering mindset underneath.

And whatever the LLM generates for the routing layer, grep it before trusting it. That pass is cheap and it compounds.