Deep modules — and why they matter more in the AI era
A guided tour of module shape — the difference between code that hides complexity and code that just relocates it — built from first principles, then walked through a real refactor.
1 · What is a module?
A module is any chunk of code with a line drawn around it. A file. A folder. A class. A package. A service. A repo. The unit doesn't matter — the line does.
What "line drawn" actually means
Forget the metaphor — think mechanism. When you put functions in a file and pick which ones to export, you've drawn a line. Exported = the outside can see it. Not exported = it stays inside. A folder with an index.ts that re-exports just three things has drawn its line at those three.
Same idea at every scale: a class draws its line with public/private. A REST API draws its line at the URL surface. A team draws its line at "what we own vs what we ask another team for." The literal mechanism changes; the concept is the same — something on this side, something on that side, and a deliberate boundary between them.
Every module has two parts:
- Interface — the surface other code touches. Function signatures, exported types, public methods, REST endpoints. What a caller has to learn.
- Implementation — everything inside that the caller doesn't have to know. Helpers, internal state, private functions, the actual logic.
Who is "the caller"?
The caller is whoever uses your code. If you wrote loadSources(...), the caller is the route file that runs await loadSources(...). Caller = the line of code that says "hey, please run this for me." Anything that imports your function, calls your endpoint, or instantiates your class is a caller.
So the line a module draws is really the border between the caller's interest and the module's inside interest. The caller cares about the interface; the module cares about its insides; the line says "you don't need to look across this."
The whole point of drawing a line is to declare a contract: "outside this line, you only need to know X. Inside, the complexity is my problem."
Vocabulary check
"Module" is deliberately broad here. A single function with parameters has an interface (the signature) and an implementation (the body). A folder with a public index.ts has an interface (its exports) and an implementation (everything else). The shape of the argument is the same at every scale.
2 · Deep vs shallow modules
This is John Ousterhout's framing from A Philosophy of Software Design. Picture a module as a rectangle. The width is the size of its interface — how much a caller has to learn. The depth is how much functionality lives behind it.
- Deep — narrow on top, fat below. One simple call, lots happening inside. Example: Unix
read(fd, buf, n). Three arguments. Behind them: filesystem layers, page cache, device drivers, networking. The caller doesn't care. - Shallow — wide on top, thin below. The interface is almost as complicated as the implementation. Example: an 8-argument wrapper that just forwards to another function. You pay full interface cost for almost no hiding.
The rule: a module earns its keep by hiding more than it exposes. If it doesn't hide much, it's noise — the caller would be better off reading the body inline.
Deep means the implementation rectangle dwarfs the interface rectangle. The hidden complexity earns the abstraction. As you slide the implementation thin, the module collapses into noise.
3 · The "cover the body" test
A useful diagnostic: cover the implementation with your hand and read only the interface. Can you predict what it does and what it costs? If yes, the module is deep — the interface communicates the contract. If you have to peek inside to know what's happening, the abstraction is leaky and the module is shallow.
"But if the implementation is deep, how can I know from the interface what's inside?"
This is a fair pushback, and it's worth untangling. There are two different things "predict" could mean:
- Predict the implementation — knowing the line-by-line code inside.
- Predict the contract — knowing what the function does and roughly what it costs.
A deep module hides #1 but exposes #2. From read(fd, buf, n) you cannot predict the page cache, the device drivers, the kernel data structures — you shouldn't be able to. But you can predict: "given a file descriptor and a buffer of size n, this fills the buffer with up to n bytes from the file." That's the contract. The signature gave you a short, useful summary that's smaller than the implementation.
A shallow module fails differently. With buildHeaders(userId, env, contentType, accept, traceId) you also can't fully predict the implementation from the signature — but here the contract is barely smaller than the implementation. Five inputs, five outputs, one-to-one. The "summary" is the same size as the body. The abstraction didn't compress anything.
The corrected test: can you state the contract in fewer words than it would take to read the body? If yes, deep — the abstraction is doing compression work. If no, shallow — the wrapper is just renaming the body.
Try it on three real signatures:
async function createPaymentProcess({ sum, payerInfo, description, notifyUrl, ... }): Promise<ProviderResponse>A payment-provider integration wrapper. Predict: what's hidden inside? Is this deep or shallow?
function buildHeaders(userId, env, contentType, accept, traceId) { ... }A hypothetical helper. What's hidden? Is the abstraction earning its place?
async function* streamAnswer(sources, question, options): AsyncGenerator<AnswerEvent>An LLM streaming-answer generator. Predict the contract.
What this exercise reveals
Shallow modules feel productive — you're "decomposing" — but they often just spread complexity across more files instead of removing it. The dependency graph gets wider, the cognitive load goes up, nothing is actually encapsulated. The cover-body test catches this: if you have to peek to predict, the abstraction isn't doing real work.
4 · Why the same advice transfers to AI — for different reasons
Ousterhout was writing for humans reading code. His argument was about working memory and cognitive load: a deep module lets a reader page out the implementation and just reason about the interface. That argument still applies — but the AI era adds a separate, sharper one.
Humans and language models have inverted cognitive constraints:
Humans
- Limited working memory (Miller's 7 ± 2)
- Persistent expertise across days, months, years
- Pattern-recognize from experience
- Get tired; deep modules let you page out
Ousterhout's argument: narrow interface = small surface to hold in mind
Language models
- Effectively unlimited working memory inside one context
- Zero persistent memory across sessions
- Pattern-recognize from training data + context
- Don't get tired; do generate plausible nonsense at large surface
AI-era argument: narrow interface = small token surface for misuse + bounded blast radius when the agent is wrong inside
Pocock's framing — "go back to old books" — is correct in outcome but a little misleading in cause. The reasons fundamentals transfer to AI aren't nostalgic. They're different physics, same shape:
- Wide interface ⇒ more bytes for the same operation ⇒ token cost
- Wide interface ⇒ more public API surface ⇒ more chances for the agent to misuse it
- Narrow boundary ⇒ cheap to test from outside ⇒ tight feedback loop
- Narrow boundary ⇒ bounded blast radius when the agent is wrong inside
5 · Pocock's claim, made concrete
Pocock says: "AI is really good at creating codebases like this" — the shallow shape. Why?
- Each turn the agent sees only so much; safer to make small isolated functions
- Fewer interface decisions to commit to
- Pattern-matched to "good code = small functions" from training data
- It's the path of least resistance when you don't have a holistic view
The compounding bit is what hurts: AI both produces shallow structure and fails to navigate shallow structure. Once a codebase tilts shallow, every new feature widens the dependency graph by another fan, and the next agent run gets lost in it.
A worked example
Imagine a content platform with several resource types — articles, podcasts, videos, a cross-resource "library" search. Each type exposes a streaming Q&A endpoint so a user can ask questions about that resource. The routes grew separately, each is ~350 lines, and they look like this:
app/api/articles/ask/route.ts ~305 lines
app/api/podcasts/ask/route.ts ~352 lines
app/api/videos/ask/route.ts ~345 lines
app/api/library/ask/route.ts ~368 lines
─────────
~1,370 lines
LOC = lines of code. So "352 LOC" just means a 352-line file. You'll see the term in the chart and the before/after tree below.
Two of those (podcasts vs videos) are ~95% identical. Let's prove it.
6 · Side-by-side: two near-identical routes
The diff between the two routes is so narrow it's almost embarrassing. Toggle the highlight modes to see what's identical (the vast majority) vs what genuinely differs (a handful of names).
podcastQueries vs videoQueries), one sources function (loadPodcastSources vs loadVideoSources), one prompt variant + one index key, and the strings "podcast" vs "video" in log prefixes and the kind arg to streamAnswer. Everything else — auth, quota, conversation resolve, SSE plumbing, abort handling, persist-before-close, error mapping — is byte-for-byte the same.
The compounding tax
Adding a fifth content type by copy-paste would mean a fifth ~350-line file that differs from the others by ~20 lines. A sixth makes it six. Each new type is +350 lines of accidental duplication and another place to fix the next abort/quota/persistence bug. This is the shallow shape compounding.
7 · The lower layers were already deep
Here's what makes this kind of case interesting: the building blocks for a unified shell often already exist. In our example, the Q&A engine in lib/qa/ was already parameterized by content type — router, history loader, answer streamer, index loader all took an indexKey or a kind or a table as input.
// Already deep — content-type-agnostic, called the same way by every route:
routeQuestion(indexKey, question, { promptName }) // returns slugs
loadRoutingIndex(indexKey) // returns Index
loadThreadHistory(conversationId, limit, table) // returns ThreadHistory
streamAnswer(sources, question, { kind, ... }) // yields events
The deep-module engine was right there. Each lower-layer function had a narrow interface and fat insides. So why was the orchestration on top still copy-pasted?
The diagnosis
The route handler was the leaky orchestration layer above otherwise-clean abstractions. Every route reached into the same routeQuestion, streamAnswer, loadThreadHistory — and repeated the same auth, quota, conversation, SSE, persist scaffolding around them. The scaffolding was the shallow part.
The fix isn't to redesign the lower layers. It's to extract the route shell as another deep module above them — narrow surface (the per-type config), fat insides (the SSE/auth/quota/persist plumbing).
8 · Deep modules at the schema level
The single most important insight in this kind of refactor often isn't in your application code at all — it's in the database. A well-shaped schema can encode the deep-module move at the data layer, and that's what makes the application-level collapse possible:
Holds every kind of content: podcasts, videos, articles, interviews, and more. The crucial column:
contentType: text("content_type").notNull()
// values: "podcast" | "video" | "article" | "interview" | ...
Where the type lives. Every row knows which kind of content it is by this column. Anything that needs to scope by type filters here (e.g. the routing index for podcasts is built from WHERE contentType = 'podcast').
One row per transcribed content item. Crucially: no contentType column. The medium has already been resolved — by the time something is in this table, it's text.
content_id → content_items.id
text — the full transcript
language — locale tag
modelVersion, transcribedAt, costUsd, ...
Audio (podcasts), video, live recordings — they all converge on the same row shape because the medium-specific work (speech-to-text for audio, captions for video, etc.) happens upstream in the ingestion pipeline, not in the Q&A loader.
Because content_transcripts is medium-blind, the loader is too. Pre-refactor we had two near-identical loaders (loadPodcastSources and loadVideoSources) — same query, different error string. They collapsed into one because the data shape was the same.
// One function. Used by podcast routes, video routes,
// and every future transcribed-content route for free.
loadTranscriptSources(contentIds, kindLabel) → AnswerSource[]
Future content types that ingest into content_transcripts inherit this loader for free.
Each route knows its content type because it's at a type-specific URL. /api/podcasts/ask means "user is asking about podcasts." That's the boundary where the type gets pinned.
The route doesn't have to filter by type when loading sources — the IDs it passes to loadTranscriptSources are already scoped to its type by upstream business logic. Type is a routing concern, not a loader concern.
The decision rule the schema encodes
Share when the shape is the same; separate when the shape legitimately differs. Transcripts: same shape across all media → one table, one loader. Query logs (podcast_queries, video_queries, live_queries): different fk targets and per-type analytics needs → separate tables.
Audio vs video is different in real life. Their transcripts are not. The schema gets to make that distinction.
9 · "Should I collapse these two functions?"
Walk through the questions. The interactive tree below ends at collapse or keep separate based on real shape, not real-world labels.
- Same shape from the same source? → collapse
- Different storage (e.g. structured documents in object storage with custom markers, vs rows in a relational DB) → keep separate
- Different downstream consumers with different analytics needs (per-type query logs) → keep separate
- "Are they the same kind of thing in the real world?" is the wrong question. Audio vs video is different in real life; their transcripts are not.
The principle
Ousterhout's deep-module test isn't about the metaphysics of the data. It's about the interface to the data. Two functions that read the same shape from the same source should collapse, regardless of what their inputs represent in the world. Two functions that look superficially similar but read different shapes from different sources should stay separate, regardless of whether their inputs are "the same thing."
10 · The redesign
Here's the new shape. The old four-route layout vs the new shell + thin callers.
Before — shallow orchestration
Each new content type adds another ~350 lines of mostly-duplicated scaffolding.
After — deep shell + thin callers
Path A: the new type ships at 80 LOC, the shell is paid once. Old routes migrate later — each migration removes ~280 lines.
What the shell owns
// lib/qa/run-ask-stream.ts (~360 lines)
//
// Owns:
// • Auth (requireAuth)
// • Quota (fetchAgentQuota, incrementAgentUsage)
// • Conversation resolve/create (with ownership check)
// • History load (loadThreadHistory(conversationId, limit, table))
// • SSE stream lifecycle (encoder, controller, closeSafely)
// • Abort handling (req.signal.aborted)
// • Persist-before-close discipline (the row hits the DB
// even when the client disconnects mid-stream)
// • Error mapping
//
// Takes from the caller (the per-type config):
// • table — which queries table to log to
// • plan(question, history) → { sources, selectedContentIds, routerUsage }
// • resolveScopePin(conversationId, contentId) — per-type pin rule
// • kindLabel — for log prefixes
// • answerKind — forwarded to streamAnswer
// • defaultPromptVariant
What a thin caller looks like
// app/api/live/ask/route.ts — 80 lines including imports + zod body
export async function POST(req: Request) {
const session = await requireAuth();
const parsed = Body.safeParse(await req.json());
if (!parsed.success) return Response.json({ error: "invalid_body" }, { status: 400 });
const { contentId, question, conversationId } = parsed.data;
return runAskStream(req, {
session: { userId: session.userId, role: session.role },
question,
contentId,
providedConversationId: conversationId,
table: liveQueries,
logPrefix: "live/ask",
answerKind: "live",
defaultPromptVariant: PROMPTS.liveAnswer.default,
async plan() {
const sources = await loadTranscriptSources([contentId], "Live session");
return { sources, selectedContentIds: [contentId],
routerUsage: { inputTokens: null, outputTokens: null } };
},
async resolveScopePin(conversationId, requestedContentId) {
// first prior turn fixes the session id; subsequent turns must match
...
},
});
}
11 · How LOC grows with content types
The shallow approach scales linearly: each new content type costs another ~350 lines. The deep approach pays a one-time shell cost and then ~80 lines per type. Drag the slider to see how the gap widens.
The break-even point
The deep approach is more expensive at N = 1 (you wrote the shell for nothing). At N = 2 they're roughly tied. At N ≥ 3 the deep approach is strictly better — and the gap widens forever. This is why the right time to extract a shell is when you're about to write the third copy: the second was a coincidence; the third is a pattern.
12 · Path A vs Path B vs Path C
Three reasonable answers to "we want to add a new content type and we have a shallow-orchestration problem." Each has trade-offs.
Path A · Add new with shell, leave old
- PR size
- Small (~400 lines new, ~0 churn in old routes)
- Risk
- Low — shell is validated by one fresh caller before more callers commit to it
- Scaling
- Old routes still pay tax until migrated; each migration is mechanical and independent
- Lock-in
- None — if the shell signature is wrong, you fix it once and only one caller is affected
- When
- You have an immediate new feature to ship and want to start the cleanup without betting the farm
Path B · Migrate everything at once
- PR size
- Large (~1,500 lines of churn across 5 routes)
- Risk
- Higher — four old routes commit to a shell that's only been tested by one caller
- Scaling
- Clean immediately; no temporary inconsistency
- Lock-in
- If the shell signature has a subtle gap, you discover it across four routes simultaneously
- When
- You're confident in the shell shape, the team can review a big diff, no shipping pressure
Path C · Registry-based content types
- PR size
- Very large; touches every layer (DB, lib, API, UI, types)
- Risk
- Significant — replaces a static union with runtime registration
- Scaling
- Adding a content type becomes one registry entry — best long-term shape
- Lock-in
- High — registry pattern is hard to undo; if it doesn't fit a type, you've made things worse
- When
- You have 5–6 content types proven on the shell and a clear next batch that fits the same shape
The right answer in most situations is Path A. Two reasons:
- The new feature needs to ship. Adding it through the shell gets it out the door and proves the shell shape under one fresh caller.
- The shell signature is unproven. Migrating four existing routes into an unproven shell would be the bigger bet. The next migration becomes the second caller — still cheap to course-correct if needed.
13 · What stays separate (and why that's part of the design)
A real deep-module move names what it doesn't unify. The honest version of this refactor explicitly excludes a few things, and each exclusion has a reason:
- Loaders for differently-stored content stay separate. Suppose alongside the transcribed-content types there's also a structured documents type — say, long-form books stored as markdown in object storage with custom page markers (e.g.
<!-- page N -->). That loader has to parse markers the transcript loader doesn't need to know about. Different storage, different shape, different parsing → forcing it through the shared loader would mean the loader knows about page markers it has no business knowing about. - Routers with different strategies stay separate. A cross-resource "library" search that has to first pick which book and then which chapter is a two-stage routing problem. A single-resource Q&A endpoint is one-stage. Different routing strategies belong as different functions, not as one "router" with branching internals.
- Per-type query log tables stay separate.
podcast_queries,video_queries,live_queries— each has the right fk targets and the right per-type analytics needs. The shell'stableparameter accepts any of them; the union of column shapes is named explicitly (call itAskQueriesTable) and only includes the ones that genuinely share a column set. - Routes whose insert shape really differs stay outside the shell. If one route logs
bookSlug+selectedChapterSlugsand another logscontentId+selectedContentIds, those are different schemas. Forcing them through the shell would require the shell to know about books, which defeats the point of a narrow boundary.
Honest exclusion is part of the design
A deep module that refuses to unify what doesn't fit is more honest than a deep module that swallows everything. The shell's name (runAskStream) and its type (AskQueriesTable) say exactly what it covers — transcribed-content Q&A. The fact that structurally different routes don't fit isn't a failure; it's the boundary the abstraction earns by being narrow.
14 · The naming move
Folder names are part of the interface — and they're easy to get wrong in a way that compounds. A common pattern: the first surface you build gets a domain-specific folder name (say, lib/articles/), then subsequent surfaces piggyback on the same infrastructure, and the folder name never gets updated. Now the folder says one thing and contains another.
This connects directly to a different Pocock point: ubiquitous language. Names are part of the interface. A wrong name doesn't just look bad — it leaks into how every future contributor and agent thinks about the module. An LLM exploring the codebase will read lib/articles/router.ts and reason about it as "the articles router," not "the generic Q&A router." Every prompt that needs to talk about routing has to first re-establish that the name lies.
lib/articles/ ← misleading (only one of many surfaces uses articles)
router.ts (used by ALL Q&A surfaces)
answer.ts (used by ALL Q&A surfaces)
history.ts (used by ALL Q&A surfaces)
index-loader.ts (used by ALL Q&A surfaces)
sources.ts (used by ALL Q&A surfaces)
lib/articles/ → lib/qa/ ← honest
The fix is git mv + a find/replace on imports. Cosmetic at the file level, real at the cognitive level. The trick is sequencing: do the rename after the migrations, not during them.
Do the rename last
If you rename mid-migration, every file you touch shows up in the diff for two reasons: the migration and the import path. That muddies review. Rename after migrations are done — when the legacy-named imports being added are at their minimum.
15 · The lesson, in one breath
Deep modules in the AI era aren't about clever abstraction. They're about drawing the boundary at the place where complexity legitimately differs vs is identical — and being honest about which is which.
The leverage point moves from "writing code" to "drawing the right boundary." That sounds like it should make senior judgment matter less in an AI-assisted workflow. It's the opposite. AI can hand-roll the implementation behind any interface you give it. Choosing the right interface is the part that doesn't automate.
This kind of refactor isn't a clever architectural move. It's three small acts of judgment, in order:
- Notice that the routes are 95% identical at the orchestration layer.
- Notice that the lower layers were already deep — the engine exists; only the route shell is leaky.
- Refuse to unify the routes whose shape really differs, even though the temptation is to pull everything into the shell.
None of those three are mechanical. They're shape-sensing decisions. The mechanical part — writing the 360-line shell, the 80-line new route — is the easy half.
What to take away
If you take only one thing: the right question is never "what kind of thing is this in the world?" It's "what shape is the data, and where does the medium-specific work end?" That question separates the parts that should collapse from the parts that legitimately stay separate. Everything else is consequence.