Compounding Agentic Workflows: When the Skill Runs the Second Time

Claude drove my app today like a user. Clicked through a feature, took eleven screenshots, learned the UI as it went, then wrote me the Hebrew operator manual for it — text, layout, RTL formatting, screenshots inlined. PDF ready to email by lunch.

That sentence is the thing people quote-tweet. It’s also the wrong thing to focus on. The eleven screenshots and the PDF aren’t the engineering claim. The engineering claim is that this was the second run of a workflow, and the second run was structurally easier than the first. The skill that ran today was crystallized from yesterday. Tomorrow’s run will be easier than today’s.

That’s what I actually want to talk about. Agentic efficiency isn’t a property of the model or the tools. It’s a property of the artifact you build with the agent over time. Compounding is the mechanism. The second run is where you find out if the compounding worked.

Agentic efficiency isn’t a property of the agent. It’s a property of the artifact you build with the agent over time.

The first run was one prompt

The app is a sales-and-orders system in Hebrew, used by non-engineering staff. When I ship a feature, two things eventually need to exist for it: a manual the operator reads to learn the feature, and an automation skill so Claude can hit the same flow over the API later for bulk work. Two consumers, same source of truth — the actual UI.

The first time I built that pair, it was one natural-language prompt. I described what I wanted: verify the code path, drive the app from a real browser tab like a customer would, screenshot every screen, write an automation skill that hits the same flow via API, then write a Hebrew manual for the operator with the screenshots inlined. Single session. End to end.

The output wasn’t a draft. It was a finished manual, shipped to the operator, in use. The API skill works. The PDF rendered correctly with RTL Hebrew. That distinction matters for what comes next, so I want to underline it: when I say “first run,” I don’t mean a prototype. I mean a real artifact that did its job in production. So good that I wanted every future feature to get the same treatment.

/skillify, at the end

That’s why we ran /skillify at the end of the session. Skillify is a meta-skill — you point it at a workflow you’ve just finished, and it watches what you actually did, parses the load-bearing parts, and emits a SKILL.md the next run can read. The output is a project-local skill file, slash-command-callable, that encodes the workflow as instructions Claude reads at the top of every future invocation.

The framing matters. You don’t sit down and design a skill. You ship a real artifact by hand, and at the end — when the work was good enough that you want every future instance of this work to get the same treatment — you skill it. The skill is a crystallization of a successful run, not a top-down spec.

The difference is everything. A designed skill is somebody’s guess about what the workflow should be. A crystallized skill is what the workflow actually was the one time it produced a real artifact. The first is theory. The second is shipped.

The second run

Today I picked another feature in the app, pointed /feature-runbook at it, and Claude went. It read the route handlers in the backend. It attached chrome-cdp to a real production browser tab — read-only on every mutation form, since this was production, not dev. Opened forms, screenshotted empty states, closed without clicking save, used existing entities in the database for the “after” states. Eleven screenshots, zero writes.

Then it wrote the Hebrew manual paragraph by paragraph, image refs and all. Built it to HTML. Rendered to PDF through headless Chrome. Emailed me the PDFs through AgentMail when it was done. The whole loop was one slash command and a couple of clarifying answers.

I want to break apart what’s actually doing the work in that loop, because the skill paragraphs are the thing that compounds — and unless you can name the load-bearing parts, you can’t tell which of your own workflows are skillable.

Claude as the user, not Claude as the tester

The first primitive is the agent operating the app like a real human. Not testing it. Using it.

The distinction is sharp. A test asserts an expectation against a known input. A user discovers the UI as they go — what’s on this screen, where’s the button, what does this label mean, what happens if I click here. Chrome-cdp gives Claude that second posture. It opens the app, looks at what’s actually rendered, reads the buttons, navigates, and figures out the flow.

This matters because half the artifacts I want — the operator manual, the API skill, the screenshot set, the knowledge updates for the voice agent — depend on what the user sees, not what the component renders. Component code lies, in the friendly sense: it tells you what should appear, modulo i18n keys, modulo permissions, modulo conditional rendering, modulo the seven other things that decide what actually shows up. The browser is the single source of truth. Claude looks at it.

Read-only is the other half of this primitive. Production data is real. Mutation forms get opened and screenshotted at empty state, never submitted. The “after” state comes from existing entities the database already has. That rule is encoded as a paragraph in the skill: never click Save on a prod form; screenshot empty state, find an existing entity for the result state. That’s one paragraph. It applies to every future feature.

The skill knows when to ask

The second primitive is the agent knowing where the human-in-the-loop lives.

Today’s feature had two audiences. Internal staff use it one way. External users use it another. The first run hadn’t faced this — that feature had only one audience. The second run hit it on day one.

What the skill did, early in the run, was pause and ask. Through the clarifying-question tool, it surfaced four scope questions: one runbook or two? Internal third-person Hebrew, or also a customer-facing second-person version? Walk through prod or local dev? Generate the paired API skill yes or no? I picked: both audiences, prod walkthrough, paired skill yes. Same screenshot directory, two manuals.

That’s an engineering decision a lot of agentic systems get wrong. Either they barrel ahead and produce one artifact when two were needed, or they ask twenty questions every run and the human ends up doing the work anyway. The right move is to ask exactly the questions that the work-so-far cannot answer. The skill encoded which questions those were — because the first run’s blind spot was that it never thought to branch on audience. Once that blind spot was visible, the skill paragraph wrote itself: if the feature has more than one user role, ask whether to produce one runbook or two.

That’s the second run’s contribution. Run one was the discovery. Run two stress-tests the discovery on a feature with a different shape, and the skill levels up on what the first run had glossed over.

Run one is discovery. Run two is the stress test. Run three is the skill running you.

Write-time verification

The third primitive is verification baked into the writing step, not after.

The paired API skill the workflow generates is a markdown file with curl examples for every endpoint the feature touches. The first instinct is: write the curl examples from memory of how the feature behaves. The skill paragraph says no. Read every route handler in the route handlers in source before writing a single curl example. Confirm the endpoint path, the HTTP method, the request body shape, the response shape. Only then write the curl.

Today that rule caught two field-naming bugs at write time. Tiny names that look right at a glance and fail at the API — the kind of bug that costs an hour of debugging when somebody runs the curl and gets a 400. The skill caught both before the file was saved.

This is what “agentic efficiency” actually means in practice. It doesn’t mean the agent is fast. It means the agent is correct, because verification happens at the write step instead of after the artifact is shipped and somebody downstream pays the cost.

Encoded gotchas

The fourth primitive — and the one that does the actual compounding — is that every gotcha from a previous run is a paragraph in SKILL.md the next run reads at the top of the invocation.

A short list of paragraphs in the runbook skill, just to be concrete:

Verify every endpoint shape against the route handlers in source before writing curl examples.
Never click Save on a prod mutation form. Screenshot empty state, use an existing entity for the result state.
Hebrew strings in cdp.mjs eval shell args fail under zsh quoting. Write the JS to a file and eval $(cat path.js).
Markdown-to-PDF: use Chrome headless with --no-pdf-header-footer. Hebrew RTL needs dir="rtl" on the body, not just per-element.
AgentMail body-size limit is 9MB. If you have more than one PDF, send them as separate emails.

Each one of those paragraphs is a wall I hit on a previous run. Now they’re walls the next run doesn’t hit.

That’s the compounding mechanism. The first run pays the discovery tax. The second run inherits it. The third run inherits two runs’ worth. By the fifth or sixth run, the skill is a small library of “things that bit me, sorted by when they bite,” and the human work is reduced to picking the feature and answering the clarifying questions. Nothing magical. Just paragraphs.

Why this is different from documentation

Documentation captures what the system does. A skill captures what the workflow does. Those are different artifacts.

Documentation rots when the system changes. Skills rot when the workflow changes — and workflows change much less often than systems do. The Hebrew shell-quoting paragraph above doesn’t care which feature you’re documenting. It cares that you’re driving chrome-cdp from a Mac with zsh, which is true for every run of this workflow on my machine.

That’s why skills compound and documentation doesn’t. Documentation is a snapshot of the world; skills are the procedure for shipping in it. As long as the procedure stays the same, every gotcha you encode against the procedure pays for itself on every future run.

Three shapes of compounding, not one

I want to be careful here, because it’s easy to read everything I just wrote as “compounding works when you do the exact same work twice.” That’s the narrowest version of the claim and it’s not the only one.

Workflow compounding is what the runbook example showed. Two runs of structurally similar work, the second cheaper than the first because the procedure was crystallized. This is what skills are best at. It needs the work to repeat in shape, even if the content differs.

Knowledge compounding is broader. Every gotcha you encode is a fact about the system you’re operating in — your shell, your DB, your deploy pipeline, your domain — and that fact stays true even when the next task is a totally different shape. The Hebrew shell-quoting paragraph from earlier doesn’t only apply to runbooks. It applies to any future workflow that drives chrome-cdp from this machine. Knowledge accrues across workflows, not just within one.

Learning compounding is broader still. Every time you reach for a tool, the next time you reach for it you reach a little faster and with a clearer sense of what it’s good for. That happens whether the work repeats or not. Skills are one way to capture it; context files are another; the engineer’s own intuition over time is a third. None of them require the next task to look like the last.

So “skill it” is one specific bet. It bets that the workflow will repeat. When that bet is right, you get the full speed-up of the runbook example. When it isn’t — research, original design, work where every instance is genuinely novel — you don’t, and skilling it would cost more attention than it saves. But knowledge and learning still compound underneath, in context files, in the gotchas you write down, in the patterns you internalize. Compounding doesn’t stop. The artifact you build to capture it changes shape.

The test for whether to skill a workflow is simple. After a run, ask: if I had to do this same kind of work again next week, would the bulk of what I just figured out apply, or would I be starting from scratch? If the answer is “most of it applies,” skill it. If the answer is “I’d start from scratch,” don’t skill the workflow — but write down what you learned somewhere the next run will see it.

That question is itself an instance of the third kind. I learned it by skilling things I shouldn’t have, and not skilling things I should have, until the right question crystallized. The heuristic lives in my head rather than in a SKILL.md, but it compounds the same way — paid for once, applied every time. Most work that feels novel turns out to be the same shape repeating, and the question is what tells me when I’m fooled.

The broader shape

Agentic efficiency isn’t an attribute of the model. Models get better, but the speed-up curve from a better model is logarithmic. The speed-up curve from a richer artifact between you and the agent is something else — every gotcha you encode is a fixed cost paid once and amortized over every future run.

That’s why the second run matters more than the first. The first run produces a real artifact, which is necessary, but the first run is also where the cost is highest and the discovery is messiest. The second run is where you find out which of your discoveries generalize. The skill that ran today is the skill that yesterday’s session produced — and tomorrow’s session, on whatever feature comes next, will run a slightly better version of it, because today found two new gotchas that didn’t exist in yesterday’s SKILL.md.

Skill it, don’t re-run it. Crystallize the workflow when the work is good enough that you want every future instance to get the same treatment. Let the second run stress-test what the first run glossed over. Encode every gotcha as a paragraph. The agent doesn’t get better; the artifact between you and the agent does. That’s where the compounding lives.

The agent doesn’t get better. The artifact between you and the agent does.