Back to Blog

The Prompting Playbook: Prompt vs Harness

A full explanation of Anthropic's prompting playbook: evals, failure modes, tools, harness design, and when not to solve agent problems with more prompt text.

agents prompting harness tool-use ai-engineering anthropic

The Core Thesis

Prompting is not “find the magic wording.” It is an engineering loop: define evals, clean the prompt, isolate failures, fix one failure mode at a time, and know when the prompt is the wrong layer to solve the problem. The biggest idea is that good AI behavior comes from the combination of prompt + model + harness + tools + evals, not from prompt text alone.

Part 1: Debugging An Existing Prompt

The first scenario is a customer-support bot for a telco company. The prompt has become messy over time: multiple contributors, no clear owner, copied website text, old model-specific patches, mixed policy/tone/data/instructions, and contradictions.

The first move is evals. The speaker says you need tests before changing the prompt because when a new model behaves differently, you need to know whether the issue is prompt-tunable behavior or a model capability limitation. The eval suite should include:

  1. Control cases: obvious cases the model should always pass.
  2. Edge cases: known historical failures.
  3. Capability-boundary cases: places where the model should refuse, escalate, or admit limitation.

Then she applies general prompt hygiene. The important rule: if a human cannot tell which parts are policy, data, guidelines, and tone, the model probably cannot either. The fix is structure: separate role, policy, tone, data, and output format, often with XML-style tags.

Failure Mode 1: The Model Withholds Information

In the hotspot example, the model has the customer’s actual legacy-plan hotspot allowance but refuses to give it, because the prompt says not to give wrong plan details and to direct legacy users elsewhere. That defensive patch probably made sense for an older model, but the newer model over-optimizes for it.

Lesson: hallucination is not the only risk. Models can also become too cautious and withhold information they actually have. Track defensive prompt patches in version control with the reason they were added, so you can remove them when models improve.

Failure Mode 2: Instructions Don’t Add Capability

For proration billing, the prompt says things like “always calculate correctly,” but the model gives vague math. The fix is not stronger wording. The fix is a tool: define a proration calculator, expose the tool schema, and implement the actual math.

Lesson: telling the model to do hard math correctly does not make it reliable at math. If the task requires exact computation, give it a tool.

Failure Mode 3: One-Sided Objectives Create Bad Tradeoffs

For billing errors, the model should escalate to a human, but the prompt emphasizes that escalation costs money and hurts fast-resolution metrics. The model therefore avoids escalation even when escalation is correct.

The fix is to state both sides of the tradeoff: escalation costs money, but mishandling billing errors costs refunds and customer trust.

Lesson: smarter models optimize objectives more seriously. If you give only one side of a business tradeoff, the model may follow it too well.

Part 2: Building A New Agent From Scratch

The second scenario is a scheduling agent that creates a week-long retail staff schedule under hard constraints. Here the lesson expands beyond prompt text.

Because scheduling has hard rules, evaluation can be programmatic: a Python function can count constraint violations instead of using an LLM judge.

They compare approaches:

  1. Simple prompt + smaller model: fails all trials.
  2. Same prompt + stronger reasoning model: fewer violations, still not good enough.
  3. Strong model + adaptive thinking: passes, but uses much more latency and tokens.
  4. Smaller model + better prompt: improves, but may hit output limits and still be expensive.
  5. Agentic generate/evaluate/repair loop: passes with lower latency/tokens than trying to force one prompt to do everything.

Agentic generate/evaluate/repair loop

“Agentic generate/evaluate/repair loop: passes with lower latency/tokens than trying to force one prompt to do everything.”

This means:

Instead of giving the model one giant prompt that says: “build a staff schedule, check all the rules, fix mistakes, and return a perfect answer,” you break the work into three smaller stages:

  1. Generate: the model creates a first draft.
  2. Evaluate: a separate step checks the draft and returns a precise list of problems.
  3. Repair: a third step receives the problems and fixes only those issues.

Why is this better? Because one big prompt asks the model to hold too many things in its head at once. It is creating, judging, repairing, and trying not to break anything all in one pass. In an agentic loop, each step is simpler, so the model burns fewer tokens, does less messy reasoning, and can sometimes reach a better result faster.

In the staff scheduling example from the video:

  • generate creates an initial schedule.
  • evaluate says: “Wednesday is missing one employee”, “Sarah was scheduled when she is unavailable”, “there is a shift without a manager”.
  • repair fixes those specific issues instead of rebuilding everything from scratch.

It is similar to how we work with code: first you write, then you run tests/lint, then you fix according to the failures. You do not try to write perfect code with imaginary compilation inside your head.

The big architectural lesson: sometimes the best “prompt” is not one larger prompt. Split the job into smaller steps: generate a draft, evaluate violations, repair targeted issues. This makes the process more reliable and lets you add soft constraints at runtime without rewriting the backend validator.

The Practical Playbook

Use this whenever you are maintaining or building an agent:

  1. Start with evals before prompt edits.
  2. Include control, edge, and boundary/capability cases.
  3. Clean prompt structure before debugging individual failures.
  4. Separate policy, data, tone, role, and output contract.
  5. Remove stale defensive patches from older models.
  6. Fix one failure mode at a time.
  7. Use tools when the model needs exact capability.
  8. State both sides of tradeoffs.
  9. Use structured outputs or stop sequences when format consistency matters.
  10. Consider model choice and harness design, not just wording.
  11. For complex tasks, split into generate/evaluate/repair or similar agentic loops.
  12. Treat prompts as versioned production artifacts, not vibes in a text box.

My Take

This is basically “prompting as systems engineering.” The most useful part is the insistence that a model failure should not immediately become “add another sentence to the prompt.” First ask: is this a prompt clarity issue, stale instruction issue, missing tool issue, wrong model issue, output-contract issue, or architecture issue?

That distinction is the money.

Prompt vs Harness

The generate/evaluate/repair loop is mostly a harness/orchestration change, not just a prompt change.

A prompt-only version looks like this:

user request -> one big prompt -> model -> final answer

The agentic version looks like this:

user request -> generate prompt -> draft
draft -> evaluator / validator -> list of problems
draft + problems -> repair prompt -> improved answer

So yes, the prompts change too, but the deeper change is that the system around the model changes. That surrounding system is the harness: the code that decides which model to call, in what order, with which tools, schemas, validators, retries, stop sequences, and follow-up calls.

What belongs to the harness?

If the issue is prompt structure, that is usually a prompt problem. For example, policy, tone, data, and task instructions are all mixed together.

If an old instruction is interfering, that is usually a prompt/versioning problem. For example, a defensive patch from an older model now makes a newer model too cautious.

If the model lacks a capability, that is a tool + harness problem. For example, do not just tell the model “calculate accurately”; give it a calculator tool and wire that tool into the API flow.

If the model is the wrong fit, that is a harness/config problem. Model choice, routing, fallback models, adaptive thinking, max tokens, and latency/cost tradeoffs all live outside the prompt.

If the output format is unreliable, that can start as a prompt issue, but the stronger fix is often harness-level: structured outputs, JSON schema, parser validation, retry-on-invalid-output, or stop sequences.

If the task needs multiple steps instead of one big answer, that is clearly harness/orchestration. Examples: generate/evaluate/repair, planner/executor, retrieve/read/write, classify/route/respond.

When Not To Fix With Prompting

If the model does math badly, do not just write “be accurate.” Add a calculator or deterministic function.

If the model needs fresh or private data, do not write “use the latest information.” Add search, RAG, database access, or retrieval.

If the model returns broken JSON, do not rely only on “return valid JSON.” Use structured outputs or schema validation with retry.

If the model must follow hard business rules, do not leave everything to judgment. Put deterministic routing in the harness. Example: if billing_conflict=true, escalate to a human.

If the task has many constraints, do not stuff all constraints into one huge prompt. Use a validator or a generate/evaluate/repair loop.

If the system is too slow or expensive, that is usually not a prompt problem. Look at model routing, caching, max tokens, smaller models, async flows, or step splitting.

If an action has side effects, like sending an email, charging a customer, deleting data, or publishing content, do not rely only on “ask before doing it.” Put a real approval gate in the harness.

Practical Rule

Use prompting when the problem is about understanding, instruction clarity, tone, task framing, or prioritization.

Change tools or harness when the problem is about capability, reliability, validation, output format, external data, latency, cost, safety gates, or multi-step workflow.