I built a prompt optimization tool with Claude based on GEPA.

The idea: an LLM that runs iteration rounds of prompts on test cases, until reaching the optimal prompt for the task.

The method was created as an optimizer within the DSPy library, suitable for tasks with clear output like data extraction, invoice classification, etc.

The Build Process

I wanted to understand deeply and started building examples with Claude Code based on LLM as judge. During the process there were many mistakes and misunderstandings, it explained, I corrected and guided it, until I understood what we were building and reached a good result.

In the end we also built a skill for it and tested it: Claude in browser reviewed and improved it and gave a test case, then I ran it with Claude Code. After seeing the skill worked I asked it to rebuild a visualization tool based on its skill.

The Steps

This is how we went deep into all the stages:

  • Base: LLM runs the original prompt on test cases and gives a score
  • Reflection: Analyzes the chain of thought and model traces
  • Mutation: Creates improved prompts based on reflection and runs again (including crossover between successful prompts)
  • The chosen prompt is the baseline for the next round

Pareto Frontier

Important selection step - Pareto Frontier: don’t choose only by highest score, balance between parameters like accuracy vs length, creativity vs facts, and other variables relevant to the task.

The Result

An LLM that’s both judge and optimizer: learns from the data, rewrites the prompt according to its results analysis. Instead of trying prompts manually - the LLM does it in a structured and smart way.

What Can We Learn?

  • It’s always worth asking the model how to work with it, even without a structured process
  • Give the model context, it will understand and write prompts for itself
  • The best learning is building

Try the tool