Evals
Built-in evaluation harness for AI agents — compare MCP tool runs side-by-side to measure quality and guide prompt and tool improvements.
Evals let you compare two runs against the same prompt to measure quality differences — between models, prompts, or tool sets.
Why evals?
Without evals, improving agents is guesswork. Evals give you a structured, repeatable way to answer questions like:
- Is Claude 3.5 Sonnet actually better than GPT-4o for this task?
- Did my prompt change improve or regress quality?
- Which tool set produces the most accurate results?
Running an eval
- Open the Evals panel (⌘3)
- Click New eval
- Select a baseline run (what you're comparing against)
- Configure the challenger — a different model, prompt, or tool set
- Click Run eval
Both runs execute against the same input, and the results are shown side-by-side.
Eval metrics
| Metric | Description |
|---|---|
| Duration | Wall-clock time for each run |
| Token usage | Prompt + completion tokens |
| Step count | Number of tool calls and model steps |
| Output diff | Side-by-side diff of the final output |
Eval scoring
You can add a scorer to automatically grade outputs. Built-in scorers:
- Exact match — output must equal expected string
- Contains — output must include a substring
- JSON valid — output must be parseable JSON
- LLM judge — a second model grades the output on a rubric (requires API key)
Custom scorers are JavaScript functions that receive the run output and return a score between 0 and 1.
Eval history
All eval results are stored in SQLite alongside run data. The Evals panel shows a history of all past evaluations with their scores, making it easy to track improvement over time.