IInvoked

Evals

Built-in evaluation harness for AI agents — compare MCP tool runs side-by-side to measure quality and guide prompt and tool improvements.

Evals let you compare two runs against the same prompt to measure quality differences — between models, prompts, or tool sets.

Why evals?

Without evals, improving agents is guesswork. Evals give you a structured, repeatable way to answer questions like:

  • Is Claude 3.5 Sonnet actually better than GPT-4o for this task?
  • Did my prompt change improve or regress quality?
  • Which tool set produces the most accurate results?

Running an eval

  1. Open the Evals panel (⌘3)
  2. Click New eval
  3. Select a baseline run (what you're comparing against)
  4. Configure the challenger — a different model, prompt, or tool set
  5. Click Run eval

Both runs execute against the same input, and the results are shown side-by-side.

Eval metrics

MetricDescription
DurationWall-clock time for each run
Token usagePrompt + completion tokens
Step countNumber of tool calls and model steps
Output diffSide-by-side diff of the final output

Eval scoring

You can add a scorer to automatically grade outputs. Built-in scorers:

  • Exact match — output must equal expected string
  • Contains — output must include a substring
  • JSON valid — output must be parseable JSON
  • LLM judge — a second model grades the output on a rubric (requires API key)

Custom scorers are JavaScript functions that receive the run output and return a score between 0 and 1.

Eval history

All eval results are stored in SQLite alongside run data. The Evals panel shows a history of all past evaluations with their scores, making it easy to track improvement over time.

On this page