IInvoked

Running Evals

Run structured evals to compare AI agent runs — benchmark models, prompts, and MCP tool sets side-by-side to measure quality and guide improvements.

When to run evals

Run evals when you want to answer a specific question:

  • "Is this model better for my use case than that one?"
  • "Did my prompt change make things better or worse?"
  • "What's the cost/quality tradeoff between Sonnet and Haiku?"

Setting up an eval

1. Choose a baseline

Select an existing run to use as the baseline — this is what you're comparing against. Pick a run that represents the current behavior you want to improve or compare.

2. Configure the challenger

The challenger is what you're testing. You can vary:

  • Model — same prompt, different model
  • Prompt — same model, different prompt (use the prompt editor)
  • Tools — same prompt and model, different tool set

3. Run

Click Run eval. Invoked executes the challenger against the same input as the baseline and presents the results side-by-side.

Reading eval results

The eval view shows:

Baseline                    Challenger
─────────────────           ─────────────────
Model: claude-3-opus        Model: claude-3-5-sonnet
Duration: 4.2s              Duration: 1.8s  ✓ faster
Tokens: 2,104               Tokens: 1,891   ✓ fewer
Steps: 6                    Steps: 4        ✓ fewer

Output diff:
  Line 1 unchanged
- Old phrasing from baseline
+ New phrasing from challenger
  Line 3 unchanged

Adding a scorer

To automatically grade outputs, add a scorer to the eval:

// scorer.js — returns 0–1
export function score(output, expected) {
  const keywords = ['risk', 'impact', 'timeline']
  const found = keywords.filter(k => output.toLowerCase().includes(k))
  return found.length / keywords.length
}

Load it from Evals → New eval → Add scorer → Load file.

Batch evals

Run the same comparison across multiple inputs to get aggregate statistics:

  1. Create a test set (CSV or JSON file with input and optional expected_output columns)
  2. In New eval, toggle Batch mode
  3. Upload your test set
  4. Run — Invoked executes both baseline and challenger for every row

Aggregate results show mean/median scores, duration, and token usage across all cases.

On this page