Testing & Evaluation¶
Testing a conversational agent is different from testing a traditional software system. The agent's behavior is probabilistic — the same input can produce slightly different outputs on different runs. A robust evaluation strategy tests deterministic behavior where it exists, and measures quality where it doesn't.
SCRAPI provides five complementary evaluation types, each targeting a different layer of your agent.
The five evaluation types¶
| Type | What it tests | Where it runs | Format |
|---|---|---|---|
| Platform Goldens | Deterministic turn-by-turn responses and tool calls | CX Agent Studio platform | YAML with conversations: key |
| Local Simulations | Open-ended conversation goals over multiple turns | Local machine using Sessions API | YAML with evals: key |
| Tool Tests | Isolated tool inputs and outputs with assertions | Local machine | YAML with tests: key |
| Callback Tests | Python unit tests for callback code | Local machine (pytest) | pytest files |
| Turn Evals | Single-turn response assertions | Local machine using Sessions API | Python code |
Choosing the right eval type¶
Use this decision table to pick the right eval for what you're testing:
Is the behavior deterministic (same input → exact same output)?
├── YES → Can you test it as a tool in isolation?
│ ├── YES → Tool Tests
│ └── NO → Platform Goldens (turn-by-turn assertions)
└── NO → Is this a multi-turn conversation goal?
├── YES → Local Simulations
└── NO → Turn Evals (single-turn quality checks)
Is the behavior in a Python callback?
└── YES → Callback Tests (pytest)
In practice, you'll use multiple types together. A well-tested agent typically has:
- Tool tests for every tool — fast, isolated, run in seconds
- Platform goldens for key happy-path conversations — verifies the full agent loop
- Local simulations for complex multi-step scenarios where exact responses vary
- Callback tests for any non-trivial callback logic
Quick comparison by use case¶
- "I want to test that my tool returns the right data"
- Use Tool Tests. They call the tool directly and assert on specific fields in the response.
- "I want to test that the agent says the right thing in a specific conversation"
- Use Platform Goldens. You script the exact conversation and the expected responses.
- "I want to test that the agent can complete a task without caring about exact phrasing"
- Use Local Simulations. You define a goal and success criteria; Gemini plays the user and judges whether the goal was met.
- "I want to test that my callback code is correct before deploying it"
- Use Callback Tests. They're just pytest — you can mock the platform objects and test your logic in isolation.
- "I want to quickly verify a single agent response meets a condition"
- Use Turn Evals. They're the most lightweight option for single-turn assertions.
What's in this section¶
Platform Goldens- YAML format, pushing evals to the platform, running them with
cxas run --wait, and interpreting results. Local Simulations- YAML format, the
SimulationEvalsclass, parallel execution, and audio modality support. Tool Tests- YAML format with operator assertions, the
ToolEvalsclass, andcxas test-tools. Callback Tests- Directory structure, pytest integration, the
CallbackEvalsclass, andcxas test-callbacks. Turn Evals- The
TurnEvalsclass, theTurnOperatorenum, and single-turn code examples. Running Evaluations- End-to-end CLI workflow,
cxas ci-test, exit codes for CI, and combining multiple eval types.