Turn Evals¶
Turn Evals are the most lightweight evaluation type in SCRAPI. They send a single message (or a short scripted sequence of messages) to your agent and assert on specific properties of the response — whether the agent called a particular tool, what it said, or whether it transferred control to a sub-agent.
Use Turn Evals when you need quick, targeted assertions on specific agent behaviors without the overhead of scripting a full conversation.
When to use Turn Evals¶
Turn Evals are a good fit when:
- You want to verify that a specific input always triggers a specific tool call
- You're checking that the agent doesn't call tools in certain situations
- You want to verify agent transfer happens for the right inputs
- You need a fast check during development without waiting for platform goldens to run
They're less appropriate for multi-turn conversations or for checking the quality of natural language responses — use Local Simulations for those.
The TurnEvals class¶
from cxas_scrapi.evals.turn_evals import TurnEvals
turn_evals = TurnEvals(
app_name="projects/my-project/locations/us/apps/my-app",
)
The TurnOperator enum¶
TurnOperator defines the assertion types available for each turn:
| Operator | Description |
|---|---|
CONTAINS | Agent response contains the expected string |
EQUALS | Agent response exactly equals the expected string |
TOOL_CALLED | The specified tool was called during this turn |
TOOL_INPUT | The specified tool was called with the expected input argument |
TOOL_OUTPUT | The specified tool returned the expected output |
NO_TOOLS_CALLED | No tools were called during this turn |
AGENT_TRANSFER | The agent transferred to the specified sub-agent |
from cxas_scrapi.evals.turn_evals import TurnOperator
# Available values
TurnOperator.CONTAINS
TurnOperator.EQUALS
TurnOperator.TOOL_CALLED
TurnOperator.TOOL_INPUT
TurnOperator.TOOL_OUTPUT
TurnOperator.NO_TOOLS_CALLED
TurnOperator.AGENT_TRANSFER
Writing turn eval tests¶
Defining tests in YAML¶
Turn eval tests are defined in YAML files. Each test specifies user input and expectations:
conversations:
- conversation: order_lookup_triggers_tool
user: "What's the status of order ORD-12345?"
variables:
order_12345_status: "shipped"
expectations:
- type: tool_called
value: "lookup_order"
- conversation: welcome_message_check
event: "welcome"
expectations:
- type: contains
value: "Welcome"
- type: no_tools_called
- conversation: tool_receives_correct_order_id
user: "Check order ORD-12345 please"
expectations:
- type: tool_input
value:
order_id: "ORD-12345"
- conversation: billing_transfers_to_billing_agent
user: "I want to dispute a charge on my bill"
expectations:
- type: agent_transfer
value: "billing-agent"
Multi-turn test cases¶
Turn Evals support short scripted sequences using the turns field:
conversations:
- conversation: order_id_collection_flow
turns:
- turn: ask_about_order
user: "I want to check my order"
expectations:
- type: no_tools_called
- type: contains
value: "order ID"
- turn: provide_order_id
user: "It's ORD-12345"
expectations:
- type: tool_called
value: "lookup_order"
Running tests programmatically¶
from cxas_scrapi.evals.turn_evals import TurnEvals
turn_evals = TurnEvals(app_name="projects/my-project/locations/us/apps/my-app")
# Load test cases from YAML
test_cases = turn_evals.load_turn_test_cases_from_file("evals/turn_evals/core_assertions.yaml")
# Run all tests — returns a pandas DataFrame
results_df = turn_evals.run_turn_tests(test_cases)
Running from a YAML file¶
You can also define turn eval test cases in YAML:
test_cases:
- name: "order_lookup_triggers_tool"
session_parameters:
order_12345_status: "shipped"
turns:
- turn: user
user: "What's the status of order ORD-12345?"
expectations:
- type: tool_called
value: "lookup_order"
- name: "welcome_has_no_tools"
turns:
- turn: event
event: welcome
expectations:
- type: no_tools_called
You can also load from a directory to run all YAML files at once:
test_cases = turn_evals.load_turn_tests_from_dir("evals/turn_evals/")
results_df = turn_evals.run_turn_tests(test_cases)
Interpreting results¶
run_turn_tests returns a DataFrame with one row per expectation:
| Column | Description |
|---|---|
test_name | Name of the test case |
turn | Which turn in a multi-turn test |
user | The user input that was sent |
status | SUCCESS or FAILURE |
errors | Error details if the assertion failed |
expected | What was expected |
actual | What was observed |
results_df = turn_evals.run_turn_tests(test_cases)
failed = results_df[results_df["status"] == "FAILURE"]
if not failed.empty:
print("Failed assertions:")
for _, row in failed.iterrows():
print(f" {row['test_name']}: {row['errors']}")
Integration with the skills system¶
The Run skill includes Turn Evals as part of its combined reporting. When you run all four eval types, Turn Eval results appear in the combined report alongside tool tests, goldens, and simulations, making it easy to see the full picture at a glance.