SimulationEvals¶
SimulationEvals runs AI-driven end-to-end conversation simulations against your CXAS agent. Instead of scripting exact utterances, you describe goals and success criteria — and a Gemini model figures out what to say at each turn to try to achieve them. This is a great way to test how your agent handles realistic, messy, unpredictable conversations.
Here are the key concepts:
Step(Pydantic model) — a single goal within a simulation, with agoal,success_criteria, optionalresponse_guide, and amax_turnslimit. Steps can also include astatic_utterancefor when you want a fixed first message, andinject_variablesfor seeding session state.StepStatusenum — tracks whether each step isNOT_STARTED,IN_PROGRESS, orCOMPLETED.simulate_conversation()— drives the full multi-turn loop, returning anLLMUserConversationobject that contains the transcript, step progress, and expectation results.generate_report()— produces aSimulationReportwith two DataFrames: goal progress and expectation results. It renders as styled HTML in a Jupyter notebook.
Quick Example¶
from cxas_scrapi import SimulationEvals
app_name = "projects/my-project/locations/us/apps/my-app-id"
sim = SimulationEvals(app_name=app_name)
test_case = {
"steps": [
{
"goal": "User wants to check their account balance",
"success_criteria": "Agent provides a numeric balance and account status",
"max_turns": 5,
},
{
"goal": "User asks to dispute a charge",
"success_criteria": "Agent acknowledges the dispute and provides a reference number",
"max_turns": 8,
},
],
"expectations": [
"The agent should never ask for the full credit card number",
"The agent should offer to escalate if it cannot resolve the dispute",
],
}
# Run the simulation
conversation = sim.simulate_conversation(
test_case=test_case,
console_logging=True,
)
# View the report
report = conversation.generate_report()
print(report) # Colorized in terminal, styled HTML in Jupyter
Reference¶
SimulationEvals ¶
Bases: Apps
Wrapper class to simulate entire multi-turn conversations with a CXAS Agent.
Source code in src/cxas_scrapi/evals/simulation_evals.py
simulate_conversation ¶
simulate_conversation(test_case, model=_DEFAULT_GEMINI_MODEL, session_id=None, console_logging=True, modality='text')
Runs the simulated conversation loop.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_case | Dict[str, Any] | The test case dictionary defining evaluation steps. | required |
model | str | The Gemini model used for evaluating turns. | _DEFAULT_GEMINI_MODEL |
console_logging | bool | Whether to print interaction transcript to the console. | True |
Source code in src/cxas_scrapi/evals/simulation_evals.py
run_simulations ¶
run_simulations(test_cases, runs=1, parallel=1, model=_DEFAULT_GEMINI_MODEL, modality='text', verbose=False)
Runs multiple simulations, optionally in parallel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_cases | List[Dict[str, Any]] | List of test case dictionaries. | required |
runs | int | Number of runs per test case. | 1 |
parallel | int | Number of parallel workers (capped at 25). | 1 |
model | str | Gemini model to use. | _DEFAULT_GEMINI_MODEL |
modality | str | 'text' or 'audio'. | 'text' |
verbose | bool | Whether to log to console (only active if parallel=1). | False |
Source code in src/cxas_scrapi/evals/simulation_evals.py
export_results_to_golden ¶
Exports simulation results to a Golden Evaluation YAML file.
Fetches the full conversation trace for each simulation from the platform to ensure accuracy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results | List[Dict[str, Any]] | The list of results returned by run_simulations. | required |
output_path | Optional[str] | Optional local path to save the generated YAML. | None |
Returns:
| Type | Description |
|---|---|
str | The generated YAML string. |
Source code in src/cxas_scrapi/evals/simulation_evals.py
Step ¶
Bases: BaseModel
StepStatus ¶
Bases: str, Enum