Local Simulations¶
Local Simulations take a different approach to testing than Platform Goldens. Instead of scripting exact conversations and expected responses, you describe a goal and let an AI-powered user simulator (Gemini) try to achieve it. At the end, Gemini judges whether the agent met the goal and any additional expectations you specified.
This is valuable when you want to test that an agent can complete a task, without caring about the exact phrasing of each response — which is especially important for voice agents where natural language variation is expected.
How simulations work¶
- SCRAPI starts a real session with your agent using the Sessions API
- Gemini plays the role of a human user, sending messages to the agent to try to achieve the goal
- The conversation continues until the goal is met, the max number of turns is reached, or the agent ends the session
- Gemini evaluates whether each step's
success_criteriawas met and whether anyexpectationswere satisfied - SCRAPI produces a report with pass/fail status for each step and expectation
Because the user is simulated by a language model, the conversation is non-deterministic — each run may produce slightly different messages. This mirrors how real users behave.
YAML format¶
Simulation files use the evals: key at the top level:
evals:
- name: "successful_order_lookup"
tags: ["P0", "order_management"]
session_parameters:
order_12345_status: "shipped"
order_12345_eta: "2026-04-18"
steps:
- goal: "Ask about the status of order ORD-12345"
success_criteria: "The user has provided order ID ORD-12345 and the agent has acknowledged it"
response_guide: "The user is a customer checking on a recent purchase. They are polite but want a quick answer."
max_turns: 3
- goal: "Get the order status and delivery date"
success_criteria: "The agent has provided the shipping status and the estimated delivery date"
max_turns: 2
expectations:
- "The agent correctly identified the order as shipped"
- "The agent mentioned the estimated delivery date"
- "The agent maintained a friendly, helpful tone throughout"
Top-level fields¶
| Field | Type | Description |
|---|---|---|
name | string | Unique name for this evaluation |
tags | list | Tags for filtering (e.g., ["P0", "smoke"]) |
session_parameters | dict | Variables injected at session start |
steps | list | Ordered sequence of conversational goals |
expectations | list | Post-conversation quality assertions evaluated by Gemini |
Step fields¶
| Field | Type | Description |
|---|---|---|
goal | string | What the simulated user is trying to accomplish in this step |
success_criteria | string | The condition that determines whether this step is complete |
response_guide | string | Persona and context hints for the simulated user |
max_turns | int | Maximum turns allowed before declaring the step incomplete |
static_utterance | string | Instead of AI simulation, send this exact text (useful for testing specific inputs) |
inject_variables | dict | Variables to inject for the first step only (overrides session_parameters) |
Expectations¶
Expectations are evaluated by Gemini after the full conversation completes, looking at the entire transcript. They're natural language assertions:
expectations:
- "The agent never made up information that wasn't in the tool response"
- "The agent asked for the order ID before looking it up"
- "The agent offered to help with anything else before ending"
Each expectation is judged as Met or Not Met, with a justification from Gemini.
The SimulationEvals class¶
For programmatic use, import SimulationEvals:
from cxas_scrapi.evals.simulation_evals import SimulationEvals
sim_evals = SimulationEvals(
app_name="projects/my-project/locations/us/apps/my-app",
)
Running a single evaluation programmatically¶
The simulate_conversation method takes a test_case dict defining the steps and expectations:
from cxas_scrapi.evals.simulation_evals import SimulationEvals
sim_evals = SimulationEvals(app_name="projects/my-project/locations/us/apps/my-app")
test_case = {
"steps": [
{
"goal": "Ask about order ORD-12345",
"success_criteria": "User provided order ID and agent acknowledged",
"max_turns": 3,
},
{
"goal": "Get delivery date",
"success_criteria": "Agent provided estimated delivery date",
"max_turns": 2,
},
],
"expectations": [
"Agent maintained professional tone",
"Agent never hallucinated data",
],
}
eval_conv = sim_evals.simulate_conversation(test_case=test_case)
report = eval_conv.generate_report()
# Goals report (one row per step)
print(report.goals_df)
# Expectations report (one row per expectation)
if report.expectations_df is not None:
print(report.expectations_df)
Running in parallel¶
Simulations can be slow because they involve multiple real API calls. Run them in parallel to speed things up:
import concurrent.futures
test_cases = [...] # list of test_case dicts
def run_single(tc):
eval_conv = sim_evals.simulate_conversation(test_case=tc, console_logging=False)
return eval_conv.generate_report()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(run_single, tc) for tc in test_cases]
reports = [f.result() for f in concurrent.futures.as_completed(futures)]
Parallel execution and rate limits
The Sessions API and Gemini both have rate limits. Start with max_workers=3 and increase if you're not hitting errors. The skills system's Run skill handles this automatically.
Audio modality¶
If your agent handles voice conversations, you can run simulations in audio mode:
sim_evals = SimulationEvals(app_name="projects/my-project/locations/us/apps/my-app")
eval_conv = sim_evals.simulate_conversation(
test_case=test_case,
modality="audio", # default is "text"
)
In audio mode, SCRAPI uses the Sessions API's audio streaming endpoint. The simulated user's messages are still text internally, but they're processed by the agent's audio pipeline, which exercises TTS/STT and any audio-specific callbacks.
Interpreting results¶
The SimulationReport object has two DataFrames:
goals_df¶
| Column | Description |
|---|---|
eval_name | Name of the simulation |
step_index | Which step (0-indexed) |
goal | The goal text |
status | Completed or Not Completed |
justification | Gemini's explanation |
turns_used | How many turns it took |
expectations_df¶
| Column | Description |
|---|---|
eval_name | Name of the simulation |
expectation | The expectation text |
status | Met or Not Met |
justification | Gemini's explanation |
Reading the output¶
# Overall pass rate
total = len(report.goals_df)
passed = (report.goals_df["status"] == "Completed").sum()
print(f"Steps completed: {passed}/{total} ({passed/total*100:.0f}%)")
# Failed steps
failed = report.goals_df[report.goals_df["status"] != "Completed"]
for _, row in failed.iterrows():
print(f"FAILED: {row['goal']}")
print(f" Reason: {row['justification']}")
Tips for writing good simulations¶
- Keep steps focused
- Each step should test one thing. Broad goals like "complete the full conversation" are hard to debug when they fail.
- Write meaningful success criteria
- "The agent helped the user" is too vague. "The agent provided the order status and delivery date" is testable.
- Use
response_guideto set tone - If your agent needs to handle impatient users or edge cases, use
response_guideto set that context for the simulator. - Use
static_utterancefor exact inputs - When you want to test how the agent handles a specific phrasing (e.g., "what's my ETA?"), use
static_utteranceto send that exact text. - Use session parameters for mocking
- Just like goldens, use
session_parametersto inject mock tool responses so your simulations are deterministic and fast.