Run Skill¶
The Run skill executes all four evaluation types against your agent and produces a combined report showing the overall health of your agent. It's the "are we good?" step in the development loop.
Invoking the Run skill¶
The foundry routes you to Run when you express an intent like:
- "Run the evals"
- "What's the pass rate?"
- "Test the agent"
- "How is the agent performing?"
The Run skill is a sub-skill of the Agent Foundry — it is automatically routed to when the foundry detects a run/test intent.
What the Run skill does¶
The Run skill orchestrates all four eval types in the optimal order:
1. Callback tests (fastest)¶
Runs all pytest-based callback tests. These are the fastest tests — pure Python, no API calls. The skill reports how many tests passed and flags any failures before proceeding.
2. Tool tests (fast, isolated)¶
cxas test-tools \
--app-name "projects/.../apps/<app>" \
--test-file evals/tool_tests/order_tests.yaml
Runs tool tests against the live platform. These are fast but require network access. The skill includes latency statistics in the report.
3. Platform goldens (push + run)¶
cxas push-eval \
--app-name "projects/.../apps/<app>" \
--file evals/goldens/order_lookup.yaml
cxas run --app-name "projects/.../apps/<app>" --wait --filter-auto-metrics
Pushes golden files and waits for results. This is the most comprehensive test but also the slowest.
4. Local simulations (AI-driven)¶
from cxas_scrapi.evals.simulation_evals import SimulationEvals
# ... runs all simulation YAML files in parallel ...
Runs simulations using the deployment configured in gecx-config.json. The skill uses parallel execution (up to 5 concurrent simulations) to keep runtime reasonable.
Combined report¶
After running all four types, the skill generates a combined report:
===========================
CXAS Agent Evaluation Report
===========================
App: My Support Agent
Run at: 2026-04-14 14:32:07
Callback Tests
--------------
12 tests | 12 passed | 0 failed
Status: PASS
Tool Tests
----------
8 tests | 7 passed | 1 failed
Status: FAIL
Failed:
- lookup_order/handles_api_timeout: agent_action field missing
Expected: is_not_null | Actual: null
Platform Goldens
----------------
8 conversations | 7 passed | 1 failed
Status: FAIL
Failed:
- order_management/bad_order_id_handling: turn 2
Expected: "couldn't find" | Actual: "I'll look that up for you"
Local Simulations
-----------------
3 evals | 8 steps | 6 completed | 2 not completed
Status: FAIL
Not completed:
- billing_inquiry/get_account_number: goal not achieved in 3 turns
===========================
Overall Pass Rate: 28/31 (90%)
===========================
The report is written to test-results/latest-report.md and printed to the terminal.
Report artifacts¶
The Run skill writes results to test-results/:
test-results/
├── latest-report.md # Human-readable combined report
├── latest-report.json # Machine-readable combined report
├── callback-tests.log # Raw pytest output
├── tool-tests.csv # Tool test results (one row per test)
├── goldens-results.json # Platform golden results
└── simulations-results.csv # Simulation results (one row per step)
These files are useful for tracking trends over time. If you commit test-results/ to Git, you can see how the pass rate evolves with each push.
Target pass rate¶
The Debug skill uses the pass rate from the Run skill to determine when to stop iterating. By default, it targets 100% for tool tests and callback tests, and 80% for goldens and simulations (to account for natural language variability).
You can adjust the target by telling the foundry what you want:
Running specific eval types¶
If you only want to run one type:
The Run skill will run only the specified types and report accordingly.
Handling slow simulations¶
Local simulations are the slowest part of the Run skill. If you're in a tight iteration loop and don't need simulation results, tell the skill:
The skill will skip the simulation phase and note that the report is incomplete.