Tool Tests¶

Tool tests let you verify that your tools work correctly in isolation — without running a full conversation through the agent. You call a tool with specific inputs and assert that the output matches your expectations using a set of comparison operators.

This is the fastest and most focused way to test tool behavior. Since tools are the interface between your agent and external systems, getting them right is critical.

Why test tools separately¶

When an agent produces a wrong answer, the cause might be:

The tool returned the wrong data
The agent misinterpreted the tool's data
The instruction told the agent to do the wrong thing

Testing tools in isolation eliminates cause #1 from the equation. If your tool tests pass but your golden evals fail, you know the problem is in the instruction or the agent's reasoning — not the tool's implementation.

YAML format¶

Tool test files use the tests: key:

tests:
  - name: "lookup_order_found"
    tool: "lookup_order"
    args:
      order_id: "ORD-12345"
    variables:
      order_12345_status: "shipped"
      order_12345_eta: "2026-04-18"
    expectations:
      response:
        - path: "$.status"
          operator: equals
          value: "shipped"
        - path: "$.estimated_delivery"
          operator: contains
          value: "2026-04-18"
        - path: "$.order_id"
          operator: is_not_null

  - name: "lookup_order_not_found"
    tool: "lookup_order"
    args:
      order_id: "ORD-99999"
    variables:
      order_99999_status: "not_found"
    expectations:
      response:
        - path: "$.agent_action"
          operator: is_not_null
          # Ensures the tool returns an agent_action key on error

Test fields¶

Field	Type	Description
`name`	string	Unique name for this test case
`tool`	string	The tool display name to invoke
`args`	dict	Arguments to pass to the tool
`variables`	dict	Session variables to inject (mock data)
`expectations`	dict	Expectations object with `response` list

Assertion fields¶

Field	Type	Description
`path`	string	JSONPath expression pointing to the field to check
`operator`	string	Comparison operator (see below)
`value`	any	Expected value (optional for some operators)

Operators¶

Operator	Description	`value` required?
`equals`	Exact equality	Yes
`contains`	String/list containment	Yes
`greater_than`	Numeric greater-than	Yes
`less_than`	Numeric less-than	Yes
`length_equals`	Collection/string length equals	Yes
`length_greater_than`	Collection/string length greater-than	Yes
`length_less_than`	Collection/string length less-than	Yes
`is_null`	Field is null or missing	No
`is_not_null`	Field is present and not null	No

JSONPath expressions¶

The path field uses JSONPath syntax. Some examples:

$.status                    # Top-level "status" key
$.order.estimated_delivery  # Nested field
$.items[0].name             # First item in a list
$.items.length()            # Length of a list

Running with the CLI¶

cxas test-tools \
  --app-name "projects/my-project/locations/us/apps/my-app" \
  --file evals/tool_tests/order_tests.yaml

The CLI prints a summary table and exits with code 0 if all tests pass, 1 if any fail.

Debug mode¶

When a test fails and you're not sure why, use --debug to see the full tool request and response:

cxas test-tools \
  --app-name "projects/my-project/locations/us/apps/my-app" \
  --file evals/tool_tests/order_tests.yaml \
  --debug

Debug mode prints the raw API request, the raw response, and the assertion evaluation for each test.

The `ToolEvals` class¶

For programmatic use:

from cxas_scrapi.evals.tool_evals import ToolEvals

tool_evals = ToolEvals(
    app_name="projects/my-project/locations/us/apps/my-app",
)

# Load test cases from a YAML file
test_cases = tool_evals.load_tool_test_cases_from_file("evals/tool_tests/order_tests.yaml")

# Run the tests — returns a pandas DataFrame
results_df = tool_evals.run_tool_tests(test_cases)

print(results_df)

The returned DataFrame has columns:

Column	Description
`test_name`	Name of the test case
`tool`	Tool display name
`status`	`PASSED`, `FAILED`, or `ERROR`
`latency (ms)`	Time to invoke the tool
`errors`	Error details (on failure)

Filtering by pass/fail¶

test_cases = tool_evals.load_tool_test_cases_from_file("evals/tool_tests/order_tests.yaml")
results_df = tool_evals.run_tool_tests(test_cases)

failed = results_df[results_df["status"] == "FAILED"]
if not failed.empty:
    print(f"{len(failed)} test(s) failed:")
    print(failed[["test_name", "tool", "errors"]].to_string())

Summary report¶

The ToolEvals class can generate a summary report with pass rates and latency statistics:

summary_df = ToolEvals.generate_report(results_df)
print(summary_df)

High p99 latency on a tool that the agent calls frequently is worth investigating — it adds to the overall conversation latency.

Testing error handling¶

Every tool should handle errors gracefully by returning an agent_action key. Test this explicitly:

tests:
  - name: "handles_missing_order_id"
    tool: "lookup_order"
    input:
      order_id: ""  # Empty string — edge case
    assertions:
      - path: "$.agent_action"
        operator: is_not_null
        # The tool should return a friendly error message the agent can relay

  - name: "handles_api_timeout"
    tool: "lookup_order"
    input:
      order_id: "SIMULATE_TIMEOUT"
    variables:
      lookup_order_simulate_timeout: "true"
    assertions:
      - path: "$.agent_action"
        operator: contains
        value: "try again"

The linter rule T001 (tool-error-pattern) will warn if your tool doesn't include an agent_action error return.

Full example¶

Here's a complete tool test file for an order management tool:

tests:
  - name: "order_found_shipped"
    tool: "lookup_order"
    input:
      order_id: "ORD-12345"
    variables:
      order_12345_status: "shipped"
      order_12345_eta: "2026-04-18"
    assertions:
      - path: "$.status"
        operator: equals
        value: "shipped"
      - path: "$.order_id"
        operator: equals
        value: "ORD-12345"
      - path: "$.estimated_delivery"
        operator: is_not_null

  - name: "order_found_processing"
    tool: "lookup_order"
    input:
      order_id: "ORD-67890"
    variables:
      order_67890_status: "processing"
    assertions:
      - path: "$.status"
        operator: equals
        value: "processing"

  - name: "order_not_found_returns_error"
    tool: "lookup_order"
    input:
      order_id: "ORD-INVALID"
    variables:
      order_invalid_status: "not_found"
    assertions:
      - path: "$.agent_action"
        operator: is_not_null

  - name: "empty_order_id_returns_error"
    tool: "lookup_order"
    input:
      order_id: ""
    assertions:
      - path: "$.agent_action"
        operator: is_not_null

cxas test-tools \
  --app-name "projects/my-project/locations/us/apps/my-app" \
  --file evals/tool_tests/order_management.yaml

Expected output:

Tool Test Results
=================
Test                              | Pass | Latency
----------------------------------|------|--------
order_found_shipped               | PASS |  234ms
order_found_processing            | PASS |  198ms
order_not_found_returns_error     | PASS |  201ms
empty_order_id_returns_error      | PASS |   45ms

4/4 tests passed. p50: 199ms, p90: 234ms