Skip to content

Platform Goldens

Platform Goldens are the most thorough evaluation type SCRAPI provides. They run on the CX Agent Studio platform itself, exercising the full agent loop — model inference, tool calls, and callbacks — in a controlled, deterministic way. You script a conversation, specify expected responses and tool call expectations, and the platform tells you whether the agent behaved as intended.


What goldens test

A golden is a scripted conversation. Each turn in the conversation has:

  • A user input (what the human says)
  • An expected agent response (what you want the agent to say)
  • Optional: expected tool calls with expected arguments and responses

The platform runs the conversation through the live agent and compares the actual output to your expectations using configurable match types (semantic similarity, substring containment, exact match, or regex).

Goldens are best for testing known, correct agent behavior — happy paths, error handling paths, and any behavior that should be deterministic.


YAML format

Golden files use the conversations: key at the top level:

# common session parameters inject mock data for all conversations in this file
common_session_parameters:
  order_12345_status: "shipped"
  order_12345_eta: "2026-04-18"

conversations:
  - conversation: "happy_path_order_lookup"
    turns:
      - user: "Hi, I'd like to check on my order"
        agent: "Of course! Could you share your order ID?"

      - user: "It's ORD-12345"
        tool_calls:
          - action: lookup_order
            args:
              order_id:
                value: "ORD-12345"
                $matchType: contains
        agent: "Your order ORD-12345 has shipped and should arrive by April 18th."

  - conversation: "missing_order_id"
    turns:
      - user: "Where's my stuff?"
        agent: "I'd be happy to help track your order. Could you provide your order ID?"

Required fields

Field Type Description
conversations list Array of conversation objects
conversation string Unique name for this conversation
turns list The ordered turns in the conversation

Turn fields

Field Type Description
user string The user's message for this turn
agent string or list[string] Expected agent response (required — omitting causes "UNEXPECTED RESPONSE" failures). Use a list when the agent may respond with multiple text chunks.
tool_calls list Expected tool invocations during this turn
event string Instead of user, inject a platform event (e.g., "welcome")

Tool call fields

Field Type Description
action string The tool display name
args dict Expected arguments
output dict Expected tool response (used to mock the tool's return value)

Match types ($matchType)

The $matchType field controls how argument and response values are compared:

Value Description
semantic Gemini-powered semantic similarity check (default for agent fields)
contains The actual value must contain the expected value as a substring
exact Exact string match
regexp Regular expression match
ignore Skip this field during comparison

Session parameters

common_session_parameters injects data into every conversation in the file. Use this to mock tool responses without making real API calls:

common_session_parameters:
  # When the agent calls lookup_order, the platform returns these values
  order_12345_status: "shipped"
  order_12345_eta: "2026-04-18"

You can also set session_parameters per conversation to override the common values for specific test cases.

Tags

Tags let you filter which goldens to run in CI or report on:

conversations:
  - conversation: "happy_path_order_lookup"
    tags: ["P0", "order_management"]
    turns:
      - ...

Pushing goldens to the platform

Before you can run goldens, you need to push them to the platform:

cxas push-eval \
  --app-name "projects/my-project/locations/us/apps/my-app" \
  --file evals/goldens/order_lookup.yaml

Push each golden file individually using --file.


Running goldens

Once pushed, run the evaluations and wait for results:

cxas run \
  --app-name "projects/my-project/locations/us/apps/my-app" \
  --wait

--wait polls until all evaluations complete and then prints a summary. Without --wait, the command starts the run and returns immediately — useful when you want to check results later.

Filtering runs

You can filter which conversations to run using tags:

cxas run \
  --app-name "projects/my-project/locations/us/apps/my-app" \
  --filter-auto-metrics \
  --wait

Exit codes

Exit code Meaning
0 All evaluations passed
1 One or more evaluations failed
2 SCRAPI error (invalid arguments, auth failure, etc.)

These exit codes make cxas run --wait suitable for CI — your pipeline fails if any golden fails.


Interpreting results

The output of cxas run --wait includes a summary table:

Evaluation Results
==================
Conversation             | Turns | Pass | Fail | Score
-------------------------|-------|------|------|------
happy_path_order_lookup  |   2   |   2  |   0  | 100%
missing_order_id         |   1   |   1  |   0  | 100%
bad_order_id_handling    |   2   |   1  |   1  |  50%

Total: 3 conversations, 5 turns, 4 pass, 1 fail

For each failing turn, the platform provides a detailed comparison showing what was expected versus what was actually produced.

Common failure patterns

"UNEXPECTED RESPONSE"
The turn has a user field but no agent field. The platform always expects an agent response — if you don't specify one, any response is flagged as unexpected. Fix: always add an agent field. Linter rule E008 catches this.
Semantic match failures
The agent's response was correct in meaning but phrased differently than expected. Consider making the expected agent text less specific, or using $matchType: contains for key facts.
Tool call argument mismatches
The agent called the tool with different arguments than expected. Check the instruction to ensure the agent is extracting the right parameters, and check if $matchType: ignore is appropriate for any arguments you don't care about.

Full working example

Here's a complete golden file for an order management agent:

common_session_parameters:
  order_12345_status: "shipped"
  order_12345_eta: "2026-04-18"
  order_99999_status: "not_found"

conversations:
  - conversation: "successful_order_lookup"
    tags: ["P0", "order_management"]
    turns:
      - event: welcome
        agent: "Welcome to Acme Support! How can I help you today?"

      - user: "I want to check my order status"
        agent: "Of course! Please share your order ID and I'll look that up for you."

      - user: "Order ID is ORD-12345"
        tool_calls:
          - action: lookup_order
            args:
              order_id:
                value: "ORD-12345"
                $matchType: contains
        agent: "shipped"

  - conversation: "order_not_found"
    tags: ["P0", "order_management", "error_handling"]
    turns:
      - user: "Check order ORD-99999"
        tool_calls:
          - action: lookup_order
            args:
              order_id:
                value: "ORD-99999"
                $matchType: exact
        agent: "I wasn't able to find that order"
# Push and run
cxas push-eval \
  --app-name "projects/my-project/locations/us/apps/my-app" \
  --file evals/goldens/order_management.yaml

cxas run --app-name "projects/my-project/locations/us/apps/my-app" --wait