Multi-Turn Conversation (Chat)¶
Eval Recipe for model migration¶
This Eval Recipe demonstrates how to evaluate a multi-turn conversation (chat) on Gemini 1.0 and Gemini 2.0 using the open source evaluation tool Promptfoo.
-
Use case: multi-turn conversation
-
Evaluation Dataset is based on Multi-turn Prompts Dataset. It includes 5 conversations:
dataset.jsonl
. Each record in this file links to a JSON file with the system instruction followed by a few messages from the User and responses from the Assistant. This dataset does not include any ground truth labels. -
Prompt Template located in
prompt_template.txt
injects thechat
variable from our dataset which represents the conversation history. -
promptfooconfig.yaml
contains all Promptfoo configuration:providers
: list of models that will be evaluatedprompts
: location of the prompt template filetests
: location of the dataset filedefaultTest
: configures the evaluation metric:type: select-best
auto-rater that decides which of the two models configured above generated the best response.providers
configures the judge modelvalue
configures the custom criteria that is evaluated by theselect-best
auto-rater
How to run this Eval Recipe¶
-
Google Cloud Shell is the easiest option as it automatically clones our Github repo:
-
Alternatively, you can use the following command to clone this repo to any Linux environment with configured Google Cloud Environment:
git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \ cd applied-ai-engineering-samples && \ git sparse-checkout init && \ git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \ git pull origin main cd genai-on-vertex-ai/gemini/model_upgrades
-
Install Promptfoo using these instructions.
-
Navigate to the Eval Recipe directory in terminal and run the command
promptfoo eval
. -
Run
promptfoo view
to analyze the eval results. You can switch the Display option toShow failures only
in order to investigate any underperforming prompts.
How to customize this Eval Recipe:¶
- Copy the configuration file
promptfooconfig.yaml
to a new folder. - Add your labeled dataset file with JSONL schema similar to
dataset.jsonl
. - Save your prompt template to
prompt_template.txt
and make sure that the template variables map to the variables defined in your dataset. - That's it! You are ready to run
promptfoo eval
. If needed, add alternative prompt templates or additional metrics to promptfooconfig.yaml as explained here.