Multi-turn Chat¶
Eval Recipe for model migration¶
This Eval Recipe demonstrates how to compare performance of a Multi-turn conversation (Chat) with Gemini 1.0 and Gemini 2.0 using Vertex AI Evaluation Service.

-
Use case: multi-turn conversation (Chat)
-
Evaluation Dataset is based on Multi-turn Prompts Dataset. It includes 5 conversations:
dataset.jsonl. Each record in this file links to a JSON file with the conversation history between the User and the Model. This dataset does not include any ground truth labels. -
Python script
eval.pyexecutes the evaluation:load_dataset: loads all conversation histories to a Pandas Dataframe.generate_chat_responses: runs an inference for each conversation in the dataset and saves the model responses.run_eval: passes the dataset including responses from the Baseline and Candidate models to a pairwise autorater in order to calculate the win rate.
-
Shell script
run.shinstalls the required Python libraries and runseval.py
How to run this Eval Recipe¶
-
Google Cloud Shell is the easiest option as it automatically clones our Github repo:
-
Alternatively, you can use the following command to clone this repo to any Linux environment with configured Google Cloud Environment:
git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \ cd applied-ai-engineering-samples && \ git sparse-checkout init && \ git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \ git pull origin main cd genai-on-vertex-ai/gemini/model_upgrades -
Navigate to the Eval Recipe directory in terminal, set your Google Cloud Project ID and run the shell script
run.sh. -
The resulting metrics will be displayed in the script output.
- You can use Vertex AI Experiments to view the history of evaluations for each experiment, including the final metrics.
How to customize this Eval Recipe:¶
- Edit the Python script
eval.py:- set the
projectparameter of vertexai.init to your Google Cloud Project ID. - set the parameter
baseline_modelto the model that is currently used by your application - set the parameter
candidate_modelto the model that you want to compare with your current model - configure a unique
experiment_namefor tracking purposes
- set the
- Replace the dataset with your custom data in the same format.
- Please refer to our documentation if you want to further customize your evaluation. Vertex AI Evaluation Service has a lot of features that are not included in this recipe.