Multi-turn Chat¶
Eval Recipe for model migration¶
This Eval Recipe demonstrates how to compare performance of a Multi-turn conversation (Chat) with Gemini 1.0 and Gemini 2.0 using Vertex AI Evaluation Service.

- 
Use case: multi-turn conversation (Chat) 
- 
Evaluation Dataset is based on Multi-turn Prompts Dataset. It includes 5 conversations: dataset.jsonl. Each record in this file links to a JSON file with the conversation history between the User and the Model. This dataset does not include any ground truth labels.
- 
Python script eval.pyexecutes the evaluation:- load_dataset: loads all conversation histories to a Pandas Dataframe.
- generate_chat_responses: runs an inference for each conversation in the dataset and saves the model responses.
- run_eval: passes the dataset including responses from the Baseline and Candidate models to a pairwise autorater in order to calculate the win rate.
 
- 
Shell script run.shinstalls the required Python libraries and runseval.py
How to run this Eval Recipe¶
- 
Google Cloud Shell is the easiest option as it automatically clones our Github repo: 
- 
Alternatively, you can use the following command to clone this repo to any Linux environment with configured Google Cloud Environment: git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \ cd applied-ai-engineering-samples && \ git sparse-checkout init && \ git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \ git pull origin main cd genai-on-vertex-ai/gemini/model_upgrades
- 
Navigate to the Eval Recipe directory in terminal, set your Google Cloud Project ID and run the shell script run.sh.
- 
The resulting metrics will be displayed in the script output. 
- You can use Vertex AI Experiments to view the history of evaluations for each experiment, including the final metrics.
How to customize this Eval Recipe:¶
- Edit the Python script eval.py:- set the projectparameter of vertexai.init to your Google Cloud Project ID.
- set the parameter baseline_modelto the model that is currently used by your application
- set the parameter candidate_modelto the model that you want to compare with your current model
- configure a unique experiment_namefor tracking purposes
 
- set the 
- Replace the dataset with your custom data in the same format.
- Please refer to our documentation if you want to further customize your evaluation. Vertex AI Evaluation Service has a lot of features that are not included in this recipe.
