Multi-turn Chat¶

Eval Recipe for model migration¶

This Eval Recipe demonstrates how to compare performance of a Multi-turn conversation (Chat) with Gemini 1.0 and Gemini 2.0 using Vertex AI Evaluation Service.

Use case: multi-turn conversation (Chat)
Evaluation Dataset is based on Multi-turn Prompts Dataset. It includes 5 conversations: dataset.jsonl. Each record in this file links to a JSON file with the conversation history between the User and the Model. This dataset does not include any ground truth labels.
Python script eval.py executes the evaluation:
- load_dataset: loads all conversation histories to a Pandas Dataframe.
- generate_chat_responses: runs an inference for each conversation in the dataset and saves the model responses.
- run_eval: passes the dataset including responses from the Baseline and Candidate models to a pairwise autorater in order to calculate the win rate.
Shell script run.sh installs the required Python libraries and runs eval.py

How to run this Eval Recipe¶

Google Cloud Shell is the easiest option as it automatically clones our Github repo:

Alternatively, you can use the following command to clone this repo to any Linux environment with configured Google Cloud Environment:

git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \
cd applied-ai-engineering-samples && \
git sparse-checkout init && \
git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \
git pull origin main
cd genai-on-vertex-ai/gemini/model_upgrades

Navigate to the Eval Recipe directory in terminal, set your Google Cloud Project ID and run the shell script run.sh.
```
cd multiturn_chat/vertex_script
export PROJECT_ID="[your-project-id]"
./run.sh
```
The resulting metrics will be displayed in the script output.
You can use Vertex AI Experiments to view the history of evaluations for each experiment, including the final metrics.

How to customize this Eval Recipe:¶

Edit the Python script eval.py:
- set the project parameter of vertexai.init to your Google Cloud Project ID.
- set the parameter baseline_model to the model that is currently used by your application
- set the parameter candidate_model to the model that you want to compare with your current model
- configure a unique experiment_name for tracking purposes
Replace the dataset with your custom data in the same format.
Please refer to our documentation if you want to further customize your evaluation. Vertex AI Evaluation Service has a lot of features that are not included in this recipe.