RAG Retrieval¶
Eval Recipe for model migration¶
This Eval Recipe demonstrates how to compare performance of two embedding models on a RAG dataset using Vertex AI Evaluation Service.
We will be looking at text-embedding-004 as our baseline model and text-embedding-005 as our candidate model. Please follow the documentation here to get an understanding of the various text embedding models. 
- 
Use case: RAG retrieval 
- 
Metric: This eval uses a Pointwise Retrieval quality template to evaluate the responses and pick a model as the winner. We will define retrieval qualityas the metric here. It checks whether theretrieved_contextcontains all the key information present inreference.
- 
Evaluation Datasets are based on RAG Dataset in compliance with the following license. They include 8 randomly sampled prompts in JSONL files baseline_dataset.jsonlandcandidate_dataset.jsonlwith the following structure:- question: User inputted question
- reference: The golden truth answer for the question
- retrieved_context: The context retrieved from the model.
 
- 
Prompt Template is a zero-shot prompt located in prompt_template.txtwith two prompt variables (referenceandretrieved_context) that are automatically populated from our dataset.
- 
This eval recipe uses an LLM judge model(gemini-2.0-flash) to evaluate the retrieval quality of the embedding models. 
Prerequisite¶
This recipe assumes that the user has already created datasets for the baseline embedding model and the candidate embedding model. The user needs to generate the datasets for the baseline(text-embedding-004) and candidate(text-embedding-005) embedding models. Please refer to RAG Engine generation notebook to create two separate RAG engines and set up corresponding datasets. The retrieved_context column in the dataset is the context retrieved from the respective RAG engine for each one of the questions.
- 
Python script eval.pyconfigures the evaluation:- run_eval: configures the evaluation task, runs it on the 2 models and prints the results.
- load_dataset: loads the dataset including the contents of all documents.
 
- 
Shell script run.shinstalls the required Python libraries and runseval.py
- 
Google Cloud Shell is the easiest option as it automatically clones our Github repo: 
- 
Alternatively, you can use the following command to clone this repo to any Linux environment with configured Google Cloud Environment: git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \ cd applied-ai-engineering-samples && \ git sparse-checkout init && \ git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \ git pull origin main cd genai-on-vertex-ai/gemini/model_upgrades
- 
Navigate to the Eval Recipe directory in terminal, set your Google Cloud Project ID and run the shell script run.sh.
- 
The resulting metrics will be displayed in the script output. 
- 
You can use Vertex AI Experiments to view the history of evaluations for each experiment, including the final metrics scores. 
How to customize this Eval Recipe:¶
We will have two runs, one for the baseline model and the candidate model
- Edit the Python script eval.py:- set the projectparameter of vertexai.init to your Google Cloud Project ID.
- set the parameter modelin the run_eval calls (e.g., 'gemini-2.0-flash') to the LLM you want to use for performing the evaluation task.
- set the parameter embedding_modelto the model that you want to run the evaluation for
- configure a unique experiment_namefor tracking purposes
- set the parameter dataset_local_pathto the file you are running the evaluations for
 
- set the 
- Replace the contents of prompt_template.txtwith your custom prompt template. Make sure that prompt template variables map to the dataset attributes.
- Please refer to our documentation if you want to further customize your evaluation. Vertex AI Evaluation Service has a lot of features that are not included in this recipe.
