RAG Retrieval¶
Eval Recipe for model migration¶
This Eval Recipe demonstrates how to compare performance of two embedding models on a RAG dataset using Vertex AI Evaluation Service.
We will be looking at text-embedding-004
as our baseline model and text-embedding-005
as our candidate model. Please follow the documentation here to get an understanding of the various text embedding models.
-
Use case: RAG retrieval
-
Metric: This eval uses a Pointwise Retrieval quality template to evaluate the responses and pick a model as the winner. We will define
retrieval quality
as the metric here. It checks whether theretrieved_context
contains all the key information present inreference
. -
Evaluation Datasets are based on RAG Dataset in compliance with the following license. They include 8 randomly sampled prompts in JSONL files
baseline_dataset.jsonl
andcandidate_dataset.jsonl
with the following structure:question
: User inputted questionreference
: The golden truth answer for the questionretrieved_context
: The context retrieved from the model.
-
Prompt Template is a zero-shot prompt located in
prompt_template.txt
with two prompt variables (reference
andretrieved_context
) that are automatically populated from our dataset. -
This eval recipe uses an LLM judge model(gemini-2.0-flash) to evaluate the retrieval quality of the embedding models.
Prerequisite¶
This recipe assumes that the user has already created datasets for the baseline embedding model and the candidate embedding model. The user needs to generate the datasets for the baseline(text-embedding-004) and candidate(text-embedding-005) embedding models. Please refer to RAG Engine generation notebook to create two separate RAG engines and set up corresponding datasets. The retrieved_context
column in the dataset is the context retrieved from the respective RAG engine for each one of the questions.
-
Python script
eval.py
configures the evaluation:run_eval
: configures the evaluation task, runs it on the 2 models and prints the results.load_dataset
: loads the dataset including the contents of all documents.
-
Shell script
run.sh
installs the required Python libraries and runseval.py
-
Google Cloud Shell is the easiest option as it automatically clones our Github repo:
-
Alternatively, you can use the following command to clone this repo to any Linux environment with configured Google Cloud Environment:
git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \ cd applied-ai-engineering-samples && \ git sparse-checkout init && \ git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \ git pull origin main cd genai-on-vertex-ai/gemini/model_upgrades
-
Navigate to the Eval Recipe directory in terminal, set your Google Cloud Project ID and run the shell script
run.sh
. -
The resulting metrics will be displayed in the script output.
-
You can use Vertex AI Experiments to view the history of evaluations for each experiment, including the final metrics scores.
How to customize this Eval Recipe:¶
We will have two runs, one for the baseline model and the candidate model
- Edit the Python script
eval.py
:- set the
project
parameter of vertexai.init to your Google Cloud Project ID. - set the parameter
model
in the run_eval calls (e.g., 'gemini-2.0-flash') to the LLM you want to use for performing the evaluation task. - set the parameter
embedding_model
to the model that you want to run the evaluation for - configure a unique
experiment_name
for tracking purposes - set the parameter
dataset_local_path
to the file you are running the evaluations for
- set the
- Replace the contents of
prompt_template.txt
with your custom prompt template. Make sure that prompt template variables map to the dataset attributes. - Please refer to our documentation if you want to further customize your evaluation. Vertex AI Evaluation Service has a lot of features that are not included in this recipe.