Rag embeddings eval
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
RAG Embeddings Retrieval Eval Recipe¶
This Eval Recipe demonstrates how to compare performance of two embedding models on a RAG dataset using Vertex AI Evaluation Service.
We will be looking at text-embedding-004
as our baseline model and text-embedding-005
as our candidate model. Please follow the documentation here to get an understanding of the various text embedding models.
Use case: RAG retrieval
Metric: This eval uses a Pointwise Retrieval quality template to evaluate the responses and pick an embedding model as the winner. We will define
retrieval quality
as the metric here. It checks whether theretrieved_context
contains all the key information present inreference
.Evaluation Datasets are based on RAG Dataset in compliance with the following license. They include 8 randomly sampled prompts in JSONL files
baseline_dataset.jsonl
andcandidate_dataset.jsonl
with the following structure:question
: User inputted questionreference
: The golden truth answer for the questionretrieved_context
: The context retrieved from the model
Prompt Template is a zero-shot prompt located in
prompt_template.txt
with two prompt variables (reference
andretrieved_context
) that are automatically populated from our dataset.This eval recipe uses an LLM judge model(gemini-2.0-flash) to evaluate the retrieval quality of the embedding models.
Prerequisite¶
This recipe assumes that the user has already created datasets for the baseline embedding model and the candidate embedding model. The user needs to generate the datasets for the baseline(text-embedding-004) and candidate(text-embedding-005) embedding models. Please refer to RAG Engine generation notebook to create two separate RAG engines and set up corresponding datasets. The retrieved_context
column in the dataset is the context retrieved from the respective RAG engine for each one of the questions.
Configure Eval Settings¶
%%writefile .env
PROJECT_ID=your-project-id # Google Cloud Project ID
LOCATION=us-central1 # Region for all required Google Cloud services
EXPERIMENT_NAME=rag-embeddings-eval-recipe-demo # Creates Vertex AI Experiment to track the eval runs
BASELINE_EMBEDDING_MODEL=text-embedding-004
CANDIDATE_EMBEDDING_MODEL=text-embedding-005
MODEL=gemini-2.0-flash # This model will be the judge for performing evaluations
BASELINE_DATASET_URI="gs://gemini_assets/rag_embeddings/baseline_dataset.jsonl" # Baseline embedding model dataset in Google Cloud Storage
CANDIDATE_DATASET_URI="gs://gemini_assets/rag_embeddings/candidate_dataset.jsonl" # Candidate embedding model dataset in Google Cloud Storage
PROMPT_TEMPLATE_URI="gs://gemini_assets/rag_embeddings/prompt_template.txt" # Text file in Google Cloud Storage
METRIC_NAME="retrieval_quality"
Install Python Libraries¶
%pip install --upgrade --quiet google-cloud-aiplatform[evaluation] python-dotenv
# The error "session crashed" is expected. Please ignore it and proceed to the next cell.
import IPython
IPython.Application.instance().kernel.do_shutdown(True)
import os
import json
import pandas as pd
import sys
import vertexai
from dotenv import load_dotenv
from google.cloud import storage
from datetime import datetime
from IPython.display import clear_output
from vertexai.evaluation import EvalTask, EvalResult, PointwiseMetric
Authenticate to Google Cloud (requires permission to open a popup window)¶
load_dotenv(override=True)
if os.getenv("PROJECT_ID") == "your-project-id":
raise ValueError("Please configure your Google Cloud Project ID in the first cell.")
if "google.colab" in sys.modules:
from google.colab import auth
auth.authenticate_user()
vertexai.init(project=os.getenv('PROJECT_ID'), location=os.getenv('LOCATION'))
Run the eval on both models on the Pairwise Autorater¶
def load_file(gcs_uri: str) -> str:
blob = storage.Blob.from_string(gcs_uri, storage.Client())
return blob.download_as_string().decode('utf-8')
def load_dataset(dataset_uri: str):
jsonl = load_file(dataset_uri)
samples = [json.loads(line) for line in jsonl.splitlines() if line.strip()]
df = pd.DataFrame(samples)
return df
def load_prompt_template() -> str:
blob = storage.Blob.from_string(os.getenv("PROMPT_TEMPLATE_URI"), storage.Client())
return blob.download_as_string().decode('utf-8')
def run_eval(model: str, embedding_model: str, dataset_uri: str) -> EvalResult:
timestamp = f"{datetime.now().strftime('%b-%d-%H-%M-%S')}".lower()
return EvalTask(
dataset=dataset_uri,
metrics=[PointwiseMetric(
metric=os.getenv('METRIC_NAME'),
metric_prompt_template= load_prompt_template()
)
],
experiment=os.getenv('EXPERIMENT_NAME')
).evaluate(
response_column_name= 'retrieved_context',
experiment_run_name=f"{timestamp}-{embedding_model}-{model.replace('.', '-')}"
)
baseline_metrics = run_eval(os.getenv("MODEL"), os.getenv("BASELINE_EMBEDDING_MODEL"), os.getenv("BASELINE_DATASET_URI"))
candidate_metrics = run_eval(os.getenv("MODEL"), os.getenv("CANDIDATE_EMBEDDING_MODEL"), os.getenv("CANDIDATE_DATASET_URI"))
clear_output()
print("Average score for baseline model retrieval quality:", round(baseline_metrics.summary_metrics[f'{os.getenv("METRIC_NAME")}/mean'],3))
print("Average score for candidate model retrieval quality:", round(candidate_metrics.summary_metrics[f'{os.getenv("METRIC_NAME")}/mean'],3))