Rag embeddings eval
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
RAG Embeddings Retrieval Eval Recipe¶
This Eval Recipe demonstrates how to compare performance of two embedding models on a RAG dataset using Vertex AI Evaluation Service.
We will be looking at text-embedding-004 as our baseline model and text-embedding-005 as our candidate model. Please follow the documentation here to get an understanding of the various text embedding models.
- Use case: RAG retrieval 
- Metric: This eval uses a Pointwise Retrieval quality template to evaluate the responses and pick an embedding model as the winner. We will define - retrieval qualityas the metric here. It checks whether the- retrieved_contextcontains all the key information present in- reference.
- Evaluation Datasets are based on RAG Dataset in compliance with the following license. They include 8 randomly sampled prompts in JSONL files - baseline_dataset.jsonland- candidate_dataset.jsonlwith the following structure:- question: User inputted question
- reference: The golden truth answer for the question
- retrieved_context: The context retrieved from the model
 
- Prompt Template is a zero-shot prompt located in - prompt_template.txtwith two prompt variables (- referenceand- retrieved_context) that are automatically populated from our dataset.
- This eval recipe uses an LLM judge model(gemini-2.0-flash) to evaluate the retrieval quality of the embedding models. 
Prerequisite¶
This recipe assumes that the user has already created datasets for the baseline embedding model and the candidate embedding model. The user needs to generate the datasets for the baseline(text-embedding-004) and candidate(text-embedding-005) embedding models. Please refer to RAG Engine generation notebook to create two separate RAG engines and set up corresponding datasets. The retrieved_context column in the dataset is the context retrieved from the respective RAG engine for each one of the questions.
Configure Eval Settings¶
%%writefile .env
PROJECT_ID=your-project-id        # Google Cloud Project ID
LOCATION=us-central1                  # Region for all required Google Cloud services
EXPERIMENT_NAME=rag-embeddings-eval-recipe-demo      # Creates Vertex AI Experiment to track the eval runs
BASELINE_EMBEDDING_MODEL=text-embedding-004
CANDIDATE_EMBEDDING_MODEL=text-embedding-005
MODEL=gemini-2.0-flash # This model will be the judge for performing evaluations
BASELINE_DATASET_URI="gs://gemini_assets/rag_embeddings/baseline_dataset.jsonl"  # Baseline embedding model dataset in Google Cloud Storage
CANDIDATE_DATASET_URI="gs://gemini_assets/rag_embeddings/candidate_dataset.jsonl"  # Candidate embedding model dataset in Google Cloud Storage
PROMPT_TEMPLATE_URI="gs://gemini_assets/rag_embeddings/prompt_template.txt"  # Text file in Google Cloud Storage
METRIC_NAME="retrieval_quality"
Install Python Libraries¶
%pip install --upgrade --quiet google-cloud-aiplatform[evaluation] python-dotenv
# The error "session crashed" is expected. Please ignore it and proceed to the next cell.
import IPython
IPython.Application.instance().kernel.do_shutdown(True)
import os
import json
import pandas as pd
import sys
import vertexai
from dotenv import load_dotenv
from google.cloud import storage
from datetime import datetime
from IPython.display import clear_output
from vertexai.evaluation import EvalTask, EvalResult, PointwiseMetric
Authenticate to Google Cloud (requires permission to open a popup window)¶
load_dotenv(override=True)
if os.getenv("PROJECT_ID") == "your-project-id":
    raise ValueError("Please configure your Google Cloud Project ID in the first cell.")
if "google.colab" in sys.modules:  
    from google.colab import auth  
    auth.authenticate_user()
vertexai.init(project=os.getenv('PROJECT_ID'), location=os.getenv('LOCATION'))
Run the eval on both models on the Pairwise Autorater¶
def load_file(gcs_uri: str) -> str:
    blob = storage.Blob.from_string(gcs_uri, storage.Client())
    return blob.download_as_string().decode('utf-8')
def load_dataset(dataset_uri: str):
    jsonl = load_file(dataset_uri)
    samples = [json.loads(line) for line in jsonl.splitlines() if line.strip()]
    df = pd.DataFrame(samples)
    return df
def load_prompt_template() -> str:
    blob = storage.Blob.from_string(os.getenv("PROMPT_TEMPLATE_URI"), storage.Client())
    return blob.download_as_string().decode('utf-8')
def run_eval(model: str, embedding_model: str, dataset_uri: str) -> EvalResult:
  timestamp = f"{datetime.now().strftime('%b-%d-%H-%M-%S')}".lower()
  return EvalTask(
      dataset=dataset_uri,
      metrics=[PointwiseMetric(
               metric=os.getenv('METRIC_NAME'),
               metric_prompt_template= load_prompt_template()
               )   
               ],
      experiment=os.getenv('EXPERIMENT_NAME')
  ).evaluate(
      response_column_name= 'retrieved_context',
      experiment_run_name=f"{timestamp}-{embedding_model}-{model.replace('.', '-')}"
  )
baseline_metrics = run_eval(os.getenv("MODEL"), os.getenv("BASELINE_EMBEDDING_MODEL"), os.getenv("BASELINE_DATASET_URI"))
candidate_metrics = run_eval(os.getenv("MODEL"), os.getenv("CANDIDATE_EMBEDDING_MODEL"), os.getenv("CANDIDATE_DATASET_URI"))
clear_output()
print("Average score for baseline model retrieval quality:", round(baseline_metrics.summary_metrics[f'{os.getenv("METRIC_NAME")}/mean'],3))
print("Average score for candidate model retrieval quality:", round(candidate_metrics.summary_metrics[f'{os.getenv("METRIC_NAME")}/mean'],3))