In [ ]:

Copied!





# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Evaluating Retrieval Augmented Generation (RAG) Systems¶

Run in Colab

Run in Colab Enterprise

View on GitHub

Open in Vertex AI Workbench


Author(s)	Egon Soares, Renato Leite

Overview¶

In this notebook, you will learn how to use the Vertex AI Rapid Evaluation SDK to evaluate components of a Retrieval Augmented Generation (RAG) System.

RAG systems have emerged as a powerful approach for improving the groundedness, relevancy, and factuality of large language model (LLM) responses by combining the capabilities of LLMs with information retrieval techniques from external sources.

Evaluating the various components of this system is crucial to ensure the quality of the overall response.

The diagram below illustrates a simplified view of a typical RAG system workflow.

In this notebook, we'll delve into the evaluation of two components of a RAG system:

Question Rephrasing with LLM: During the "Search" step, LLMs can rephrase user questions to improve retrieval accuracy, leading to more relevant and informative responses in RAG systems. Here you will evaluate the rephrased question.
Response from the RAG System: Evaluate the quality, accuracy, and relevance of the final answer generated by the RAG System.

It's important to note that this diagram is a simplified representation of a RAG System.
Real-world RAG systems often involve additional components and complexities, but this overview provides a solid foundation for understanding the core principles.

Reference Architecture¶

This diagram illustrates a simplified RAG system built on Google Cloud.
IMPORTANT: The purpose of this diagram is to illustrate the common Google Cloud components of a RAG system and identify potential areas where output can be evaluated.
It is not intended to be a final representation of how a RAG system should be designed.

System Architecture and GCP products:

Data Ingestion: The system starts with various data sources, which can include web pages, files, databases, knowledge bases, etc.
Preprocessing: The data is parsed and chunked by Document AI or with your custom scripts, and stored in Cloud Storage.
Embedding and Storage: The processed data is then converted into vector embeddings using a Vertex AI Embeddings model, and these embeddings are stored in Vertex AI Vector Search.
User Query: When a user submits a query, it is first rephrased using Vertex AI Gemini and converted into an embedding.
Retrieval: The query embedding is used to search the stored embeddings and return the most relevant documents.
Answer Generation: Finally, Vertex AI Gemini utilizes the retrieved documents and the rephrased question to generate a comprehensive and contextually relevant answer.

Based on this system architecture, we will provide some guidelines to evaluate the rephrased user question and the final response from the RAG System.

References:
https://cloud.google.com/generative-ai-app-builder/docs/parse-chunk-documents#parse-chunk-rag
https://cloud.google.com/document-ai/docs/layout-parse-chunk
https://cloud.google.com/vertex-ai/generative-ai/docs/models/online-pipeline-services

Getting Started¶

Install Vertex AI SDK for Rapid Evaluation¶

In [ ]:

Copied!

! pip install --upgrade --user --quiet google-cloud-aiplatform
! pip install --upgrade --user --quiet datasets tqdm nest_asyncio
! pip install --upgrade --user --quiet google-cloud-aiplatform
! pip install --upgrade --user --quiet datasets tqdm nest_asyncio

Authenticate your notebook environment (Colab only)¶

If you are using Colab, uncomment the python code below and execute in your Colab environment.
It will authenticate your user to access the GCP project.

In [ ]:

Copied!

# import sys

# if "google.colab" in sys.modules:
#     from google.colab import auth

#     auth.authenticate_user()
# import sys

# if "google.colab" in sys.modules:
#     from google.colab import auth

#     auth.authenticate_user()

Set Google Cloud project information and initialize Vertex AI SDK¶

In [ ]:

Copied!

PROJECT_ID = "<YOUR PROJECT ID>"       # Replace with your project ID
LOCATION = "us-central1"

import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)
PROJECT_ID = ""       # Replace with your project ID
LOCATION = "us-central1"

import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

Import Libraries¶

In [ ]:

Copied!





import nest_asyncio
import pandas as pd

from IPython.display import display, Markdown, HTML
from vertexai.preview.evaluation import EvalTask
from vertexai.preview.generative_models import (
    GenerativeModel,
    HarmBlockThreshold,
    HarmCategory
)

nest_asyncio.apply()
import nest_asyncio
import pandas as pd

from IPython.display import display, Markdown, HTML
from vertexai.preview.evaluation import EvalTask
from vertexai.preview.generative_models import (
    GenerativeModel,
    HarmBlockThreshold,
    HarmCategory
)

nest_asyncio.apply()

Helper Functions¶

In [ ]:

Copied!





def display_eval_report(eval_result, metrics=None):
    """Displays the evaluation results."""

    title, summary_metrics, report_df = eval_result
    metrics_df = pd.DataFrame.from_dict(summary_metrics, orient="index").T
    if metrics:
        metrics_df = metrics_df.filter(
            [
                metric
                for metric in metrics_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )
        report_df = report_df.filter(
            [
                metric
                for metric in report_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )

    # Display the title with Markdown for emphasis
    display(Markdown(f"## {title}"))

    # Display the metrics DataFrame
    display(Markdown("### Summary Metrics"))
    display(metrics_df)

    # Display the detailed report DataFrame
    display(Markdown(f"### Report Metrics"))
    display(report_df)


def display_explanations(df, metrics=None, n=1):
    """Displays specific evaluation metrics."""
    style = "white-space: pre-wrap; width: 800px; overflow-x: auto;"
    df = df.sample(n=n)
    if metrics:
        df = df.filter(
            ["instruction", "context", "reference", "completed_prompt", "response"]
            + [
                metric
                for metric in df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )

    for _, row in df.iterrows():
        for col in df.columns:
            display(HTML(f"<h2>{col}:</h2> <div style='{style}'>{row[col]}</div>"))
        display(HTML("<hr>"))
def display_eval_report(eval_result, metrics=None):
    """Displays the evaluation results."""

    title, summary_metrics, report_df = eval_result
    metrics_df = pd.DataFrame.from_dict(summary_metrics, orient="index").T
    if metrics:
        metrics_df = metrics_df.filter(
            [
                metric
                for metric in metrics_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )
        report_df = report_df.filter(
            [
                metric
                for metric in report_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )

    # Display the title with Markdown for emphasis
    display(Markdown(f"## {title}"))

    # Display the metrics DataFrame
    display(Markdown("### Summary Metrics"))
    display(metrics_df)

    # Display the detailed report DataFrame
    display(Markdown(f"### Report Metrics"))
    display(report_df)


def display_explanations(df, metrics=None, n=1):
    """Displays specific evaluation metrics."""
    style = "white-space: pre-wrap; width: 800px; overflow-x: auto;"
    df = df.sample(n=n)
    if metrics:
        df = df.filter(
            ["instruction", "context", "reference", "completed_prompt", "response"]
            + [
                metric
                for metric in df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )

    for _, row in df.iterrows():
        for col in df.columns:
            display(HTML(f"{col}:{row[col]}
"))
        display(HTML(""))

Bring-Your-Own-Answer Evaluation for RAG¶

Use Case 1: Evaluate rephrased user query¶

To improve the quality of the RAG System response, one option is to rephrase the user question to improve its clarity and make it easier to understand.
You will use 2 metrics to evaluate this task: Coherence and Fluency.

No description has been provided for this image

According to Vertex AI documentation, here is a brief description of both metrics.

Coherence: The coherence metric describes the model's ability to provide a coherent response.
Evaluation criteria for coherence:

Follows logical flow: Ideas logically progress with clear transitions that are relevant to the main point.
Organized: Writing structure is clear, employing topic sentences where appropriate and effective transitions to guide the reader.
Cohesive: Word choices, sentence structures, pronouns, and figurative language reinforce connections between ideas.

Fluency: The fluency metric describes the model's language mastery.
Evaluation criteria for fluency:

Has proper grammar: The language's grammar rules are correctly followed, including but not limited to sentence structures, verb tenses, subject-verb agreement, proper punctuation, and capitalization.
Chooses words appropriately: Words chosen are appropriate and purposeful given their relative context and positioning in the text. The vocabulary demonstrates prompt understanding.
Smooth: Sentences flow smoothly and avoid awkward phrasing or run-on sentences. Ideas and sentences connect logically, using transitions effectively where needed.

Reference: https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval

Prepare Dataset¶

To evaluate the coherence and fluency, simply provide the input questions to the Vertex AI Rapid Evaluation SDK.

In [ ]:

Copied!





questions = [
    "Can I configure certificates manually?",
    "How many control plane instances should I use?",
    "Is it possible to run different replicas of a StatefulSet in different zones?",
]

rephrase_dataset = pd.DataFrame(
    {
        "response": questions,
    }
)
questions = [
    "Can I configure certificates manually?",
    "How many control plane instances should I use?",
    "Is it possible to run different replicas of a StatefulSet in different zones?",
]

rephrase_dataset = pd.DataFrame(
    {
        "response": questions,
    }
)

Create an EvalTask and define the metrics you want to use.
You can also set an experiment ID to log all the results to Vertex AI Experiments.

In [ ]:

Copied!





eval_rephrase_task = EvalTask(
    dataset=rephrase_dataset,
    metrics=[
        "coherence",
        "fluency"
    ],
    experiment="evaluate-rephrase-01",
)
eval_rephrase_task = EvalTask(
    dataset=rephrase_dataset,
    metrics=[
        "coherence",
        "fluency"
    ],
    experiment="evaluate-rephrase-01",
)

In [ ]:

Copied!

# Start the evaluation process. Depending on the amount of samples in your evaluation 
# dataset, this can take a few minutes to complete.
result = eval_rephrase_task.evaluate()
# Start the evaluation process. Depending on the amount of samples in your evaluation 
# dataset, this can take a few minutes to complete.
result = eval_rephrase_task.evaluate()

Overall Evaluation Result¶

If you want to have an overall view of all the metrics evaluation result in one table, you can use the display_eval_report() helper function.

In [ ]:

Copied!

display_eval_report((("Eval Result", result.summary_metrics, result.metrics_table)))
display_eval_report((("Eval Result", result.summary_metrics, result.metrics_table)))

Detailed Explanation for an Individual Instance¶

If you need to delve into the individual result's detailed explanations on why a score is assigned and how confident the model is for each model-based metric, you can use the display_explanations() helper function.
For example, you can set n=2 to display explanation of the 2nd instance result as follows:

In [ ]:

Copied!

display_explanations(result.metrics_table, n=2)
display_explanations(result.metrics_table, n=2)

Use Case 2: Evaluate RAG answer¶

To evaluate the responses from the RAG system, we can use the following metrics:

question_answering_quality
question_answering_relevance
question_answering_helpfulness
groundedness
fulfillment

According to Vertex AI documentation, here is a brief description of these metrics.

Question Answering Quality: The question_answering_quality metric describes the model's ability to answer questions given a body of text to reference.
Evaluation criteria for question_answering_quality:

Follows instructions: The response answers the question and follows any instructions.
Grounded: The response includes only information from the inference context and inference instruction.
Relevance: The response contains details relevant to the instruction.
Comprehensive: The model captures important details from the question.

Question Answering Relevance: The question_answering_relevance metric describes the model's ability to respond with relevant information when asked a question.
Evaluation criteria for question_answering_relevance:

Relevance: The response contains details relevant to the instruction.
Clarity: The response provides clearly defined information that directly addresses the instruction.

Question Answering Helpfulness: The question_answering_helpfulness metric describes the model's ability to provide important details when answering a question.
Evaluation criteria for question_answering_helpfulness:

Helpful: The response satisfies the user's query.
Comprehensive: The model captures important details to satisfy the user's query.

Groundedness: The groundedness metric describes the model's ability to provide or reference information included only in the input text.
Evaluation criteria for groundedness:

Grounded: The response includes only information from the inference context and the inference instruction.

Fulfillment: The fulfillment metric describes the model's ability to fulfill instructions.
Evaluation criteria for fulfillment:

Follows instructions: The response demonstrates an understanding of the instructions and satisfies all of the instruction requirements.

Prepare Dataset¶

To evaluate this metrics, we need to provide the user question, the retrieved documents and the generated response.

In [ ]:

Copied!





# These are sample document you will use as the context to your questions.
retrieved_contexts = []
for file_path in ["files/certificates.md", "files/cluster-large.md", "files/multiple-zones.md"]:
    with open(file_path) as fp:
        retrieved_contexts.append(fp.read())
# These are sample document you will use as the context to your questions.
retrieved_contexts = []
for file_path in ["files/certificates.md", "files/cluster-large.md", "files/multiple-zones.md"]:
    with open(file_path) as fp:
        retrieved_contexts.append(fp.read())

In [ ]:

Copied!

print(retrieved_contexts[0])
print(retrieved_contexts[0])

In [ ]:

Copied!





# User questions
questions = [
    "Can I configure certificates manually?",
    "How many control plane instances should I use?",
    "Is it possible to run different replicas of a StatefulSet in different zones?",
]

# Generated response from LLM
generated_answers = [
    "Yes, if you don't want kubeadm to generate the required certificates, you can create them using a single root CA or by providing all certificates.",
    "At least one control plane instance per failure zone is recommended for fault tolerance. You can scale these instances vertically, and then horizontally after reaching a point of diminishing returns with vertical scaling.",
    "Yes, you can use Pod topology spread constraints to ensure that replicas of a StatefulSet are distributed across different zones whenever possible.",
]

# Dataset that will be fed to the Rapid Evaluation service.
eval_dataset = pd.DataFrame(
    {
        "instruction": questions,
        "context": retrieved_contexts,
        "response": generated_answers,
    }
)
# User questions
questions = [
    "Can I configure certificates manually?",
    "How many control plane instances should I use?",
    "Is it possible to run different replicas of a StatefulSet in different zones?",
]

# Generated response from LLM
generated_answers = [
    "Yes, if you don't want kubeadm to generate the required certificates, you can create them using a single root CA or by providing all certificates.",
    "At least one control plane instance per failure zone is recommended for fault tolerance. You can scale these instances vertically, and then horizontally after reaching a point of diminishing returns with vertical scaling.",
    "Yes, you can use Pod topology spread constraints to ensure that replicas of a StatefulSet are distributed across different zones whenever possible.",
]

# Dataset that will be fed to the Rapid Evaluation service.
eval_dataset = pd.DataFrame(
    {
        "instruction": questions,
        "context": retrieved_contexts,
        "response": generated_answers,
    }
)

Definition of an EvalTask with the defined metrics.

In [ ]:

Copied!





answer_eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "question_answering_quality",
        "question_answering_relevance",
        "question_answering_helpfulness",
        "groundedness",
        "fulfillment",
    ],
    experiment="evaluate-rag-answer-01",
)
answer_eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "question_answering_quality",
        "question_answering_relevance",
        "question_answering_helpfulness",
        "groundedness",
        "fulfillment",
    ],
    experiment="evaluate-rag-answer-01",
)

In [ ]:

Copied!

result = answer_eval_task.evaluate()
result = answer_eval_task.evaluate()

In [ ]:

Copied!

display_eval_report((("Eval Result", result.summary_metrics, result.metrics_table)))
display_eval_report((("Eval Result", result.summary_metrics, result.metrics_table)))

In [ ]:

Copied!

display_explanations(result.metrics_table, n=1)
display_explanations(result.metrics_table, n=1)

In [ ]:

Copied!

display_explanations(result.metrics_table, metrics=["question_answering_quality"])
display_explanations(result.metrics_table, metrics=["question_answering_quality"])