Summarization¶
Eval Recipe for model migration¶
This Eval Recipe demonstrates how to compare performance of a Summarization prompt with Gemini 1.0 and Gemini 2.0 using a labeled dataset and open source evaluation tool Promptfoo.
-
Use case: summarize a news article.
-
The Evaluation Dataset1 includes 5 news articles stored as plain text files, and a JSONL file with ground truth labels:
dataset.jsonl
. Each record in this file includes 2 attributes wrapped in thevars
object. This structure allows Promptfoo to inject the article text into the prompt template, and find ground truth label required to score the quality of model-generated summaries:document
: relative path to the plain text file containing the news articlesummary
: ground truth label (short summary of the article)
-
Prompt Template is a zero-shot prompt located in
prompt_template.txt
with variabledocument
that gets populated from the corresponding dataset attribute. -
promptfooconfig.yaml
contains all Promptfoo configuration:providers
: list of models that will be evaluatedprompts
: location of the prompt template filetests
: location of the labeled dataset filedefaultTest
: defines the scoring logic:type: rouge-n
rates similarity between the model response and the ground truth labelvalue: "{{summary}}"
instructs Promptfoo to use the "summary" dataset attribute as the ground truth label
How to run this Eval Recipe¶
- Configure your Google Cloud Environment and clone this Github repo to your environment. We recommend Cloud Shell or Vertex AI Workbench.
git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \
cd applied-ai-engineering-samples && \
git sparse-checkout init && \
git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \
git pull origin main
- Install Promptfoo using these instructions.
- Navigate to the Eval Recipe directory in terminal and run the command
promptfoo eval
.
promptfoo view
to analyze the eval results. You can switch the Display option to Show failures only
in order to investigate any underperforming prompts.
How to customize this Eval Recipe:¶
- Copy the configuration file
promptfooconfig.yaml
to a new folder. - Add your labeled dataset file with JSONL schema similar to
dataset.jsonl
. - Save your prompt template to
prompt_template.txt
and make sure that the template variables map to the variables defined in your dataset. - That's it! You are ready to run
promptfoo eval
. If needed, add alternative prompt templates or additional metrics to promptfooconfig.yaml as explained here.
-
Dataset (XSum) citation: @InProceedings{xsum-emnlp, author = {Shashi Narayan and Shay B. Cohen and Mirella Lapata}, title = {Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization}, booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}, year = {2018}, address = {Brussels, Belgium}, } ↩