Summarization¶
Eval Recipe for model migration¶
This Eval Recipe demonstrates how to compare performance of a Summarization prompt with Gemini 1.0 and Gemini 2.0 using a labeled dataset and open source evaluation tool Promptfoo.

- 
Use case: summarize a news article. 
- 
The Evaluation Dataset1 includes 5 news articles stored as plain text files, and a JSONL file with ground truth labels: dataset.jsonl. Each record in this file includes 2 attributes wrapped in thevarsobject. This structure allows Promptfoo to inject the article text into the prompt template, and find ground truth label required to score the quality of model-generated summaries:- document: relative path to the plain text file containing the news article
- summary: ground truth label (short summary of the article)
 
- 
Prompt Template is a zero-shot prompt located in prompt_template.txtwith variabledocumentthat gets populated from the corresponding dataset attribute.
- 
promptfooconfig.yamlcontains all Promptfoo configuration:- providers: list of models that will be evaluated
- prompts: location of the prompt template file
- tests: location of the labeled dataset file
- defaultTest: defines the scoring logic:- type: rouge-nrates similarity between the model response and the ground truth label
- value: "{{summary}}"instructs Promptfoo to use the "summary" dataset attribute as the ground truth label
 
 
How to run this Eval Recipe¶
- 
Google Cloud Shell is the easiest option as it automatically clones our Github repo: 
- 
Alternatively, you can use the following command to clone this repo to any Linux environment with configured Google Cloud Environment: git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \ cd applied-ai-engineering-samples && \ git sparse-checkout init && \ git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \ git pull origin main cd genai-on-vertex-ai/gemini/model_upgrades
- 
Install Promptfoo using these instructions. 
- 
Navigate to the Eval Recipe directory in terminal and run the command promptfoo eval.
- 
Run promptfoo viewto analyze the eval results. You can switch the Display option toShow failures onlyin order to investigate any underperforming prompts.
How to customize this Eval Recipe:¶
- Copy the configuration file promptfooconfig.yamlto a new folder.
- Add your labeled dataset file with JSONL schema similar to dataset.jsonl.
- Save your prompt template to prompt_template.txtand make sure that the template variables map to the variables defined in your dataset.
- That's it! You are ready to run promptfoo eval. If needed, add alternative prompt templates or additional metrics to promptfooconfig.yaml as explained here.
- 
Dataset (XSum) citation: @InProceedings{xsum-emnlp, author = {Shashi Narayan and Shay B. Cohen and Mirella Lapata}, title = {Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization}, booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}, year = {2018}, address = {Brussels, Belgium}, } ↩ 
