Text Classification¶
Eval Recipe for model migration¶
This Eval Recipe demonstrates how to compare performance of a text classification prompt with Gemini 1.0 and Gemini 2.0 using Vertex AI Evaluation Service.
-
Use case: given a Product Description find the most relevant Product Category from a predefined list of categories.
-
Metric: this eval uses a single deterministic metric "Accuracy" calculated by comparing model responses with ground truth labels.
-
Labeled evaluation dataset (
dataset.jsonl
) is based on MAVE dataset from Google Research. It includes 6 records that represent products from different categories. Each record provides two attributes which are wrapped in thevars
object. This dataset structure allows Promptfoo to recognize variables that are needed to populate prompt templates, and ground truth labels used for scoring:product
: product name and descriptionreference
: the name of correct product category which serves as the ground truth label
-
Prompt template is a zero-shot prompt located in
prompt_template.txt
with just one prompt variableproduct
that maps to theproduct
attribute in the dataset. -
Python script
eval.py
configures the evaluation:run_eval
: configures the evaluation task, runs it on the 2 models and prints the results.case_insensitive_match
: scores the accuracy of model responses by comparing them to ground truth labels.
-
Shell script
run.sh
installs the required Python libraries and runseval.py
How to run this Eval Recipe¶
- Configure your Google Cloud Environment and clone this Github repo to your environment. We recommend Cloud Shell or Vertex AI Workbench.
git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \
cd applied-ai-engineering-samples && \
git sparse-checkout init && \
git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \
git pull origin main
- Navigate to the Eval Recipe directory in terminal, set your Google Cloud Project ID and run the shell script
run.sh
.
cd genai-on-vertex-ai/gemini/model_upgrades/text_classification/vertex_script
export PROJECT_ID="[your-project-id]"
./run.sh
- The resulting scores will be displayed in the script output.
- You can use Vertex AI Experiments to view the history of evaluations for each experiment, including the final metrics scores.
How to customize this Eval Recipe:¶
- Edit the Python script
eval.py
:- set the
project
parameter of vertexai.init to your Google Cloud Project ID. - set the parameter
baseline_model
to the model that is currently used by your application - set the parameter
candidate_model
to the model that you want to compare with your current model - configure a unique
experiment_name
for each template for tracking purposes
- set the
- Replace the contents of
dataset.jsonl
with your custom data in the same format. - Replace the contents of
prompt_template.txt
with your custom prompt template. Make sure that prompt template variables have the same names as dataset attributes. - Please refer to our documentation if you want to further customize your evaluation. Vertex AI Evaluation Service has a lot of features that are not included in this recipe, including LLM-based autoraters that can provide valuable metrics even without ground truth labels.