Skip to content

gemini_evals_banner

Vertex AI: Gemini Evaluations Playbook

Experiment, Evaluate & Analyze model performance for your use cases

✨ Overview

The Gemini Evaluations Playbook provides recipes to streamline the experimentation and evaluation of Generative AI models for your use cases using Vertex Generative AI Evaluation service. This enables you to track and align model performance with your objectives, while providing insights to optimize the model under different conditions and configurations.

📏 Experimentation and evaluation workflow

Prompting strategies and best practices are essential for getting started with Gemini, but they're only the first step. To ensure your Generative AI solution with Gemini delivers repeatable and scalable performance, you need a systematic experimentation and evaluation process. This involves meticulous tracking of each experimental configuration, including prompt templates (system instructions, context, and few-shot learning examples), and model parameters like temperature and max output tokens.

Your evaluation should go beyond overall results and report granular metrics for each experiment and not just final results for the evaluation exercise.

By following this process, you'll not only maximize your GenAI solution's performance but also identify anti-patterns and system-level design improvements early on. This proactive approach is far more efficient than discovering issues after deployment.

evals-process-workflow

[!NOTE] Refer here for adding automation to your experimentation workflow with the Vertex AI Prompt Optimizer.

📏 Architecture

The following diagram depicts the architecture of the Gemini Evaluations Playbook. The architecture leverages - Vertex Generative AI Evaluation service for running evaluations - Google BigQuery for logging prompts, experiments and eval runs.

evals-playbook-architecture

🧩 Key Features

The Gemini Evaluations Playbook (referred as Evals Playbook) provides following key features:

✅ Define, track and compare experiments Define and track a hierarchical structure of tasks, experiments, and evaluation runs to systematically organize and track your evaluation efforts.
✅ Log evaluation results with prompts and responses Manage and log experiment configurations and results to BigQuery, enabling comprehensive analysis.
✅ Customize evaluation runs Customize evaluations by configuring prompt templates, generation parameters, safety settings, and evaluation metrics to match your specific use case.
✅ Comprehensive Metrics Track a range of built-in and custom metrics to gain a holistic understanding of model performance.
✅ Iterative refinement Analyze insights from evaluation to iteratively refine prompts, model configurations, and fine-tuning to achieve optimal outcomes.

🏁 Getting Started

STEP 1. Clone the repository

git clone https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && cd applied-ai-engineering-samples/genai-on-vertex-ai/gemini/evals_playbook

STEP 2. Prepare your environment

Start with 0_gemini_evals_playbook_setup notebook to install required libraries (using Poetry) and configure the necessary resources on Google Cloud. This includes setting up a BigQuery dataset and saving configuration parameters.

STEP 3. Experiment, evaluate, and analyze

Run the 1_gemini_evals_playbook_evaluate notebook to design experiments, assess model performance on your generative AI tasks, and analyze evaluation results including side-by-side comparison of results across different experiments and runs.

Run the 2_gemini_evals_playbook_grid_search notebook to systematically explore different experiment configurations by testing various prompt templates or model settings (like temperature), or combinations of these using a grid-search style approach.

🧬 Repository Structure

.
├── bigquery_sqls
  └── evals_bigquery.sql
└── docs
└── notebooks
  └── 0_gemini_evals_playbook_setup.ipynb
  └── 1_gemini_evals_playbook_evaluate.ipynb
  └── 2_gemini_evals_playbook_gridsearch.ipynb
└── utils
  └── config.py
  └── evals_playbook.py
└── config.ini
└── pyproject.toml
Navigating repository structure - [`/evals_bigquery.sql`](/utils/evals_bigquery.sql): SQL queries to create BigQuery datasets and tables - [`/notebooks`](/notebooks): Notebooks demonstrating the usage of Evals Playbook - [`/utils`](/utils): Utility or helper functions for running notebooks - [`/congig.ini`](/config.ini): Save and reuse configuration parameters created in[0_gemini_evals_playbook_setup](/notebooks/0_gemini_evals_playbook_setup.ipynb) - [`/docs`](/docs): Documentation explaining key concepts

📄 Documentation

🚧 Quotas and limits

Verify you have sufficient quota to run experiments and evaluations: - BigQuery quotas - Vertex AI Gemini quotas

🪪 License

Distributed with the Apache-2.0 license.

Also contains code derived from the following third-party packages: * Python * pandas * LLM Comparator

🙋 Getting Help

If you have any questions or if you found any problems with this repository, please report through GitHub issues.