Vertex AI: Gemini Evaluations Playbook
Experiment, Evaluate & Analyze model performance for your use cases
β¨ OverviewΒΆ
The Gemini Evaluations Playbook provides recipes to streamline the experimentation and evaluation of Generative AI models for your use cases using Vertex Generative AI Evaluation service. This enables you to track and align model performance with your objectives, while providing insights to optimize the model under different conditions and configurations.
π Experimentation and evaluation workflowΒΆ
Prompting strategies and best practices are essential for getting started with Gemini, but they're only the first step. To ensure your Generative AI solution with Gemini delivers repeatable and scalable performance, you need a systematic experimentation and evaluation process. This involves meticulous tracking of each experimental configuration, including prompt templates (system instructions, context, and few-shot learning examples), and model parameters like temperature and max output tokens.
Your evaluation should go beyond overall results and report granular metrics for each experiment and not just final results for the evaluation exercise.
By following this process, you'll not only maximize your GenAI solution's performance but also identify anti-patterns and system-level design improvements early on. This proactive approach is far more efficient than discovering issues after deployment.
[!NOTE] Refer here for adding automation to your experimentation workflow with the Vertex AI Prompt Optimizer.
π ArchitectureΒΆ
The following diagram depicts the architecture of the Gemini Evaluations Playbook. The architecture leverages - Vertex Generative AI Evaluation service for running evaluations - Google BigQuery for logging prompts, experiments and eval runs.
π§© Key FeaturesΒΆ
The Gemini Evaluations Playbook (referred as Evals Playbook) provides following key features:
β Define, track and compare experiments
Define and track a hierarchical structure of tasks, experiments, and evaluation runs to systematically organize and track your evaluation efforts.β Log evaluation results with prompts and responses
Manage and log experiment configurations and results to BigQuery, enabling comprehensive analysis.β Customize evaluation runs
Customize evaluations by configuring prompt templates, generation parameters, safety settings, and evaluation metrics to match your specific use case.β Comprehensive Metrics
Track a range of built-in and custom metrics to gain a holistic understanding of model performance.β Iterative refinement
Analyze insights from evaluation to iteratively refine prompts, model configurations, and fine-tuning to achieve optimal outcomes.π Getting StartedΒΆ
STEP 1. Clone the repositoryΒΆ
git clone https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && cd applied-ai-engineering-samples/genai-on-vertex-ai/gemini/evals_playbook
STEP 2. Prepare your environmentΒΆ
Start with 0_gemini_evals_playbook_setup notebook to install required libraries (using Poetry) and configure the necessary resources on Google Cloud. This includes setting up a BigQuery dataset and saving configuration parameters.
STEP 3. Experiment, evaluate, and analyzeΒΆ
Run the 1_gemini_evals_playbook_evaluate notebook to design experiments, assess model performance on your generative AI tasks, and analyze evaluation results including side-by-side comparison of results across different experiments and runs.
STEP 4. Optimize with grid searchΒΆ
Run the 2_gemini_evals_playbook_grid_search notebook to systematically explore different experiment configurations by testing various prompt templates or model settings (like temperature), or combinations of these using a grid-search style approach.
𧬠Repository Structure¢
.
βββ bigquery_sqls
βββ evals_bigquery.sql
βββ docs
βββ notebooks
βββ 0_gemini_evals_playbook_setup.ipynb
βββ 1_gemini_evals_playbook_evaluate.ipynb
βββ 2_gemini_evals_playbook_gridsearch.ipynb
βββ utils
βββ config.py
βββ evals_playbook.py
βββ config.ini
βββ pyproject.toml
Navigating repository structure
- [`/evals_bigquery.sql`](/utils/evals_bigquery.sql): SQL queries to create BigQuery datasets and tables - [`/notebooks`](/notebooks): Notebooks demonstrating the usage of Evals Playbook - [`/utils`](/utils): Utility or helper functions for running notebooks - [`/congig.ini`](/config.ini): Save and reuse configuration parameters created in[0_gemini_evals_playbook_setup](/notebooks/0_gemini_evals_playbook_setup.ipynb) - [`/docs`](/docs): Documentation explaining key conceptsπ DocumentationΒΆ
π§ Quotas and limitsΒΆ
Verify you have sufficient quota to run experiments and evaluations: - BigQuery quotas - Vertex AI Gemini quotas
πͺͺ LicenseΒΆ
Distributed with the Apache-2.0 license.
Also contains code derived from the following third-party packages: * Python * pandas * LLM Comparator
π Getting HelpΒΆ
If you have any questions or if you found any problems with this repository, please report through GitHub issues.