Document Question Answering¶
Eval Recipe for model migration¶
This Eval Recipe demonstrates how to compare performance of a document question answering prompt with Gemini 1.0 and Gemini 2.0 using Vertex AI Evaluation Service.
- Use case: answer questions based on information from the given document.
-
The Evaluation Dataset is based on SQuAD2.0. It includes 6 documents stored as plain text files, and a JSONL file that provides ground truth labels:
dataset.jsonl
. Each record in this file includes 3 attributes:document_path
: relative path to the plain text document filequestion
: the question that we want to ask about this particular documentreference
: expected correct answer or special codeANSWER_NOT_FOUND
used to verify that the model does not hallucinate answers when the document does not provide enough information to answer the given question.
-
Prompt Template is a zero-shot prompt located in
prompt_template.txt
with two prompt variables (document
andquestion
) that are automatically populated from our dataset. -
Python script
eval.py
configures the evaluation:run_eval
: configures the evaluation task, runs it on the 2 models and prints the results.load_dataset
: loads the dataset including the contents of all documents.
-
Shell script
run.sh
installs the required Python libraries and runseval.py
How to run this Eval Recipe¶
- Configure your Google Cloud Environment and clone this Github repo to your environment. We recommend Cloud Shell or Vertex AI Workbench.
git clone --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples.git && \
cd applied-ai-engineering-samples && \
git sparse-checkout init && \
git sparse-checkout set genai-on-vertex-ai/gemini/model_upgrades && \
git pull origin main
- Navigate to the Eval Recipe directory in terminal, set your Google Cloud Project ID and run the shell script
run.sh
.
cd genai-on-vertex-ai/gemini/model_upgrades/document_qna/vertex_script
export PROJECT_ID="[your-project-id]"
./run.sh
- The resulting metrics will be displayed in the script output.
- You can use Vertex AI Experiments to view the history of evaluations for each experiment, including the final metrics scores.
How to customize this Eval Recipe:¶
- Edit the Python script
eval.py
:- set the
project
parameter of vertexai.init to your Google Cloud Project ID. - set the parameter
baseline_model
to the model that is currently used by your application - set the parameter
candidate_model
to the model that you want to compare with your current model - configure a unique
experiment_name
for tracking purposes
- set the
- Replace the contents of
dataset.jsonl
with your custom data in the same format. - Replace the contents of
prompt_template.txt
with your custom prompt template. Make sure that prompt template variables map to the dataset attributes. - Please refer to our documentation if you want to further customize your evaluation. Vertex AI Evaluation Service has a lot of features that are not included in this recipe.