# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Game Review Analysis Workflow with Vertex AI Extensions¶
Open in Colab |
Open in Colab Enterprise |
Open in Workbench |
View on GitHub |
Author(s) | Meltem Subasioglu |
Reviewers(s) | Yan Sun, Michael Sherman |
Last updated | 2024-04-21: Documentation Changes |
Overview¶
Vertex AI Extensions is a platform for creating and managing extensions that connect large language models to external systems via APIs. These external systems can provide LLMs with real-time data and perform data processing actions on their behalf.
In this tutorial, you'll use Vertex AI Extensions to complete a review analysis of a Steam game:
- Retrieve 50 reviews about the game from Steam
- Create a pre-built Code Interpreter extension in your project
- Use Code Interpreter to analyze the reviews and generate plots
- Retrieve 10 websites with more detailed reviews on the game
- Create and use the Vertex AI Search extension to research and summarize the website reviews
- Use Code Interpreter to build a report with all the generated assets
- [Optional]: Convert the report to PDF and upload to your Google Drive
- [Optional]: Send the PDF Report as an attachment via Gmail
▶ If you're already familiar with Google Cloud and the Vertex AI Extensions Code Interpreter Extension, you can skip reading between here and the "Getting Started" section.
Vertex AI Extensions¶
Vertex AI Extensions is a platform for creating and managing extensions that connect large language models to external systems via APIs. These external systems can provide LLMs with real-time data and perform data processing actions on their behalf. You can use pre-built or third-party extensions in Vertex AI Extensions.
Vertex AI Extensions Code Interpreter Extension¶
The Code Interpreter extension provides access to a Python interpreter with a sandboxed, secure execution environment that can be used with any model in the Vertex AI Model Garden. This extension can generate and execute code in response to a user query or workflow. It allows the user or LLM agent to perform various tasks such as data analysis and visualization on new or existing data files.
You can use the Code Interpreter extension to:
- Generate and execute code.
- Perform a wide variety of mathematical calculations.
- Sort, filter, select the top results, and otherwise analyze data (including data acquired from other tools and APIs).
- Create visualizations, plot charts, draw graphs, shapes, print results, etc.
Vertex AI Extensions Search Extension¶
The Vertex AI Search extension lets you access and search website corpuses and unstructured data to provide relevant responses to natural language questions, such as:
- "How did the competitive threats for the company change from Q1 of last year to Q1 of this year?"
- "What parts of the company are growing the fastest? How fast?"
Using this Notebook¶
If you're running outside of Colab, depending on your environment you may need to install pip packages that are included in the Colab environment by default but are not part of the Python Standard Library. Outside of Colab you'll also notice comments in code cells that look like #@something, these trigger special Colab functionality but don't change the behavior of the notebook.
This tutorial uses the following Google Cloud services and resources:
- Service Usage API
- Vertex AI Extensions
- Vertex AI Agent Builder
- Discovery Engine
- Google Cloud Storage Client
- Google Drive API Client
- Gmail API Client
This notebook has been tested in the following environment:
- Python version = 3.10.12 & 3.12.0
- google-cloud-aiplatform version = 1.47.0
- google-cloud-discoveryengine version = 0.11.11
Note: Vertex AI Extensions requires google-cloud-aiplatform version >= 1.47.0
🗒 Please note: the optional section near the end of this notebook shows how to use Google's Workspace APIs to save a PDF report to your Google Drive and to send an email with the attached PDF. Using the Workspace APIs requires setting up an OAuth consent screen and going through a web-based authentication flow. Many remote notebook environments, including Colab and Juypterlab, don't support this out-of-the-box. If you want to run through the optional section, make sure you are running this notebook in an environment that can open a webpage that you can interact with, like a local development environment.
Useful Tips¶
- This notebook uses Generative AI cababilities. Re-running a cell that uses Generative AI capabilities may produce similar but not identical results.
- Because of #1, it is possible that an output from Code Interpreter producess errors. If that happens re-run the cell that produced the coding error. The different generated code will likely be bug free. The
run_code_interpreter
method below helps automate this, but you still may need to rerun cells that generate working code that doesn't perfectly follow the instructions in the prompt. - The use of Extensions and other Generative AI capabilities is subject to service quotas. Running the notebook using "Run All" may exceed your queries per minute (QPM) limitations. Run the notebook manually and if you get a quota error pause for up to 1 minute before retrying that cell. Code Interpreter defaults to Gemini on the backend and is subject to the Gemini quotas, view your Gemini quotas here.
- The Code Interpreter Extension is stateless and therefore every request to Code Interpreter does not have knowledge of previous operations nor files injested or produced in previous steps. Therefore, with any request to Code Interpreter you need to submit all files and instructions for that request to complete successfully.
- The Code Interpreter runs in a sandbox environment. So try to avoid prompts that need additional Python packages to run, or prompt Code Interpreter to ignore anything that needs packages beyond the built-in ones.
- Tell Code Interpreter to catch and print any exceptions for you, and to suppress UserWarnings and FutureWarnings.
- For debugging the output of Code Interpreter, it usually helps to copy the error message into the prompt and tell Code Interpreter to properly handle that error.
You can take a look at this section as an example for points 5-7.
Getting Started¶
The following steps are necessary to run this notebook, no matter what notebook environment you're using.
If you're entirely new to Google Cloud, get started here.
Google Cloud Project Setup¶
- Select or create a Google Cloud project. When you first create an account, you get a $300 free credit towards your compute/storage costs.
- Make sure that billing is enabled for your project.
- Enable the Service Usage API
- Enable the Cloud Storage API.
- Enable the Vertex AI API.
- Enable the Agent Builder API
- Enable the Discovery Engine API for your project
- [Optional Section] Enable the Google Drive API.
- [Optional Section] Enable the Gmail API.
Google Cloud Permissions¶
To run the complete Notebook, including the optional section, you will need to have the Owner role for your project.
If you want to skip the optional section, you need at least the following roles:
roles/serviceusage.serviceUsageAdmin
to enable APIsroles/iam.serviceAccountAdmin
to modify service agent permissionsroles/discoveryengine.admin
to modify discoveryengine assetsroles/aiplatform.user
to use Vertex AI componentsroles/storage.objectAdmin
to modify and delete GCS buckets
Install Vertex AI SDK and Other Required Packages¶
!pip install google-cloud-aiplatform --upgrade
# Note -- this may not work in some non-Colab environments. If you get errors
# when running 'import vertexai' below, you'll need to find another way to
# install the latest google-cloud-aiplatform package into your notebook kernel.
# In some kernel setups running "%pip install google-cloud-aiplatform --upgrade"
# in a code cell works if "!pip install ...." doesn't. This may apply to other
# package installations as well.
!pip install xhtml2pdf
!pip install google-cloud-discoveryengine --upgrade
## If you're running outside of colab, make sure to install the following modules as well:
!pip install pandas
!pip install google
!pip install google-api-python-client
!pip install google-oauth
!pip install google-auth-oauthlib
Restart Runtime¶
To use the newly installed packages in this notebook, you may need to restart the runtime. You can do this by running the cell below, which restarts the current kernel.
You may see the restart reported as a crash, but it is working as intended -- you are merely restarting the runtime.
The restart might take a minute or longer. After it's restarted, continue to the next step.
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
import sys
from google.auth import default
from google.colab import auth as google_auth
if "google.colab" in sys.modules:
google_auth.authenticate_user()
creds, _ = default()
Authenticate (Outside Colab)¶
If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into Application Default Credentials for your local environment and initializing the Google Cloud CLI. More authentication options are discussed here.
Once the Google Cloud CLI is properly installed on your system, follow the instructions in the next cells to set up your ADC.
Setting up Application Default Credentials¶
Outside of Colab, you can authenticate through Google Cloud via Application Default Credentials. It is recommended that you set up a new configuration to run this notebook.
To do so, open a terminal and run:
$ gcloud config configurations create CONFIG_NAME
This creates a new config with the specified name.
💡 NOTE: You can list all available configurations by running
$ gcloud config configurations list
💡
The configuration should be activated automatically. Next, login with your account by running
$ gcloud auth login EMAIL_ADDRESS
Use the email address of your Google Cloud Project Account.
Then, set your project:
$ gcloud config set project PROJECT_ID
You will possibly get a warning that the active project doesn't match the quota project. To change this, run:
$ gcloud auth application-default set-quota-project PROJECT_ID
Confirm that the API cloudresourcemanager.googleapis.com will be enabled with Y.
Finally, create the application default credentials:
$ gcloud auth application-default login
You're ADC is all set now. Fetch your credentials by running the next cell:
from google.auth import default
creds, _ = default()
Set Google Cloud Project Information and Initialize the Vertex AI SDK¶
To get started using Vertex AI, you must have an existing Google Cloud project and enable all the APIs mentioned in the 'Getting Started' section of this notebook.
Learn more about setting up a project and a development environment.
Make sure to change PROJECT_ID
in the next cell. You can leave the values for REGION
and API_ENV
unless you have a specific reason to change them.
import vertexai
PROJECT_ID = "YOUR_PROJECT_ID" # @param {type:"string"}
REGION = "us-central1" # @param {type: "string"}
API_ENV = "aiplatform.googleapis.com" # @param {type:"string"}
vertexai.init(
project=PROJECT_ID,
location=REGION,
api_endpoint=f"{REGION}-{API_ENV}",
)
Create a Google Cloud Storage Bucket¶
You will need a GCS bucket. For the scope of this notebook, you will create a bucket by running the cells below.
# @markdown Select a **unique** name for your bucket
GCS_BUCKET = "my_test_bucket123456" # @param {type:"string"}
The next cell creates your GCS bucket with the specified name:
from google.cloud import storage
# Create a client object.
client = storage.Client(project=PROJECT_ID)
# Create the bucket with public access.
bucket = client.create_bucket(GCS_BUCKET)
print(f"Bucket {GCS_BUCKET} created successfully.")
Using Vertex AI Extensions to Analyze Game Reviews - Tutorial¶
Step 1: Create a Code Interpreter Extension¶
Now you can create the extension. The following cell uses the Python SDK to import the extension (thereby creating it) into Vertex AI Extensions.
from vertexai.preview import extensions
extension_code_interpreter = extensions.Extension.from_hub("code_interpreter")
extension_code_interpreter
Code Interpreter Helper Functions¶
These functions make it easier to inspect Code Interpreter's output, assemble Code Interprer requests, and run generated code.
process_response
¶
process_response
displays the generated code and any output files, shows the output from code execution, surfaces code execution errors, and saves output files.
If the output of process_response
looks strange, try making your noteboook window wider--this will help keep the HTML layout organized.
To use this functionality call process_response(response)
, where response
is the Code Interpreter response
object.
import base64
import json
import pprint
import pandas
import sys
import IPython
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
css_styles = """
<style>
.main_summary {
font-weight: bold;
font-size: 14px; color: #4285F4;
background-color:rgba(221, 221, 221, 0.5); padding:8px;}
.main_summary:hover {background-color: rgba(221, 221, 221, 1);}
details {
background-color:#fff;
border: 1px solid #E8EAED;
padding:0px;
margin-bottom:2px; }
details img {width:50%}
details > div {padding:10px; }
div#left > * > div {
overflow:auto;
max-height:400px; }
div#right > pre {
overflow:auto;
max-height:600px;
background-color: ghostwhite;
padding: 10px; }
details details > div { overflow: scroll; max-height:400px}
details details {
background-color:rgba(246, 231, 217, 0.2);
border: 1px solid #FBBC04;}
details details > summary {
padding: 8px;
background-color:rgba(255, 228, 196, 0.6); }
details details > summary:hover { background-color:rgba(255, 228, 196, 0.9); }
div#left {width: 64%; padding:0 1%; }
div#right {
border-left: 1px solid silver;
width: 30%;
float: right;
padding:0 1%; }
body {color: #000; background-color: white; padding:10px 10px 40px 10px; }
#main { border: 1px solid #FBBC04; padding:10px 0; display: flow-root; }
h3 {color: #000; }
code { font-family: monospace; color: #900; padding: 0 2px; font-size: 105%; }
</style>
"""
# Parser to visualise the content of returned files as HTML.
def parse_files_to_html(outputFiles, save_files_locally = True):
IMAGE_FILE_EXTENSIONS = set(["jpg", "jpeg", "png"])
file_list = []
details_tml = """<details><summary>{name}</summary><div>{html_content}</div></details>"""
if not outputFiles:
return "No Files generated from the code"
# Sort output_files so images are displayed before other files such as JSON.
for output_file in sorted(
outputFiles,
key=lambda x: x["name"].split(".")[-1] not in IMAGE_FILE_EXTENSIONS,
):
file_name = output_file.get("name")
file_contents = base64.b64decode(output_file.get("contents"))
if save_files_locally:
open(file_name,"wb").write(file_contents)
if file_name.split(".")[-1] in IMAGE_FILE_EXTENSIONS:
# Render Image
file_html_content = ('<img src="data:image/png;base64, '
f'{output_file.get("contents")}" />')
elif file_name.endswith(".json"):
# Pretty print JSON
json_pp = pprint.pformat(
json.loads(file_contents.decode()),
compact=False,
width=160)
file_html_content = (f'<span>{json_pp}</span>')
elif file_name.endswith(".csv"):
# CSV
csv_md = pandas.read_csv(
StringIO(file_contents.decode())).to_markdown(index=False)
file_html_content = f'<span>{csv_md}</span>'
elif file_name.endswith(".pkl"):
# PKL
file_html_content = f'<span>Preview N/A</span>'
else:
file_html_content = f"<span>{file_contents.decode()}</span>"
file_list.append({'name': file_name, "html_content": file_html_content})
buffer_html = [ details_tml.format(**_file) for _file in file_list ]
return "".join(buffer_html)
# Processing code interpreter response to html visualization.
def process_response(response: dict, save_files_locally = True) -> None:
result_template = """
<details open>
<summary class='main_summary'>{summary}:</summary>
<div><pre>{content}</pre></div>
</details>
"""
result = ""
code = response.get('generated_code')
if 'execution_result' in response and response['execution_result']!="":
result = result_template.format(
summary="Executed Code Output",
content=response.get('execution_result'))
else:
result = result_template.format(
summary="Executed Code Output",
content="Code does not produce printable output.")
if response.get('execution_error', None):
result += result_template.format(
summary="Generated Code Raised a (Possibly Non-Fatal) Exception",
content=response.get('execution_error', None))
result += result_template.format(
summary="Files Created <u>(Click on filename to view content)</u>",
content=parse_files_to_html(
response.get('output_files', []),
save_files_locally = True))
display(
IPython.display.HTML(
( f"{css_styles}"
f"""
<div id='main'>
<div id="right">
<h3>Generated Code by Code Interpreter</h3>
<pre><code>{code}</code></pre>
</div>
<div id="left">
<h3>Code Execution Results</h3>
{result}
</div>
</div>
"""
)
)
)
{name}
{summary}:
{content}
Generated Code by Code Interpreter
{code}
Code Execution Results
{result}run_code_interpreter
¶
run_code_interpreter
eases calling Code Interpreter by encoding files to base 64 (a Code Interpreter requirement) and submitting the files alongside the instructions. It also automates retries (5 by default) if the generated code doesn't execute or if Code Interpreter fails due to exceeding Gemini (time-based) quotas. Additionally, a global CODE_INTERPRETER_WRITTEN_FILES
variable is populated by run_code_interpreter
to aid with cleaning up files created by Code Interpreter, though this notebook doesn't take advantage of this and implements alternate Code Interpreter output management later.
To use this functionality call run_code_interpreter(instructions, filenames, retry_num, retry_wait_time)
where instructions
is the prompt for Code Interpreter, filenames
is a list of local files in the working directory to submit to Code Interpreter, optionally retry_num
if you want to change the default number of retries from 5, and optionally retry_wait_time
if you want to change the default 15 second wait between retries.
from time import sleep
global CODE_INTERPRETER_WRITTEN_FILES
CODE_INTERPRETER_WRITTEN_FILES = []
def run_code_interpreter(instructions: str,
filenames: list[dict] = [],
retry_num: int = 5,
retry_wait_time: int = 15) -> dict['str', 'str']:
global CODE_INTERPRETER_WRITTEN_FILES
file_arr = [
{
"name": filename,
"contents": base64.b64encode(open(filename, "rb").read()).decode()
}
for filename in filenames
]
attempts = 0
res = {}
while attempts <= retry_num:
attempts += 1
res = extension_code_interpreter.execute(
operation_id = "generate_and_execute",
operation_params = {
"query": instructions,
"files": file_arr
},
)
CODE_INTERPRETER_WRITTEN_FILES.extend(
[item['name'] for item in res['output_files']])
if not res.get('execution_error', None):
return res
elif attempts <= retry_num:
print(f"The generated code produced an error {res.get('execution_error')}"
f" -Automatic retry attempt # {attempts}/{retry_num}")
Step 2: Use Code Interpreter to Analyze Steam Reviews¶
In this section, you will specify a game title and parse some Steam reviews for the title from store.steampowered.com. Using the Code Interpreter extension, you will then perform automated analysis on the reviews.
#@markdown Specify the name of the game.
game = "Palworld" # @param {type: "string"}
Prepare the Reviews Dataset¶
Now, grab the Steam App ID for the game, if the game is supported on the platform. For this, do a Google Search to retrieve the Steam Game URL, and parse the ID out of the URL.
Note: if you are facing errors with importing googlesearch
, make sure that you don't have any conflicting packages installed. This is the googlesearch module that's installed when running pip install google
.
# Fetch steam review URL and the games App ID.
from googlesearch import search
query = f"{game} steampowered.com "
steam_url = list()
for j in search(query, tld="com", num=1, stop=1, pause=1):
print("URL: ",j)
steam_url.append(j)
try:
steam_url = steam_url[0].split('app/')[1]
steam_appId = steam_url.split('/')[0]
print("App ID: ", steam_appId)
except:
print("Could not parse the steam ID out of the URL. The game is likely not supported on Steam.")
steam_appId = None
Now, grab some reviews from Steam. The Steam website loads infinitely and does not allow searching through the pages by the url. So you are limited to retrieving 10 hits for now. To get more than 10 reviews, set five different filters to get the reviews:
- Top rated reviews of all time
- Trending reviews today
- Trending reviews this week
- Trending reviews this month
- Most recent reviews
This will give us a total of 50 reviews to work with.
import requests
from bs4 import BeautifulSoup
import json
def get_steam_reviews(filter, num_reviews=10):
"""
Fetches Steam reviews for a given filter and number of reviews.
Args:
filter (str): The filter type (e.g., 'toprated', 'trendweek').
num_reviews (int): The desired number of reviews to fetch. Defaults to 10.
Returns:
list: A list of dictionaries, each representing a review with
'author', 'content', 'rating', 'date', and 'hours_played' keys.
"""
url = f'https://steamcommunity.com/app/{steam_appId}/reviews/?p=1&browsefilter={filter}'
print("URL: ", url)
reviews = []
# Iterate over reviews until we have num_reviews.
while len(reviews) < num_reviews:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
review_blocks = soup.find_all('div', class_='apphub_Card') # Find all review cards.
for block in review_blocks:
# Author
author_block = block.find('div', class_='apphub_CardContentAuthorName') # Fetch author.
if author_block:
author = author_block.text.strip()
# Rating
rating_block = block.find('div', class_='title') # Fetch title.
if rating_block:
rating = rating_block.text.strip()
# Review Content
content_block = block.find('div', class_='apphub_CardTextContent') # Fetch content.
if content_block:
content = content_block.text.strip()
# Review Date
date_block = content_block.find('div', class_='date_posted') # Fetch date.
if date_block:
date = date_block.text.replace('Posted:', '').strip()
# Total Hours Played
hours_block = block.find('div', class_='hours') # Fetch total hours played.
if hours_block:
hours_played = hours_block.text.strip()
reviews.append({'author': author, 'content': content, 'rating': rating, 'date': date, 'hours_played' : hours_played})
if len(reviews) >= num_reviews:
break
return reviews
topRated_reviews = get_steam_reviews('toprated')
trendWeek_reviews = get_steam_reviews('trendweek')
trendMonth_reviews = get_steam_reviews('trendmonth')
trendDay_reviews = get_steam_reviews('trendday')
mostRecent_reviews = get_steam_reviews('mostrecent')
Concatenate all the reviews into one single list:
all_reviews = topRated_reviews + trendWeek_reviews + trendMonth_reviews + trendDay_reviews + mostRecent_reviews
Write the reviews into a .csv file so you can parse it with the Code Interpreter extension.
import csv
filename = 'reviews.csv'
with open(filename, 'w', newline='') as csvfile:
# Determine field names (header row).
fieldnames = all_reviews[0].keys()
# Create a DictWriter.
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
# Write the header.
writer.writeheader()
# Write the data rows.
writer.writerows(all_reviews)
Get the reviews in a pandas dataframe, so you can take a look into its content and inspect the reviews.
import pandas as pd
df = pd.read_csv('reviews.csv')
df.head(10)
Let Code Interpreter Do Its Magic¶
Write a helper function to collect all of the assets created by a Vertex AI Extension. This will help later when generating the PDF Report and with cleaning up the generated files. For this purpose, this function collects the file names of any generated images from Code Interpreter Extension as well as the text outputs generated by the Vertex AI Search Extension.
output_list = []
def is_string(value):
return isinstance(value, str)
def grab_outs(response):
# Check if response is a string from Search Extension.
if is_string(response):
output_list.append(response)
# Else it's a dict output from Code Interpreter Extension.
else:
for dict in response['output_files']:
output_list.append(dict["name"]) # Grab the filename from the dict output.
You can call the Vertex AI Code Interpreter Extension to generate plots and graphs on your dataset. You can also ask the Code Interpreter extension to take a look at the dataset for you and generate a few ideas for insightful visualizations. The following cell prompts the Code Interpreter extension to save some plot ideas in the ideas.txt file:
response = run_code_interpreter(instructions=f"""
You are given a dataset of reviews. I want you to come up with some ideas for relevant visualization for this dataset.
Create natural language **instructions** and save them into the file ideas.txt.
Please put your ideas as natural language **instructions** into the file ideas.txt.
Do not generate any plots yourself.
""", filenames= ['reviews.csv'])
process_response(response)
You can view the ideas.txt file by expanding the output.
Next, ask Code Interpreter to create a plot by running the next cell. You can also experiment with changing this Code Interpreter prompt to attempt one of the ideas in ideas.txt.
response = run_code_interpreter(instructions=f"""
You are given a dataset of reviews. Create a pie chart showing the following:
- How many ratings have 'recommended' vs 'not recommended'?
Save the plot with a descriptive name.
""", filenames= ['reviews.csv'])
process_response(response)
# Grab the output if it looks good.
grab_outs(response)
Easy peasy. But what if you want to generate a more complex plot with the Code Interpreter extension? You can try that with the next cell:
response = run_code_interpreter(instructions=f"""
You are given a dataset of reviews. The hours_played column contains information on the total hours played, in the format '3,650.6 hrs on record' or '219.6 hrs on record'.
Avoid and handle conversion errors, e.g. 'could not convert string to float: '3,650.6''.
Make a plot that shows the relationship between hours played and the count of the ratings 'Not Recommended'.
Put the hours_played into the different buckets 0-50, 50-100, 100-1000, >1000.
Save the plot with a descriptive name.
Make sure Plots have visible numbers or percentages when applicable, and labels.
Make sure to avoid and handle the error 'Expected value of kwarg 'errors' to be one of ['raise', 'ignore']. Supplied value is 'coerce' '.
Use >>> import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) <<< to avoid any FutureWarnings from pandas.
""", filenames= ['reviews.csv'])
process_response(response)
# Grab the output if it looks good.
grab_outs(response)
Step 3: Use the Vertex AI Search Extension to do a Qualitative Analysis of the Reviews¶
To use the Vertex AI Search Extension, please grant the Vertex AI Extension Service agent the permission needed by following the UI instructions or by running the next cell.
To do so in the UI:
- Go to https://console.cloud.google.com/iam-admin/iam
- Make sure you're in the right project.
- Enable the checkfield
Include Google-provided role grants
. This will show you the active service accounts in your project. - Locate the service agent with the name Vertex AI Extension Service Agent.
- Click on the pen icon to edit the roles for this service agent.
- Click on
add another role
and add Discovery Engine Editor. - Save the changes.
Alternatively, run the next cell to assign the role to the Service Agent programmatically:
!gcloud config set project {PROJECT_ID}
%%bash -s "$PROJECT_ID"
# Get project number using gcloud.
PROJECT_NUMBER=$(gcloud projects describe $1 --format="value(projectNumber)")
# Service agent email.
SERVICE_AGENT_EMAIL="service-$PROJECT_NUMBER@gcp-sa-vertex-ex.iam.gserviceaccount.com"
# Role to add.
ROLE="roles/discoveryengine.editor"
# Add the role using gcloud CLI (with the correct service agent email).
gcloud projects add-iam-policy-binding $1 \
--member="serviceAccount:$SERVICE_AGENT_EMAIL" \
--role=$ROLE
Set Up Qualitative Review Dataset¶
Grab some more detailed reviews of the game for qualitative analysis. For this, you can use Google Search to get urls of the top 10 results for the game's reviews.
from googlesearch import search
# Search.
query = f"{game} Reviews"
urls = list()
for j in search(query, tld="com", num=10, stop=10, pause=2):
print(j)
urls.append(j)
We want the Vertex AI Search extension to summarize and to answer questions relating to these reviews.
To do this, we need to ingest the contents we want to search over into a Vertex AI Search data store - no worries, the notebook will guide you through the complete setup in the next sections! 🍀
Vertex AI Search allows you to ingest website URLs directly into a Data Store. However, currently this is only supported through the Google Cloud Console.
To ingest the website contents into a data store right from this notebook, we need to put the contents into a Google Cloud Storage bucket.
In our case, let's retrieve all the text content from the websites and save them in .txt files. Compared to using raw .html files this ensures cleaner results, as we're only interested in the textual information from the review sites and can ditch everything else (including unnecessary images and other content).
The following cell lets you grab the text content from the websites and write them into .txt files. Then, these files will be uploaded to your GCS bucket, following the file name pattern website_text_{idx}.txt
.
import requests
import os
from bs4 import BeautifulSoup
from google.cloud import storage
def url_txt_to_gcs(id, url, filename, bucket_name):
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract all text content.
all_text = soup.get_text(separator='\n', strip=True)
# Save to .txt file.
with open(filename, "w", encoding='utf-8') as file:
file.write(id +"\n"+ all_text)
# Upload.
client = storage.Client()
bucket = client.get_bucket(bucket_name)
blob = bucket.blob(filename)
file_path = os.path.join(filename)
blob.upload_from_filename(file_path)
print(f"File uploaded to gs://{bucket_name}/{filename}")
# Upload the website content .txt files into GCS.
txt_files = []
for idx, url in enumerate(urls):
id = "doc-"+str(idx)
filename = f"website_text_{idx}.txt"
txt_files.append(f"website_text_{idx}.txt")
url_txt_to_gcs(id, url, filename, GCS_BUCKET)
Create a Vertex AI Search Data Store and Ingest Your Files¶
The Vertex AI Search extension needs a Data Store and Vertex AI Search App to run. You can learn more about Data Stores and Vertex AI Search Apps here.
Therefore, we need to do the following steps:
- Create a Vertex AI Search data store.
- Ingest our website .txt files into the data store.
- Connect a Vertex AI Search App to the data store.
The following cells will help you with this setup:
# @markdown Specify an id for your datastore. It should only use lowercase letters.
data_store_id = "gamereview-extensions" # @param {type:"string"}
Use the following bash command to ✨create✨ your Vertex AI Search data store:
%%bash -s "$PROJECT_ID" "$data_store_id"
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-H "X-Goog-User-Project: $1" \
"https://discoveryengine.googleapis.com/v1alpha/projects/$1/locations/global/collections/default_collection/dataStores?dataStoreId=$2" \
-d '{
"displayName": "GameReview-Extensions-Store",
"industryVertical": "GENERIC",
"solutionTypes": ["SOLUTION_TYPE_SEARCH"],
"contentConfig": "CONTENT_REQUIRED",
}'
🎉 Your data store is all set! You can inspect it under: https://console.cloud.google.com/gen-app-builder/data-stores
Now you just need to ✨ingest✨ your .txt files with the website contents into it by running the cell below.
This process can take somewhere between 5-10 mins. The cell will finish running once the ingestion is done.
from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine
from typing import Optional
def import_documents_sample(
project_id: str,
location: str,
data_store_id: str,
gcs_uri: Optional[str] = None,
) -> str:
"""Imports documents into a Vertex AI data store from GCS.
This function imports documents into a specified data store within Vertex AI
Agent Builder from a GCS bucket. It uses the incremental reconciliation
mode, which adds new documents and updates existing ones.
Args:
project_id: The ID of the Google Cloud project.
location: The region where the data store is located (e.g., "us-central1").
data_store_id: The ID of the data store.
gcs_uri: The GCS URI of the documents to import (e.g., "gs://my-bucket/docs/*.txt").
Returns:
str: The name of the long-running operation that imports the documents.
Raises:
google.api_core.exceptions.GoogleAPICallError: If the API call fails.
"""
client_options = (
ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
if location != "global"
else None
)
# Create a client.
client = discoveryengine.DocumentServiceClient(client_options=client_options)
# The full resource name of the search engine branch.
# e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
parent = client.branch_path(
project=project_id,
location=location,
data_store=data_store_id,
branch="default_branch",
)
request = discoveryengine.ImportDocumentsRequest(
parent=parent,
gcs_source=discoveryengine.GcsSource(
input_uris=[gcs_uri], data_schema="content"
),
# Options: `FULL`, `INCREMENTAL`
reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
)
# Make the request
operation = client.import_documents(request=request)
print(f"Waiting for operation to complete: {operation.operation.name}")
response = operation.result()
# Once the operation is complete, get information from operation metadata.
metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)
# Handle the response.
print(response)
print(metadata)
return operation.operation.name
gcs_uri = f"gs://{GCS_BUCKET}/*.txt" # grabs all the .txt files we generated
import_documents_sample(PROJECT_ID, 'global', data_store_id, gcs_uri)
Connect Data Store to a Vertex AI Search App¶
The following cell lets you create a Vertex AI Search App to ✨connect✨ to your newly created data store. For the Vertex AI Search Extension to work, we need to enable Advanced Features, including Enterprise features by setting "searchTier": "SEARCH_TIER_ENTERPRISE"
and Advanced LLM Features by setting "searchAddOns": ["SEARCH_ADD_ON_LLM"]
in the code cell below.
These settings will be set automatically by running the next cell.
%%bash -s "$PROJECT_ID" "$data_store_id"
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-H "X-Goog-User-Project: $1" \
"https://discoveryengine.googleapis.com/v1/projects/$1/locations/global/collections/default_collection/engines?engineId=$2" \
-d '{
"displayName": "game-review-engine",
"dataStoreIds": ["'$2'"],
"solutionType": "SOLUTION_TYPE_SEARCH",
"searchEngineConfig": {
"searchTier": "SEARCH_TIER_ENTERPRISE",
"searchAddOns": ["SEARCH_ADD_ON_LLM"]
}
}'
Set up the Vertex AI Search Extension¶
Your data store and search app are all set. Now you just need to create an instance of the Vertex AI Search Extension by running the cell below.
# Construct an object that points to the relevant data store.
DATASTORE = f"projects/{PROJECT_ID}/locations/global/collections/default_collection/dataStores/{data_store_id}/servingConfigs/default_search"
# Instantiate extension.
extension_vertex_ai_search = extensions.Extension.from_hub(
"vertex_ai_search",
runtime_config={
"vertex_ai_search_runtime_config": {
"serving_config_name": DATASTORE,
}
})
extension_vertex_ai_search
The following is a helper function. You can let Vertex AI Search generate an answer for your prompt directly, but for a more descriptive response you can retrieve the segment matches provided by the search app and let Gemini generate an answer from the segments.
from vertexai.preview.generative_models import GenerativeModel, Part
import vertexai.preview.generative_models as generative_models
model = GenerativeModel("gemini-1.0-pro-001")
def get_vertexSearch_response(QUERY, mode):
"""Queries Vertex AI Search and generates a response using either Vertex AI Search or Gemini.
This function takes a query and a mode as input. It first sends the query to Vertex AI Search.
Depending on the specified mode, it either:
- Returns the extractive answers directly from Vertex AI Search (mode='vertex').
- Uses the extractive segments from Vertex AI Search as context for Gemini to generate a more
comprehensive response (mode='gemini').
Args:
QUERY: The query string to send to Vertex AI Search.
mode: The response generation mode, either 'vertex' or 'gemini'.
Returns:
str: The generated response, either from Vertex AI Search or Gemini.
Raises:
ValueError: If the `mode` is not 'vertex' or 'gemini'.
vertexai.preview.generative_models.errors.GenerativeModelError: If the Gemini API call fails.
"""
vertex_ai_search_response = extension_vertex_ai_search.execute(
operation_id = "search",
operation_params = {"query": QUERY},
)
# Let Vertex AI Search Extension generate a response.
if mode == 'vertex':
list_extractive_answers = []
for i in vertex_ai_search_response:
list_extractive_answers.append(i["extractive_answers"][0])
return list_extractive_answers
# Let Gemini generate a response over the Vertex AI Search Extension segments.
elif mode == 'gemini':
list_extractive_segments = []
for i in vertex_ai_search_response:
list_extractive_segments.append(i["extractive_segments"][0])
prompt = f"""
Prompt: {QUERY};
Contents: {str(list_extractive_segments)}
"""
res = model.generate_content(
prompt,
generation_config={
"max_output_tokens": 2048,
"temperature": 0.1,
"top_p": 1
},
safety_settings={
generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
},
stream=False,
)
return res.text
Use the Vertex AI Search Extension to Answer Questions and Retrieve Summaries¶
Now you can run the Vertex AI Search Extension. The cell below demonstrates an output from Vertex AI Search without Gemini.
ㅤ
❗NOTE - if you are facing the following error:
FailedPrecondition: 400 Cannot use enterprise edition features (website search, multi-modal search, extractive answers/segments, etc.) in a standard edition search engine...
when running the cell below, simply wait a few minutes and try to run the cell again. That means the settings from the Vertex AI Search App creation have not yet propagated to the system (setting propagation may take up to 15 minutes to take effect after creating the search app).❗
QUERY = f"What are some negative review points for {game}?" # @param {type:"string"}
search_res = get_vertexSearch_response(QUERY, mode='vertex')
search_res
The following cell highlights the differences between the pure Vertex AI Search Extension output above, and the hybrid response generated with Gemini below:
QUERY = f"List 10 positive review points for {game}"
response = get_vertexSearch_response(QUERY, mode='gemini')
print(response)
# Grab the output for report generation.
grab_outs(response)
Looks good. Collect more information from the website contents by giving the extension some more prompts:
QUERY = f"List 10 negative review points for {game}"
response = get_vertexSearch_response(QUERY, mode='gemini')
print(response)
# Grab the output for report generation.
grab_outs(response)
QUERY = f"Provide a summary description of the game {game}"
response = get_vertexSearch_response(QUERY, mode='gemini')
print(response)
# Grab the output for report generation.
grab_outs(response)
Step 4: Populate Your Results Into a PDF Report¶
Now it's time to put everything together. You have collected the generated responses (both images and texts) from Vertex AI Code Interpreter and Search Extensions.
output_list
Next you need to fetch the image filenames from the output_list:
imgs_files = []
other_files = []
txt_outs = []
for element in output_list:
if ".png" in element or ".jpg" in element or ".jpeg" in element:
# Ignore images with code_execution in filename (these are doubles).
if "code_execution" in element:
other_files.append(element)
else:
# Grab image filenames.
imgs_files.append(element)
else:
# Get text outputs.
txt_outs.append(element)
Generate the Report With the Vertex AI Code Interpreter Extension¶
With the collected text outputs and the images, you can ask the Code Interpreter extension to generate a compelling PDF Report. For this, let it generate a .html file first - you can convert it to PDF in the next cells.
imgs_files
response = run_code_interpreter(instructions=f"""
You are a report generator. Given a list of filenames and strings, create an interesting report in html language and save it to report.html.
The report revolves around reviews for the game {game}.
Structure the report with proper headings. Don't use 'String' as a heading.
Write the whole report in natural language. You are allowed to use bullet points.
Start the report with a summary of the game {game}.
Embed the png images directly in the html and include image descriptions.
And string contents:
{txt_outs}
""", filenames=imgs_files)
process_response(response)
Convert the html to a .pdf file and save it as report.pdf
:
import xhtml2pdf.pisa as pisa
with open("report.html") as infile, open("report.pdf", "w+b") as outfile:
pisa.CreatePDF(infile, outfile)
Your report.pdf is now generated and saved in your working directory.
[OPTIONAL] Step 5: Google Workspace APIs (Outside Colab)¶
If you are skipping this optional section, you should still go to the "Cleaning Up" section at the end if you want to remove files and GCP resources created by this notebook.
This section shows how you can use the Workspace APIs to store your generated PDF report in your Google Drive and send the report as an attachment via Gmail.
🚨 As mentioned in the beginning of this notebook, using the Workspace APIs requires setting up an OAuth consent screen and going through a web-based authentication flow that many remote notebook environments, including Colab and Jupyterlab don't support out-of-the-box. If you want to run through the optional section, make sure you are running this notebook in an environment that can open a webpage that you can interact with, like a local development environment.🚨
For this, you need to configure the Google Workspace API and credentials first. You can check out the Python Quick Start Guide for more details. If you've followed this notebook so far just follow these steps to complete the configuration:
ㅤ
👣 Steps for setting up the scopes:
- Go to the OAuth consent screen in your project
- For User type select external, then click Create.
- Complete the app registration form by adding an app name, and adding your email to the user support email & developer contact information, then click Save and Continue.
- Click on
Add or Remove Scopes
. - In the filter search bar of the selected scopes window, search for drive and enable the Scope https://www.googleapis.com/auth/drive
- Now search for Gmail and enable the Scope https://www.googleapis.com/auth/gmail.send
- Click on Save and Continue.
- In the Test Users window, add your own Google email address as a User by clicking
Add Users
, then click on Save and Continue. - Review your app registration summary. To make changes, click Edit. If the app registration looks OK, click Back to Dashboard.
ㅤ
👣 Steps for retrieving authorized credentials:
- Go to Credentials in the GCP console.
- Click Create Credentials > OAuth client ID.
- Click Application type > Desktop app.
- In the Name field, type a name for the credential. This name is only shown in the Google Cloud console.
- Click Create. The OAuth client created screen appears, showing your new Client ID and Client secret.
- Click OK. The newly created credential appears under OAuth 2.0 Client IDs.
- Save the downloaded JSON file as credentials.json, and move the file to your working directory.
After that, you can run the following cell to get your creds variable by parsing the credentials.json file:
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from google.oauth2 import credentials
import os
SCOPES = ['https://mail.google.com/', 'https://www.googleapis.com/auth/gmail.send', 'https://www.googleapis.com/auth/drive']
creds = None
# Token file typically stores credentials for reuse.
token_file = 'token.json'
# Check if authorized credentials exist.
if os.path.exists(token_file):
creds = credentials.Credentials.from_authorized_user_file(token_file, SCOPES)
# If not, or credentials are invalid, trigger the authorization flow.
if not creds or not creds.valid:
if creds and creds.expired and creds.refresh_token:
creds.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
"credentials.json", SCOPES
)
creds = flow.run_local_server(port=0)
# Save the credentials for the next run.
with open("token.json", "w") as token:
token.write(creds.to_json())
Uploading Report to Google Drive¶
This section lets you upload the generated PDF report to your Google Drive. It will first create a new folder for you (specify the folder name in the next cell) and upload the PDF file to that folder.
# @markdown Provide the folder name on Google Drive where the PDF should be saved into:
folder_name = 'extensions-demo' # @param {type:"string"}
Let's create the Google Drive API Service:
drive_service = build('drive', 'v3', credentials=creds)
The following function lets you create a new folder in Google Drive:
import os
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload
def create_folder(folder_name, drive_service):
"""Creates a folder in Google Drive.
This function uses the Google Drive API to create a new folder with the specified name.
Args:
folder_name: The name of the folder to create.
drive_service:
Returns:
str: The ID of the newly created folder.
"""
file_metadata = {
'name': folder_name,
'mimeType': 'application/vnd.google-apps.folder'
}
folder = drive_service.files().create(body=file_metadata, fields='id').execute()
return folder.get('id')
# Create your folder.
folder_id = create_folder(folder_name, drive_service)
Lastly, upload your report.pdf to your new Google Drive Folder. The next function will help you upload a specified file to your newly created folder:
def upload_file(file_path, folder_id, drive_service):
"""Uploads a file to a specific folder in Google Drive.
This function uses the Google Drive API to upload a file from the local filesystem
to a specified folder in Google Drive. It automatically determines the appropriate
MIME type based on the file extension.
Args:
file_path: The path to the file to upload.
folder_id: The ID of the folder to upload the file to.
Returns:
str: The ID of the uploaded file.
"""
file_metadata = {
'name': os.path.basename(file_path),
'parents': [folder_id]
}
# Determine MIME type based on file extension.
extension = os.path.splitext(file_path)[1].lower()
if extension in ['.jpg', '.jpeg', '.png']:
mime_type = 'image/jpeg' # Adjust for other image types if needed.
elif extension == '.pdf':
mime_type = 'application/pdf'
else:
mime_type = 'application/octet-stream' # Generic fallback.
media = MediaFileUpload(file_path, mimetype=mime_type, resumable=True)
file = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()
print(f'File uploaded to Drive: {file.get("id")}')
return file.get("id")
# Upload file to Google Drive folder
file_id = upload_file('report.pdf', folder_id, drive_service)
Sending the Report via Gmail¶
The following sections show how to attach the generated PDF report to an email and send it to a recipient with the Gmail API.
Grab the contents of the PDF report:
import os
def read_pdf_file(filename):
with open(filename, 'rb') as f:
pdf_data = f.read()
return pdf_data
pdf_filename = "report.pdf" # Path to your PDF in Colab.
pdf_data = read_pdf_file(pdf_filename)
Funciton to parse the PDF contents into a raw message for the e-mail attachment:
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
from email import encoders
import base64
def create_message_with_attachment(sender, to, subject, body, filename, attachment):
message = MIMEMultipart()
message['to'] = to
message['from'] = sender
message['subject'] = subject
msg_body = MIMEText(body, 'plain')
message.attach(msg_body)
part = MIMEBase('application', 'octet-stream') # For PDFs
part.set_payload(attachment)
encoders.encode_base64(part)
part.add_header('Content-Disposition', f'attachment; filename={filename}')
message.attach(part)
raw_message = base64.urlsafe_b64encode(message.as_bytes()).decode()
return {'raw': raw_message}
Setting Up E-mail Configuration¶
Provide the recipient email address in the next cell.
# Provide the details for constructing your e-mail.
recipient = 'recipient@domain.com' #@param {type: 'string'}
Send the E-mail¶
📧 Now you can send the e-mail with the attached PDF report:
from googleapiclient.discovery import build
# Build the Gmail API service object.
service = build('gmail', 'v1', credentials=creds)
# Provide the details for constructing your e-mail.
subject = f"{game} Review Analysis Report"
body = f"Attached is the Report on the Review Analysis for {game}"
# Construct e-mail.
message = create_message_with_attachment('me', recipient,
subject, body,
pdf_filename, pdf_data)
# Send e-mail.
service.users().messages().send(userId='me', body=message).execute()
print("Email sent!")
🧹 Cleaning up¶
Clean up resources created in this notebook.
Remove the extensions instances created in this notebook by running the cell below:
extension_code_interpreter.delete()
extension_vertex_ai_search.delete()
You can run the next cell to get a list of all other remaining Vertex AI Extension Instances in your environment:
extensions.Extension.list()
Optionally, you can uncomment the following code block to delete all active extensions in your project, by using the IDs above to clean up:
#clean_ids = []
#for element in extensions.Extension.list():
#clean_ids.append(str(element).split("extensions/")[1])
#for id in clean_ids:
#extension = extensions.Extension(id)
#extension.delete()
Uncomment below to delete your GCS Bucket by first deleting all files in it, then deleting the bucket itself:
❗❗❗ Only run the below cells if you created a new bucket just for this notebook ❗❗❗
from google.cloud import storage
def empty_bucket(bucket_name):
"""Deletes all objects in the specified GCS bucket."""
client = storage.Client()
bucket = client.get_bucket(bucket_name)
blobs = bucket.list_blobs() # List all blobs (objects)
for blob in blobs:
blob.delete() # Delete each blob
print(f"Bucket {bucket_name} emptied.")
## Empty the bucket by deleting all files in it
empty_bucket(GCS_BUCKET)
## Create a client object
client = storage.Client(project=PROJECT_ID)
## Get the bucket object
bucket = client.get_bucket(GCS_BUCKET)
## Delete the bucket
bucket.delete()
print(f"Bucket {GCS_BUCKET} deleted successfully.")
Now, delete all the assets generated by the Vertex AI extensions. First, get the filenames:
files = imgs_files + other_files
for i in range (10):
files.append(f'website_text_{i}.txt')
files.append('report.html')
files.append('report.pdf')
files.append('reviews.csv')
files.append('ideas.txt')
files
Next, delete the files:
import os
for file in files:
try:
os.remove(file)
except FileNotFoundError as e:
print(e)
print('Skipping.')
If you ran the optional section, delete your newly created Google Drive folder and the file in it:
from googleapiclient.discovery import build
# Delete the file with file_id
drive_service.files().delete(fileId=file_id).execute()
print(f"File with ID {file_id} deleted.")
# Delete the folder with folder_id
drive_service.files().delete(fileId=folder_id).execute()
print(f"Folder with ID {folder_id} deleted.")
Delete your Google Cloud CLI ADC Configuration, if you no longer need it, by running:
$ gcloud config configurations delete CONFIG_NAME
❗❗❗ Don't forget to delete any other created assets if you don't need them, e.g. the Vertex AI data store and search app (you need to delete them from the Google Cloud Console).
- Your Vertex AI Search app: https://console.cloud.google.com/gen-app-builder/apps
- Your Vertex AI Search data store: https://console.cloud.google.com/gen-app-builder/data-stores