# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Defining custom attributes based on URL patterns in Vertex AI Search Website Datastores¶
Open in Colab |
Open in Colab Enterprise |
Open in Workbench |
View on GitHub |
Author(s) | Hossein Mansour |
Reviewers(s) | Ismail Najim, Rajesh Thallam |
Last updated | 2024-08-09: The first draft |
Overview¶
In this notebook, we demonstrate how to create custom attributes based on URL patterns in Vertex AI Search Website datastores.
These custom attributes will act similarly to metadata from page source and can be used for different purposes such as improving recall and precision, influencing results via boosting and filtering, and including additional context to be retrieved together with the documents.
You can find more information about different types of metadata here.
Custom attributes based on URL patterns are particularly helpful in cases where adjusting page source to include relevant information is not feasible due to a need to keep that information private or when organizational complexities make it difficult to influence the page source content (e.g., content being managed by a third party).
Custom attributes can be used, in lieu of page source metadata, in conjunction with page source metadata, or to override poor quality page content via post-processing (e.g., a Title_Override custom attribute to override the actual page title for certain URLs).
Note that basic URL-based boosting and filtering can be done directly. Custom Attributes are intended for more advanced usecases.
If the custom attribute is made searchable, it can be used to implicitly influence retrieval and ranking of the page by providing additional information such as tags and related topics.
We will perform the following steps:
- [Prerequisite] Creating a Vertex AI Search Website Datastore and Search App
- Setting Schema and URL mapping for Customer Attributes
- Getting Schema and URL mapping to confirm this is what we want
- Searching the Datastore and demonstrating how custom attributes can be used for filtering
- Clean up
Please refer to the official documentation for the definition of Datastores and Apps and their relationships to one another
REST API is used throughout this notebook. Please consult the official documentation for alternative ways to achieve the same goal, namely Client libraries and RPC.
Vertex AI Search¶
Vertex AI Search (VAIS) is a fully-managed platform, powered by large language models, that lets you build AI-enabled search and recommendation experiences for your public or private websites or mobile applications
VAIS can handle a diverse set of data sources including structured, unstructured, and website data, as well as data from third-party applications such as Jira, Salesforce, and Confluence.
VAIS also has built-in integration with LLMs which enables you to provide answers to complex questions, grounded in your data
Using this Notebook¶
If you're running outside of Colab, depending on your environment you may need to install pip packages that are included in the Colab environment by default but are not part of the Python Standard Library. Outside of Colab you'll also notice comments in code cells that look like #@something, these trigger special Colab functionality but don't change the behavior of the notebook.
This tutorial uses the following Google Cloud services and resources:
- Service Usage API
- Discovery Engine API
This notebook has been tested in the following environment:
- Python version = 3.10.12
- google.cloud.storage = 2.8.0
- google.auth = 2.27.0
Getting Started¶
The following steps are necessary to run this notebook, no matter what notebook environment you're using.
If you're entirely new to Google Cloud, get started here
Google Cloud Project Setup¶
- Select or create a Google Cloud project. When you first create an account, you get a $300 free credit towards your compute/storage costs
- Make sure that billing is enabled for your project
- Enable the Service Usage API
- Enable the Cloud Storage API
- Enable the Discovery Engine API for your project
Google Cloud Permissions¶
Ideally you should have Owner role for your project to run this notebook. If that is not an option, you need at least the following roles
roles/serviceusage.serviceUsageAdmin
to enable APIsroles/iam.serviceAccountAdmin
to modify service agent permissionsroles/discoveryengine.admin
to modify discoveryengine assets
Setup Environment¶
Authentication¶
If you're using Colab, run the code in the next cell. Follow the popups and authenticate with an account that has access to your Google Cloud project.
If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into Application Default Credentials for your local environment and initializing the Google Cloud CLI. In many cases, running gcloud auth application-default login
in a shell on the machine running the notebook kernel is sufficient.
More authentication options are discussed here.
# Colab authentication.
import sys
if "google.colab" in sys.modules:
from google.colab import auth
auth.authenticate_user()
print("Authenticated")
from google.auth import default
from google.auth.transport.requests import AuthorizedSession
creds, _ = default()
authed_session = AuthorizedSession(creds)
Import Libraries¶
import json
import pprint
import time
Configure environment¶
The Location of a Datastore is set at the time of creation and it should be called appropriately to query the Datastore. global
is typically recommended unless you have a particular reason to use a regional Datastore.
You can find more information regarding the Location
of datastores and associated limitations here.
VAIS_BRANCH
is the branch of VAIS to use. At the time of writing this notebook, URL mapping for Custom Attributes is only available in v1alpha of Discovery Engine API.
INCLUDE_URL_PATTERN
is the pattern of a website to be included in the datastore, e.g. “www.example.com/”, “www.example.com/abc/”.
Note that you need to verify the ownership of a domain to be able to index it.
PROJECT_ID = '' # @param {type: 'string'}
DATASTORE_ID = '' # @param {type: 'string'}
APP_ID = '' # @param {type: 'string'}
LOCATION = "global" # @param ["global", "us", "eu"]
VAIS_BRANCH = "v1alpha" # @param {type: 'string'}
INCLUDE_URL_PATTERN = "" # @param {type: 'string'}
Step 1. [Prerequisite] Create a Website Search Datastore and APP¶
In this section we will programmatically create a VAIS Advanced Website Datastore and APP. You can achieve the same goal with a few clicks in the UI.
If you already have an Advanced Website Datastore available, you can skip this section.
Helper functions to issue basic search on a Datastore or an App¶
def search_by_datastore(project_id: str, location: str, datastore_id: str, query: str):
"""Searches a datastore using the provided query."""
response = authed_session.post(
f'https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{project_id}/locations/{location}/collections/default_collection/dataStores/{datastore_id}/servingConfigs/default_search:search',
headers={
'Content-Type': 'application/json',
},
json={
"query": query,
"pageSize": 1
},
)
return response
def search_by_app(project_id: str, location: str, app_id: str, query: str):
"""Searches an app using the provided query."""
response = authed_session.post(
f'https://discoveryengine.googleapis.com/v1/projects/{project_id}/locations/{location}/collections/default_collection/engines/{app_id}/servingConfigs/default_config:search',
headers={
'Content-Type': 'application/json',
},
json={
"query": query,
"pageSize": 1
},
)
return response
Helper functions to check whether or not a Datastore or an App already exist¶
def datastore_exists(project_id: str, location: str, datastore_id: str) -> bool:
"""Check if a datastore exists."""
response = search_by_datastore(project_id, location, datastore_id, "test")
status_code = response.status_code
# A 400 response is expected as the URL pattern needs to be set first
if status_code == 200 or status_code == 400:
return True
if status_code == 404:
return False
raise Exception(f"Error: {status_code}")
def app_exists(project_id: str, location: str, app_id: str) -> bool:
"""Check if an App exists."""
response = search_by_app(project_id, location, app_id, "test")
status_code = response.status_code
if status_code == 200:
return True
if status_code == 404:
return False
raise Exception(f"Error: {status_code}")
Helper functions to create a Datastore or an App¶
def create_website_datastore(vais_branch: str, project_id: str, location: str, datastore_id: str) -> int:
"""Create a website datastore"""
payload = {
"displayName": datastore_id,
"industryVertical": "GENERIC",
"solutionTypes": ["SOLUTION_TYPE_SEARCH"],
"contentConfig": "PUBLIC_WEBSITE",
}
header = {"X-Goog-User-Project": project_id, "Content-Type": "application/json"}
es_endpoint = f"https://discoveryengine.googleapis.com/{vais_branch}/projects/{project_id}/locations/{location}/collections/default_collection/dataStores?dataStoreId={datastore_id}"
response = authed_session.post(es_endpoint, data=json.dumps(payload), headers=header)
if response.status_code == 200:
print(f"The creation of Datastore {datastore_id} is initiated.")
print("It may take a few minutes for the Datastore to become available")
else:
print(f"Failed to create Datastore {datastore_id}")
print(response.text())
return response.status_code
def create_app(vais_branch: str, project_id: str, location: str, datastore_id: str, app_id: str) -> int:
"""Create a search app."""
payload = {
"displayName": app_id,
"dataStoreIds": [datastore_id],
"solutionType": "SOLUTION_TYPE_SEARCH",
"searchEngineConfig": {
"searchTier": "SEARCH_TIER_ENTERPRISE",
"searchAddOns": ["SEARCH_ADD_ON_LLM"],
}
}
header = {"X-Goog-User-Project": project_id, "Content-Type": "application/json"}
es_endpoint = f"https://discoveryengine.googleapis.com/{vais_branch}/projects/{project_id}/locations/{location}/collections/default_collection/engines?engineId={app_id}"
response = authed_session.post(es_endpoint, data=json.dumps(payload), headers=header)
if response.status_code == 200:
print(f"The creation of App {app_id} is initiated.")
print("It may take a few minutes for the App to become available")
else:
print(f"Failed to create App {app_id}")
print(response.json())
return response.status_code
Create a Datastores with the provided ID if it doesn't exist¶
if datastore_exists(PROJECT_ID, LOCATION, DATASTORE_ID):
print(f"Datastore {DATASTORE_ID} already exists.")
else:
create_website_datastore(VAIS_BRANCH, PROJECT_ID, LOCATION, DATASTORE_ID)
[Optional] Check if the Datastore is created successfully¶
The Datastore is polled to track when it becomes available.
This may take a few minutes
while not datastore_exists(PROJECT_ID, LOCATION, DATASTORE_ID):
print(f"Datastore {DATASTORE_ID} is still being created.")
time.sleep(30)
print(f"Datastore {DATASTORE_ID} is created successfully.")
Create an App with the provided ID if it doesn't exist¶
The App will be connected to a Datastore with the ID provided earlier in this notebook
if app_exists(PROJECT_ID, LOCATION, APP_ID):
print(f"App {APP_ID} already exists.")
else:
create_app(VAIS_BRANCH, PROJECT_ID, LOCATION, DATASTORE_ID, APP_ID)
[Optional] Check if the App is created successfully¶
The App is polled to track when it becomes available.
This may take a few minutes
while not app_exists(PROJECT_ID, LOCATION, APP_ID):
print(f"App {APP_ID} is still being created.")
time.sleep(30)
print(f"App {APP_ID} is created successfully.")
Upgrade an existing Website Datastore to Advanced Website DataStore¶
def upgrade_to_advanced(vais_branch: str, project_id: str, location: str, datastore_id: str) -> int:
"""Upgrade the website search datastore to advanced"""
header = {"X-Goog-User-Project": project_id}
es_endpoint = f"https://discoveryengine.googleapis.com/{vais_branch}/projects/{project_id}/locations/{location}/collections/default_collection/dataStores/{datastore_id}/siteSearchEngine:enableAdvancedSiteSearch"
response = authed_session.post(es_endpoint, headers=header)
if response.status_code == 200:
print(f"Datastore {datastore_id} upgraded to Advanced Website Search")
else:
print(f"Failed to upgrade Datastore {datastore_id}")
print(response.text())
return response.status_code
upgrade_to_advanced(VAIS_BRANCH, PROJECT_ID, LOCATION, DATASTORE_ID)
Set the URLs to Include/Exclude in the Index¶
You can set up to 500 Include and Exclude URL patterns for Advanced website search Datastores.
This function sets a single URL pattern to be included every time it gets executed.
The field type
in the payload is used to indicate if the provided Uri pattern should be included or excluded. Here we only use INCLUDE
.
The INCLUDE
and EXCLUDE
URL patters specified with this function are incremental. You also have options to Delete, List, Batch Create, etc
For this example, we index http://cloud.google.com/generative-ai-app-builder/*
Note that you need to verify the ownership of a domain to be able to index it.
def include_url_patterns(vais_branch: str, project_id: str, location: str, datastore_id: str, include_url_patterns) -> int:
"""Set include and exclude URL patterns for the Datastore"""
payload = {
"providedUriPattern": include_url_patterns,
"type": "INCLUDE",
}
header = {"X-Goog-User-Project": project_id, "Content-Type": "application/json"}
es_endpoint = f"https://discoveryengine.googleapis.com/{vais_branch}/projects/{project_id}/locations/{location}/dataStores/{datastore_id}/siteSearchEngine/targetSites"
response = authed_session.post(es_endpoint, data=json.dumps(payload), headers=header)
if response.status_code == 200:
print(f"URL patterns successfully set")
print("Depending on the size of your domain, the initial indexing may take from minutes to hours")
else:
print(f"Failed to set URL patterns for the Datastore {datastore_id}")
print(response.text())
return response.status_code
include_url_patterns(VAIS_BRANCH, PROJECT_ID, LOCATION, DATASTORE_ID, INCLUDE_URL_PATTERN)
Step 2. Schema and URL mapping for Custom Attributes¶
Set the Schema and URL mapping¶
In this example we use VAIS REST API documentation as the source for the datastore. For the mapping we add "REST" tags to all branches of REST documentation. We also add an additional tag to identify each branch (i.e. V1, V1alpha, V1beta). The schema and URL mapping should follow this formatting.
Separately, we identify pages under Samples with a corresponding tag.
As mentioned above, you can only index a website you own, as a result your mapping will be different from the ones used in this example.
Note that each successful mapping request overrides the previous ones (i.e. mappings are not incremental)
header = {"X-Goog-User-Project": PROJECT_ID}
es_endpoint = f"https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}/siteSearchEngine:setUriPatternDocumentData"
json_data = {
"documentDataMap": {
"https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1/*": {
"Topic": ["Rest", "V1"]
},
"https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1alpha/*": {
"Topic": ["Rest", "V1alpha"]
},
"https://cloud.google.com/generative-ai-app-builder/docs/reference/rest/v1beta/*": {
"Topic": ["Rest", "V1beta"]
},
"https://cloud.google.com/generative-ai-app-builder/docs/samples*": {
"Topic": ["Samples"]
},
},
"schema": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"properties": {
"Topic": {
"items": {
"indexable": True,
"retrievable": True,
"searchable": True,
"type": "string",
},
"type": "array",
}
},
"type": "object",
},
}
set_schema_response = authed_session.post(es_endpoint, headers=header, json=json_data)
print(json.dumps(set_schema_response.json(), indent=1))
Get the Schema and URL mapping¶
Get the Schema and URL mapping to ensure it is updated according to your expectations.
header = {"X-Goog-User-Project": PROJECT_ID}
es_endpoint = f"https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}/siteSearchEngine:getUriPatternDocumentData"
get_schema_response = authed_session.get(es_endpoint, headers=header)
print(json.dumps(get_schema_response.json(), indent=1))
Step 3. Run queries w/wo Metadata filter¶
Search Parameters¶
QUERY
: Used to query VAIS.
PAGE_SIZE
: The maximum number of results retrieved from VAIS.
QUERY = '' # @param {type: 'string'}
PAGE_SIZE = 5 # @param {type: 'integer'}
Search Without Filter¶
Given that the Topic
custom attribute is made retrievable
in the Schema, You will get it back in the response, when applicable.
Custom attributes are included in the structData
field of the result
).
search_response = authed_session.post(
f'https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}/servingConfigs/default_search:search',
headers={
'Content-Type': 'application/json'
},
json={
"query": QUERY,
"pageSize": PAGE_SIZE},
)
print(json.dumps(search_response.json(), indent=1))
Search with Filter¶
Now we apply a filter so that a search only returns results from the V1alpha branch of the REST documentation. The filter and expected results will be different based on the domain included in your website datastore.
We could also use this indexable field for other purposes such as Boosting, if desired.
search_response = authed_session.post(
f'https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}/servingConfigs/default_search:search',
headers={
'Content-Type': 'application/json'
},
json={
"query": QUERY,
"filter": "Topic: ANY(\"V1alpha\")",
"pageSize": PAGE_SIZE},
)
print(json.dumps(search_response.json(), indent=1))
Clean up¶
Delete the Search App¶
Delete the App if you no longer need it
Alternatively you can follow these instructions to delete an App from the UI
response = authed_session.delete(
f'https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/engines/{APP_ID}',
headers={
"X-Goog-User-Project": PROJECT_ID
}
)
print(response.text)
Delete the Datastores¶
Delete the Datastore if you no longer need it
Alternatively you can follow these instructions to delete a Datastore from the UI
response = authed_session.delete(
f'https://discoveryengine.googleapis.com/{VAIS_BRANCH}/projects/{PROJECT_ID}/locations/{LOCATION}/collections/default_collection/dataStores/{DATASTORE_ID}',
headers={
"X-Goog-User-Project": PROJECT_ID
}
)
print(response.text)