In [1]:

Copied!





# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Open in Colab Enterprise

Open in Vertex AI Workbench

View on GitHub


Author(s)	Ken Lee
Reviewers(s)	Abhishek Bhagwat
Last updated	2024-10-07

Deploying a chatbot or retrieval augmented generation (RAG) application to real users provides a wealth of valuable data. User queries reveal insights into their needs, the products they engage with, and the effectiveness of the chatbot itself. This data is crucial for both understanding your users and continuously evaluating the performance of your deployed system.

image_link

This notebook demonstrates how to leverage Gemini to accelerate the analysis and summarization of real user queries from a production RAG system or chatbot. By analyzing these queries, we can identify a representative set of questions to form an evaluation dataset, establishing a foundation for continuous evaluation.

This process aims to answer the following questions:

What general categories of questions are users asking? What problems are they encountering?
What topics are prevalent in user conversations?
What sentiments are users expressing?

Inspired by a Weights and Biases article, this notebook extends those concepts by utilizing Gemini's capabilities. Gemini's large context window allows for rapid exploratory data analysis (EDA) of clustered questions, even with extensive datasets, facilitating efficient metadata extraction and informed selection of an evaluation dataset. This, in turn, enables the construction of a robust and representative evaluation set for the RAG system.

🎬 Getting Started¶

The following steps are necessary to run this notebook, no matter what notebook environment you're using.

If you're entirely new to Google Cloud, get started here.

Google Cloud Permissions¶

To run the complete Notebook, including the optional section, you will need to have the Owner role for your project.

If you want to skip the optional section, you need at least the following roles:

roles/serviceusage.serviceUsageAdmin to enable APIs
roles/iam.serviceAccountAdmin to modify service agent permissions
roles/aiplatform.user to use AI Platform components

Install Vertex AI SDK and Other Required Packages¶

In [ ]:

Copied!





!pip install -qqq llama-index \
llama-index-llms-vertex \
llama-index-embeddings-vertex \
python-louvain \
tiktoken \
aiofiles \
annotated-types \
python-fasthtml
!pip install -qqq llama-index \
llama-index-llms-vertex \
llama-index-embeddings-vertex \
python-louvain \
tiktoken \
aiofiles \
annotated-types \
python-fasthtml

Restart Runtime¶

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [ ]:

Copied!

# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

Authenticate¶

If you're using Colab, run the code in the next cell. Follow the popups and authenticate with an account that has access to your Google Cloud project.

If you're running this notebook somewhere besides Colab, make sure your environment has the right Google Cloud access. If that's a new concept to you, consider looking into Application Default Credentials for your local environment and initializing the Google Cloud CLI. In many cases, running gcloud auth application-default login in a shell on the machine running the notebook kernel is sufficient.

More authentication options are discussed here.

In [ ]:

Copied!





# Colab authentication.
import sys

if "google.colab" in sys.modules:
    from google.colab import auth
    auth.authenticate_user()
    print('Authenticated')
# Colab authentication.
import sys

if "google.colab" in sys.modules:
    from google.colab import auth
    auth.authenticate_user()
    print('Authenticated')

Set Google Cloud project information and Initialize Vertex AI SDK¶

To get started using Vertex AI, you must have an existing Google Cloud project and enable the Vertex AI API.

Learn more about setting up a project and a development environment.

Make sure to change PROJECT_ID in the next cell. You can leave the values for REGION unless you have a specific reason to change them.

In [2]:

Copied!





import vertexai

PROJECT_ID = "<enter-your-project-id>"
REGION = "us-central1"
CSV_PATH = "./curate_evals_example.csv"
TEST_RUN = False
CLUSTERING_NEIGHBORHOOD_SIZE = 5

vertexai.init(
    project=PROJECT_ID,
    location=REGION
)
import vertexai

PROJECT_ID = ""
REGION = "us-central1"
CSV_PATH = "./curate_evals_example.csv"
TEST_RUN = False
CLUSTERING_NEIGHBORHOOD_SIZE = 5

vertexai.init(
    project=PROJECT_ID,
    location=REGION
)

Prepare the dataset¶

For this demo we are using a hypothetical dataset of questions about Google Cloud Services

In [3]:

Copied!





import pandas as pd
import numpy as np
if TEST_RUN:
  df = pd.DataFrame({"Prompt": ["What is RAG?", "What is life?", "What is football?", "Who am I?"],
                   "answer": ["Retrieval Augmented Generation", "Love", "National Football League", "Human"]})
else:
  df = pd.read_csv(CSV_PATH)
import pandas as pd
import numpy as np
if TEST_RUN:
  df = pd.DataFrame({"Prompt": ["What is RAG?", "What is life?", "What is football?", "Who am I?"],
                   "answer": ["Retrieval Augmented Generation", "Love", "National Football League", "Human"]})
else:
  df = pd.read_csv(CSV_PATH)

In [4]:

Copied!

df
df

Out[4]:

	Topic	Question
0	Compute Engine	How can I create a virtual machine instance on...
1	Compute Engine	"What are the different machine types availabl...
2	Compute Engine	"Can you explain the different pricing options...
3	Compute Engine	"How do I connect to my Compute Engine instanc...
4	Compute Engine	"What are preemptible instances, and how can t...
...	...	...
95	Cost Management	"How can I track and manage my Google Cloud co...
96	Cost Management	"What are the different pricing models for Goo...
97	Cost Management	"How can I optimize my Google Cloud costs?"
98	Cost Management	"What tools are available for cost management ...
99	Cost Management	"How can I set budgets and alerts for my Googl...

100 rows × 2 columns

Dataset Preprocessing¶

Real world RAG systems have some anomalies in terms of the search queries - often, you will encounter single word queries or typos. In this step, we will preprocess and clean the dataset to remove the following types of queries:

Very short and very long queries
Near duplicates

In [5]:

Copied!

df["question_len"] = df["Question"].apply(lambda x: len(x))
df["question_len"] = df["Question"].apply(lambda x: len(x))

In [6]:

Copied!

# Discard questions with too little or too many characters
df = df[(df.question_len > 5) & (df.question_len < 1000)]
# Discard questions with too little or too many characters
df = df[(df.question_len > 5) & (df.question_len < 1000)]

In [7]:

Copied!

df
df

Out[7]:

	Topic	Question	question_len
0	Compute Engine	How can I create a virtual machine instance on...	63
1	Compute Engine	"What are the different machine types availabl...	115
2	Compute Engine	"Can you explain the different pricing options...	77
3	Compute Engine	"How do I connect to my Compute Engine instanc...	59
4	Compute Engine	"What are preemptible instances, and how can t...	65
...	...	...	...
95	Cost Management	"How can I track and manage my Google Cloud co...	51
96	Cost Management	"What are the different pricing models for Goo...	66
97	Cost Management	"How can I optimize my Google Cloud costs?"	43
98	Cost Management	"What tools are available for cost management ...	63
99	Cost Management	"How can I set budgets and alerts for my Googl...	64

100 rows × 3 columns

Visualize distribution of question lengths¶

In [8]:

Copied!

df.question_len.hist(bins=25)
df.question_len.hist(bins=25)

Out[8]:

<Axes: >

No description has been provided for this image

Generating the embeddings for the questions¶

Vertex AI embeddings models can generate optimized embeddings for various task types, such as document retrieval, question and answering, and fact verification. Task types are labels that optimize the embeddings that the model generates based on your intended use case.

In this example, we will set the TASK_TYPE as RETRIEVAL_DOCUMENT as this is used to generate embeddings that are optimized for information retrieval

Read more about the various TASK_TYPE offered by Vertex AI Embedding models here

In [9]:

Copied!





import asyncio
from tqdm.asyncio import tqdm_asyncio
from typing import List, Optional,  Tuple
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel
from google.cloud import storage
from vertexai.generative_models import GenerativeModel

async def embed_text_async(
    model: TextEmbeddingModel,
    texts: List[str] = ["banana muffins? ", "banana bread? banana muffins?"],
    task: str = "RETRIEVAL_DOCUMENT",
    dimensionality: Optional[int] = 768,):
    inputs = [TextEmbeddingInput(text, task) for text in texts]
    kwargs = dict(output_dimensionality=dimensionality) if dimensionality else {}
    embeddings = await model.get_embeddings_async(texts, **kwargs)
    return [embedding.values for embedding in embeddings]

# embedding model to use
model_name = "text-embedding-005"
embedding_model = TextEmbeddingModel.from_pretrained(model_name)

# embed questions from the dataset asynchronously
embedded_qs = await tqdm_asyncio.gather(*[embed_text_async(embedding_model,
                                        [x["Question"]]) for i, x in df.iterrows()])
import asyncio
from tqdm.asyncio import tqdm_asyncio
from typing import List, Optional,  Tuple
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel
from google.cloud import storage
from vertexai.generative_models import GenerativeModel

async def embed_text_async(
    model: TextEmbeddingModel,
    texts: List[str] = ["banana muffins? ", "banana bread? banana muffins?"],
    task: str = "RETRIEVAL_DOCUMENT",
    dimensionality: Optional[int] = 768,):
    inputs = [TextEmbeddingInput(text, task) for text in texts]
    kwargs = dict(output_dimensionality=dimensionality) if dimensionality else {}
    embeddings = await model.get_embeddings_async(texts, **kwargs)
    return [embedding.values for embedding in embeddings]

# embedding model to use
model_name = "text-embedding-005"
embedding_model = TextEmbeddingModel.from_pretrained(model_name)

# embed questions from the dataset asynchronously
embedded_qs = await tqdm_asyncio.gather(*[embed_text_async(embedding_model,
                                        [x["Question"]]) for i, x in df.iterrows()])

100%|██████████| 100/100 [00:00<00:00, 302.49it/s]

In [10]:

Copied!

embedded_qs_flattened = [q[0] for q in embedded_qs]
embedded_qs_flattened = [q[0] for q in embedded_qs]

Cluster the Questions¶

While various clustering algorithms can be applied, Louvain community detection is a particularly suitable choice for this task due to its speed and effectiveness.

Vector-based Retrieval Clustering¶

Store your embedded question set in a vector index
Query the vector index with each question in the dataset, retrieving a topk-sized neighborhood of questions around the query question.
Form a graph of questions by adding an edge between the query question and each of the retrieved questions
Perform Louvain or Leiden community detection on the graph to create clusters of questions

In [11]:

Copied!





from llama_index.core import (
    VectorStoreIndex,
    Settings,
    SimpleDirectoryReader,
    load_index_from_storage,
    StorageContext,
    Document
)
from llama_index.llms.vertex import Vertex
from llama_index.embeddings.vertex import VertexTextEmbedding
from vertexai.generative_models import HarmCategory, HarmBlockThreshold
import networkx as nx
from community import community_louvain # pip install python-louvain
import google.auth
import google.auth.transport.requests

credentials = google.auth.default()[0]
request = google.auth.transport.requests.Request()
credentials.refresh(request)


query_list = df["Question"].tolist()
query_docs = [Document(text=t) for t in query_list] # To make it LlamaIndex compatible
embed_model = VertexTextEmbedding(credentials=credentials, model_name="text-embedding-005")
llm = Vertex(model="gemini-2.0-flash-001",
             temperature=0.2,
             max_tokens=8192,
             safety_settings={
                    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
                    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
                    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
                    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
        }
)
Settings.llm = llm
Settings.embed_model = embed_model


# Form a local vector index with all our questions
vector_index = VectorStoreIndex.from_documents(query_docs)
vector_retriever = vector_index.as_retriever(similarity_top_k=CLUSTERING_NEIGHBORHOOD_SIZE)


# Create a similarity graph
G = nx.Graph()

# Get a neighborhood of similar questions by querying the vector index
similar_texts = await tqdm_asyncio.gather(*[vector_retriever.aretrieve(text) for i, text in enumerate(query_list)])

for i, text in enumerate(query_list):
  for s in similar_texts[i]:
    G.add_edge(text, s.text)
from llama_index.core import (
    VectorStoreIndex,
    Settings,
    SimpleDirectoryReader,
    load_index_from_storage,
    StorageContext,
    Document
)
from llama_index.llms.vertex import Vertex
from llama_index.embeddings.vertex import VertexTextEmbedding
from vertexai.generative_models import HarmCategory, HarmBlockThreshold
import networkx as nx
from community import community_louvain # pip install python-louvain
import google.auth
import google.auth.transport.requests

credentials = google.auth.default()[0]
request = google.auth.transport.requests.Request()
credentials.refresh(request)


query_list = df["Question"].tolist()
query_docs = [Document(text=t) for t in query_list] # To make it LlamaIndex compatible
embed_model = VertexTextEmbedding(credentials=credentials, model_name="text-embedding-005")
llm = Vertex(model="gemini-2.0-flash-001",
             temperature=0.2,
             max_tokens=8192,
             safety_settings={
                    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
                    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
                    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
                    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
        }
)
Settings.llm = llm
Settings.embed_model = embed_model


# Form a local vector index with all our questions
vector_index = VectorStoreIndex.from_documents(query_docs)
vector_retriever = vector_index.as_retriever(similarity_top_k=CLUSTERING_NEIGHBORHOOD_SIZE)


# Create a similarity graph
G = nx.Graph()

# Get a neighborhood of similar questions by querying the vector index
similar_texts = await tqdm_asyncio.gather(*[vector_retriever.aretrieve(text) for i, text in enumerate(query_list)])

for i, text in enumerate(query_list):
  for s in similar_texts[i]:
    G.add_edge(text, s.text)

100%|██████████| 100/100 [00:01<00:00, 77.81it/s]

In [12]:

Copied!

# Apply Louvain Community Detection
partition = community_louvain.best_partition(G)
df["cluster_idx"] = df["Question"].map(partition)
# Apply Louvain Community Detection
partition = community_louvain.best_partition(G)
df["cluster_idx"] = df["Question"].map(partition)

In [13]:

Copied!

grouped_df = pd.DataFrame(df.groupby("cluster_idx")['Question'].apply(list)).reset_index()
grouped_df = pd.DataFrame(df.groupby("cluster_idx")['Question'].apply(list)).reset_index()

In [14]:

Copied!

grouped_df
grouped_df

Out[14]:

	cluster_idx	Question
0	0	[How can I create a virtual machine instance o...
1	1	["What is BigQuery, and how can I use it to an...
2	2	["What are the different tools available for d...
3	3	["I need to increase the storage space on my C...
4	4	["My application is experiencing performance i...
5	5	["What are preemptible instances, and how can ...
6	6	["What is Cloud Load Balancing, and how does i...
7	7	["I need to transfer a large amount of data to...
8	8	["I'm trying to train a machine learning model...
9	9	["I'm concerned about the security of my sensi...

In [15]:

Copied!

grouped_df["num_questions"] = grouped_df["Question"].apply(len)
grouped_df
grouped_df["num_questions"] = grouped_df["Question"].apply(len)
grouped_df

Out[15]:

	cluster_idx	Question	num_questions
0	0	[How can I create a virtual machine instance o...	9
1	1	["What is BigQuery, and how can I use it to an...	5
2	2	["What are the different tools available for d...	12
3	3	["I need to increase the storage space on my C...	11
4	4	["My application is experiencing performance i...	8
5	5	["What are preemptible instances, and how can ...	13
6	6	["What is Cloud Load Balancing, and how does i...	12
7	7	["I need to transfer a large amount of data to...	6
8	8	["I'm trying to train a machine learning model...	4
9	9	["I'm concerned about the security of my sensi...	20

Analyze Clusters Using Gemini¶

We can use Gemini to extract summaries, topics, relevant questions, sentiment or any other required information from the cluster. This allows us to quickly identify higher level patterns about the various questions from users, understand different user problems and much more insightful information.

In [16]:

Copied!





from vertexai.generative_models import GenerativeModel, GenerationConfig
from vertexai.generative_models import HarmCategory, HarmBlockThreshold
from llama_index.core.program import LLMTextCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import Annotated
from enum import Enum
from annotated_types import Len

num_clusters = grouped_df.shape[0]

class Sentiment(Enum):
  POSITIVE = "positive"
  NEGATIVE = "negative"
  NEUTRAL = "neutral"

class ClusterSummary(BaseModel):
  '''A cluster summary, list of topics, most representative questions, and sentiment associated with a cluster of questions from chat sessions.'''
  summary_desc: str
  topics: List[str]
  most_representative_qs: Annotated[List[str], Len(3, 8)]
  sentiment: Sentiment


boring_prompt = """Please provide a brief summary which captures the nature of the given cluster of questions below in the form of "Questions concerning ____".
                  \n Cluster questions:
                  \n {questions_list}
                  \n The clusters titles should not be generic such as "Google Cloud AI" or "Gemini".
                  \n They need to be specific in order to distinguish the clusters from others which may be similar.
                  \n Also include a list of topic phrases which the questions address, the most representative questions of the cluster, and an overall sentiment. Be sure to follow a consistent format."""

movie_prompt = """You are an expert movie producer for famous movies.
                  \n Please provide a quipy, movie title which captures the essence of the given cluster of questions below.
                  \n Example:
                  \n How does RAG work on Vertex?
                  \n Where can I find documentation on Vertex AI Generative model API?
                  \n What are the pitfals of Gemini vs. Gemma?
                  \n Answer:
                  \n movie title: "Into the Vertex"
                  \n representative qs: How does RAG work on Vertex?
                  \n topics: Vertex AI, Vertex AI Generative Model
                  \n sentiment: neutral
                  \n Cluster questions:
                  \n {questions_list}
                  \n Also include a list of topic phrases which the questions address, the most representative questions of the cluster, and an overall sentiment. Be sure to follow a consistent format. """

async def summarize_cluster(questions: List[str]):
  questions_list = "\n".join(questions)
  llm_program = LLMTextCompletionProgram.from_defaults(
        output_parser=PydanticOutputParser(ClusterSummary),
        prompt_template_str=boring_prompt,
        verbose=True,
    )
  try:
    cluster_summary = await llm_program.acall(questions_list=questions_list)
  except Exception as e:
    print(e)
    return None
  return cluster_summary
from vertexai.generative_models import GenerativeModel, GenerationConfig
from vertexai.generative_models import HarmCategory, HarmBlockThreshold
from llama_index.core.program import LLMTextCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import Annotated
from enum import Enum
from annotated_types import Len

num_clusters = grouped_df.shape[0]

class Sentiment(Enum):
  POSITIVE = "positive"
  NEGATIVE = "negative"
  NEUTRAL = "neutral"

class ClusterSummary(BaseModel):
  '''A cluster summary, list of topics, most representative questions, and sentiment associated with a cluster of questions from chat sessions.'''
  summary_desc: str
  topics: List[str]
  most_representative_qs: Annotated[List[str], Len(3, 8)]
  sentiment: Sentiment


boring_prompt = """Please provide a brief summary which captures the nature of the given cluster of questions below in the form of "Questions concerning ____".
                  \n Cluster questions:
                  \n {questions_list}
                  \n The clusters titles should not be generic such as "Google Cloud AI" or "Gemini".
                  \n They need to be specific in order to distinguish the clusters from others which may be similar.
                  \n Also include a list of topic phrases which the questions address, the most representative questions of the cluster, and an overall sentiment. Be sure to follow a consistent format."""

movie_prompt = """You are an expert movie producer for famous movies.
                  \n Please provide a quipy, movie title which captures the essence of the given cluster of questions below.
                  \n Example:
                  \n How does RAG work on Vertex?
                  \n Where can I find documentation on Vertex AI Generative model API?
                  \n What are the pitfals of Gemini vs. Gemma?
                  \n Answer:
                  \n movie title: "Into the Vertex"
                  \n representative qs: How does RAG work on Vertex?
                  \n topics: Vertex AI, Vertex AI Generative Model
                  \n sentiment: neutral
                  \n Cluster questions:
                  \n {questions_list}
                  \n Also include a list of topic phrases which the questions address, the most representative questions of the cluster, and an overall sentiment. Be sure to follow a consistent format. """

async def summarize_cluster(questions: List[str]):
  questions_list = "\n".join(questions)
  llm_program = LLMTextCompletionProgram.from_defaults(
        output_parser=PydanticOutputParser(ClusterSummary),
        prompt_template_str=boring_prompt,
        verbose=True,
    )
  try:
    cluster_summary = await llm_program.acall(questions_list=questions_list)
  except Exception as e:
    print(e)
    return None
  return cluster_summary

In [17]:

Copied!

# Summarize each cluster individually
cluster_summaries = await tqdm_asyncio.gather(*[summarize_cluster(q["Question"]) for idx, q in grouped_df.iterrows()])
# Summarize each cluster individually
cluster_summaries = await tqdm_asyncio.gather(*[summarize_cluster(q["Question"]) for idx, q in grouped_df.iterrows()])

100%|██████████| 10/10 [00:05<00:00,  1.70it/s]

In [18]:

Copied!

cluster_summaries
cluster_summaries

Out[18]:

[ClusterSummary(summary_desc='Questions concerning the practical usage and troubleshooting of Google Compute Engine virtual machine instances, including instance creation, selection, connection, deletion recovery, clustering for high availability, firewall issues, and pricing.', topics=['Google Compute Engine', 'Virtual Machine Instances', 'Instance Creation', 'Machine Types', 'Pricing', 'SSH Connection', 'Instance Deletion Recovery', 'High Availability Clustering', 'Firewall Troubleshooting', 'Discounts'], most_representative_qs=['How can I create a virtual machine instance on Compute Engine?', 'What are the different machine types available on Compute Engine, and how do I choose the right one for my needs?', 'Can you explain the different pricing options for Compute Engine instances?', 'How do I connect to my Compute Engine instance using SSH?', 'I accidentally deleted my Compute Engine instance. How can I recover it?', 'I want to set up a cluster of Compute Engine instances for high availability. Can you guide me through the process?', "I'm having trouble connecting to my Virtual Machine instance. I think there's a firewall issue. How can I troubleshoot this?"], sentiment=<Sentiment.NEUTRAL: 'neutral'>),
 ClusterSummary(summary_desc='Questions concerning the practical application and optimization of BigQuery for large dataset analysis, including data loading, query performance, visualization, and machine learning integration.', topics=['BigQuery', 'large datasets', 'data analysis', 'data loading', 'query optimization', 'performance', 'visualization', 'Data Studio', 'machine learning'], most_representative_qs=['What is BigQuery, and how can I use it to analyze large datasets?', 'I have a large dataset that I want to analyze using BigQuery. How can I load my data into BigQuery?', 'My BigQuery queries are taking a long time to run. How can I optimize my queries for better performance?', 'I want to visualize my data in BigQuery using Data Studio. How can I connect Data Studio to my BigQuery dataset?', 'How can I use machine learning with BigQuery to gain insights from my data?'], sentiment=<Sentiment.NEUTRAL: 'neutral'>),
 ClusterSummary(summary_desc="Questions concerning the practical application of Google Cloud's Machine Learning and Data Processing tools for building, deploying, and monitoring models.", topics=['Data Processing on Google Cloud', 'Data Pipelines', 'Data Visualization', 'Real-time Data Processing', 'Pre-trained Models for Image Recognition', 'AutoML', 'Machine Learning Model Deployment', 'Machine Learning Model Monitoring', 'AI and ML Services on Google Cloud'], most_representative_qs=['How can I use Google Cloud to build a data pipeline?', 'How can I use Google Cloud to visualize my data?', 'How can I use AutoML to build a machine learning model without writing any code?', 'I need to monitor the performance of my deployed machine learning model. What tools are available on Google Cloud?', 'How can I use Google Cloud to build a machine learning model?', 'How can I use Google Cloud to deploy my machine learning model?'], sentiment=<Sentiment.NEUTRAL: 'neutral'>),
 ClusterSummary(summary_desc='Questions concerning Google Cloud database services, particularly storage, migration, scaling, and performance optimization for Cloud SQL and Cloud Spanner.', topics=['Google Cloud Storage', 'Database Services', 'Cloud SQL', 'Cloud Spanner', 'Database Migration', 'Database Scaling', 'Storage Capacity', 'Database Replication', 'Query Performance', 'Database Security'], most_representative_qs=['What database services are available on Google Cloud?', 'What is the difference between Cloud SQL and Cloud Spanner?', 'How do I migrate my existing database to Google Cloud?', 'How can I scale my database on Google Cloud?', 'My Cloud SQL database is running out of storage space. How can I increase the storage capacity?', 'I need to replicate my Cloud SQL database to another region for disaster recovery. How can I set up database replication?', "I'm experiencing slow query performance on my Cloud Spanner database. How can I optimize my database and queries?"], sentiment=<Sentiment.NEUTRAL: 'neutral'>),
 ClusterSummary(summary_desc='Questions concerning the monitoring, troubleshooting, and optimization of application performance on Google Cloud Platform, particularly focusing on diagnosing latency, utilizing logging and monitoring tools, and understanding specific services like Compute Engine and Cloud Run.', topics=['application performance', 'troubleshooting', 'optimization', 'Compute Engine', 'network latency', 'Google Kubernetes Engine', 'Cloud Logging', 'Cloud Monitoring', 'Cloud Run'], most_representative_qs=['My application is experiencing performance issues. How can I troubleshoot and optimize my Compute Engine instance?', 'My application is experiencing high latency. Could it be a networking issue? How can I diagnose and resolve network latency problems?', 'I want to monitor the performance of my applications running on Google Kubernetes Engine. What tools can I use?', 'How can I monitor the performance and logs of my Cloud Run services?', 'What is Cloud Monitoring, and how does it work?'], sentiment=<Sentiment.NEUTRAL: 'neutral'>),
 ClusterSummary(summary_desc='Questions concerning cost optimization strategies and troubleshooting unexpected expenses within Google Cloud Platform.', topics=['Preemptible Instances', 'Cost Optimization', 'Unused Resources', 'Cloud Storage Costs', 'High CPU Usage', 'Cost Allocation', 'Budgeting', 'Cost Analysis', 'Resource Management'], most_representative_qs=['What are preemptible instances, and how can they save me money?', "I'm getting billed for a Compute Engine instance that I'm not using. How can I identify and shut down unused instances?", 'My Cloud Storage costs are higher than expected. How can I analyze my usage and optimize my storage costs?', 'My Google Cloud bill is higher than expected this month. How can I identify the source of the increased cost?', 'I want to track the cost of my Google Cloud resources by department. How can I set up cost allocation?', 'What are some best practices for optimizing my Google Cloud costs?'], sentiment=<Sentiment.NEGATIVE: 'negative'>),
 ClusterSummary(summary_desc='Questions concerning the automation of application deployment, management, and scaling on Google Cloud, particularly focusing on serverless technologies like Cloud Functions and Cloud Run.', topics=['Cloud Load Balancing', 'application deployment automation', 'Google Cloud resource management', 'task automation on Google Cloud', 'serverless computing', 'serverless platforms on Google Cloud', 'serverless application development and deployment', 'Cloud Functions', 'Google Cloud Run', 'containerized applications', 'API development with Cloud Functions', 'Cloud Function timeout limits', 'Cloud Run deployment configuration'], most_representative_qs=['I need to automate the deployment of my applications on Google Cloud. What tools and services can I use?', 'What tools are available for managing my Google Cloud resources?', 'How can I automate tasks on Google Cloud?', 'What is serverless computing, and what are its benefits?', 'What serverless platforms are available on Google Cloud?', 'How can I build and deploy a serverless application on Google Cloud?', 'What is Cloud Functions, and how does it work?', 'How can I use Google Cloud Run to deploy containerized applications?'], sentiment=<Sentiment.NEUTRAL: 'neutral'>),
 ClusterSummary(summary_desc='Questions concerning the practical aspects of using Google Cloud Storage, such as data transfer methods, file recovery, access management, and pricing.', topics=['data transfer', 'file recovery', 'public access', 'data upload', 'data access', 'pricing'], most_representative_qs=["I need to transfer a large amount of data to Google Cloud Storage. What's the most efficient way to do this?", 'I accidentally deleted some files from my Cloud Storage bucket. How can I recover them?', 'I want to make my data in Cloud Storage available to the public. How can I configure public access?', 'How much does it cost to store data in Google Cloud Storage?'], sentiment=<Sentiment.NEUTRAL: 'neutral'>),
 ClusterSummary(summary_desc='Questions concerning practical challenges and usage of Vertex AI for machine learning tasks.', topics=['Vertex AI', 'Machine Learning', 'Model Training', 'Error Troubleshooting', 'Model Deployment', 'API', 'IAM Configuration'], most_representative_qs=["I'm trying to train a machine learning model on Vertex AI, but I'm getting errors. How can I troubleshoot these errors?", 'I want to deploy my trained machine learning model as an API. How can I do this using Vertex AI?', 'What is Vertex AI, and how can I use it?', "I'm having trouble configuring IAM to use Vertex AI. What do I do?"], sentiment=<Sentiment.NEUTRAL: 'neutral'>),
 ClusterSummary(summary_desc='Questions concerning securing Google Cloud resources and infrastructure, particularly focusing on networking, access control, and data protection.', topics=['Cloud Storage Security', 'Virtual Private Cloud (VPC)', 'Firewall Rules', 'Network Connectivity', 'Network Security', 'Data Security', 'Identity and Access Management (IAM)', 'Multi-Factor Authentication', 'Security Best Practices', 'Security Monitoring', 'Vulnerability Remediation', 'Secure Application Development'], most_representative_qs=['What are the security features of Google Cloud Storage?', 'How do I create a Virtual Private Cloud (VPC) on Google Cloud?', 'What are firewalls, and how do I configure them in Google Cloud?', 'How can I connect my on-premises network to Google Cloud?', 'How can I secure my applications and data on Google Cloud?', 'What is Identity and Access Management (IAM), and how does it work?', 'How can I implement multi-factor authentication on Google Cloud?', 'What are security best practices for Google Cloud?'], sentiment=<Sentiment.NEUTRAL: 'neutral'>)]

In [19]:

Copied!

just_summaries = [c.summary_desc if c else None for c in cluster_summaries]
just_summaries = [c.summary_desc if c else None for c in cluster_summaries]

In [20]:

Copied!





df_grouped_by_cluster = df.groupby("cluster_idx").agg("count")
df_grouped_by_cluster["cluster_summary"] = cluster_summaries
df_grouped_by_cluster["just_summary"] = just_summaries
df_grouped_by_cluster["questions_list"] = grouped_df["Question"]
df_grouped_by_cluster = df.groupby("cluster_idx").agg("count")
df_grouped_by_cluster["cluster_summary"] = cluster_summaries
df_grouped_by_cluster["just_summary"] = just_summaries
df_grouped_by_cluster["questions_list"] = grouped_df["Question"]

In [21]:

Copied!





from fasthtml.common import *
from fasthtml.fastapp import *
from random import sample
from fasthtml.components import Zero_md

tlink = Script(src="https://cdn.tailwindcss.com")
dlink = Link(rel="stylesheet", href="https://cdn.jsdelivr.net/npm/daisyui@4.11.1/dist/full.min.css")
app = FastHTML(hdrs=(dlink, tlink))

def Markdown(md, css = ''):
    css_template = Template(Style(css), data_append=True)
    return Zero_md(css_template, Script(md, type="text/markdown"))

def MarkdownWOutBackground(md: str):
    css = '.markdown-body {background-color: unset !important; color: unset !important;} .markdown-body table {color: black !important;}'
    markdown_wout_background = partial(Markdown, css=css)
    return markdown_wout_background(md)

def stat_card(num_questions: int):
  return Div(
    Div('Total Questions', cls='stat-title'),
    Div(f'{num_questions}', cls='stat-value'),
    cls='stat'
  )

def cluster_card(cluster_summary: ClusterSummary, questions_list: List[str]):
  if cluster_summary.sentiment == Sentiment.NEGATIVE:
    badge_color = "error"
  elif cluster_summary.sentiment == Sentiment.NEUTRAL:
    badge_color = "neutral"
  else:
    badge_color = "success"
  return Div(
              Div(
                  H2(cluster_summary.summary_desc, cls='card-title'),
                  Div(
                      stat_card(len(questions_list)),
                      Div(cluster_summary.sentiment, cls=f'badge badge-{badge_color}'),
                      cls="flex flex-row items-center"
                  ),
                  H4("Representative Questions:", cls="font-bold"),
                  Ul(
                      *[Li(q) for q in cluster_summary.most_representative_qs],
                      cls='list-disc list-inside mt-2'
                  ),
                  H4("Topics Discussed:", cls="font-bold"),
                  Ul(
                      *[Li(t) for t in cluster_summary.topics],
                      cls='list-disc list-inside mt-2'
                  ),
                  cls='card-body'
              ),
              cls='card bg-base-100 shadow-xl'
          )

@app.get("/")
def cluster_analysis():
    return Div(
              *[cluster_card(c, q) for c, q in zip(cluster_summaries, df_grouped_by_cluster["questions_list"])],
              cls="grid grid-cols-2 gap-2"
            )
from fasthtml.common import *
from fasthtml.fastapp import *
from random import sample
from fasthtml.components import Zero_md

tlink = Script(src="https://cdn.tailwindcss.com")
dlink = Link(rel="stylesheet", href="https://cdn.jsdelivr.net/npm/daisyui@4.11.1/dist/full.min.css")
app = FastHTML(hdrs=(dlink, tlink))

def Markdown(md, css = ''):
    css_template = Template(Style(css), data_append=True)
    return Zero_md(css_template, Script(md, type="text/markdown"))

def MarkdownWOutBackground(md: str):
    css = '.markdown-body {background-color: unset !important; color: unset !important;} .markdown-body table {color: black !important;}'
    markdown_wout_background = partial(Markdown, css=css)
    return markdown_wout_background(md)

def stat_card(num_questions: int):
  return Div(
    Div('Total Questions', cls='stat-title'),
    Div(f'{num_questions}', cls='stat-value'),
    cls='stat'
  )

def cluster_card(cluster_summary: ClusterSummary, questions_list: List[str]):
  if cluster_summary.sentiment == Sentiment.NEGATIVE:
    badge_color = "error"
  elif cluster_summary.sentiment == Sentiment.NEUTRAL:
    badge_color = "neutral"
  else:
    badge_color = "success"
  return Div(
              Div(
                  H2(cluster_summary.summary_desc, cls='card-title'),
                  Div(
                      stat_card(len(questions_list)),
                      Div(cluster_summary.sentiment, cls=f'badge badge-{badge_color}'),
                      cls="flex flex-row items-center"
                  ),
                  H4("Representative Questions:", cls="font-bold"),
                  Ul(
                      *[Li(q) for q in cluster_summary.most_representative_qs],
                      cls='list-disc list-inside mt-2'
                  ),
                  H4("Topics Discussed:", cls="font-bold"),
                  Ul(
                      *[Li(t) for t in cluster_summary.topics],
                      cls='list-disc list-inside mt-2'
                  ),
                  cls='card-body'
              ),
              cls='card bg-base-100 shadow-xl'
          )

@app.get("/")
def cluster_analysis():
    return Div(
              *[cluster_card(c, q) for c, q in zip(cluster_summaries, df_grouped_by_cluster["questions_list"])],
              cls="grid grid-cols-2 gap-2"
            )

Gemini-generated Cluster Analysis¶

In [22]:

Copied!





from starlette.testclient import TestClient
client = TestClient(app)
r = client.get("/")
show(r.content)
from starlette.testclient import TestClient
client = TestClient(app)
r = client.get("/")
show(r.content)

Out[22]:

FastHTML page

Questions concerning the practical usage and troubleshooting of Google Compute Engine virtual machine instances, including instance creation, selection, connection, deletion recovery, clustering for high availability, firewall issues, and pricing.

Total Questions

9

Sentiment.NEUTRAL

Representative Questions:

How can I create a virtual machine instance on Compute Engine?
What are the different machine types available on Compute Engine, and how do I choose the right one for my needs?
Can you explain the different pricing options for Compute Engine instances?
How do I connect to my Compute Engine instance using SSH?
I accidentally deleted my Compute Engine instance. How can I recover it?
I want to set up a cluster of Compute Engine instances for high availability. Can you guide me through the process?
I'm having trouble connecting to my Virtual Machine instance. I think there's a firewall issue. How can I troubleshoot this?

Topics Discussed:

Google Compute Engine
Virtual Machine Instances
Instance Creation
Machine Types
Pricing
SSH Connection
Instance Deletion Recovery
High Availability Clustering
Firewall Troubleshooting
Discounts

Questions concerning the practical application and optimization of BigQuery for large dataset analysis, including data loading, query performance, visualization, and machine learning integration.

Total Questions

5

Sentiment.NEUTRAL

Representative Questions:

What is BigQuery, and how can I use it to analyze large datasets?
I have a large dataset that I want to analyze using BigQuery. How can I load my data into BigQuery?
My BigQuery queries are taking a long time to run. How can I optimize my queries for better performance?
I want to visualize my data in BigQuery using Data Studio. How can I connect Data Studio to my BigQuery dataset?
How can I use machine learning with BigQuery to gain insights from my data?

Topics Discussed:

BigQuery
large datasets
data analysis
data loading
query optimization
performance
visualization
Data Studio
machine learning

Questions concerning the practical application of Google Cloud's Machine Learning and Data Processing tools for building, deploying, and monitoring models.

Total Questions

12

Sentiment.NEUTRAL

Representative Questions:

How can I use Google Cloud to build a data pipeline?
How can I use Google Cloud to visualize my data?
How can I use AutoML to build a machine learning model without writing any code?
I need to monitor the performance of my deployed machine learning model. What tools are available on Google Cloud?
How can I use Google Cloud to build a machine learning model?
How can I use Google Cloud to deploy my machine learning model?

Topics Discussed:

Data Processing on Google Cloud
Data Pipelines
Data Visualization
Real-time Data Processing
Pre-trained Models for Image Recognition
AutoML
Machine Learning Model Deployment
Machine Learning Model Monitoring
AI and ML Services on Google Cloud

Questions concerning Google Cloud database services, particularly storage, migration, scaling, and performance optimization for Cloud SQL and Cloud Spanner.

Total Questions

11

Sentiment.NEUTRAL

Representative Questions:

What database services are available on Google Cloud?
What is the difference between Cloud SQL and Cloud Spanner?
How do I migrate my existing database to Google Cloud?
How can I scale my database on Google Cloud?
My Cloud SQL database is running out of storage space. How can I increase the storage capacity?
I need to replicate my Cloud SQL database to another region for disaster recovery. How can I set up database replication?
I'm experiencing slow query performance on my Cloud Spanner database. How can I optimize my database and queries?

Topics Discussed:

Google Cloud Storage
Database Services
Cloud SQL
Cloud Spanner
Database Migration
Database Scaling
Storage Capacity
Database Replication
Query Performance
Database Security

Questions concerning the monitoring, troubleshooting, and optimization of application performance on Google Cloud Platform, particularly focusing on diagnosing latency, utilizing logging and monitoring tools, and understanding specific services like Compute Engine and Cloud Run.

Total Questions

8

Sentiment.NEUTRAL

Representative Questions:

My application is experiencing performance issues. How can I troubleshoot and optimize my Compute Engine instance?
My application is experiencing high latency. Could it be a networking issue? How can I diagnose and resolve network latency problems?
I want to monitor the performance of my applications running on Google Kubernetes Engine. What tools can I use?
How can I monitor the performance and logs of my Cloud Run services?
What is Cloud Monitoring, and how does it work?

Topics Discussed:

application performance
troubleshooting
optimization
Compute Engine
network latency
Google Kubernetes Engine
Cloud Logging
Cloud Monitoring
Cloud Run

Questions concerning cost optimization strategies and troubleshooting unexpected expenses within Google Cloud Platform.

Total Questions

13

Sentiment.NEGATIVE

Representative Questions:

What are preemptible instances, and how can they save me money?
I'm getting billed for a Compute Engine instance that I'm not using. How can I identify and shut down unused instances?
My Cloud Storage costs are higher than expected. How can I analyze my usage and optimize my storage costs?
My Google Cloud bill is higher than expected this month. How can I identify the source of the increased cost?
I want to track the cost of my Google Cloud resources by department. How can I set up cost allocation?
What are some best practices for optimizing my Google Cloud costs?

Topics Discussed:

Preemptible Instances
Cost Optimization
Unused Resources
Cloud Storage Costs
High CPU Usage
Cost Allocation
Budgeting
Cost Analysis
Resource Management

Questions concerning the automation of application deployment, management, and scaling on Google Cloud, particularly focusing on serverless technologies like Cloud Functions and Cloud Run.

Total Questions

12

Sentiment.NEUTRAL

Representative Questions:

I need to automate the deployment of my applications on Google Cloud. What tools and services can I use?
What tools are available for managing my Google Cloud resources?
How can I automate tasks on Google Cloud?
What is serverless computing, and what are its benefits?
What serverless platforms are available on Google Cloud?
How can I build and deploy a serverless application on Google Cloud?
What is Cloud Functions, and how does it work?
How can I use Google Cloud Run to deploy containerized applications?

Topics Discussed:

Cloud Load Balancing
application deployment automation
Google Cloud resource management
task automation on Google Cloud
serverless computing
serverless platforms on Google Cloud
serverless application development and deployment
Cloud Functions
Google Cloud Run
containerized applications
API development with Cloud Functions
Cloud Function timeout limits
Cloud Run deployment configuration

Questions concerning the practical aspects of using Google Cloud Storage, such as data transfer methods, file recovery, access management, and pricing.

Total Questions

6

Sentiment.NEUTRAL

Representative Questions:

I need to transfer a large amount of data to Google Cloud Storage. What's the most efficient way to do this?
I accidentally deleted some files from my Cloud Storage bucket. How can I recover them?
I want to make my data in Cloud Storage available to the public. How can I configure public access?
How much does it cost to store data in Google Cloud Storage?

Topics Discussed:

data transfer
file recovery
public access
data upload
data access
pricing

Questions concerning practical challenges and usage of Vertex AI for machine learning tasks.

Total Questions

4

Sentiment.NEUTRAL

Representative Questions:

I'm trying to train a machine learning model on Vertex AI, but I'm getting errors. How can I troubleshoot these errors?
I want to deploy my trained machine learning model as an API. How can I do this using Vertex AI?
What is Vertex AI, and how can I use it?
I'm having trouble configuring IAM to use Vertex AI. What do I do?

Topics Discussed:

Vertex AI
Machine Learning
Model Training
Error Troubleshooting
Model Deployment
API
IAM Configuration

Questions concerning securing Google Cloud resources and infrastructure, particularly focusing on networking, access control, and data protection.

Total Questions

20

Sentiment.NEUTRAL

Representative Questions:

What are the security features of Google Cloud Storage?
How do I create a Virtual Private Cloud (VPC) on Google Cloud?
What are firewalls, and how do I configure them in Google Cloud?
How can I connect my on-premises network to Google Cloud?
How can I secure my applications and data on Google Cloud?
What is Identity and Access Management (IAM), and how does it work?
How can I implement multi-factor authentication on Google Cloud?
What are security best practices for Google Cloud?

Topics Discussed:

Cloud Storage Security
Virtual Private Cloud (VPC)
Firewall Rules
Network Connectivity
Network Security
Data Security
Identity and Access Management (IAM)
Multi-Factor Authentication
Security Best Practices
Security Monitoring
Vulnerability Remediation
Secure Application Development

Sample Questions from Each Cluster to create the Eval Dataset¶

We can sample randomly proportional to each cluster's size
Or we can take samples from the most representative questions Gemini identified

Probably need to sit down with an SME and compare both:

In [24]:

Copied!





# Calculate the total number of questions
total_questions = df_grouped_by_cluster['question_len'].sum()

# Calculate the fraction of questions for each row
df_grouped_by_cluster['cluster_fraction'] = df_grouped_by_cluster['question_len'] / total_questions

# Function to sample from a list based on the fraction
def sample_questions(row, num_samples):
    return np.random.choice(row['questions_list'],
                            size=int(num_samples * row['cluster_fraction']),
                            replace=False).tolist()

# Specify the total number of samples you want
total_samples = 50

# Apply the sampling function to each row
df_grouped_by_cluster['proportional_sampled_questions'] = df_grouped_by_cluster.apply(lambda row: sample_questions(row, total_samples), axis=1)

# Unroll the DataFrame
df_grouped_by_cluster = df_grouped_by_cluster.reset_index()

# Print the resulting DataFrame
unrolled_proportional_df = df_grouped_by_cluster.apply(lambda x: pd.Series({
    'cluster_title': [x["just_summary"]] * len(x['proportional_sampled_questions']),
    'sampled_question': x['proportional_sampled_questions']
}), axis=1)

# Concatenate the series and reset the index
unrolled_proportional_df = pd.concat([unrolled_proportional_df['cluster_title'].explode(),
                         unrolled_proportional_df['sampled_question'].explode()],
                        axis=1).reset_index(drop=True)
# Calculate the total number of questions
total_questions = df_grouped_by_cluster['question_len'].sum()

# Calculate the fraction of questions for each row
df_grouped_by_cluster['cluster_fraction'] = df_grouped_by_cluster['question_len'] / total_questions

# Function to sample from a list based on the fraction
def sample_questions(row, num_samples):
    return np.random.choice(row['questions_list'],
                            size=int(num_samples * row['cluster_fraction']),
                            replace=False).tolist()

# Specify the total number of samples you want
total_samples = 50

# Apply the sampling function to each row
df_grouped_by_cluster['proportional_sampled_questions'] = df_grouped_by_cluster.apply(lambda row: sample_questions(row, total_samples), axis=1)

# Unroll the DataFrame
df_grouped_by_cluster = df_grouped_by_cluster.reset_index()

# Print the resulting DataFrame
unrolled_proportional_df = df_grouped_by_cluster.apply(lambda x: pd.Series({
    'cluster_title': [x["just_summary"]] * len(x['proportional_sampled_questions']),
    'sampled_question': x['proportional_sampled_questions']
}), axis=1)

# Concatenate the series and reset the index
unrolled_proportional_df = pd.concat([unrolled_proportional_df['cluster_title'].explode(),
                         unrolled_proportional_df['sampled_question'].explode()],
                        axis=1).reset_index(drop=True)

In [25]:

Copied!

unrolled_proportional_df
unrolled_proportional_df

Out[25]:

	cluster_title	sampled_question
0	Questions concerning the practical usage and t...	"What are the different machine types availabl...
1	Questions concerning the practical usage and t...	"I accidentally deleted my Compute Engine inst...
2	Questions concerning the practical usage and t...	"Are there any discounts or sustained use disc...
3	Questions concerning the practical usage and t...	"I'm having trouble connecting to my Virtual M...
4	Questions concerning the practical application...	"How can I use machine learning with BigQuery ...
5	Questions concerning the practical application...	"What is BigQuery, and how can I use it to ana...
6	Questions concerning the practical application...	"What is Dataflow, and how does it work?"
7	Questions concerning the practical application...	"What are the different tools available for da...
8	Questions concerning the practical application...	"How can I use Google Cloud to build a machine...
9	Questions concerning the practical application...	"I need to monitor the performance of my deplo...
10	Questions concerning the practical application...	"How can I use AutoML to build a machine learn...
11	Questions concerning the practical application...	"How can I use Google Cloud to visualize my da...
12	Questions concerning Google Cloud database ser...	"What are the different storage options availa...
13	Questions concerning Google Cloud database ser...	"What database services are available on Googl...
14	Questions concerning Google Cloud database ser...	"I need to increase the storage space on my Co...
15	Questions concerning Google Cloud database ser...	"How can I scale my database on Google Cloud?"
16	Questions concerning Google Cloud database ser...	"How do I migrate my existing database to Goog...
17	Questions concerning the monitoring, troublesh...	"How can I monitor the performance and logs of...
18	Questions concerning the monitoring, troublesh...	"My application is experiencing performance is...
19	Questions concerning the monitoring, troublesh...	"What is Cloud Logging, and how can I use it t...
20	Questions concerning the monitoring, troublesh...	"I'm trying to troubleshoot an issue with my a...
21	Questions concerning cost optimization strateg...	"What are some best practices for optimizing m...
22	Questions concerning cost optimization strateg...	"My Cloud Storage costs are higher than expect...
23	Questions concerning cost optimization strateg...	"I'm not using some of my Google Cloud resourc...
24	Questions concerning cost optimization strateg...	"How can I track and manage my Google Cloud co...
25	Questions concerning cost optimization strateg...	"What tools are available for cost management ...
26	Questions concerning cost optimization strateg...	"My Google Cloud bill is higher than expected ...
27	Questions concerning the automation of applica...	"My Cloud Function is timing out. How can I in...
28	Questions concerning the automation of applica...	"What is serverless computing, and what are it...
29	Questions concerning the automation of applica...	"How can I use Google Cloud Run to deploy cont...
30	Questions concerning the automation of applica...	"I need to deploy a containerized application ...
31	Questions concerning the automation of applica...	"What is Cloud Functions, and how does it work?"
32	Questions concerning the automation of applica...	"I want to build a simple API using Cloud Func...
33	Questions concerning the practical aspects of ...	"How can I upload data to Google Cloud Storage?"
34	Questions concerning the practical aspects of ...	"I want to make my data in Cloud Storage avail...
35	Questions concerning the practical aspects of ...	"I need to transfer a large amount of data to ...
36	Questions concerning practical challenges and ...	"I'm trying to train a machine learning model ...
37	Questions concerning practical challenges and ...	"I'm having trouble configuring IAM to use Ver...
38	Questions concerning securing Google Cloud res...	"How can I improve the security of my network ...
39	Questions concerning securing Google Cloud res...	"What is Identity and Access Management (IAM),...
40	Questions concerning securing Google Cloud res...	"I want to ensure that only authorized users c...
41	Questions concerning securing Google Cloud res...	"I'm concerned about the security of my sensit...
42	Questions concerning securing Google Cloud res...	"What are firewalls, and how do I configure th...
43	Questions concerning securing Google Cloud res...	"How can I connect my on-premises network to G...
44	Questions concerning securing Google Cloud res...	"What are the security features of Google Clou...
45	Questions concerning securing Google Cloud res...	"I want to connect my Cloud Function to a Clou...
46	Questions concerning securing Google Cloud res...	"I want to ensure that my network traffic is s...
47	Questions concerning securing Google Cloud res...	"What are security best practices for Google C...

In [26]:

Copied!





df_grouped_by_cluster["gemini_representative_questions_len"] = df_grouped_by_cluster["cluster_summary"].apply(lambda x: len(x.most_representative_qs))
df_grouped_by_cluster["gemini_representative_questions"] = df_grouped_by_cluster["cluster_summary"].apply(lambda x: x.most_representative_qs)
# Print the resulting DataFrame
unrolled_gemini_df = df_grouped_by_cluster.apply(lambda x: pd.Series({
    'cluster_title': [x["just_summary"]] * len(x['gemini_representative_questions']),
    'representative_question': x['gemini_representative_questions']
}), axis=1)

# Concatenate the series and reset the index
unrolled_gemini_df = pd.concat([unrolled_gemini_df['cluster_title'].explode(),
                         unrolled_gemini_df['representative_question'].explode()],
                        axis=1).reset_index(drop=True)
df_grouped_by_cluster["gemini_representative_questions_len"] = df_grouped_by_cluster["cluster_summary"].apply(lambda x: len(x.most_representative_qs))
df_grouped_by_cluster["gemini_representative_questions"] = df_grouped_by_cluster["cluster_summary"].apply(lambda x: x.most_representative_qs)
# Print the resulting DataFrame
unrolled_gemini_df = df_grouped_by_cluster.apply(lambda x: pd.Series({
    'cluster_title': [x["just_summary"]] * len(x['gemini_representative_questions']),
    'representative_question': x['gemini_representative_questions']
}), axis=1)

# Concatenate the series and reset the index
unrolled_gemini_df = pd.concat([unrolled_gemini_df['cluster_title'].explode(),
                         unrolled_gemini_df['representative_question'].explode()],
                        axis=1).reset_index(drop=True)

In [27]:

Copied!

unrolled_gemini_df
unrolled_gemini_df

Out[27]:

	cluster_title	representative_question
0	Questions concerning the practical usage and t...	How can I create a virtual machine instance on...
1	Questions concerning the practical usage and t...	What are the different machine types available...
2	Questions concerning the practical usage and t...	Can you explain the different pricing options ...
3	Questions concerning the practical usage and t...	How do I connect to my Compute Engine instance...
4	Questions concerning the practical usage and t...	I accidentally deleted my Compute Engine insta...
5	Questions concerning the practical usage and t...	I want to set up a cluster of Compute Engine i...
6	Questions concerning the practical usage and t...	I'm having trouble connecting to my Virtual Ma...
7	Questions concerning the practical application...	What is BigQuery, and how can I use it to anal...
8	Questions concerning the practical application...	I have a large dataset that I want to analyze ...
9	Questions concerning the practical application...	My BigQuery queries are taking a long time to ...
10	Questions concerning the practical application...	I want to visualize my data in BigQuery using ...
11	Questions concerning the practical application...	How can I use machine learning with BigQuery t...
12	Questions concerning the practical application...	How can I use Google Cloud to build a data pip...
13	Questions concerning the practical application...	How can I use Google Cloud to visualize my data?
14	Questions concerning the practical application...	How can I use AutoML to build a machine learni...
15	Questions concerning the practical application...	I need to monitor the performance of my deploy...
16	Questions concerning the practical application...	How can I use Google Cloud to build a machine ...
17	Questions concerning the practical application...	How can I use Google Cloud to deploy my machin...
18	Questions concerning Google Cloud database ser...	What database services are available on Google...
19	Questions concerning Google Cloud database ser...	What is the difference between Cloud SQL and C...
20	Questions concerning Google Cloud database ser...	How do I migrate my existing database to Googl...
21	Questions concerning Google Cloud database ser...	How can I scale my database on Google Cloud?
22	Questions concerning Google Cloud database ser...	My Cloud SQL database is running out of storag...
23	Questions concerning Google Cloud database ser...	I need to replicate my Cloud SQL database to a...
24	Questions concerning Google Cloud database ser...	I'm experiencing slow query performance on my ...
25	Questions concerning the monitoring, troublesh...	My application is experiencing performance iss...
26	Questions concerning the monitoring, troublesh...	My application is experiencing high latency. C...
27	Questions concerning the monitoring, troublesh...	I want to monitor the performance of my applic...
28	Questions concerning the monitoring, troublesh...	How can I monitor the performance and logs of ...
29	Questions concerning the monitoring, troublesh...	What is Cloud Monitoring, and how does it work?
30	Questions concerning cost optimization strateg...	What are preemptible instances, and how can th...
31	Questions concerning cost optimization strateg...	I'm getting billed for a Compute Engine instan...
32	Questions concerning cost optimization strateg...	My Cloud Storage costs are higher than expecte...
33	Questions concerning cost optimization strateg...	My Google Cloud bill is higher than expected t...
34	Questions concerning cost optimization strateg...	I want to track the cost of my Google Cloud re...
35	Questions concerning cost optimization strateg...	What are some best practices for optimizing my...
36	Questions concerning the automation of applica...	I need to automate the deployment of my applic...
37	Questions concerning the automation of applica...	What tools are available for managing my Googl...
38	Questions concerning the automation of applica...	How can I automate tasks on Google Cloud?
39	Questions concerning the automation of applica...	What is serverless computing, and what are its...
40	Questions concerning the automation of applica...	What serverless platforms are available on Goo...
41	Questions concerning the automation of applica...	How can I build and deploy a serverless applic...
42	Questions concerning the automation of applica...	What is Cloud Functions, and how does it work?
43	Questions concerning the automation of applica...	How can I use Google Cloud Run to deploy conta...
44	Questions concerning the practical aspects of ...	I need to transfer a large amount of data to G...
45	Questions concerning the practical aspects of ...	I accidentally deleted some files from my Clou...
46	Questions concerning the practical aspects of ...	I want to make my data in Cloud Storage availa...
47	Questions concerning the practical aspects of ...	How much does it cost to store data in Google ...
48	Questions concerning practical challenges and ...	I'm trying to train a machine learning model o...
49	Questions concerning practical challenges and ...	I want to deploy my trained machine learning m...
50	Questions concerning practical challenges and ...	What is Vertex AI, and how can I use it?
51	Questions concerning practical challenges and ...	I'm having trouble configuring IAM to use Vert...
52	Questions concerning securing Google Cloud res...	What are the security features of Google Cloud...
53	Questions concerning securing Google Cloud res...	How do I create a Virtual Private Cloud (VPC) ...
54	Questions concerning securing Google Cloud res...	What are firewalls, and how do I configure the...
55	Questions concerning securing Google Cloud res...	How can I connect my on-premises network to Go...
56	Questions concerning securing Google Cloud res...	How can I secure my applications and data on G...
57	Questions concerning securing Google Cloud res...	What is Identity and Access Management (IAM), ...
58	Questions concerning securing Google Cloud res...	How can I implement multi-factor authenticatio...
59	Questions concerning securing Google Cloud res...	What are security best practices for Google Cl...

Save Results to CSV¶

We do need to obtain ground truth answers
But we can be confident we are putting the effort towards relevant, representative questions

In [28]:

Copied!

unrolled_gemini_df.to_csv("representative_eval_questions.csv")
unrolled_gemini_df.to_csv("representative_eval_questions.csv")

Conclusion¶

With this notebook you can go from a mass of user queries from a RAG system and get immediate insights into the types of queries people are asking with useful clusters of queries described and analyzed by Gemini. This analysis can help inform decisions around how to improve the RAG system or it may highlight other issues in the business or product beyond what the chatbot can address. Finally, you can sample queries from these clusters to get a representative set of evaluation questions with which you can use to continuously evaluate the RAG system over time.

As a next step will be to take this set of representative questions and obtain ground truth from users or subject matter experts and then evaluating performance using a service like Vertex AI Evaluation Service.