RAG on GKE
This is a sample to deploy a Retrieval Augmented Generation (RAG) application on GKE.
The latest recommended release is branch release-1.1.
What is RAG?
RAG is a popular approach for boosting the accuracy of LLM responses, particularly for domain specific or private data sets.
RAG uses a semantically searchable knowledge base (like vector search) to retrieve relevant snippets for a given prompt to provide additional context to the LLM. Augmenting the knowledge base with additional data is typically cheaper than fine tuning and is more scalable when incorporating current events and other rapidly changing data spaces.
RAG on GKE Architecture
- A GKE service endpoint serving Hugging Face TGI inference using
mistral-7b
. - Cloud SQL
pgvector
instance with vector embeddings generated from an input dataset. - A Ray cluster running on GKE that runs jobs to generate embeddings and populate the vector DB.
- A Jupyter notebook running on GKE that reads the dataset using GCS fuse driver integrations and runs a Ray job to populate the vector DB.
- A front end chat interface running on GKE that prompts the inference server with context from the vector DB.
This tutorial walks you through installing the RAG infrastructure in a GCP project, generating vector embeddings for a sample Kaggle Netflix shows dataset and prompting the LLM with context.
Prerequisites
Install tooling (required)
Install the following on your computer: * Kubectl * Terraform * Helm * Gcloud
Bring your own cluster (optional)
By default, this tutorial creates a cluster on your behalf. We highly recommend following the default settings.
If you prefer to manage your own cluster, set create_cluster = false
and make sure the network_name
is set to your cluster's network in the Installation section. Creating a long-running cluster may be better for development, allowing you to iterate on Terraform components without recreating the cluster every time.
Use gcloud to create a GKE Autopilot cluster. Note that RAG requires the latest Autopilot features, available on the latest versions of 1.28 and 1.29.
gcloud container clusters create-auto rag-cluster \
--location us-central1 \
--cluster-version 1.28
Bring your own VPC (optional)
By default, this tutorial creates a new network on your behalf with Private Service Connect already enabled. We highly recommend following the default settings.
If you prefer to use your own VPC, set create_network = false
in the in the Installation section. This also requires enabling Private Service Connect for your VPC. Without Private Service Connect, the RAG components cannot connect to the vector DB:
Installation
This section sets up the RAG infrastructure in your GCP project using Terraform.
NOTE: Terraform keeps state metadata in a local file called terraform.tfstate
. Deleting the file may cause some resources to not be cleaned up correctly even if you delete the cluster. We suggest using terraform destroy
before reapplying/reinstalling.
-
cd ai-on-gke/applications/rag
-
Edit
workloads.tfvars
to set your project ID, location, cluster name, and GCS bucket name. Ensure thegcs_bucket
name is globally unique (add a random suffix). Optionally, make the following changes:- (Recommended) Enable authenticated access for JupyterHub, frontend chat and Ray dashboard services.
- (Optional) Set a custom
kubernetes_namespace
where all k8s resources will be created. - (Optional) Set
autopilot_cluster = false
to deploy using GKE Standard. - (Optional) Set
create_cluster = false
if you are bringing your own cluster. If using a GKE Standard cluster, ensure it has an L4 nodepool with autoscaling and node autoprovisioning enabled. You can simplify setup by following the Terraform instructions ininfrastructure/README.md
. - (Optional) Set
create_network = false
if you are bringing your own VPC. Ensure your VPC has Private Service Connect enabled as described above.
-
Run
terraform init
-
Run
terraform apply --var-file workloads.tfvars
Generate vector embeddings for the dataset
This section generates the vector embeddings for your input dataset. Currently, the default dataset is Netflix shows. We will use a Jupyter notebook to run a Ray job that generates the embeddings & populates them into the pgvector
instance created above.
Set your the namespace, cluster name and location from workloads.tfvars
):
export NAMESPACE=rag
export CLUSTER_LOCATION=us-east4
export CLUSTER_NAME=rag-cluster
Connect to the GKE cluster:
gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_LOCATION}
- Connect and login to JupyterHub:
- If IAP is disabled (
jupyter_add_auth = false
): - Port forward to the JupyterHub service:kubectl port-forward service/proxy-public -n ${NAMESPACE} 8081:80 &
- Go tolocalhost:8081
in a browser - Login with these credentials: * username: admin * password: useterraform output jupyterhub_password
to fetch the password value -
If IAP is enabled (
jupyter_add_auth = true
): - Fetch the domain:terraform output jupyterhub_uri
- If you used a custom domain, ensure you configured your DNS as described above. - Verify the domain status isActive
: -kubectl get managedcertificates jupyter-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
- Note: This can take up to 20 minutes to propagate. - Once the domain status is Active, go to the domain in a browser and login with your Google credentials. - To add additional users to your JupyterHub application, go to Google Cloud Platform IAP, select therag/proxy-public
service and add principals with the roleIAP-secured Web App User
. -
Load the notebook:
- Once logged in to JupyterHub, choose the
CPU
preset withDefault
storage. - Click [File] -> [Open From URL] and paste:
https://raw.githubusercontent.com/GoogleCloudPlatform/ai-on-gke/main/applications/rag/example_notebooks/rag-kaggle-ray-sql-interactive.ipynb
- Once logged in to JupyterHub, choose the
-
Configure Kaggle:
- Create a Kaggle account.
- Generate an API token. See further instructions. This token is used in the notebook to access the Kaggle Netflix shows dataset.
- Replace the variables in the 1st cell of the notebook with your Kaggle credentials (can be found in the
kaggle.json
file created while generating the API token):KAGGLE_USERNAME
KAGGLE_KEY
-
Generate vector embeddings: Run all the cells in the notebook to generate vector embeddings for the Netflix shows dataset (https://www.kaggle.com/datasets/shivamb/netflix-shows) via a Ray job and store them in the
pgvector
CloudSQL instance.- When the last cell succeeded, the vector embeddings have been generated and we can launch the frontend chat interface. Note that the Ray job can take up to 10 minutes to finish.
- Ray may take several minutes to create the runtime environment. During this time, the job will appear to be missing (e.g.
Status message: PENDING
). - Connect to the Ray dashboard to check the job status or logs:
- If IAP is disabled (
ray_dashboard_add_auth = false
):kubectl port-forward -n ${NAMESPACE} service/ray-cluster-kuberay-head-svc 8265:8265
- Go to
localhost:8265
in a browser
- If IAP is enabled (
ray_dashboard_add_auth = true
):- Fetch the domain:
terraform output ray-dashboard-managed-cert
- If you used a custom domain, ensure you configured your DNS as described above.
- Verify the domain status is
Active
:kubectl get managedcertificates ray-dashboard-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
- Note: This can take up to 20 minutes to propagate.
- Once the domain status is Active, go to the domain in a browser and login with your Google credentials.
- To add additional users to your frontend application, go to Google Cloud Platform IAP, select the
rag/ray-cluster-kuberay-head-svc
service and add principals with the roleIAP-secured Web App User
.
- Fetch the domain:
- If IAP is disabled (
Launch the frontend chat interface
- Connect to the frontend:
- If IAP is disabled (
frontend_add_auth = false
):- Port forward to the frontend service:
kubectl port-forward service/rag-frontend -n ${NAMESPACE} 8080:8080 &
- Go to
localhost:8080
in a browser
- Port forward to the frontend service:
- If IAP is enabled (
frontend_add_auth = true
):- Fetch the domain:
terraform output frontend_uri
- If you used a custom domain, ensure you configured your DNS as described above.
- Verify the domain status is
Active
:kubectl get managedcertificates frontend-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
- Note: This can take up to 20 minutes to propagate.
- Once the domain status is Active, go to the domain in a browser and login with your Google credentials.
- To add additional users to your frontend application, go to Google Cloud Platform IAP, select the
rag/rag-frontend
service and add principals with the roleIAP-secured Web App User
.
- Fetch the domain:
- If IAP is disabled (
- Prompt the LLM
- Start chatting! This will fetch context related to your prompt from the vector embeddings in the
pgvector
CloudSQL instance, augment the original prompt with the context & query the inference model (mistral-7b
) with the augmented prompt.
- Start chatting! This will fetch context related to your prompt from the vector embeddings in the
Configure authenticated access via IAP (recommended)
We recommend you configure authenticated access via IAP for your services.
1) Make sure the OAuth Consent Screen is configured for your project. Ensure User type
is set to Internal
.
2) Make sure Policy for Restrict Load Balancer Creation Based on Load Balancer Types allows EXTERNAL_HTTP_HTTPS.
3) Set the following variables in workloads.tfvars
:
* jupyter_add_auth = true
* frontend_add_auth = true
* ray_dashboard_add_auth = true
4) Allowlist principals for your services via jupyter_members_allowlist
, frontend_members_allowlist
and ray_dashboard_members_allowlist
.
5) Configure custom domains names via jupyter_domain
, frontend_domain
and ray_dashboard_domain
for your services.
6) Configure DNS records for your custom domains:
- Register a Domain on Google Cloud Domains or use a domain registrar of your choice.
- Set up your DNS service to point to the public IP
* Run terraform output frontend_ip_address
to get the public ip address of frontend, and add an A record in your DNS configuration to point to the public IP address.
* Run terraform output jupyterhub_ip_address
to get the public ip address of jupyterhub, and add an A record in your DNS configuration to point to the public IP address.
* Run terraform output ray_dashboard_ip_address
to get the public ip address of ray dashboard, and add an A record in your DNS configuration to point to the public IP address.
- Add an A record: If the DNS service of your domain is managed by Google Cloud DNS managed zone, there are two options to add the A record:
1. Go to https://console.cloud.google.com/net-services/dns/zones, select the zone and click ADD STANDARD, fill in your domain name and public IP address.
2. Run gcloud dns record-sets create <domain address>. --zone=<zone name> --type="A" --ttl=<ttl in seconds> --rrdatas="<public ip address>"
Cleanup
- Run
terraform destroy --var-file="workloads.tfvars"
- Network deletion issue:
terraform destroy
fails to delete the network due to a known issue in the GCP provider. For now, the workaround is to manually delete it.
- Network deletion issue:
Troubleshooting
Set your the namespace, cluster name and location from workloads.tfvars
:
export NAMESPACE=rag
export CLUSTER_LOCATION=us-central1
export CLUSTER_NAME=rag-cluster
Connect to the GKE cluster:
gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${CLUSTER_LOCATION}
-
Troubleshoot Ray job failures:
- If the Ray actors fail to be scheduled, it could be due to a stockout or quota issue.
- Run
kubectl get pods -n ${NAMESPACE} -l app.kubernetes.io/name=kuberay
. There should be a Ray head and Ray worker pod inRunning
state. If your ray pods aren't running, it's likely due to quota or stockout issues. Check that your project and selectedcluster_location
have L4 GPU capacity.
- Run
- Often, retrying the Ray job submission (the last cell of the notebook) helps.
- The Ray job may take 15-20 minutes to run the first time due to environment setup.
- If the Ray actors fail to be scheduled, it could be due to a stockout or quota issue.
-
Troubleshoot IAP login issues:
- Verify the cert is Active:
- For JupyterHub
kubectl get managedcertificates jupyter-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
- For the frontend:
kubectl get managedcertificates frontend-managed-cert -n ${NAMESPACE} --output jsonpath='{.status.domainStatus[0].status}'
- For JupyterHub
- Verify users are allowlisted for JupyterHub or frontend services:
- JupyterHub: Go to Google Cloud Platform IAP, select the
rag/proxy-public
service and check if the user has roleIAP-secured Web App User
. - Frontend: Go to Google Cloud Platform IAP, select the
rag/rag-frontend
service and check if the user has roleIAP-secured Web App User
.
- JupyterHub: Go to Google Cloud Platform IAP, select the
- Org error:
- The OAuth Consent Screen has
User type
set toInternal
by default, which means principals external to the org your project is in cannot log in. To add external principals, changeUser type
toExternal
.
- The OAuth Consent Screen has
- Verify the cert is Active:
-
Troubleshoot
terraform apply
failures:- Inference server (
mistral
) fails to deploy:- This usually indicates a stockout/quota issue. Verify your project and chosen
cluster_location
have L4 capacity.
- This usually indicates a stockout/quota issue. Verify your project and chosen
- GCS bucket already exists:
- GCS bucket names have to be globally unique, pick a different name with a random suffix.
- Cloud SQL instance already exists:
- Ensure the
cloudsql_instance
name doesn't already exist in your project.
- Ensure the
- GMP operator webhook connection refused:
- This is a rare, transient error. Run
terraform apply
again to resume deployment.
- This is a rare, transient error. Run
- Inference server (
-
Troubleshoot
terraform destroy
failures:- Network deletion issue:
terraform destroy
fails to delete the network due to a known issue in the GCP provider. For now, the workaround is to manually delete it.
- Network deletion issue:
-
Troubleshoot error:
Repo model mistralai/Mistral-7B-Instruct-v0.1 is gated. You must be authenticated to access it.
for the pod of deploymentmistral-7b-instruct
.
The error is because the RAG deployments uses Mistral-7B-instruct
which is now a gated model on Hugging Face. Deployments fail as they require a Hugging Face authentication token, which is not part of the current workflow.
While we are actively working on long-term fix. This is how to workaround the error:
- Use the guide as a reference to create an access token.
- Go to the model card in Hugging Face and click "Agree and access repository"
- Create a secret as noted in with the Hugging Face credential called hf-secret
in the name space where your mistral-7b-instruct
deployment is running.
- Add the following entry to env
within the deployment mistral-7b-instruct
via kubectl edit
.
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token