KServe on GKE Autopilot

KServe is a highly scalable, standards-based platform for model inference on Kubernetes. Installing KServe on GKE Autopilot can be challenging due to the security policies enforced by Autopilot. This tutorial will guide you step by step through the process of installing KServe in a GKE Autopilot cluster.

Additionally, this tutorial includes an example of serving Gemma2 with vLLM in KServe, demonstrating how to utilize GPU resources in KServe on Google Kubernetes Engine (GKE).

Before you begin

Ensure you have a gcp project with billing enabled and enabled the GKE API.
Ensure you have the following tools installed on your workstation
gcloud CLI
gcloud kubectl
helm

Set up your GKE Cluster

Set the default environment variables:

export PROJECT_ID=$(gcloud config get project)
export REGION=us-central1
export CLUSTER_NAME=kserve-demo

Create a GKE Autopilot cluster:

gcloud container clusters create-auto ${CLUSTER_NAME} \
    --location=$REGION \
    --project=$PROJECT_ID \
    --workload-policies=allow-net-admin

# Get credentials
gcloud container clusters get-credentials ${CLUSTER_NAME} \
--region ${REGION} \
--project ${PROJECT_ID}

If you're using an existing cluster, ensure it is updated to allow net admin permissions. This is necessary for the installation of Istio later on:

gcloud container clusters update ${CLUSTER_NAME} \
--region=${REGION}
--project=$PROJECT_ID \
--workload-policies=allow-net-admin

Install KServe

KServe relies on Knative and requires a networking layer. In this tutorial, we will use Istio, the networking layer that integrates best with Knative.

Install Knative

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-core.yaml

Note: You will see warnings that Autopilot mutated the CRDs during this tutorial. These warnings are safe to ignore.

Install Istio

helm repo add istio https://istio-release.storage.googleapis.com/charts
helm repo update
kubectl create namespace istio-system
helm install istio-base istio/base -n istio-system --set defaultRevision=default
helm install istiod istio/istiod -n istio-system --wait
helm install istio-ingressgateway istio/gateway -n istio-system

# Verify the installation
kubectl get deployments -n istio-system

# Example Output
NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
istio-ingressgateway   1/1     1            1           17h
istiod                 1/1     1            1           20h

Install Knative-Istio

kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.15.1/net-istio.yaml

# Verify the installation
kubectl get pods -n knative-serving

# Example Output
NAME                                    READY   STATUS    RESTARTS      AGE
activator-749cf94f87-b7p9n              1/1     Running   0             17m
autoscaler-5c764b5f7d-m8zvk             1/1     Running   1 (14m ago)   17m
controller-5649f5bbb7-wvlmk             1/1     Running   4 (13m ago)   17m
net-istio-controller-7f8dfbddb7-d8cmq   1/1     Running   0             18s
net-istio-webhook-54ffc96585-cpgfl      2/2     Running   0             18s
webhook-64c67b4fc-smdtl                 1/1     Running   3 (13m ago)   17m

Install DNS In this tutorial we use Magic DNS. To configure a real DNS, follow the steps here.

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.15.1/serving-default-domain.yaml

Install Cert Manager, which is required to provision webhook certs for production grade installation.

helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --version v1.15.3 --set crds.enabled=true --set global.leaderElection.namespace=cert-manager

Install Kserve and Kserve cluster runtimes

kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0-rc0/kserve.yaml

# Wait until kserve-controller-manager is ready
kubectl rollout status deployment kserve-controller-manager -n kserve

# Install cluster runtimes
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14.0-rc0/kserve-cluster-resources.yaml

# View these runtimes
kubectl get ClusterServingRuntimes -n kserve

To request accelerators (GPUs) for your Google Kubernetes Engine (GKE) Autopilot workloads, nodeSelector is used in the manifest. Therefore, we will enable nodeSelector in Knative, which is disabled by default.

kubectl patch configmap/config-features \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"kubernetes.podspec-nodeselector":"enabled", "kubernetes.podspec-tolerations":"enabled"}}'

# restart knative webhook to consume the config, for example
kubectl get pods -n knative-serving
# Find the webhook pod and delete it to restart the pod.
kubeclt delete pod webhook-64c67b4fc-nmzwt -n knative-serving

After successfully installing KServe, you can now explore various examples such as, first inference service, canary rollout, inference batcher and auto-scaling. In the next step, we'll demonstrate how to deploy Gemma2 using vLLM in KServe with GKE Autopilot.

Deploy Gemma2 served with vllm.

Generate a hugging face access token follow these steps. Specify a Name of your choice and a Role of at least Read.
Make sure you accepted the term to use gemma2 in hugging face.
Create the hugging face token

kubectl create namespace kserve-test

# Specify your hugging face token.
export HF_TOKEN = XXX

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
    name: hf-secret
    namespace: kserve-test
type: Opaque
stringData:
    hf_api_token: ${HF_TOKEN}
EOF

Create the inference service

kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-gemma2
  namespace: kserve-test
spec:
  predictor:
    nodeSelector:
      cloud.google.com/gke-accelerator: nvidia-l4
      cloud.google.com/gke-accelerator-count: "1"
    model:
      modelFormat:
        name: huggingface
      args:
        - --enable_docs_url=True
        - --model_name=gemma2
        - --model_id=google/gemma-2-2b
      env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-secret
            key: hf_api_token
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
EOF

Wait for the service to be ready:

kubectl get inferenceservice huggingface-gemma2 -n kserve-test
kubectl get pods -n kserve-test

# Replace pod_name with the correct pod name.
kubectl events --for pod/POD_NAME -n kserve-test --watch

Test the Inference Service

Find the URL returned in kubectl get inferenceservice

URL=$(kubectl get inferenceservice huggingface-gemma2 -o jsonpath='{.status.url}')

# URL should look like this:
http://huggingface-gemma2.kserve-test.34.121.87.225.sslip.io

Open the swagger UI at $URL/docs
Play with the openai chat API with the example input below. Click execute and you can see the response.

{
    "model": "gemma2",
    "messages": [
        {
            "role": "system",
            "content": "You are an assistant that speaks like Shakespeare."
        },
        {
            "role": "user",
            "content": "Write a poem about colors"
        }
    ],
    "max_tokens": 30,
    "stream": false
}

Clean up

Delete the GKE cluster.

gcloud container clusters delete ${CLUSTER_NAME} \
    --location=$REGION \
    --project=$PROJECT_ID \