NVIDIA NIM on GKE

Before you begin

Get access to NVIDIA NIMs

[!IMPORTANT] Before you proceed further, ensure you have the NVIDIA AI Enterprise License (NVAIE) to access the NIMs. To get started, go to build.nvidia.com and provide your company email address
In the Google Cloud console, on the project selector page, select or create a new project with billing enabled
Ensure you have the following tools installed on your workstation
gcloud CLI
gcloud kubectl
git
jq
ngc
Enable the required APIs

gcloud services enable \
  container.googleapis.com \
  file.googleapis.com

Set up your GKE Cluster

Choose your region and set your project and machine variables:

export PROJECT_ID=$(gcloud config get project)
export REGION=us-central1
export ZONE=${REGION?}-b
export MACH=a2-highgpu-1g
export GPU_TYPE=nvidia-tesla-a100
export GPU_COUNT=1

Create a GKE cluster:

gcloud container clusters create nim-demo --location ${REGION?} \
  --workload-pool ${PROJECT_ID?}.svc.id.goog \
  --enable-image-streaming \
  --enable-ip-alias \
  --node-locations ${ZONE?} \
  --workload-pool=${PROJECT_ID?}.svc.id.goog \
  --addons=GcpFilestoreCsiDriver  \
  --machine-type n2d-standard-4 \
  --num-nodes 1 --min-nodes 1 --max-nodes 5 \
  --ephemeral-storage-local-ssd=count=2

Create a nodepool

gcloud container node-pools create ${MACH?}-node-pool --cluster nim-demo \
   --accelerator type=${GPU_TYPE?},count=${GPU_COUNT?},gpu-driver-version=latest \
  --machine-type ${MACH?} \
  --ephemeral-storage-local-ssd=count=${GPU_COUNT?} \
  --enable-autoscaling --enable-image-streaming \
  --num-nodes=1 --min-nodes=1 --max-nodes=3 \
  --node-locations ${ZONE?} \
  --region ${REGION?} \
  --spot

Set Up Access to NVIDIA NIMs and prepare environment

Get your NGC_API_KEY from NGC

export NGC_CLI_API_KEY="<YOUR_API_KEY>"

[!NOTE] If you have not set up NGC, see NGC Setup to get your access key and begin using NGC.

As a part of the NGC setup, set your configs

ngc config set

Ensure you have access to the repository by listing the models

ngc registry model list

Create a Kuberntes namespace

kubectl create namespace nim

Deploy a PVC to persist the model

Create a PVC to persist the model weights - recommended for deployments with more than one (1) replica. Save the following yaml as pvc.yaml.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-store-pvc
  namespace: nim
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 30Gi
  storageClassName: standard-rwx

Apply PVC

kubectl apply -f pvc.yaml

[!NOTE] This PVC will dynamically provision a PV with the necessary storage to persist model weights across replicas of your pods.

Deploy the NIM with the generated engine using a Helm chart

Clone the nim-deploy repository

git clone https://github.com/NVIDIA/nim-deploy.git
cd nim-deploy/helm

Deploy chart with minimal configurations

helm --namespace nim install demo-nim nim-llm/ --set model.ngcAPIKey=$NGC_CLI_API_KEY --set persistence.enabled=true --set persistence.existingClaim=model-store-pvc

Test the NIM

Expose the service

kubectl port-forward --namespace nim services/demo-nim-nim-llm 8000

Send a test prompt - A100

curl -X 'POST' \
  'http://localhost:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "messages": [
    {
      "content": "You are a polite and respectful poet.",
      "role": "system"
    },
    {
      "content": "Write a limerick about the wonders of GPUs and Kubernetes?",
      "role": "user"
    }
  ],
  "model": "meta/llama3-8b-instruct",
  "max_tokens": 256,
  "top_p": 1,
  "n": 1,
  "stream": false,
  "frequency_penalty": 0.0
}' | jq '.choices[0].message.content' -

Browse the API by navigating to http://localhost:8000/docs