Skip to content

NVIDIA NIM on GKE

Before you begin

  1. Get access to NVIDIA NIMs

    [!IMPORTANT] Before you proceed further, ensure you have the NVIDIA AI Enterprise License (NVAIE) to access the NIMs. To get started, go to build.nvidia.com and provide your company email address

  2. In the Google Cloud console, on the project selector page, select or create a new project with billing enabled

  3. Ensure you have the following tools installed on your workstation

  4. gcloud CLI
  5. gcloud kubectl
  6. git
  7. jq
  8. ngc

  9. Enable the required APIs

gcloud services enable \
  container.googleapis.com \
  file.googleapis.com

Set up your GKE Cluster

  1. Choose your region and set your project and machine variables:
export PROJECT_ID=$(gcloud config get project)
export REGION=us-central1
export ZONE=${REGION?}-b
export MACH=a2-highgpu-1g
export GPU_TYPE=nvidia-tesla-a100
export GPU_COUNT=1
  1. Create a GKE cluster:
gcloud container clusters create nim-demo --location ${REGION?} \
  --workload-pool ${PROJECT_ID?}.svc.id.goog \
  --enable-image-streaming \
  --enable-ip-alias \
  --node-locations ${ZONE?} \
  --workload-pool=${PROJECT_ID?}.svc.id.goog \
  --addons=GcpFilestoreCsiDriver  \
  --machine-type n2d-standard-4 \
  --num-nodes 1 --min-nodes 1 --max-nodes 5 \
  --ephemeral-storage-local-ssd=count=2
  1. Create a nodepool
gcloud container node-pools create ${MACH?}-node-pool --cluster nim-demo \
   --accelerator type=${GPU_TYPE?},count=${GPU_COUNT?},gpu-driver-version=latest \
  --machine-type ${MACH?} \
  --ephemeral-storage-local-ssd=count=${GPU_COUNT?} \
  --enable-autoscaling --enable-image-streaming \
  --num-nodes=1 --min-nodes=1 --max-nodes=3 \
  --node-locations ${ZONE?} \
  --region ${REGION?} \
  --spot

Set Up Access to NVIDIA NIMs and prepare environment

  1. Get your NGC_API_KEY from NGC
export NGC_CLI_API_KEY="<YOUR_API_KEY>"

[!NOTE] If you have not set up NGC, see NGC Setup to get your access key and begin using NGC.

  1. As a part of the NGC setup, set your configs
ngc config set
  1. Ensure you have access to the repository by listing the models
ngc registry model list
  1. Create a Kuberntes namespace
kubectl create namespace nim

Deploy a PVC to persist the model

  1. Create a PVC to persist the model weights - recommended for deployments with more than one (1) replica. Save the following yaml as pvc.yaml.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-store-pvc
  namespace: nim
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 30Gi
  storageClassName: standard-rwx
  1. Apply PVC
kubectl apply -f pvc.yaml

[!NOTE] This PVC will dynamically provision a PV with the necessary storage to persist model weights across replicas of your pods.

Deploy the NIM with the generated engine using a Helm chart

  1. Clone the nim-deploy repository
git clone https://github.com/NVIDIA/nim-deploy.git
cd nim-deploy/helm
  1. Deploy chart with minimal configurations
helm --namespace nim install demo-nim nim-llm/ --set model.ngcAPIKey=$NGC_CLI_API_KEY --set persistence.enabled=true --set persistence.existingClaim=model-store-pvc

Test the NIM

  1. Expose the service
kubectl port-forward --namespace nim services/demo-nim-nim-llm 8000
  1. Send a test prompt - A100
curl -X 'POST' \
  'http://localhost:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "messages": [
    {
      "content": "You are a polite and respectful poet.",
      "role": "system"
    },
    {
      "content": "Write a limerick about the wonders of GPUs and Kubernetes?",
      "role": "user"
    }
  ],
  "model": "meta/llama3-8b-instruct",
  "max_tokens": 256,
  "top_p": 1,
  "n": 1,
  "stream": false,
  "frequency_penalty": 0.0
}' | jq '.choices[0].message.content' -
  1. Browse the API by navigating to http://localhost:8000/docs