NVIDIA NIM on GKE
Before you begin
-
Get access to NVIDIA NIMs
[!IMPORTANT] Before you proceed further, ensure you have the NVIDIA AI Enterprise License (NVAIE) to access the NIMs. To get started, go to build.nvidia.com and provide your company email address
-
In the Google Cloud console, on the project selector page, select or create a new project with billing enabled
-
Ensure you have the following tools installed on your workstation
- gcloud CLI
- gcloud kubectl
- git
- jq
-
Enable the required APIs
gcloud services enable \
container.googleapis.com \
file.googleapis.com
Set up your GKE Cluster
- Choose your region and set your project and machine variables:
export PROJECT_ID=$(gcloud config get project)
export REGION=us-central1
export ZONE=${REGION?}-b
export MACH=a2-highgpu-1g
export GPU_TYPE=nvidia-tesla-a100
export GPU_COUNT=1
- Create a GKE cluster:
gcloud container clusters create nim-demo --location ${REGION?} \
--workload-pool ${PROJECT_ID?}.svc.id.goog \
--enable-image-streaming \
--enable-ip-alias \
--node-locations ${ZONE?} \
--workload-pool=${PROJECT_ID?}.svc.id.goog \
--addons=GcpFilestoreCsiDriver \
--machine-type n2d-standard-4 \
--num-nodes 1 --min-nodes 1 --max-nodes 5 \
--ephemeral-storage-local-ssd=count=2
- Create a nodepool
gcloud container node-pools create ${MACH?}-node-pool --cluster nim-demo \
--accelerator type=${GPU_TYPE?},count=${GPU_COUNT?},gpu-driver-version=latest \
--machine-type ${MACH?} \
--ephemeral-storage-local-ssd=count=${GPU_COUNT?} \
--enable-autoscaling --enable-image-streaming \
--num-nodes=1 --min-nodes=1 --max-nodes=3 \
--node-locations ${ZONE?} \
--region ${REGION?} \
--spot
Set Up Access to NVIDIA NIMs and prepare environment
- Get your NGC_API_KEY from NGC
export NGC_CLI_API_KEY="<YOUR_API_KEY>"
[!NOTE] If you have not set up NGC, see NGC Setup to get your access key and begin using NGC.
- As a part of the NGC setup, set your configs
ngc config set
- Ensure you have access to the repository by listing the models
ngc registry model list
- Create a Kuberntes namespace
kubectl create namespace nim
Deploy a PVC to persist the model
- Create a PVC to persist the model weights - recommended for deployments with more than one (1) replica. Save the following yaml as
pvc.yaml
.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-store-pvc
namespace: nim
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 30Gi
storageClassName: standard-rwx
- Apply PVC
kubectl apply -f pvc.yaml
[!NOTE] This PVC will dynamically provision a PV with the necessary storage to persist model weights across replicas of your pods.
Deploy the NIM with the generated engine using a Helm chart
- Clone the nim-deploy repository
git clone https://github.com/NVIDIA/nim-deploy.git
cd nim-deploy/helm
- Deploy chart with minimal configurations
helm --namespace nim install demo-nim nim-llm/ --set model.ngcAPIKey=$NGC_CLI_API_KEY --set persistence.enabled=true --set persistence.existingClaim=model-store-pvc
Test the NIM
- Expose the service
kubectl port-forward --namespace nim services/demo-nim-nim-llm 8000
- Send a test prompt - A100
curl -X 'POST' \
'http://localhost:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"messages": [
{
"content": "You are a polite and respectful poet.",
"role": "system"
},
{
"content": "Write a limerick about the wonders of GPUs and Kubernetes?",
"role": "user"
}
],
"model": "meta/llama3-8b-instruct",
"max_tokens": 256,
"top_p": 1,
"n": 1,
"stream": false,
"frequency_penalty": 0.0
}' | jq '.choices[0].message.content' -
- Browse the API by navigating to http://localhost:8000/docs