Running distributed ML training workloads on GKE using Indexed Jobs

In this guide you will run a distributed ML training workload on GKE using an Indexed Job.

Specifically, you will train a handwritten digit image classifier on the classic MNIST dataset using PyTorch. The training computation will be distributed across 4 GPU nodes in a GKE cluster.

Prerequisites

Google Cloud account set up.
gcloud command line tool installed and configured to use your GCP project.
kubectl command line utility is installed.
docker is installed.

1. Create a standard GKE cluster

Run the command:

gcloud container clusters create demo --zone us-central1-c

You should see output indicating the cluster is being created (this can take ~10 minutes or so).

2. Create a GPU node pool.

You can choose any supported GPU type you wish, using a supported machine type. See the docs for more details. In this example, we are using NVIDIA Tesla T4s with the N1 machine family.

gcloud container node-pools create gpu-pool \
  --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=LATEST \
  --machine-type n1-standard-4 \
  --zone us-central1-c --cluster demo \
  --node-locations us-central1-c \
  --num-nodes 4

Creating this GPU node pool will take a few minutes.

3. Build and push the Docker image to GCR

Make a local copy of the mnist.py file which defines a traditional convolutional neural network, as the training logic which trains the model on the classic MNIST dataset.

Next, make a local copy of the Dockerfile and run the following commands to build the container image and push it to your GCR repository:

export PROJECT_ID=<your GCP project ID>
docker build -t pytorch-mnist-gpu -f Dockerfile .
docker tag pytorch-mnist-gpu gcr.io/$PROJECT_ID/pytorch-mnist-gpu:latest
docker push gcr.io/$PROJECT_ID/pytorch-mnist-gpu:latest

4. Define an Indexed Job and Headless Service

In the yaml below, we configure an Indexed Job to run 4 pods, 1 for each GPU node, and use torchrun to kick off a distributed training job for the CNN model on the MNIST dataset. This training job will utilize 1 T4 GPU chip on each node in the node pool.

We also define a headless service which selects the pods owned by this Indexed Job. This will trigger the creation of the DNS records needed for the pods to communicate with eachother over the network via hostnames.

Copy the yaml below into a local file mnist.yaml and be sure to replace <PROJECT_ID> with your GCP project ID in the container image.

apiVersion: v1
kind: Service
metadata:
  name: headless-svc
spec:
  clusterIP: None 
  selector:
    job-name: pytorchworker
---
apiVersion: batch/v1
kind: Job
metadata:
  name: pytorchworker
spec:
  backoffLimit: 0
  completions: 4
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      subdomain: headless-svc
      restartPolicy: Never
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
      tolerations:
      - operator: "Exists"
        key: nvidia.com/gpu
      containers:
      - name: pytorch
        image: gcr.io/<PROJECT_ID>/pytorch-mnist-gpu:latest
        imagePullPolicy: Always
        ports:
        - containerPort: 3389
        env:
        - name: MASTER_ADDR
          value: pytorchworker-0.headless-svc
        - name: MASTER_PORT
          value: "3389"
        - name: PYTHONBUFFERED
          value: "0"
        - name: LOGLEVEL
          value: "INFO"
        - name: RANK
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
        command:
        - bash
        - -xc
        - |
          printenv
          torchrun --rdzv_id=123 --nnodes=4 --nproc_per_node=1 --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --node_rank=$RANK mnist.py --epochs=1 --log-interval=1

5. Run the training job

Run the following command to create the Kubernetes resources we defined above and run the training job:

kubectl apply -f mnist.yaml

You should see 4 pods created (note the container image is large and may take a few minutes to pull before the container starts running):

$ kubectl get pods
NAME                    READY   STATUS              RESTARTS   AGE
pytorchworker-0-bbsmk   0/1     ContainerCreating   0          15s
pytorchworker-1-92tbl   0/1     ContainerCreating   0          15s
pytorchworker-2-nbrgf   0/1     ContainerCreating   0          15s
pytorchworker-3-rsrdf   0/1     ContainerCreating   0          15s

4. Observe training logs

Once the pods transition from the ContainerCreating status to the Running status, you can observe the training logs by examining the pod logs.

$ kubectl logs -f pytorchworker-1

+ torchrun --rdzv_id=123 --nnodes=4 --nproc_per_node=1 --master_addr=pytorchworker-0.headless-svc --master_port=3389 --node_rank=1 mnist.py --epochs=1 --log-interval=1
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9912422/9912422 [00:00<00:00, 90162259.46it/s]
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
100%|██████████| 28881/28881 [00:00<00:00, 33279036.76it/s]
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|██████████| 1648877/1648877 [00:00<00:00, 23474415.33it/s]
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|██████████| 4542/4542 [00:00<00:00, 19165521.90it/s]
Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw

Train Epoch: 1 [0/60000 (0%)]   Loss: 2.297087
Train Epoch: 1 [64/60000 (0%)]  Loss: 2.550339
Train Epoch: 1 [128/60000 (1%)] Loss: 2.361300
...

Train Epoch: 1 [14912/60000 (99%)]      Loss: 0.051500
Train Epoch: 1 [5616/60000 (100%)]      Loss: 0.209231
235it [00:36,  6.51it/s]

Test set: Average loss: 0.0825, Accuracy: 9720/10000 (97%)

INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0015289783477783203 seconds