This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Pipelines on Google Cloud

Instructions for customizing and using Kubeflow Pipelines on Google Cloud

1 - Connecting to Kubeflow Pipelines on Google Cloud using the SDK

How to connect to different Kubeflow Pipelines installations on Google Cloud using the Kubeflow Pipelines SDK

This guide describes how to connect to your Kubeflow Pipelines cluster on Google Cloud using the Kubeflow Pipelines SDK.

Before you begin

How SDK connects to Kubeflow Pipelines API

Kubeflow Pipelines includes an API service named ml-pipeline-ui. The ml-pipeline-ui API service is deployed in the same Kubernetes namespace you deployed Kubeflow Pipelines in.

The Kubeflow Pipelines SDK can send REST API requests to this API service, but the SDK needs to know the hostname to connect to the API service.

If the hostname can be accessed without authentication, it’s very simple to connect to it. For example, you can use kubectl port-forward to access it via localhost:

# The Kubeflow Pipelines API service and the UI is available at
# http://localhost:3000 without authentication check.
$ kubectl port-forward svc/ml-pipeline-ui 3000:80 --namespace kubeflow
# Change the namespace if you deployed Kubeflow Pipelines in a different
# namespace.
import kfp
client = kfp.Client(host='http://localhost:3000')

When deploying Kubeflow Pipelines on Google Cloud, a public endpoint for this API service is auto-configured for you, but this public endpoint has security checks to protect your cluster from unauthorized access.

The following sections introduce how to authenticate your SDK requests to connect to Kubeflow Pipelines via the public endpoint.

Connecting to Kubeflow Pipelines standalone or AI Platform Pipelines

Refer to Connecting to AI Platform Pipelines using the Kubeflow Pipelines SDK for both Kubeflow Pipelines standalone and AI Platform Pipelines.

Kubeflow Pipelines standalone deployments also show up in AI Platform Pipelines. They have the name “pipeline” by default, but you can customize the name by overriding the appName parameter in params.env when deploying Kubeflow Pipelines standalone.

Connecting to Kubeflow Pipelines in a full Kubeflow deployment

A full Kubeflow deployment on Google Cloud uses an Identity-Aware Proxy (IAP) to manage access to the public Kubeflow endpoint. The steps below let you connect to Kubeflow Pipelines in a full Kubeflow deployment with authentication through IAP.

  1. Find out your IAP OAuth 2.0 client ID.

    You or your cluster admin followed Set up OAuth for Cloud IAP to deploy your full Kubeflow deployment on Google Cloud. You need the OAuth client ID created in that step.

    You can browse all of your existing OAuth client IDs in the Credentials page of Google Cloud Console.

  2. Create another SDK OAuth Client ID for authenticating Kubeflow Pipelines SDK users. Follow the steps to set up a client ID to authenticate from a desktop app. Take a note of the client ID and client secret. This client ID and secret can be shared among all SDK users, because a separate login step is still needed below.

  3. To connect to Kubeflow Pipelines public endpoint, initiate SDK client like the following:

    import kfp
    client = kfp.Client(host='https://<KF_NAME>.endpoints.<PROJECT>.cloud.goog/pipeline',
        client_id='<AAAAAAAAAAAAAAAAAAAAAA>.apps.googleusercontent.com',
        other_client_id='<BBBBBBBBBBBBBBBBBBB>.apps.googleusercontent.com',
        other_client_secret='<CCCCCCCCCCCCCCCCCCCC>')
    
    • Pass your IAP OAuth client ID found in step 1 to client_id argument.
    • Pass your SDK OAuth client ID and secret created in step 2 to other_client_id and other_client_secret arguments.
  4. When you init the SDK client for the first time, you will be asked to log in. The Kubeflow Pipelines SDK stores obtained credentials in $HOME/.config/kfp/credentials.json. You do not need to log in again unless you manually delete the credentials file.

    To use the SDK from cron tasks where you cannot log in manually, you can copy the credentials file in `$HOME/.config/kfp/credentials.json` to another machine.
    However, you should keep the credentials safe and never expose it to
    third parties.
    
  5. After login, you can use the client.

    print(client.list_pipelines())
    

Troubleshooting

  • Error “Failed to authorize with API resource references: there is no user identity header” when using SDK methods.

    Direct access to the API service without authentication works for Kubeflow Pipelines standalone, AI Platform Pipelines, and Kubeflow 1.0 or earlier.

    However, it fails authorization checks for Kubeflow Pipelines with multi-user isolation in the full Kubeflow deployment starting from Kubeflow 1.1. Multi-user isolation requires all API access to authenticate as a user. Refer to Kubeflow Pipelines Multi-user isolation documentation for more details.

2 - Authenticating Pipelines to Google Cloud

Authentication and authorization to Google Cloud in Pipelines

This page describes authentication for Kubeflow Pipelines to Google Cloud. Available options listed below have different tradeoffs. You should choose the one that fits your use-case.

  • Configuring your cluster to access Google Cloud using Compute Engine default service account with the “cloud-platform” scope is easier to set up than the other options. However, this approach grants excessive permissions. Therefore, it is not suitable if you need workload permission separation.
  • Workload Identity takes more efforts to set up, but allows fine-grained permission control. It is recommended for production use-cases.
  • Google service account keys stored as Kubernetes secrets is the legacy approach and no longer recommended in Google Kubernetes Engine. However, it’s the only option to use Google Cloud APIs when your cluster is an anthos or on-prem cluster.

Before you begin

There are various options on how to install Kubeflow Pipelines in the Installation Options for Kubeflow Pipelines guide. Be aware that authentication support and cluster setup instructions will vary depending on the method you used to install Kubeflow Pipelines.

  • For Kubeflow Pipelines standalone, you can compare and choose from all 3 options.
  • For full Kubeflow starting from Kubeflow 1.1, Workload Identity is the recommended and default option.
  • For AI Platform Pipelines, Compute Engine default service account is the only supported option.

Compute Engine default service account

This is good for trying out Kubeflow Pipelines, because it is easy to set up.

However, it does not support permission separation for workloads in the cluster. Any workload in the cluster will be able to call any Google Cloud APIs in the chosen scope.

Cluster setup to use Compute Engine default service account

By default, your Google Kubernetes Engine nodes use Compute Engine default service account. If you allowed cloud-platform scope when creating the cluster, Kubeflow Pipelines can authenticate to Google Cloud and manage resources in your project without further configuration.

Use one of the following options to create a Google Kubernetes Engine cluster that uses the Compute Engine default service account:

  • If you followed instructions in Setting up AI Platform Pipelines and checked Allow access to the following Cloud APIs, your cluster is already using Compute Engine default service account.
  • In Google Cloud Console UI, you can enable it in Create a Kubernetes cluster -> default-pool -> Security -> Access Scopes -> Allow full access to all Cloud APIs like the following:
  • Using gcloud CLI, you can enable it with --scopes cloud-platform like the following:
gcloud container clusters create <cluster-name> \
  --scopes cloud-platform

Please refer to gcloud container clusters create command documentation for other available options.

Authoring pipelines to use default service account

Pipelines don’t need any specific changes to authenticate to Google Cloud, it will use the default service account transparently.

However, you must update existing pipelines that use the use_gcp_secret kfp sdk operator. Remove the use_gcp_secret usage to let your pipeline authenticate to Google Cloud using the default service account.

Securing the cluster with fine-grained Google Cloud permission control

Workload Identity

Workload Identity is the recommended way for your Google Kubernetes Engine applications to consume services provided by Google APIs. You accomplish this by configuring a Kubernetes service account to act as a Google service account. Any Pods running as the Kubernetes service account then use the Google service account to authenticate to cloud services.

Referenced from Workload Identity Documentation. Please read this doc for:

  • A detailed introduction to Workload Identity.
  • Instructions to enable it on your cluster.
  • Whether its limitations affect your adoption.

Terminology

This document distinguishes between Kubernetes service accounts (KSAs) and Google service accounts (GSAs). KSAs are Kubernetes resources, while GSAs are specific to Google Cloud. Other documentation usually refers to both of them as just “service accounts”.

Authoring pipelines to use Workload Identity

Pipelines don’t need any specific changes to authenticate to Google Cloud. With Workload Identity, pipelines run as the Google service account that is bound to the KSA.

However, existing pipelines that use use_gcp_secret kfp sdk operator need to remove the use_gcp_secret usage to use the bound GSA. You can also continue to use use_gcp_secret in a cluster with Workload Identity enabled and use_gcp_secret will take precedence for those workloads.

Cluster setup to use Workload Identity for Full Kubeflow

Starting from Kubeflow 1.1, Kubeflow Pipelines supports multi-user isolation. Therefore, pipeline runs are executed in user namespaces using the default-editor KSA. The default-editor KSA is auto-bound to the GSA specified in the user profile, which defaults to a shared GSA ${KFNAME}-user@${PROJECT}.iam.gserviceaccount.com.

If you want to bind the default-editor KSA with a different GSA for a specific namespace, refer to the In-cluster authentication to Google Cloud guide.

Additionally, the Kubeflow Pipelines UI, visualization, and TensorBoard server instances are deployed in your user namespace using the default-editor KSA. Therefore, to visualize results in the Pipelines UI, they can fetch artifacts in Google Cloud Storage using permissions of the same GSA you configured for this namespace.

Cluster setup to use Workload Identity for Pipelines Standalone

1. Create your cluster with Workload Identity enabled
  • In Google Cloud Console UI, you can enable Workload Identity in Create a Kubernetes cluster -> Security -> Enable Workload Identity like the following:

  • Using gcloud CLI, you can enable it with:

gcloud beta container clusters create <cluster-name> \
  --release-channel regular \
  --workload-pool=project-id.svc.id.goog

References:

2. Deploy Kubeflow Pipelines

Deploy via Pipelines Standalone as usual.

3. Bind Workload Identities for KSAs used by Kubeflow Pipelines

The following helper bash scripts bind Workload Identities for KSAs used by Kubeflow Pipelines:

  • gcp-workload-identity-setup.sh helps you create GSAs and bind them to KSAs used by pipelines workloads. This script provides an interactive command line dialog with explanation messages.
  • wi-utils.sh alternatively provides minimal utility bash functions that let you customize your setup. The minimal utilities make it easy to read and use programmatically.

For example, to get a default setup using gcp-workload-identity-setup.sh, you can

$ curl -O https://raw.githubusercontent.com/kubeflow/pipelines/master/manifests/kustomize/gcp-workload-identity-setup.sh
$ chmod +x ./gcp-workload-identity-setup.sh
$ ./gcp-workload-identity-setup.sh
# This prints the command's usage example and introduction.
# Then you can run the command with required parameters.
# Command output will tell you which GSAs and Workload Identity bindings have been
# created.
4. Configure IAM permissions of used GSAs

If you used gcp-workload-identity-setup.sh to bind Workload Identities for your cluster, you can simply add the following IAM bindings:

  • Give GSA <cluster-name>-kfp-system@<project-id>.iam.gserviceaccount.com Storage Object Viewer role to let UI load data in GCS in the same project.
  • Give GSA <cluster-name>-kfp-user@<project-id>.iam.gserviceaccount.com any permissions your pipelines need. For quick tryouts, you can give it Project Editor role for all permissions.

If you configured bindings by yourself, here are Google Cloud permission requirements for KFP KSAs:

  • Pipelines use pipeline-runner KSA. Configure IAM permissions of the GSA bound to this KSA to allow pipelines use Google Cloud APIs.
  • Pipelines UI uses ml-pipeline-ui KSA. Pipelines Visualization Server uses ml-pipeline-visualizationserver KSA. If you need to view artifacts and visualizations stored in Google Cloud Storage (GCS) from pipelines UI, you should add Storage Object Viewer permission (or the minimal required permission) to their bound GSAs.

Google service account keys stored as Kubernetes secrets

It is recommended to use Workload Identity for easier and secure management, but you can also choose to use GSA keys.

Authoring pipelines to use GSA keys

Each pipeline step describes a container that is run independently. If you want to grant access for a single step to use one of your service accounts, you can use kfp.gcp.use_gcp_secret(). Examples for how to use this function can be found in the Kubeflow examples repo.

Cluster setup to use use_gcp_secret for Full Kubeflow

From Kubeflow 1.1, there’s no longer a user-gcp-sa secrets deployed for you. Recommend using Workload Identity instead.

For Kubeflow 1.0 or earlier, you don’t need to do anything. Full Kubeflow deployment has already deployed the user-gcp-sa secret for you.

Cluster setup to use use_gcp_secret for Pipelines Standalone

Pipelines Standalone require your manual setup for the user-gcp-sa secret used by use_gcp_secret.

Instructions to set up the secret:

  1. First download the GCE VM service account token (refer to Google Cloud documentation for more information):

    gcloud iam service-accounts keys create application_default_credentials.json \
      --iam-account [SA-NAME]@[PROJECT-ID].iam.gserviceaccount.com
    
  2. Run:

    kubectl create secret -n [your-namespace] generic user-gcp-sa \
      --from-file=user-gcp-sa.json=application_default_credentials.json
    

3 - Upgrading

How to upgrade your Kubeflow Pipelines deployment on Google Cloud

Before you begin

There are various options on how to install Kubeflow Pipelines in the Installation Options for Kubeflow Pipelines guide. Be aware that upgrade support and instructions will vary depending on the method you used to install Kubeflow Pipelines.

Installation \ Features In-place upgrade Reinstallation on the same cluster Reinstallation on a different cluster User customizations across upgrades (via Kustomize)
Standalone ⚠️ Data is deleted by default.
Standalone (managed storage)
full Kubeflow (>= v1.1) Needs documentation
full Kubeflow (< v1.1)
AI Platform Pipelines
AI Platform Pipelines (managed storage)

Notes:

  • When you deploy Kubeflow Pipelines with managed storage on Google Cloud, you pipeline’s metadata and artifacts are stored in Cloud Storage and Cloud SQL. Using managed storage makes it easier to manage, back up, and restore Kubeflow Pipelines data.

Kubeflow Pipelines Standalone

Upgrade Support for Kubeflow Pipelines Standalone is in Beta.

Upgrading Kubeflow Pipelines Standalone introduces how to upgrade in-place.

Full Kubeflow

On Google Cloud, the full Kubeflow deployment follows the package pattern starting from Kubeflow 1.1.

The package pattern enables you to upgrade the full Kubeflow in-place while keeping user customizations — refer to the Upgrade Kubeflow on Google Cloud documentation for instructions.

However, there’s no current support to upgrade from Kubeflow 1.0 or earlier to Kubeflow 1.1 while keeping Kubeflow Pipelines data. This may change in the future, so provide your feedback in kubeflow/pipelines#4346 on GitHub.

AI Platform Pipelines

Upgrade Support for AI Platform Pipelines is in Alpha.

Below are the steps that describe how to upgrade your AI Platform Pipelines instance while keeping existing data:

For instances without managed storage:

  1. Delete your AI Platform Pipelines instance WITHOUT selecting Delete cluster. The persisted artifacts and database data are stored in persistent volumes in the cluster. They are kept by default when you do not delete the cluster.
  2. Reinstall Kubeflow Pipelines from the Google Cloud Marketplace using the same Google Kubernetes Engine cluster, namespace, and application name. Persisted data will be automatically picked up during reinstallation.

For instances with managed storage:

  1. Delete your AI Platform Pipelines instance.
  2. If you are upgrading from Kubeflow Pipelines 0.5.1, note that the Cloud Storage bucket is a required starting from 1.0.0. Previously deployed instances should be using a bucket named like “-”. Browse your Cloud Storage buckets to find your existing bucket name and provide it in the next step.
  3. Reinstall Kubeflow Pipelines from the Google Cloud Marketplace using the same application name and managed storage options as before. You can freely install it in any cluster and namespace (not necessarily the same as before), because persisted artifacts and database data are stored in managed storages (Cloud Storage and Cloud SQL), and will be automatically picked up during reinstallation.

4 - Enabling GPU and TPU

Enable GPU and TPU for Kubeflow Pipelines on Google Kubernetes Engine (GKE)

This page describes how to enable GPU or TPU for a pipeline on Google Kubernetes Engine by using the Pipelines DSL language.

Prerequisites

To enable GPU and TPU on your Kubeflow cluster, follow the instructions on how to customize the Google Kubernetes Engine cluster for Kubeflow before setting up the cluster.

Configure ContainerOp to consume GPUs

After enabling the GPU, the Kubeflow setup script installs a default GPU pool with type nvidia-tesla-k80 with auto-scaling enabled. The following code consumes 2 GPUs in a ContainerOp.

import kfp.dsl as dsl
gpu_op = dsl.ContainerOp(name='gpu-op', ...).set_gpu_limit(2)

The code above will be compiled into Kubernetes Pod spec:

container:
  ...
  resources:
    limits:
      nvidia.com/gpu: "2"

If the cluster has multiple node pools with different GPU types, you can specify the GPU type by the following code.

import kfp.dsl as dsl
gpu_op = dsl.ContainerOp(name='gpu-op', ...).set_gpu_limit(2)
gpu_op.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p4')

The code above will be compiled into Kubernetes Pod spec:

container:
  ...
  resources:
    limits:
      nvidia.com/gpu: "2"
nodeSelector:
  cloud.google.com/gke-accelerator: nvidia-tesla-p4

See GPU tutorial for a complete example to build a Kubeflow pipeline that uses GPUs.

Check the Google Kubernetes Engine GPU guide to learn more about GPU settings.

Configure ContainerOp to consume TPUs

Use the following code to configure ContainerOp to consume TPUs on Google Kubernetes Engine:

import kfp.dsl as dsl
import kfp.gcp as gcp
tpu_op = dsl.ContainerOp(name='tpu-op', ...).apply(gcp.use_tpu(
  tpu_cores = 8, tpu_resource = 'v2', tf_version = '1.12'))

The above code uses 8 v2 TPUs with TF version to be 1.12. The code above will be compiled into Kubernetes Pod spec:

container:
  ...
  resources:
    limits:
      cloud-tpus.google.com/v2: "8"
  metadata:
    annotations:
      tf-version.cloud-tpus.google.com: "1.12"

To learn more, see an example pipeline that uses a preemptible node pool with TPU or GPU..

See the Google Kubernetes Engine TPU Guide to learn more about TPU settings.

5 - Using Preemptible VMs and GPUs on Google Cloud

Configuring preemptible VMs and GPUs for Kubeflow Pipelines on Google Cloud

This document describes how to configure preemptible virtual machines (preemptible VMs) and GPUs on preemptible VM instances (preemptible GPUs) for your workflows running on Kubeflow Pipelines on Google Cloud.

Introduction

Preemptible VMs are Compute Engine VM instances that last a maximum of 24 hours and provide no availability guarantees. The pricing of preemptible VMs is lower than that of standard Compute Engine VMs.

GPUs attached to preemptible instances (preemptible GPUs) work like normal GPUs but persist only for the life of the instance.

Using preemptible VMs and GPUs can reduce costs on Google Cloud. In addition to using preemptible VMs, your Google Kubernetes Engine (GKE) cluster can autoscale based on current workloads.

This guide assumes that you have already deployed Kubeflow Pipelines. If not, follow the guide to deploying Kubeflow on Google Cloud.

Before you start

The variables defined in this page can be found in kubeflow-distribution/kubeflow/env.sh. They are the same value as you set based on your Kubeflow deployment.

Using preemptible VMs with Kubeflow Pipelines

In summary, the steps to schedule a pipeline to run on preemptible VMs are as follows:

  1. Create a node pool in your cluster that contains preemptible VMs.
  2. Configure your pipelines to run on the preemptible VMs.

The following sections contain more detail about the above steps.

1. Create a node pool with preemptible VMs

Create a preemptible-nodepool.yaml as below and fulfill all placerholder content KF_NAME, KF_PROJECT, LOCATION:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  labels:
    kf-name: KF_NAME # kpt-set: ${name}
  name: PREEMPTIBLE_CPU_POOL
  namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}
spec:
  location: LOCATION # kpt-set: ${location}
  initialNodeCount: 1
  autoscaling:
    minNodeCount: 0
    maxNodeCount: 5
  nodeConfig:
    machineType: n1-standard-4
    diskSizeGb: 100
    diskType: pd-standard
    preemptible: true
    taint:
    - effect: NO_SCHEDULE
      key: preemptible
      value: "true"
    oauthScopes:
    - "https://www.googleapis.com/auth/logging.write"
    - "https://www.googleapis.com/auth/monitoring"
    - "https://www.googleapis.com/auth/devstorage.read_only"
    serviceAccountRef:
      external: KF_NAME-vm@KF_PROJECT.iam.gserviceaccount.com # kpt-set: ${name}-vm@${gcloud.core.project}.iam.gserviceaccount.com
    metadata:
      disable-legacy-endpoints: "true"
  management:
    autoRepair: true
    autoUpgrade: true
  clusterRef:
    name: KF_NAME # kpt-set: ${name}
    namespace: KF_PROJECT # kpt-set: ${name}

Where:

  • PREEMPTIBLE_CPU_POOL is the name of the node pool.
  • KF_NAME is the name of the Kubeflow Google Kubernetes Engine cluster.
  • KF_PROJECT is the name of your Kubeflow Google Cloud project.
  • LOCATION is the region of this nodepool, for example: us-west1-b.
  • KF_NAME-vm@KF_PROJECT.iam.gserviceaccount.com is your service account, replace the KF_NAME and KF_PROJECT using the value above in this pattern, you can get vm service account you have already created in Kubeflow cluster deployment

Apply the nodepool patch file above by running:

kubectl --context=${MGMTCTXT} --namespace=${KF_PROJECT} apply -f <path-to-nodepool-file>/preemptible-nodepool.yaml

For Kubeflow Pipelines standalone only

Alternatively, if you are on Kubeflow Pipelines standalone, or AI Platform Pipelines, you can run this command to create node pool:

gcloud container node-pools create PREEMPTIBLE_CPU_POOL \
    --cluster=CLUSTER_NAME \
      --enable-autoscaling --max-nodes=MAX_NODES --min-nodes=MIN_NODES \
      --preemptible \
      --node-taints=preemptible=true:NoSchedule \
      --service-account=DEPLOYMENT_NAME-vm@PROJECT_NAME.iam.gserviceaccount.com

Below is an example of command:

gcloud container node-pools create preemptible-cpu-pool \
  --cluster=user-4-18 \
    --enable-autoscaling --max-nodes=4 --min-nodes=0 \
    --preemptible \
    --node-taints=preemptible=true:NoSchedule \
    --service-account=user-4-18-vm@ml-pipeline-project.iam.gserviceaccount.com

2. Schedule your pipeline to run on the preemptible VMs

After configuring a node pool with preemptible VMs, you must configure your pipelines to run on the preemptible VMs.

In the DSL code for your pipeline, add the following to the ContainerOp instance:

.apply(gcp.use_preemptible_nodepool())

The above function works for both methods of generating the ContainerOp:

Note:

  • Call .set_retry(#NUM_RETRY) on your ContainerOp to retry the task after the task is preempted.
  • If you modified the node taint when creating the node pool, pass the same node toleration to the use_preemptible_nodepool() function.
  • use_preemptible_nodepool() also accepts a parameter hard_constraint. When the hard_constraint is True, the system will strictly schedule the task in preemptible VMs. When the hard_constraint is False, the system will try to schedule the task in preemptible VMs. If it cannot find the preemptible VMs, or the preemptible VMs are busy, the system will schedule the task in normal VMs.

For example:

import kfp.dsl as dsl
import kfp.gcp as gcp

class FlipCoinOp(dsl.ContainerOp):
  """Flip a coin and output heads or tails randomly."""

  def __init__(self):
    super(FlipCoinOp, self).__init__(
      name='Flip',
      image='python:alpine3.6',
      command=['sh', '-c'],
      arguments=['python -c "import random; result = \'heads\' if random.randint(0,1) == 0 '
                 'else \'tails\'; print(result)" | tee /tmp/output'],
      file_outputs={'output': '/tmp/output'})

@dsl.pipeline(
  name='pipeline flip coin',
  description='shows how to use dsl.Condition.'
)

def flipcoin():
  flip = FlipCoinOp().apply(gcp.use_preemptible_nodepool())

if __name__ == '__main__':
  import kfp.compiler as compiler
  compiler.Compiler().compile(flipcoin, __file__ + '.zip')

Using preemptible GPUs with Kubeflow Pipelines

This guide assumes that you have already deployed Kubeflow Pipelines. In summary, the steps to schedule a pipeline to run with preemptible GPUs are as follows:

  1. Make sure you have enough GPU quota.
  2. Create a node pool in your Google Kubernetes Engine cluster that contains preemptible VMs with preemptible GPUs.
  3. Configure your pipelines to run on the preemptible VMs with preemptible GPUs.

The following sections contain more detail about the above steps.

1. Make sure you have enough GPU quota

Add GPU quota to your Google Cloud project. The Google Cloud documentation lists the availability of GPUs across regions. To check the available quota for resources in your project, go to the Quotas page in the Google Cloud Console.

2. Create a node pool of preemptible VMs with preemptible GPUs

Create a preemptible-gpu-nodepool.yaml as below and fulfill all placerholder content:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  labels:
    kf-name: KF_NAME # kpt-set: ${name}
  name: KF_NAME-containernodepool-gpu
  namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}
spec:
  location: LOCATION # kpt-set: ${location}
  initialNodeCount: 1
  autoscaling:
    minNodeCount: 0
    maxNodeCount: 5
  nodeConfig:
    machineType: n1-standard-4
    diskSizeGb: 100
    diskType: pd-standard
    preemptible: true
    oauthScopes:
    - "https://www.googleapis.com/auth/logging.write"
    - "https://www.googleapis.com/auth/monitoring"
    - "https://www.googleapis.com/auth/devstorage.read_only"
    serviceAccountRef:
      external: KF_NAME-vm@KF_PROJECT.iam.gserviceaccount.com # kpt-set: ${name}-vm@${gcloud.core.project}.iam.gserviceaccount.com
    guestAccelerator:
    - type: "nvidia-tesla-k80"
      count: 1
    metadata:
      disable-legacy-endpoints: "true"
  management:
    autoRepair: true
    autoUpgrade: true
  clusterRef:
    name: KF_NAME # kpt-set: ${name}
    namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}

Where:

  • PREEMPTIBLE_CPU_POOL is the name of the node pool.
  • KF_NAME is the name of the Kubeflow Google Kubernetes Engine cluster.
  • KF_PROJECT is the name of your Kubeflow Google Cloud project.
  • LOCATION is the region of this nodepool, for example: us-west1-b.
  • KF_NAME-vm@KF_PROJECT.iam.gserviceaccount.com is your service account, replace the KF_NAME and KF_PROJECT using the value above in this pattern, you can get vm service account you have already created in Kubeflow cluster deployment.

For Kubeflow Pipelines standalone only

Alternatively, if you are on Kubeflow Pipelines standalone, or AI Platform Pipelines, you can run this command to create node pool:

gcloud container node-pools create PREEMPTIBLE_GPU_POOL \
    --cluster=CLUSTER_NAME \
    --enable-autoscaling --max-nodes=MAX_NODES --min-nodes=MIN_NODES \
    --preemptible \
    --node-taints=preemptible=true:NoSchedule \
    --service-account=DEPLOYMENT_NAME-vm@PROJECT_NAME.iam.gserviceaccount.com \
    --accelerator=type=GPU_TYPE,count=GPU_COUNT

Below is an example of command:

gcloud container node-pools create preemptible-gpu-pool \
    --cluster=user-4-18 \
    --enable-autoscaling --max-nodes=4 --min-nodes=0 \
    --preemptible \
    --node-taints=preemptible=true:NoSchedule \
    --service-account=user-4-18-vm@ml-pipeline-project.iam.gserviceaccount.com \
    --accelerator=type=nvidia-tesla-t4,count=2

3. Schedule your pipeline to run on the preemptible VMs with preemptible GPUs

In the DSL code for your pipeline, add the following to the ContainerOp instance:

.apply(gcp.use_preemptible_nodepool()

The above function works for both methods of generating the ContainerOp:

Note:

  • Call .set_gpu_limit(#NUM_GPUs, GPU_VENDOR) on your ContainerOp to specify the GPU limit (for example, 1) and vendor (for example, 'nvidia').
  • Call .set_retry(#NUM_RETRY) on your ContainerOp to retry the task after the task is preempted.
  • If you modified the node taint when creating the node pool, pass the same node toleration to the use_preemptible_nodepool() function.
  • use_preemptible_nodepool() also accepts a parameter hard_constraint. When the hard_constraint is True, the system will strictly schedule the task in preemptible VMs. When the hard_constraint is False, the system will try to schedule the task in preemptible VMs. If it cannot find the preemptible VMs, or the preemptible VMs are busy, the system will schedule the task in normal VMs.

For example:

import kfp.dsl as dsl
import kfp.gcp as gcp

class FlipCoinOp(dsl.ContainerOp):
  """Flip a coin and output heads or tails randomly."""

  def __init__(self):
    super(FlipCoinOp, self).__init__(
      name='Flip',
      image='python:alpine3.6',
      command=['sh', '-c'],
      arguments=['python -c "import random; result = \'heads\' if random.randint(0,1) == 0 '
                 'else \'tails\'; print(result)" | tee /tmp/output'],
      file_outputs={'output': '/tmp/output'})

@dsl.pipeline(
  name='pipeline flip coin',
  description='shows how to use dsl.Condition.'
)

def flipcoin():
  flip = FlipCoinOp().set_gpu_limit(1, 'nvidia').apply(gcp.use_preemptible_nodepool())
if __name__ == '__main__':
  import kfp.compiler as compiler
  compiler.Compiler().compile(flipcoin, __file__ + '.zip')

Debugging

Run the following command if your nodepool didn’t show up or has error during provisioning:

kubectl --context=${MGMTCTXT} --namespace=${KF_PROJECT} describe containernodepool -l kf-name=${KF_NAME}

Next steps