1 - Deployment

Instructions for deploying Kubeflow on Google Cloud

1.1 - Overview

Full fledged Kubeflow deployment on Google Cloud

This guide describes how to deploy Kubeflow and a series of Kubeflow components on Google Kubernetes Engine (GKE).

Features

Kubeflow deployed on Google Cloud includes the following:

  1. Full-fledged multi-user Kubeflow running on Google Kubernetes Engine.
  2. Cluster Autoscaler with automatic resizing of the node pool.
  3. Cloud Endpoint integrated with Identity-aware Proxy (IAP).
  4. GPU and Cloud TPU accelerated nodes available for your Machine Learning (ML) workloads.
  5. Cloud Logging for easy debugging and troubleshooting.
  6. Other managed services offered by Google Cloud, such as Cloud Storage, Cloud SQL, Anthos Service Mesh, Identity and Access Management (IAM), Config Controller, and so on.

Kubeflow on Google Cloud: central dashboard

Figure 1. User interface of full-fledged Kubeflow deployment on Google Cloud.

Management cluster

Kubeflow on Google Cloud employs a management cluster, which lets you manage Google Cloud resources via Config Controller. The management cluster is independent from the Kubeflow cluster and manages Kubeflow clusters. You can also use a management cluster from a different Google Cloud project, by assigning owner permissions to the associated service account.

Kubeflow on Google Cloud: clusters hierarchy

Figure 2. Example of Kubeflow on Google Cloud deployment.

Deployment process

To set up a Kubeflow environment on Google Cloud, complete these steps:

  1. Set up Google Cloud project.
  2. Set up OAuth client.
  3. Deploy Management cluster.
  4. Deploy Kubeflow cluster.

For debugging approaches to common issues encountered during these deployment steps, see troubleshooting deployments to find common issues and debugging approaches. If the issue isn’t included in the list of commonly encountered issues, report a bug at googlecloudplatform/kubeflow-distribution.

Next steps

1.2 - Setting up GCP Project

Creating a Google Cloud project for your Kubeflow deployment

In order to deploy Kubeflow on Google Cloud, you need to set up a Google Cloud project and enable necessary APIs for the deployment.

Setting up a project

Follow these steps to set up your Google Cloud project:

  • Select or create a project on the Google Cloud Console. If you plan to use different Google Cloud projects for Management Cluster and Kubeflow Clusters: create one Management project for Management Cluster, and create one or more Kubeflow projects for Kubeflow Clusters.

  • Make sure that you have the Owner role for the project in Cloud IAM (Identity and Access Management). The deployment process creates various service accounts with appropriate roles in order to enable seamless integration with Google Cloud services. This process requires that you have the owner role for the project in order to deploy Kubeflow.

  • Make sure that billing is enabled for your project. Refer to Enable billing for a project.

  • Enable the following APIs by running the following command in a Cloud Shell or local terminal (needs to be authenticated via gcloud auth login):

    gcloud services enable \
      serviceusage.googleapis.com \
      compute.googleapis.com \
      container.googleapis.com \
      iam.googleapis.com \
      servicemanagement.googleapis.com \
      cloudresourcemanager.googleapis.com \
      ml.googleapis.com \
      iap.googleapis.com \
      sqladmin.googleapis.com \
      meshconfig.googleapis.com \
      krmapihosting.googleapis.com \
      servicecontrol.googleapis.com \
      endpoints.googleapis.com \
      cloudbuild.googleapis.com
    

Alternatively, you can these APIs can be enabled via Google Cloud Console:

* [Service Usage API](https://cloud.google.com/service-usage/docs/reference/rest)
* [Compute Engine API](https://console.cloud.google.com/apis/library/compute.googleapis.com)
* [Kubernetes Engine API](https://console.cloud.google.com/apis/library/container.googleapis.com)
* [Identity and Access Management (IAM) API](https://console.cloud.google.com/apis/library/iam.googleapis.com)
* [Service Management API](https://console.cloud.google.com/apis/api/servicemanagement.googleapis.com)
* [Cloud Resource Manager API](https://console.developers.google.com/apis/library/cloudresourcemanager.googleapis.com)
* [AI Platform Training & Prediction API](https://console.developers.google.com/apis/library/ml.googleapis.com)
* [Cloud Identity-Aware Proxy API](https://console.cloud.google.com/apis/library/iap.googleapis.com)
* [Cloud Build API](https://console.cloud.google.com/apis/library/cloudbuild.googleapis.com)
* [Cloud SQL Admin API](https://console.cloud.google.com/apis/library/sqladmin.googleapis.com)
* [Config Controller (KRM API Hosting API)](https://console.cloud.google.com/apis/library/krmapihosting.googleapis.com)
* [Service Control API](https://console.cloud.google.com/apis/library/servicecontrol.googleapis.com)
* [Google Cloud Endpoints](https://console.cloud.google.com/apis/library/endpoints.googleapis.com)
  • If you are using the Google Cloud Free Program or the 12-month trial period with $300 credit, note that the free tier does not offer enough resources for default full Kubeflow installation. You need to upgrade to a paid account.

    For more information, see the following issues:

    Read the Google Cloud Resource quotas to understand quotas on resource usage that Compute Engine enforces, and to learn how to check and increase your quotas.

  • Initialize your project to prepare it for Anthos Service Mesh installation:

    PROJECT_ID=<YOUR_PROJECT_ID>
    
    curl --request POST \
      --header "Authorization: Bearer $(gcloud auth print-access-token)" \
      --data '' \
      https://meshconfig.googleapis.com/v1alpha1/projects/${PROJECT_ID}:initialize
    

    Refer to Anthos Service Mesh documentation for details.

    If you encounter a Workload Identity Pool does not exist error, refer to the following issue:

    • kubeflow/website #2121 describes that creating and then removing a temporary Kubernetes cluster may be needed for projects that haven’t had a cluster set up beforehand.

You do not need a running Google Kubernetes Engine cluster. The deployment process creates a cluster for you.

Next steps

1.3 - Setting up OAuth client

Creating an OAuth client for Cloud IAP on Google Cloud

If you want to use Cloud Identity-Aware Proxy (Cloud IAP) when deploying Kubeflow on Google Cloud, then you must follow these instructions to create an OAuth client for use with Kubeflow.

Cloud IAP is recommended for production deployments or deployments with access to sensitive data.

Follow the steps below to create an OAuth client ID that identifies Cloud IAP when requesting access to a user’s email account. Kubeflow uses the email address to verify the user’s identity.

  1. Set up your OAuth consent screen:

    • In the Application name box, enter the name of your application. The example below uses the name “Kubeflow”.

    • Under Support email, select the email address that you want to display as a public contact. You must use either your email address or a Google Group that you own.

    • If you see Authorized domains, enter

      <project>.cloud.goog
      
      • where <project> is your Google Cloud project ID.
      • If you are using your own domain, such as acme.com, you should add that as well
      • The Authorized domains option appears only for certain project configurations. If you don’t see the option, then there’s nothing you need to set.
    • Click Save.

    • Here’s an example of the completed form:
      OAuth consent screen

  2. On the credentials screen:

    • Click Create credentials, and then click OAuth client ID.
    • Under Application type, select Web application.
    • In the Name box enter any name for your OAuth client ID. This is not the name of your application nor the name of your Kubeflow deployment. It’s just a way to help you identify the OAuth client ID.
  3. Click Create. A dialog box appears, like the one below:

    OAuth consent screen

  4. Copy the client ID shown in the dialog box, because you need the client ID in the next step.

  5. On the Create credentials screen, find your newly created OAuth credential and click the pencil icon to edit it:

    OAuth consent screen

  6. In the Authorized redirect URIs box, enter the following (if it’s not already present in the list of authorized redirect URIs):

    https://iap.googleapis.com/v1/oauth/clientIds/<CLIENT_ID>:handleRedirect
    
    • <CLIENT_ID> is the OAuth client ID that you copied from the dialog box in step four. It looks like XXX.apps.googleusercontent.com.
    • Note that the URI is not dependent on the Kubeflow deployment or endpoint. Multiple Kubeflow deployments can share the same OAuth client without the need to modify the redirect URIs.
  7. Press Enter/Return to add the URI. Check that the URI now appears as a confirmed item under Authorized redirect URIs. (The URI should no longer be editable.)

    Here’s an example of the completed form:

    OAuth credentials

  8. Click Save.

  9. Make note that you can find your OAuth client credentials in the credentials section of the Google Cloud Console. You need to retrieve the client ID and client secret later when you’re ready to enable Cloud IAP.

Next steps

1.4 - Deploying Management cluster

Setting up a management cluster on Google Cloud

This guide describes how to setup a management cluster which you will use to deploy one or more instances of Kubeflow.

The management cluster is used to run Cloud Config Connector. Cloud Config Connector is a Kubernetes addon that allows you to manage Google Cloud resources through Kubernetes.

While the management cluster can be deployed in the same project as your Kubeflow cluster, typically you will want to deploy it in a separate project used for administering one or more Kubeflow instances, because it will run with escalated permissions to create Google Cloud resources in the managed projects.

Optionally, the cluster can be configured with Anthos Config Management to manage Google Cloud infrastructure using GitOps.

Deployment steps

Install the required tools

  1. gcloud components

    gcloud components install kubectl kustomize kpt anthoscli beta
    gcloud components update
    # If the output said the Cloud SDK component manager is disabled for installation, copy the command from output and run it.
    

    You can install specific version of kubectl by following instruction (Example: Install kubectl on Linux). Latest patch version of kubectl from v1.17 to v1.19 works well too.

    Note: Starting from Kubeflow 1.4, it requires kpt v1.0.0-beta.6 or above to operate in googlecloudplatform/kubeflow-distribution repository. gcloud hasn’t caught up with this kpt version yet, install kpt separately from https://github.com/GoogleContainerTools/kpt/tags for now. Note that kpt requires docker to be installed.

Fetch googlecloudplatform/kubeflow-distribution package

The management cluster manifests live in GitHub repository googlecloudplatform/kubeflow-distribution, use the following commands to pull Kubeflow manifests:

  1. Clone the GitHub repository and check out the latest manifests:

    git clone https://github.com/googlecloudplatform/kubeflow-distribution.git 
    cd kubeflow-distribution
    git checkout master
    

    Alternatively, you can get the package by using kpt:

    # Check out the latest Kubeflow
    kpt pkg get https://github.com/googlecloudplatform/kubeflow-distribution.git@master kubeflow-distribution
    cd kubeflow-distribution
    
  2. Go to kubeflow-distribution/management directory for Management cluster configurations.

    cd management
    

Configure Environment Variables

Fill in environment variables in kubeflow-distribution/management/env.sh as followed:

MGMT_PROJECT=<the project where you deploy your management cluster>
MGMT_NAME=<name of your management cluster>
LOCATION=<location of your management cluster, use either us-central1 or us-east1>

And run:

source env.sh

This guide assumes the following convention:

  • The ${MGMT_PROJECT} environment variable contains the Google Cloud project ID where management cluster is deployed to.

  • ${MGMT_NAME} is the cluster name of your management cluster and the prefix for other Google Cloud resources created in the deployment process. Management cluster should be a different cluster from your Kubeflow cluster.

    Note, ${MGMT_NAME} should

    • start with a lowercase letter
    • only contain lowercase letters, numbers and -
    • end with a number or a letter
    • contain no more than 18 characters
  • The ${LOCATION} environment variable contains the location of your management cluster. you can choose between regional or zonal, see Available regions and zones.

Configure kpt setter values

Use kpt to set values for the name, project, and location of your management cluster. Run the following command:

bash kpt-set.sh

Note, you can find out which setters exist in a package and what their current values are by running the following command:

kpt fn eval -i list-setters:v0.1 ./manifests

Prerequisite for Config Controller

In order to deploy Google Cloud Services like Kubernetes resources, we need to create a management cluster with Config Controller installed. Follow Before you begin to create default network if not existed. Make sure to use ${MGMT_PROJECT} for PROJECT_ID.

Deploy Management Cluster

  1. Deploy the management cluster by applying cluster resources:

    make create-cluster
    
  2. Create a kubectl context for the management cluster, it will be named ${MGMT_NAME}:

    make create-context
    
  3. Grant permission to Config Controller service account:

    make grant-owner-permission
    

    Config Controller has created a default service account, this step grants owner permission to this service account in order to allow Config Controller to manage Google Cloud resources. Refer to Config Controller setup.

Understanding the deployment process

This section gives you more details about the configuration and deployment process, so that you can customize your management cluster if necessary.

Config Controller

Management cluster is a tool for managing Google Cloud services like KRM, for example: Google Kubernetes Engine container cluster, MySQL database, etc. And you can use one Management cluster for multiple Kubeflow clusters, across multiple Google Cloud projects. This capability is offered by Config Connector.

Starting with Kubeflow 1.5, we leveraged the managed version of Config Connector, which is called Config Controller. Therefore, The Management cluster is the Config Controller cluster deployed using Config Controller setup process. Note that you can create only one Management cluster within a Google Cloud project, and you usually need just one Management cluster.

Management cluster layout

Inside the Config Controller, we manage Google Cloud resources in namespace mode. That means one namespace is responsible to manage Google Cloud resources deployed to the Google Cloud project with the same name. Your management cluster contains following namespaces:

  1. config-control
  2. namespace with the same name as your Kubeflow clusters’ Google Cloud project name

config-control is the default namespace which is installed while creating Management cluster, you have granted the default service account (like service-<management-project-id>@gcp-sa-yakima.iam.gserviceaccount.com) within this project to manage Config Connector. It is the prerequisite for managing resources in other Google Cloud projects.

namespace with the same name as your Kubeflow clusters' Google Cloud project name is the resource pool for Kubeflow cluster’s Google Cloud project. For each Kubeflow Google Cloud project, you will have service account with pattern kcc-<kf-project-name>@<management-project-name>.iam.gserviceaccount.com in config-control namespace, and it needs to have owner permission to ${KF_PROJECT}, you will perform this step during Deploy Kubeflow cluster. After setup, your Google Cloud resources in Kubeflow cluster project will be deployed to the namespace with name ${KF_PROJECT} in the management cluster.

Your management cluster directory contains the following file:

  • Makefile is a file that defines rules to automate deployment process. You can refer to GNU make documentation for more introduction. The Makefile we provide is designed to be user maintainable. You are encouraged to read, edit and maintain it to suit your own deployment customization needs.

Debug

If you encounter issue creating Google Cloud resources using Config Controller. You can list resources in the ${KF_PROJECT} namespace of management cluster to learn about the detail. Learn more with Monitoring your resources

kubectl --context=${MGMT_NAME} get all -n ${KF_PROJECT}

# If you want to check the service account creation status
kubectl --context=${MGMT_NAME} get IAMServiceAccount -n ${KF_PROJECT}
kubectl --context=${MGMT_NAME} get IAMServiceAccount <service-account-name> -n ${KF_PROJECT} -oyaml

FAQs

  • Where is kfctl?

    kfctl is no longer being used to apply resources for Google Cloud, because required functionalities are now supported by generic tools including Make, Kustomize, kpt, and Cloud Config Connector.

  • Why do we use an extra management cluster to manage Google Cloud resources?

    The management cluster is very lightweight cluster that runs Cloud Config Connector. Cloud Config Connector makes it easier to configure Google Cloud resources using YAML and Kustomize.

For a more detailed explanation of the drastic changes happened in Kubeflow v1.1 on Google Cloud, read googlecloudplatform/kubeflow-distribution #123.

Next steps

1.5 - Deploying Kubeflow cluster

Instructions for using kubectl and kpt to deploy Kubeflow on Google Cloud

This guide describes how to use kubectl and kpt to deploy Kubeflow on Google Cloud.

Deployment steps

Prerequisites

Before installing Kubeflow on the command line:

  1. You must have created a management cluster and installed Config Connector.

    • If you don’t have a management cluster follow the instructions

    • Your management cluster will need a namespace setup to administer the Google Cloud project where Kubeflow will be deployed. This step will be included in later step of current page.

  2. You need to use Linux or Cloud Shell for ASM installation. Currently ASM installation doesn’t work on macOS because it comes with an old version of bash.

  3. Make sure that your Google Cloud project meets the minimum requirements described in the project setup guide.

  4. Follow the guide setting up OAuth credentials to create OAuth credentials for Cloud Identity-Aware Proxy (Cloud IAP).

Install the required tools

  1. Install gcloud.

  2. Install gcloud components

    gcloud components install kubectl kustomize kpt anthoscli beta
    gcloud components update
    

    You can install specific version of kubectl by following instruction (Example: Install kubectl on Linux). Latest patch version of kubectl from v1.17 to v1.19 works well too.

    Note: Starting from Kubeflow 1.4, it requires kpt v1.0.0-beta.6 or above to operate in googlecloudplatform/kubeflow-distribution repository. gcloud hasn’t caught up with this kpt version yet, install kpt separately from https://github.com/GoogleContainerTools/kpt/tags for now. Note that kpt requires docker to be installed.

    Note: You also need to install required tools for ASM installation tool install_asm.

Fetch googlecloudplatform/kubeflow-distribution and upstream packages

  1. If you have already installed Management cluster, you have googlecloudplatform/kubeflow-distribution locally. You just need to run cd kubeflow to access Kubeflow cluster manifests. Otherwise, you can run the following commands:

    # Check out the latest Kubeflow
    git clone https://github.com/googlecloudplatform/kubeflow-distribution.git 
    cd kubeflow-distribution
    git checkout master
    

    Alternatively, you can get the package by using kpt:

    # Check out the latest Kubeflow
    kpt pkg get https://github.com/googlecloudplatform/kubeflow-distribution.git@master kubeflow-distribution
    cd kubeflow-distribution
    
  2. Run the following command to pull upstream manifests from kubeflow/manifests repository.

    # Visit Kubeflow cluster related manifests
    cd kubeflow
    
    bash ./pull-upstream.sh
    

Environment Variables

Log in to gcloud. You only need to run this command once:

gcloud auth login
  1. Review and fill all the environment variables in kubeflow-distribution/kubeflow/env.sh, they will be used by kpt later on, and some of them will be used in this deployment guide. Review the comment in env.sh for the explanation for each environment variable. After defining these environment variables, run:

    source env.sh
    
  2. Set environment variables with OAuth Client ID and Secret for IAP:

    export CLIENT_ID=<Your CLIENT_ID>
    export CLIENT_SECRET=<Your CLIENT_SECRET>
    

kpt setter config

Run the following commands to configure kpt setter for your Kubeflow cluster:

bash ./kpt-set.sh

Everytime you change environment variables, make sure you run the command above to apply kpt setter change to all packages. Otherwise, kustomize build will not be able to pick up new changes.

Note, you can find out which setters exist in a package and their current values by running the following commands:

kpt fn eval -i list-setters:v0.1 ./apps
kpt fn eval -i list-setters:v0.1 ./common

You can learn more about list-setters in kpt documentation.

Authorize Cloud Config Connector for each Kubeflow project

In the Management cluster deployment we created the Google Cloud service account serviceAccount:kcc-${KF_PROJECT}@${MGMT_PROJECT}.iam.gserviceaccount.com this is the service account that Config Connector will use to create any Google Cloud resources in ${KF_PROJECT}. You need to grant this Google Cloud service account sufficient privileges to create the desired resources in Kubeflow project. You only need to perform steps below once for each Kubeflow project, but make sure to do it even when KF_PROJECT and MGMT_PROJECT are the same project.

The easiest way to do this is to grant the Google Cloud service account owner permissions on one or more projects.

  1. Set the Management environment variable if you haven’t:

    MGMT_PROJECT=<the project where you deploy your management cluster>
    MGMT_NAME=<the kubectl context name for management cluster>
    
  2. Apply ConfigConnectorContext for ${KF_PROJECT} in management cluster:

    make apply-kcc
    

Configure Kubeflow

Make sure you are using KF_PROJECT in the gcloud CLI tool:

gcloud config set project ${KF_PROJECT}

Deploy Kubeflow

To deploy Kubeflow, run the following command:

make apply
  • If deployment returns an error due to missing resources in serving.kserve.io API group, rerun make apply. This is due to a race condition between CRD and runtime resources in KServe.

  • If resources can’t be created because webhook.cert-manager.io is unavailable wait and then rerun make apply

  • If resources can’t be created with an error message like:

    error: unable to recognize ".build/application/app.k8s.io_v1beta1_application_application-controller-kubeflow.yaml": no matches for kind "Application" in version "app.k8s.io/v1beta1”
    

    This issue occurs when the CRD endpoint isn’t established in the Kubernetes API server when the CRD’s custom object is applied. This issue is expected and can happen multiple times for different kinds of resource. To resolve this issue, try running make apply again.

Check your deployment

Follow these steps to verify the deployment:

  1. When the deployment finishes, check the resources installed in the namespace kubeflow in your new cluster. To do this from the command line, first set your kubectl credentials to point to the new cluster:

    gcloud container clusters get-credentials "${KF_NAME}" --zone "${ZONE}" --project "${KF_PROJECT}"
    

    Then, check what’s installed in the kubeflow namespace of your Google Kubernetes Engine cluster:

    kubectl -n kubeflow get all
    

Access the Kubeflow user interface (UI)

To access the Kubeflow central dashboard, follow these steps:

  1. Use the following command to grant yourself the IAP-secured Web App User role:

    gcloud projects add-iam-policy-binding "${KF_PROJECT}" --member=user:<EMAIL> --role=roles/iap.httpsResourceAccessor
    

    Note, you need the IAP-secured Web App User role even if you are already an owner or editor of the project. IAP-secured Web App User role is not implied by the Project Owner or Project Editor roles.

  2. Enter the following URI into your browser address bar. It can take 20 minutes for the URI to become available: https://${KF_NAME}.endpoints.${KF_PROJECT}.cloud.goog/

    You can run the following command to get the URI for your deployment:

    kubectl -n istio-system get ingress
    NAME            HOSTS                                                      ADDRESS         PORTS   AGE
    envoy-ingress   your-kubeflow-name.endpoints.your-gcp-project.cloud.goog   34.102.232.34   80      5d13h
    

    The following command sets an environment variable named HOST to the URI:

    export HOST=$(kubectl -n istio-system get ingress envoy-ingress -o=jsonpath={.spec.rules[0].host})
    

Notes:

  • It can take 20 minutes for the URI to become available. Kubeflow needs to provision a signed SSL certificate and register a DNS name.
  • If you own or manage the domain or a subdomain with Cloud DNS then you can configure this process to be much faster. Check kubeflow/kubeflow#731.

Understanding the deployment process

This section gives you more details about the kubectl, kustomize, config connector configuration and deployment process, so that you can customize your Kubeflow deployment if necessary.

Application layout

Your Kubeflow application directory kubeflow-distribution/kubeflow contains the following files and directories:

  • Makefile is a file that defines rules to automate deployment process. You can refer to GNU make documentation for more introduction. The Makefile we provide is designed to be user maintainable. You are encouraged to read, edit and maintain it to suit your own deployment customization needs.

  • apps, common, contrib are a series of independent components directory containing kustomize packages for deploying Kubeflow components. The structure is to align with upstream kubeflow/manifests.

  • build is a directory that will contain the hydrated manifests outputted by the make rules, each component will have its own build directory. You can customize the build path when calling make command.

Source Control

It is recommended that you check in your entire local repository into source control.

Checking in build is recommended so you can easily see differences by git diff in manifests before applying them.

Google Cloud service accounts

The kfctl deployment process creates three service accounts in your Google Cloud project. These service accounts follow the principle of least privilege. The service accounts are:

  • ${KF_NAME}-admin is used for some admin tasks like configuring the load balancers. The principle is that this account is needed to deploy Kubeflow but not needed to actually run jobs.
  • ${KF_NAME}-user is intended to be used by training jobs and models to access Google Cloud resources (Cloud Storage, BigQuery, etc.). This account has a much smaller set of privileges compared to admin.
  • ${KF_NAME}-vm is used only for the virtual machine (VM) service account. This account has the minimal permissions needed to send metrics and logs to Stackdriver.

Upgrade Kubeflow

Refer to Upgrading Kubeflow cluster.

Next steps

1.6 - Upgrade Kubeflow

Upgrading your Kubeflow installation on Google Cloud

Before you start

To better understand upgrade process, you should read the following sections first:

This guide assumes the following settings:

  • The ${MGMT_DIR} and ${MGMT_NAME} environment variables are the same as in Management cluster setup.
  • The ${KF_NAME}, ${CLIENT_ID} and ${CLIENT_SECRET} environment variables are the same as in Deploy using kubectl and kpt.
  • The ${KF_DIR} environment variable contains the path to your Kubeflow application directory, which holds your Kubeflow configuration files. For example, /opt/kubeflow-distribution/kubeflow/.

General upgrade instructions

Starting from Kubeflow v1.5, we have integrated with Config Controller. You don’t need to manually upgrade Management cluster any more, since it managed by Upgrade Config Controller.

Starting from Kubeflow v1.3, we have reworked on the structure of googlecloudplatform/kubeflow-distribution repository. All resources are located in kubeflow-distribution/management directory. Upgrade to Management cluster v1.3 is not supported.

Before Kubeflow v1.3, both management cluster and Kubeflow cluster follow the same instance and upstream folder convention. To upgrade, you’ll typically need to update packages in upstream to the new version and repeat the make apply-<subcommand> commands in their respective deployment process.

However, specific upgrades might need manual actions below.

Upgrading management cluster

Upgrading management cluster before 1.5

It is strongly recommended to use source control to keep a copy of your working repository for recording changes at each step.

Due to the refactoring of kubeflow/manifests repository, the way we depend on googlecloudplatform/kubeflow-distribution has changed drastically. This section suits for upgrading from Kubeflow 1.3 to higher.

  1. The instructions below assume that your current working directory is

    cd "${MGMT_DIR}"
    
  2. Use your management cluster’s kubectl context:

    # Look at all your contexts
    kubectl config get-contexts
    # Select your management cluster's context
    kubectl config use-context "${MGMT_NAME}"
    # Verify the context connects to the cluster properly
    kubectl get namespace
    

    If you are using a different environment, you can always reconfigure the context by:

    make create-context
    
  3. Check your existing config connector version:

    # For Kubeflow v1.3, it should be 1.46.0
    $ kubectl get namespace cnrm-system -ojsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com\/version}'
    1.46.0
    
  4. Merge the content from new Kubeflow version of googlecloudplatform/kubeflow-distribution

    WORKING_BRANCH=<your-github-working-branch>
    VERSION_TAG=<targeted-kubeflow-version-tag-on-github>
    git checkout -b "${WORKING_BRANCH}"
    git remote add upstream https://github.com/googlecloudplatform/kubeflow-distribution.git # This is one time only.
    git fetch upstream 
    git merge "${VERSION_TAG}"
    
  5. Make sure your build directory (./build by default) is checked in to source control (git).

  6. Run the following command to hydrate Config Connector resources:

    make hydrate-kcc
    
  7. Compare the difference on your source control tracking after making hydration change. If they are addition or modification only, proceed to next step. If it includes deletion, you need to use kubectl delete to manually clean up the deleted resource for cleanup.

  8. After confirmation, run the following command to apply new changes:

    make apply-kcc
    
  9. Check version has been upgraded after applying new Config Connector resource:

    $ kubectl get namespace cnrm-system -ojsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com\/version}'
    

Upgrade management cluster from v1.1 to v1.2

  1. The instructions below assume that your current working directory is

    cd "${MGMT_DIR}"
    
  2. Use your management cluster’s kubectl context:

    # Look at all your contexts
    kubectl config get-contexts
    # Select your management cluster's context
    kubectl config use-context "${MGMT_NAME}"
    # Verify the context connects to the cluster properly
    kubectl get namespace
    

    If you are using a different environment, you can always reconfigure the context by:

    make create-context
    
  3. Check your existing config connector version:

    # For Kubeflow v1.1, it should be 1.15.1
    $ kubectl get namespace cnrm-system -ojsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com\/version}'
    1.15.1
    
  4. Uninstall the old config connector in the management cluster:

    kubectl delete sts,deploy,po,svc,roles,clusterroles,clusterrolebindings --all-namespaces -l cnrm.cloud.google.com/system=true --wait=true
    kubectl delete validatingwebhookconfiguration abandon-on-uninstall.cnrm.cloud.google.com --ignore-not-found --wait=true
    kubectl delete validatingwebhookconfiguration validating-webhook.cnrm.cloud.google.com --ignore-not-found --wait=true
    kubectl delete mutatingwebhookconfiguration mutating-webhook.cnrm.cloud.google.com --ignore-not-found --wait=true
    

    These commands uninstall the config connector without removing your resources.

  5. Replace your ./Makefile with the version in Kubeflow v1.2.0: https://github.com/googlecloudplatform/kubeflow-distribution/blob/v1.2.0/management/Makefile.

    If you made any customizations in ./Makefile, you should merge your changes with the upstream version. We’ve refactored the Makefile to move substantial commands into the upstream package, so hopefully future upgrades won’t require a manual merge of the Makefile.

  6. Update ./upstream/management package:

    make update
    
  7. Use kpt to set user values:

    kpt cfg set -R . name ${MGMT_NAME}
    kpt cfg set -R . gcloud.core.project ${MGMT_PROJECT}
    kpt cfg set -R . location ${LOCATION}
    

    Note, you can find out which setters exist in a package and what there current values are by:

    kpt cfg list-setters .
    
  8. Apply upgraded config connector:

    make apply-kcc
    

    Note, you can optionally also run make apply-cluster, but it should be the same as your existing management cluster.

  9. Check that your config connector upgrade is successful:

    # For Kubeflow v1.2, it should be 1.29.0
    $ kubectl get namespace cnrm-system -ojsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com\/version}'
    1.29.0
    

Upgrading Kubeflow cluster

To upgrade from specific versions of Kubeflow, you may need to take certain manual actions — refer to specific sections in the guidelines below.

General instructions for upgrading Kubeflow cluster

  1. The instructions below assume that:

    • Your current working directory is:

      cd ${KF_DIR}
      
    • Your kubectl uses a context that connects to your Kubeflow cluster

      # List your existing contexts
      kubectl config get-contexts
      # Use the context that connects to your Kubeflow cluster
      kubectl config use-context ${KF_NAME}
      
  2. Merge the new version of googlecloudplatform/kubeflow-distribution (example: v1.3.1), you don’t need to do it again if you have already done so during management cluster upgrade.

    WORKING_BRANCH=<your-github-working-branch>
    VERSION_TAG=<targeted-kubeflow-version-tag-on-github>
    git checkout -b "${WORKING_BRANCH}"
    git remote add upstream https://github.com/googlecloudplatform/kubeflow-distribution.git # This is one time only.
    git fetch upstream 
    git merge "${VERSION_TAG}"
    
  3. Change the KUBEFLOW_MANIFESTS_VERSION in ./pull-upstream.sh with the targeted kubeflow version same as $VERSION_TAG. Run the following commands to pull new changes from upstream kubeflow/manifests.

    bash ./pull-upstream.sh
    
  4. (Optional) If you only want to upgrade some of Kubeflow components, you can comment non-upgrade components in kubeflow/config.yaml file. Commands below will only apply the remaining components.

  5. Make sure you have checked in build folders for each component. The following command will change them so you can compare for difference.

    make hydrate
    
  6. Once you confirm the changes are ready to apply, run the following command to upgrade Kubeflow cluster:

    make apply
    

Upgrade Kubeflow cluster to v1.6

Starting from Kubeflow v1.6.0:

  • Component with deprecated API versions were upgraded to support Google Kubernetes Engine v1.22. If you would like to upgrade your Google Kubernetes Engine cluster, follow GCP instructions.
  • ASM was upgraded to v1.14. Follow the instructions on how to upgrade ASM (Anthos Service Mesh). If you want to use ASM version prior to 1.11, refer to the legacy instructions.
  • Knative was upgraded to v1.2. Follow Knative instructions to check current version and see if the update includes any breaking changes.
  • Cert-manager was upgraded to v1.5. To check your current version and see if the update includes any breaking changes, follow the cert-manager instructions.
  • Deprecated kfserving component was removed. To upgrade to KServe, follow the KServe Migration guide.

Upgrade Kubeflow cluster to v1.5

In Kubeflow v1.5.1 we use ASM v1.13. See how to upgrade ASM. To use ASM versions prior to 1.11, follow the legacy instructions.

Starting from Kubeflow v1.5, Kubeflow manifests have included KServe as an independent component from kfserving, Google Cloud distribution has switched over from kfserving to KServe for default installed components. If you want to upgrade Kubeflow while keeping kfsering, you can comment KServe and uncomment kfserving in kubeflow-distribution/kubeflow/config.yaml file. If you want to upgrade to KServe, follow the KServe Migration guide.

Upgrade Kubeflow cluster to v1.3

Due to the refactoring of kubeflow/manifests repository, the way we depend on googlecloudplatform/kubeflow-distribution has changed drastically. Upgrade to Kubeflow cluster v1.3 is not supported. And individual component upgrade has been deferred to its corresponding repository for support.

Upgrade Kubeflow cluster from v1.1 to v1.2

  1. The instructions below assume

    • Your current working directory is:

      cd ${KF_DIR}
      
    • Your kubectl uses a context that connects to your Kubeflow cluster:

      # List your existing contexts
      kubectl config get-contexts
      # Use the context that connects to your Kubeflow cluster
      kubectl config use-context ${KF_NAME}
      
  2. (Recommended) Replace your ./Makefile with the version in Kubeflow v1.2.0: https://github.com/googlecloudplatform/kubeflow-distribution/blob/v1.2.0/kubeflow/Makefile.

    If you made any customizations in ./Makefile, you should merge your changes with the upstream version.

    This step is recommended, because we introduced usability improvements and fixed compatibility for newer Kustomize versions (while still being compatible with Kustomize v3.2.1) to the Makefile. However, the deployment process is backward-compatible, so this is recommended, but not required.

  3. Update ./upstream/manifests package:

    make update
    
  4. Before applying new resources, you need to delete some immutable resources that were updated in this release:

    kubectl delete statefulset kfserving-controller-manager -n kubeflow --wait
    kubectl delete crds experiments.kubeflow suggestions.kubeflow trials.kubeflow
    

    WARNING: This step deletes all Katib running resources.

    Refer to a github comment in the v1.2 release issue for more details.

  5. Redeploy:

    make apply
    

    To evaluate the changes before deploying them you can:

    1. Run make hydrate.
    2. Compare the contents of .build with a historic version with tools like git diff.

Upgrade ASM (Anthos Service Mesh)

If you want to upgrade ASM instead of the Kubeflow components, refer to [kubeflow/common/asm/README.md](https://github.com/googlecloudplatform/kubeflow-distribution/blob/master/kubeflow/asm/README.md for the latest instructions on upgrading ASM. Detailed explanation is listed below. Note: if you are going to upgrade major or minor version of ASM, it is recommended to read the official ASM upgrade documentation before proceeding with the steps below.

Install a new ASM workload

In order to use the new ASM version, we need to download the corresponding ASM configuration package and asmcli script. Get a list of available ASM packages and the corresponding asmcli scripts by running the following command:

curl https://storage.googleapis.com/csm-artifacts/asm/ASMCLI_VERSIONS

It should return a list of ASM versions that can be installed with asmcli script. To install older versions, refer to the legacy instructions. The returned list will have a format of ${ASM_PACKAGE_VERSION}:${ASMCLI_SCRIPT_VERSION}. For example, in the following output:

...
1.13.2-asm.5+config2:asmcli_1.13.2-asm.5-config2
1.13.2-asm.5+config1:asmcli_1.13.2-asm.5-config1
1.13.2-asm.2+config2:asmcli_1.13.2-asm.2-config2
1.13.2-asm.2+config1:asmcli_1.13.2-asm.2-config1
1.13.1-asm.1+config1:asmcli_1.13.1-asm.1-config1
...

record 1.13.2-asm.5+config2:asmcli_1.13.2-asm.5-config2 corresponds to:

ASM_PACKAGE_VERSION=1.13.2-asm.5+config2
ASMCLI_SCRIPT_VERSION=asmcli_1.13.2-asm.5-config2

You need to set these two values in kubeflow/asm/Makefile. Then, run the following command in kubeflow/asm directory to install the new ASM. Note, the old ASM will not be uninstalled.

make apply

Once installed successfully, you can see istiod Deployment in your cluster with name in pattern istiod-asm-VERSION-REVISION. For example, istiod-asm-1132-5 would correspond to ASM version 1.13.2-asm.5.

Upgrade other Kubeflow components to use new ASM

There are multiple Kubeflow components with ASM namespace label, including user created namespaces. To upgrade them at once, change the following line in kubeflow/env.sh with the new ASM version asm-VERSION-REVISION, like asm-1132-5.

export ASM_LABEL=asm-1132-5

Then run the following commands in kubeflow/ directory to configure the environmental variables:

source env.sh

Run the following command to configure kpt setter:

bash kpt-set.sh

Examine the change using source control after running the following command:

make hydrate

Refer to Deploying and redeploying workloads for the complete steps to adopt the new ASM version. As part of the instructions, you can run the following command to update namespaces’ labels across the cluster:

make apply

(Optional) Uninstall the old ASM workload

Once you validated that new ASM installation and sidecar-injection for Kubeflow components are working as expected. You can Complete the transition to the new ASM or Rollback to the old ASM as instructed in Deploy and Redeploy workloads.

1.7 - Monitoring Cloud IAP Setup

Instructions for monitoring and troubleshooting Cloud IAP

Cloud Identity-Aware Proxy (Cloud IAP) is the recommended solution for accessing your Kubeflow deployment from outside the cluster, when running Kubeflow on Google Cloud.

This document is a step-by-step guide to ensuring that your IAP-secured endpoint is available, and to debugging problems that may cause the endpoint to be unavailable.

Introduction

When deploying Kubeflow using the command-line interface, you choose the authentication method you want to use. One of the options is Cloud IAP. This document assumes that you have already deployed Kubeflow.

Kubeflow uses the Google-managed certificate to provide an SSL certificate for the Kubeflow Ingress.

Cloud IAP gives you the following benefits:

  • Users can log in in using their Google Cloud accounts.
  • You benefit from Google’s security expertise to protect your sensitive workloads.

Monitoring your Cloud IAP setup

Follow these instructions to monitor your Cloud IAP setup and troubleshoot any problems:

  1. Examine the Ingress and Google Cloud Build (GCB) load balancer to make sure it is available:

    kubectl -n istio-system describe ingress
    
    Name:             envoy-ingress
    Namespace:        kubeflow
    Address:          35.244.132.160
    Default backend:  default-http-backend:80 (10.20.0.10:8080)
    Annotations:
    ...
    Events:
       Type     Reason     Age                 From                     Message
       ----     ------     ----                ----                     -------
       Normal   ADD        12m                 loadbalancer-controller  kubeflow/envoy-ingress
       Warning  Translate  12m (x10 over 12m)  loadbalancer-controller  error while evaluating the ingress spec: could not find service "kubeflow/envoy"
       Warning  Translate  12m (x2 over 12m)   loadbalancer-controller  error while evaluating the ingress spec: error getting BackendConfig for port "8080" on service "kubeflow/envoy", err: no BackendConfig for service port exists.
       Warning  Sync       12m                 loadbalancer-controller  Error during sync: Error running backend syncing routine: received errors when updating backend service: googleapi: Error 400: The resource 'projects/code-search-demo/global/backendServices/k8s-be-32230--bee2fc38fcd6383f' is not ready, resourceNotReady
     googleapi: Error 400: The resource 'projects/code-search-demo/global/backendServices/k8s-be-32230--bee2fc38fcd6383f' is not ready, resourceNotReady
       Normal  CREATE  11m  loadbalancer-controller  ip: 35.244.132.160
    ...
    

    There should be an annotation indicating that we are using managed certificate:

    annotation:
      networking.gke.io/managed-certificates: gke-certificate
    

    Any problems with creating the load balancer are reported as Kubernetes events in the results of the above describe command.

    • If the address isn’t set then there was a problem creating the load balancer.

    • The CREATE event indicates the load balancer was successfully created on the specified IP address.

    • The most common error is running out of Google Cloud resource quota. To fix this problem, you must either increase the quota for the relevant resource on your Google Cloud project or delete some existing resources.

  2. Verify that a managed certificate resource is generated:

    kubectl describe -n istio-system managedcertificate gke-certificate
    

    The status field should have information about the current status of the Certificate. Eventually, certificate status should be Active.

  3. Wait for the load balancer to report the back ends as healthy:

    kubectl describe -n istio-system ingress envoy-ingress
    
    ...
    Annotations:
     kubernetes.io/ingress.global-static-ip-name:  kubeflow-ip
     kubernetes.io/tls-acme:                       true
     certmanager.k8s.io/issuer:                    letsencrypt-prod
     ingress.kubernetes.io/backends:               {"k8s-be-31380--5e1566252944dfdb":"HEALTHY","k8s-be-32133--5e1566252944dfdb":"HEALTHY"}
    ...
    

    Both backends should be reported as healthy. It can take several minutes for the load balancer to consider the back ends healthy.

    The service with port 31380 is the one that handles Kubeflow traffic. (31380 is the default port of the service istio-ingressgateway.)

    If the backend is unhealthy, check the pods in istio-system:

    • kubectl get pods -n istio-system
    • The istio-ingressgateway-XX pods should be running
    • Check the logs of pod backend-updater-0, iap-enabler-XX to see if there is any error
    • Follow the steps here to check the load balancer and backend service on Google Cloud.
  4. Try accessing Cloud IAP at the fully qualified domain name in your web browser:

    https://<your-fully-qualified-domain-name>     
    

    If you get SSL errors when you log in, this typically means that your SSL certificate is still propagating. Wait a few minutes and try again. SSL propagation can take up to 10 minutes.

    If you do not see a login prompt and you get a 404 error, the configuration of Cloud IAP is not yet complete. Keep retrying for up to 10 minutes.

  5. If you get an error Error: redirect_uri_mismatch after logging in, this means the list of OAuth authorized redirect URIs does not include your domain.

    The full error message looks like the following example and includes the relevant links:

    The redirect URI in the request, https://<my_kubeflow>.endpoints.<my_project>.cloud.goog/_gcp_gatekeeper/authenticate, does not match the ones authorized for the OAuth client. 	
    To update the authorized redirect URIs, visit: https://console.developers.google.com/apis/credentials/oauthclient/22222222222-7meeee7a9a76jvg54j0g2lv8lrsb4l8g.apps.googleusercontent.com?project=22222222222	
    

    Follow the link in the error message to find the OAuth credential being used and add the redirect URI listed in the error message to the list of authorized URIs. For more information, read the guide to setting up OAuth for Cloud IAP.

Next steps

1.8 - Deleting Kubeflow

Deleting Kubeflow from Google Cloud using the command line interface (CLI)

This page explains how to delete your Kubeflow cluster or management cluster on Google Cloud.

Before you start

This guide assumes the following settings:

  • For Management cluster: The ${MGMT_PROJECT}, ${MGMT_DIR} and ${MGMT_NAME} environment variables are the same as in Deploy Management cluster.

  • For Kubeflow cluster: The ${KF_PROJECT}, ${KF_NAME} and ${MGMTCTXT} environment variables are the same as in Deploy Kubeflow cluster.

  • The ${KF_DIR} environment variable contains the path to your Kubeflow application directory, which holds your Kubeflow configuration files. For example, /opt/kubeflow-distribution/kubeflow/.

Deleting your Kubeflow cluster

  1. To delete the applications running in the Kubeflow namespace, remove that namespace:

    kubectl delete namespace kubeflow
    
  2. To delete the cluster and all Google Cloud resources, run the following commands:

    cd "${KF_DIR}"
    make delete
    

    Warning: this will delete the persistent disks storing metadata. If you want to preserve the disks, don’t run this command; instead selectively delete only those resources you want to delete.

Clean up your management cluster

The following instructions introduce how to clean up all resources created when installing management cluster in management project, and when using management cluster to manage Google Cloud resources in managed Kubeflow projects.

Delete or keep managed Google Cloud resources

There are Google Cloud resources managed by Config Connector in the management cluster after you deploy Kubeflow clusters with this management cluster.

To delete all the managed Google Cloud resources, delete the managed project namespace:

kubectl config use-context "${MGMTCTXT}"
kubectl delete namespace --wait "${KF_PROJECT}"

To keep all the managed Google Cloud resources, you can delete the management cluster directly.

If you need fine-grained control, refer to Config Connector: Keeping resources after deletion for more details.

After deleting Config Connector resources for a managed project, you can revoke IAM permission that let the management cluster manage the project:

gcloud projects remove-iam-policy-binding "${KF_PROJECT}" \
    "--member=serviceAccount:${MGMT_NAME}-cnrm-system@${MGMT_PROJECT}.iam.gserviceaccount.com" \
    --role=roles/owner

Delete management cluster

To delete the Google service account and the management cluster:

cd "${MGMT_DIR}"
make delete-cluster

Starting from Kubeflow v1.5, Google Cloud distribution has switched to Config Controller for Google-managed Management cluster. You can learn more detail by reading Delete your Config Controller.

Note, after deleting the management cluster, all the managed Google Cloud resources will be kept. You will be responsible for managing them by yourself. If you want to delete the managed Google Cloud resources, make sure to delete resources in the ${KF_PROJECT} namespace in the management cluster first. You can learn more about the ${KF_PROJECT} namespace in kubeflow-distribution/kubeflow/kcc folder.

You can create a management cluster to manage them again if you apply the same Config Connector resources. Refer to Managing and deleting resources - Acquiring an existing resource.

2 - Pipelines on Google Cloud

Instructions for customizing and using Kubeflow Pipelines on Google Cloud

2.1 - Connecting to Kubeflow Pipelines on Google Cloud using the SDK

How to connect to different Kubeflow Pipelines installations on Google Cloud using the Kubeflow Pipelines SDK

This guide describes how to connect to your Kubeflow Pipelines cluster on Google Cloud using the Kubeflow Pipelines SDK.

Before you begin

How SDK connects to Kubeflow Pipelines API

Kubeflow Pipelines includes an API service named ml-pipeline-ui. The ml-pipeline-ui API service is deployed in the same Kubernetes namespace you deployed Kubeflow Pipelines in.

The Kubeflow Pipelines SDK can send REST API requests to this API service, but the SDK needs to know the hostname to connect to the API service.

If the hostname can be accessed without authentication, it’s very simple to connect to it. For example, you can use kubectl port-forward to access it via localhost:

# The Kubeflow Pipelines API service and the UI is available at
# http://localhost:3000 without authentication check.
$ kubectl port-forward svc/ml-pipeline-ui 3000:80 --namespace kubeflow
# Change the namespace if you deployed Kubeflow Pipelines in a different
# namespace.
import kfp
client = kfp.Client(host='http://localhost:3000')

When deploying Kubeflow Pipelines on Google Cloud, a public endpoint for this API service is auto-configured for you, but this public endpoint has security checks to protect your cluster from unauthorized access.

The following sections introduce how to authenticate your SDK requests to connect to Kubeflow Pipelines via the public endpoint.

Connecting to Kubeflow Pipelines standalone or AI Platform Pipelines

Refer to Connecting to AI Platform Pipelines using the Kubeflow Pipelines SDK for both Kubeflow Pipelines standalone and AI Platform Pipelines.

Kubeflow Pipelines standalone deployments also show up in AI Platform Pipelines. They have the name “pipeline” by default, but you can customize the name by overriding the appName parameter in params.env when deploying Kubeflow Pipelines standalone.

Connecting to Kubeflow Pipelines in a full Kubeflow deployment

A full Kubeflow deployment on Google Cloud uses an Identity-Aware Proxy (IAP) to manage access to the public Kubeflow endpoint. The steps below let you connect to Kubeflow Pipelines in a full Kubeflow deployment with authentication through IAP.

  1. Find out your IAP OAuth 2.0 client ID.

    You or your cluster admin followed Set up OAuth for Cloud IAP to deploy your full Kubeflow deployment on Google Cloud. You need the OAuth client ID created in that step.

    You can browse all of your existing OAuth client IDs in the Credentials page of Google Cloud Console.

  2. Create another SDK OAuth Client ID for authenticating Kubeflow Pipelines SDK users. Follow the steps to set up a client ID to authenticate from a desktop app. Take a note of the client ID and client secret. This client ID and secret can be shared among all SDK users, because a separate login step is still needed below.

  3. To connect to Kubeflow Pipelines public endpoint, initiate SDK client like the following:

    import kfp
    client = kfp.Client(host='https://<KF_NAME>.endpoints.<PROJECT>.cloud.goog/pipeline',
        client_id='<AAAAAAAAAAAAAAAAAAAAAA>.apps.googleusercontent.com',
        other_client_id='<BBBBBBBBBBBBBBBBBBB>.apps.googleusercontent.com',
        other_client_secret='<CCCCCCCCCCCCCCCCCCCC>')
    
    • Pass your IAP OAuth client ID found in step 1 to client_id argument.
    • Pass your SDK OAuth client ID and secret created in step 2 to other_client_id and other_client_secret arguments.
  4. When you init the SDK client for the first time, you will be asked to log in. The Kubeflow Pipelines SDK stores obtained credentials in $HOME/.config/kfp/credentials.json. You do not need to log in again unless you manually delete the credentials file.

    To use the SDK from cron tasks where you cannot log in manually, you can copy the credentials file in `$HOME/.config/kfp/credentials.json` to another machine.
    However, you should keep the credentials safe and never expose it to
    third parties.
    
  5. After login, you can use the client.

    print(client.list_pipelines())
    

Troubleshooting

  • Error “Failed to authorize with API resource references: there is no user identity header” when using SDK methods.

    Direct access to the API service without authentication works for Kubeflow Pipelines standalone, AI Platform Pipelines, and Kubeflow 1.0 or earlier.

    However, it fails authorization checks for Kubeflow Pipelines with multi-user isolation in the full Kubeflow deployment starting from Kubeflow 1.1. Multi-user isolation requires all API access to authenticate as a user. Refer to Kubeflow Pipelines Multi-user isolation documentation for more details.

2.2 - Authenticating Pipelines to Google Cloud

Authentication and authorization to Google Cloud in Pipelines

This page describes authentication for Kubeflow Pipelines to Google Cloud. Available options listed below have different tradeoffs. You should choose the one that fits your use-case.

  • Configuring your cluster to access Google Cloud using Compute Engine default service account with the “cloud-platform” scope is easier to set up than the other options. However, this approach grants excessive permissions. Therefore, it is not suitable if you need workload permission separation.
  • Workload Identity takes more efforts to set up, but allows fine-grained permission control. It is recommended for production use-cases.
  • Google service account keys stored as Kubernetes secrets is the legacy approach and no longer recommended in Google Kubernetes Engine. However, it’s the only option to use Google Cloud APIs when your cluster is an anthos or on-prem cluster.

Before you begin

There are various options on how to install Kubeflow Pipelines in the Installation Options for Kubeflow Pipelines guide. Be aware that authentication support and cluster setup instructions will vary depending on the method you used to install Kubeflow Pipelines.

  • For Kubeflow Pipelines standalone, you can compare and choose from all 3 options.
  • For full Kubeflow starting from Kubeflow 1.1, Workload Identity is the recommended and default option.
  • For AI Platform Pipelines, Compute Engine default service account is the only supported option.

Compute Engine default service account

This is good for trying out Kubeflow Pipelines, because it is easy to set up.

However, it does not support permission separation for workloads in the cluster. Any workload in the cluster will be able to call any Google Cloud APIs in the chosen scope.

Cluster setup to use Compute Engine default service account

By default, your Google Kubernetes Engine nodes use Compute Engine default service account. If you allowed cloud-platform scope when creating the cluster, Kubeflow Pipelines can authenticate to Google Cloud and manage resources in your project without further configuration.

Use one of the following options to create a Google Kubernetes Engine cluster that uses the Compute Engine default service account:

  • If you followed instructions in Setting up AI Platform Pipelines and checked Allow access to the following Cloud APIs, your cluster is already using Compute Engine default service account.
  • In Google Cloud Console UI, you can enable it in Create a Kubernetes cluster -> default-pool -> Security -> Access Scopes -> Allow full access to all Cloud APIs like the following:
  • Using gcloud CLI, you can enable it with --scopes cloud-platform like the following:
gcloud container clusters create <cluster-name> \
  --scopes cloud-platform

Please refer to gcloud container clusters create command documentation for other available options.

Authoring pipelines to use default service account

Pipelines don’t need any specific changes to authenticate to Google Cloud, it will use the default service account transparently.

However, you must update existing pipelines that use the use_gcp_secret kfp sdk operator. Remove the use_gcp_secret usage to let your pipeline authenticate to Google Cloud using the default service account.

Securing the cluster with fine-grained Google Cloud permission control

Workload Identity

Workload Identity is the recommended way for your Google Kubernetes Engine applications to consume services provided by Google APIs. You accomplish this by configuring a Kubernetes service account to act as a Google service account. Any Pods running as the Kubernetes service account then use the Google service account to authenticate to cloud services.

Referenced from Workload Identity Documentation. Please read this doc for:

  • A detailed introduction to Workload Identity.
  • Instructions to enable it on your cluster.
  • Whether its limitations affect your adoption.

Terminology

This document distinguishes between Kubernetes service accounts (KSAs) and Google service accounts (GSAs). KSAs are Kubernetes resources, while GSAs are specific to Google Cloud. Other documentation usually refers to both of them as just “service accounts”.

Authoring pipelines to use Workload Identity

Pipelines don’t need any specific changes to authenticate to Google Cloud. With Workload Identity, pipelines run as the Google service account that is bound to the KSA.

However, existing pipelines that use use_gcp_secret kfp sdk operator need to remove the use_gcp_secret usage to use the bound GSA. You can also continue to use use_gcp_secret in a cluster with Workload Identity enabled and use_gcp_secret will take precedence for those workloads.

Cluster setup to use Workload Identity for Full Kubeflow

Starting from Kubeflow 1.1, Kubeflow Pipelines supports multi-user isolation. Therefore, pipeline runs are executed in user namespaces using the default-editor KSA. The default-editor KSA is auto-bound to the GSA specified in the user profile, which defaults to a shared GSA ${KFNAME}-user@${PROJECT}.iam.gserviceaccount.com.

If you want to bind the default-editor KSA with a different GSA for a specific namespace, refer to the In-cluster authentication to Google Cloud guide.

Additionally, the Kubeflow Pipelines UI, visualization, and TensorBoard server instances are deployed in your user namespace using the default-editor KSA. Therefore, to visualize results in the Pipelines UI, they can fetch artifacts in Google Cloud Storage using permissions of the same GSA you configured for this namespace.

Cluster setup to use Workload Identity for Pipelines Standalone

1. Create your cluster with Workload Identity enabled
  • In Google Cloud Console UI, you can enable Workload Identity in Create a Kubernetes cluster -> Security -> Enable Workload Identity like the following:

  • Using gcloud CLI, you can enable it with:

gcloud beta container clusters create <cluster-name> \
  --release-channel regular \
  --workload-pool=project-id.svc.id.goog

References:

2. Deploy Kubeflow Pipelines

Deploy via Pipelines Standalone as usual.

3. Bind Workload Identities for KSAs used by Kubeflow Pipelines

The following helper bash scripts bind Workload Identities for KSAs used by Kubeflow Pipelines:

  • gcp-workload-identity-setup.sh helps you create GSAs and bind them to KSAs used by pipelines workloads. This script provides an interactive command line dialog with explanation messages.
  • wi-utils.sh alternatively provides minimal utility bash functions that let you customize your setup. The minimal utilities make it easy to read and use programmatically.

For example, to get a default setup using gcp-workload-identity-setup.sh, you can

$ curl -O https://raw.githubusercontent.com/kubeflow/pipelines/master/manifests/kustomize/gcp-workload-identity-setup.sh
$ chmod +x ./gcp-workload-identity-setup.sh
$ ./gcp-workload-identity-setup.sh
# This prints the command's usage example and introduction.
# Then you can run the command with required parameters.
# Command output will tell you which GSAs and Workload Identity bindings have been
# created.
4. Configure IAM permissions of used GSAs

If you used gcp-workload-identity-setup.sh to bind Workload Identities for your cluster, you can simply add the following IAM bindings:

  • Give GSA <cluster-name>-kfp-system@<project-id>.iam.gserviceaccount.com Storage Object Viewer role to let UI load data in GCS in the same project.
  • Give GSA <cluster-name>-kfp-user@<project-id>.iam.gserviceaccount.com any permissions your pipelines need. For quick tryouts, you can give it Project Editor role for all permissions.

If you configured bindings by yourself, here are Google Cloud permission requirements for KFP KSAs:

  • Pipelines use pipeline-runner KSA. Configure IAM permissions of the GSA bound to this KSA to allow pipelines use Google Cloud APIs.
  • Pipelines UI uses ml-pipeline-ui KSA. Pipelines Visualization Server uses ml-pipeline-visualizationserver KSA. If you need to view artifacts and visualizations stored in Google Cloud Storage (GCS) from pipelines UI, you should add Storage Object Viewer permission (or the minimal required permission) to their bound GSAs.

Google service account keys stored as Kubernetes secrets

It is recommended to use Workload Identity for easier and secure management, but you can also choose to use GSA keys.

Authoring pipelines to use GSA keys

Each pipeline step describes a container that is run independently. If you want to grant access for a single step to use one of your service accounts, you can use kfp.gcp.use_gcp_secret(). Examples for how to use this function can be found in the Kubeflow examples repo.

Cluster setup to use use_gcp_secret for Full Kubeflow

From Kubeflow 1.1, there’s no longer a user-gcp-sa secrets deployed for you. Recommend using Workload Identity instead.

For Kubeflow 1.0 or earlier, you don’t need to do anything. Full Kubeflow deployment has already deployed the user-gcp-sa secret for you.

Cluster setup to use use_gcp_secret for Pipelines Standalone

Pipelines Standalone require your manual setup for the user-gcp-sa secret used by use_gcp_secret.

Instructions to set up the secret:

  1. First download the GCE VM service account token (refer to Google Cloud documentation for more information):

    gcloud iam service-accounts keys create application_default_credentials.json \
      --iam-account [SA-NAME]@[PROJECT-ID].iam.gserviceaccount.com
    
  2. Run:

    kubectl create secret -n [your-namespace] generic user-gcp-sa \
      --from-file=user-gcp-sa.json=application_default_credentials.json
    

2.3 - Upgrading

How to upgrade your Kubeflow Pipelines deployment on Google Cloud

Before you begin

There are various options on how to install Kubeflow Pipelines in the Installation Options for Kubeflow Pipelines guide. Be aware that upgrade support and instructions will vary depending on the method you used to install Kubeflow Pipelines.

Installation \ Features In-place upgrade Reinstallation on the same cluster Reinstallation on a different cluster User customizations across upgrades (via Kustomize)
Standalone ⚠️ Data is deleted by default.
Standalone (managed storage)
full Kubeflow (>= v1.1) Needs documentation
full Kubeflow (< v1.1)
AI Platform Pipelines
AI Platform Pipelines (managed storage)

Notes:

  • When you deploy Kubeflow Pipelines with managed storage on Google Cloud, you pipeline’s metadata and artifacts are stored in Cloud Storage and Cloud SQL. Using managed storage makes it easier to manage, back up, and restore Kubeflow Pipelines data.

Kubeflow Pipelines Standalone

Upgrade Support for Kubeflow Pipelines Standalone is in Beta.

Upgrading Kubeflow Pipelines Standalone introduces how to upgrade in-place.

Full Kubeflow

On Google Cloud, the full Kubeflow deployment follows the package pattern starting from Kubeflow 1.1.

The package pattern enables you to upgrade the full Kubeflow in-place while keeping user customizations — refer to the Upgrade Kubeflow on Google Cloud documentation for instructions.

However, there’s no current support to upgrade from Kubeflow 1.0 or earlier to Kubeflow 1.1 while keeping Kubeflow Pipelines data. This may change in the future, so provide your feedback in kubeflow/pipelines#4346 on GitHub.

AI Platform Pipelines

Upgrade Support for AI Platform Pipelines is in Alpha.

Below are the steps that describe how to upgrade your AI Platform Pipelines instance while keeping existing data:

For instances without managed storage:

  1. Delete your AI Platform Pipelines instance WITHOUT selecting Delete cluster. The persisted artifacts and database data are stored in persistent volumes in the cluster. They are kept by default when you do not delete the cluster.
  2. Reinstall Kubeflow Pipelines from the Google Cloud Marketplace using the same Google Kubernetes Engine cluster, namespace, and application name. Persisted data will be automatically picked up during reinstallation.

For instances with managed storage:

  1. Delete your AI Platform Pipelines instance.
  2. If you are upgrading from Kubeflow Pipelines 0.5.1, note that the Cloud Storage bucket is a required starting from 1.0.0. Previously deployed instances should be using a bucket named like “-”. Browse your Cloud Storage buckets to find your existing bucket name and provide it in the next step.
  3. Reinstall Kubeflow Pipelines from the Google Cloud Marketplace using the same application name and managed storage options as before. You can freely install it in any cluster and namespace (not necessarily the same as before), because persisted artifacts and database data are stored in managed storages (Cloud Storage and Cloud SQL), and will be automatically picked up during reinstallation.

2.4 - Enabling GPU and TPU

Enable GPU and TPU for Kubeflow Pipelines on Google Kubernetes Engine (GKE)

This page describes how to enable GPU or TPU for a pipeline on Google Kubernetes Engine by using the Pipelines DSL language.

Prerequisites

To enable GPU and TPU on your Kubeflow cluster, follow the instructions on how to customize the Google Kubernetes Engine cluster for Kubeflow before setting up the cluster.

Configure ContainerOp to consume GPUs

After enabling the GPU, the Kubeflow setup script installs a default GPU pool with type nvidia-tesla-k80 with auto-scaling enabled. The following code consumes 2 GPUs in a ContainerOp.

import kfp.dsl as dsl
gpu_op = dsl.ContainerOp(name='gpu-op', ...).set_gpu_limit(2)

The code above will be compiled into Kubernetes Pod spec:

container:
  ...
  resources:
    limits:
      nvidia.com/gpu: "2"

If the cluster has multiple node pools with different GPU types, you can specify the GPU type by the following code.

import kfp.dsl as dsl
gpu_op = dsl.ContainerOp(name='gpu-op', ...).set_gpu_limit(2)
gpu_op.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p4')

The code above will be compiled into Kubernetes Pod spec:

container:
  ...
  resources:
    limits:
      nvidia.com/gpu: "2"
nodeSelector:
  cloud.google.com/gke-accelerator: nvidia-tesla-p4

See GPU tutorial for a complete example to build a Kubeflow pipeline that uses GPUs.

Check the Google Kubernetes Engine GPU guide to learn more about GPU settings.

Configure ContainerOp to consume TPUs

Use the following code to configure ContainerOp to consume TPUs on Google Kubernetes Engine:

import kfp.dsl as dsl
import kfp.gcp as gcp
tpu_op = dsl.ContainerOp(name='tpu-op', ...).apply(gcp.use_tpu(
  tpu_cores = 8, tpu_resource = 'v2', tf_version = '1.12'))

The above code uses 8 v2 TPUs with TF version to be 1.12. The code above will be compiled into Kubernetes Pod spec:

container:
  ...
  resources:
    limits:
      cloud-tpus.google.com/v2: "8"
  metadata:
    annotations:
      tf-version.cloud-tpus.google.com: "1.12"

To learn more, see an example pipeline that uses a preemptible node pool with TPU or GPU..

See the Google Kubernetes Engine TPU Guide to learn more about TPU settings.

2.5 - Using Preemptible VMs and GPUs on Google Cloud

Configuring preemptible VMs and GPUs for Kubeflow Pipelines on Google Cloud

This document describes how to configure preemptible virtual machines (preemptible VMs) and GPUs on preemptible VM instances (preemptible GPUs) for your workflows running on Kubeflow Pipelines on Google Cloud.

Introduction

Preemptible VMs are Compute Engine VM instances that last a maximum of 24 hours and provide no availability guarantees. The pricing of preemptible VMs is lower than that of standard Compute Engine VMs.

GPUs attached to preemptible instances (preemptible GPUs) work like normal GPUs but persist only for the life of the instance.

Using preemptible VMs and GPUs can reduce costs on Google Cloud. In addition to using preemptible VMs, your Google Kubernetes Engine (GKE) cluster can autoscale based on current workloads.

This guide assumes that you have already deployed Kubeflow Pipelines. If not, follow the guide to deploying Kubeflow on Google Cloud.

Before you start

The variables defined in this page can be found in kubeflow-distribution/kubeflow/env.sh. They are the same value as you set based on your Kubeflow deployment.

Using preemptible VMs with Kubeflow Pipelines

In summary, the steps to schedule a pipeline to run on preemptible VMs are as follows:

  1. Create a node pool in your cluster that contains preemptible VMs.
  2. Configure your pipelines to run on the preemptible VMs.

The following sections contain more detail about the above steps.

1. Create a node pool with preemptible VMs

Create a preemptible-nodepool.yaml as below and fulfill all placerholder content KF_NAME, KF_PROJECT, LOCATION:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  labels:
    kf-name: KF_NAME # kpt-set: ${name}
  name: PREEMPTIBLE_CPU_POOL
  namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}
spec:
  location: LOCATION # kpt-set: ${location}
  initialNodeCount: 1
  autoscaling:
    minNodeCount: 0
    maxNodeCount: 5
  nodeConfig:
    machineType: n1-standard-4
    diskSizeGb: 100
    diskType: pd-standard
    preemptible: true
    taint:
    - effect: NO_SCHEDULE
      key: preemptible
      value: "true"
    oauthScopes:
    - "https://www.googleapis.com/auth/logging.write"
    - "https://www.googleapis.com/auth/monitoring"
    - "https://www.googleapis.com/auth/devstorage.read_only"
    serviceAccountRef:
      external: KF_NAME-vm@KF_PROJECT.iam.gserviceaccount.com # kpt-set: ${name}-vm@${gcloud.core.project}.iam.gserviceaccount.com
    metadata:
      disable-legacy-endpoints: "true"
  management:
    autoRepair: true
    autoUpgrade: true
  clusterRef:
    name: KF_NAME # kpt-set: ${name}
    namespace: KF_PROJECT # kpt-set: ${name}

Where:

  • PREEMPTIBLE_CPU_POOL is the name of the node pool.
  • KF_NAME is the name of the Kubeflow Google Kubernetes Engine cluster.
  • KF_PROJECT is the name of your Kubeflow Google Cloud project.
  • LOCATION is the region of this nodepool, for example: us-west1-b.
  • KF_NAME-vm@KF_PROJECT.iam.gserviceaccount.com is your service account, replace the KF_NAME and KF_PROJECT using the value above in this pattern, you can get vm service account you have already created in Kubeflow cluster deployment

Apply the nodepool patch file above by running:

kubectl --context=${MGMTCTXT} --namespace=${KF_PROJECT} apply -f <path-to-nodepool-file>/preemptible-nodepool.yaml

For Kubeflow Pipelines standalone only

Alternatively, if you are on Kubeflow Pipelines standalone, or AI Platform Pipelines, you can run this command to create node pool:

gcloud container node-pools create PREEMPTIBLE_CPU_POOL \
    --cluster=CLUSTER_NAME \
      --enable-autoscaling --max-nodes=MAX_NODES --min-nodes=MIN_NODES \
      --preemptible \
      --node-taints=preemptible=true:NoSchedule \
      --service-account=DEPLOYMENT_NAME-vm@PROJECT_NAME.iam.gserviceaccount.com

Below is an example of command:

gcloud container node-pools create preemptible-cpu-pool \
  --cluster=user-4-18 \
    --enable-autoscaling --max-nodes=4 --min-nodes=0 \
    --preemptible \
    --node-taints=preemptible=true:NoSchedule \
    --service-account=user-4-18-vm@ml-pipeline-project.iam.gserviceaccount.com

2. Schedule your pipeline to run on the preemptible VMs

After configuring a node pool with preemptible VMs, you must configure your pipelines to run on the preemptible VMs.

In the DSL code for your pipeline, add the following to the ContainerOp instance:

.apply(gcp.use_preemptible_nodepool())

The above function works for both methods of generating the ContainerOp:

Note:

  • Call .set_retry(#NUM_RETRY) on your ContainerOp to retry the task after the task is preempted.
  • If you modified the node taint when creating the node pool, pass the same node toleration to the use_preemptible_nodepool() function.
  • use_preemptible_nodepool() also accepts a parameter hard_constraint. When the hard_constraint is True, the system will strictly schedule the task in preemptible VMs. When the hard_constraint is False, the system will try to schedule the task in preemptible VMs. If it cannot find the preemptible VMs, or the preemptible VMs are busy, the system will schedule the task in normal VMs.

For example:

import kfp.dsl as dsl
import kfp.gcp as gcp

class FlipCoinOp(dsl.ContainerOp):
  """Flip a coin and output heads or tails randomly."""

  def __init__(self):
    super(FlipCoinOp, self).__init__(
      name='Flip',
      image='python:alpine3.6',
      command=['sh', '-c'],
      arguments=['python -c "import random; result = \'heads\' if random.randint(0,1) == 0 '
                 'else \'tails\'; print(result)" | tee /tmp/output'],
      file_outputs={'output': '/tmp/output'})

@dsl.pipeline(
  name='pipeline flip coin',
  description='shows how to use dsl.Condition.'
)

def flipcoin():
  flip = FlipCoinOp().apply(gcp.use_preemptible_nodepool())

if __name__ == '__main__':
  import kfp.compiler as compiler
  compiler.Compiler().compile(flipcoin, __file__ + '.zip')

Using preemptible GPUs with Kubeflow Pipelines

This guide assumes that you have already deployed Kubeflow Pipelines. In summary, the steps to schedule a pipeline to run with preemptible GPUs are as follows:

  1. Make sure you have enough GPU quota.
  2. Create a node pool in your Google Kubernetes Engine cluster that contains preemptible VMs with preemptible GPUs.
  3. Configure your pipelines to run on the preemptible VMs with preemptible GPUs.

The following sections contain more detail about the above steps.

1. Make sure you have enough GPU quota

Add GPU quota to your Google Cloud project. The Google Cloud documentation lists the availability of GPUs across regions. To check the available quota for resources in your project, go to the Quotas page in the Google Cloud Console.

2. Create a node pool of preemptible VMs with preemptible GPUs

Create a preemptible-gpu-nodepool.yaml as below and fulfill all placerholder content:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  labels:
    kf-name: KF_NAME # kpt-set: ${name}
  name: KF_NAME-containernodepool-gpu
  namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}
spec:
  location: LOCATION # kpt-set: ${location}
  initialNodeCount: 1
  autoscaling:
    minNodeCount: 0
    maxNodeCount: 5
  nodeConfig:
    machineType: n1-standard-4
    diskSizeGb: 100
    diskType: pd-standard
    preemptible: true
    oauthScopes:
    - "https://www.googleapis.com/auth/logging.write"
    - "https://www.googleapis.com/auth/monitoring"
    - "https://www.googleapis.com/auth/devstorage.read_only"
    serviceAccountRef:
      external: KF_NAME-vm@KF_PROJECT.iam.gserviceaccount.com # kpt-set: ${name}-vm@${gcloud.core.project}.iam.gserviceaccount.com
    guestAccelerator:
    - type: "nvidia-tesla-k80"
      count: 1
    metadata:
      disable-legacy-endpoints: "true"
  management:
    autoRepair: true
    autoUpgrade: true
  clusterRef:
    name: KF_NAME # kpt-set: ${name}
    namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}

Where:

  • PREEMPTIBLE_CPU_POOL is the name of the node pool.
  • KF_NAME is the name of the Kubeflow Google Kubernetes Engine cluster.
  • KF_PROJECT is the name of your Kubeflow Google Cloud project.
  • LOCATION is the region of this nodepool, for example: us-west1-b.
  • KF_NAME-vm@KF_PROJECT.iam.gserviceaccount.com is your service account, replace the KF_NAME and KF_PROJECT using the value above in this pattern, you can get vm service account you have already created in Kubeflow cluster deployment.

For Kubeflow Pipelines standalone only

Alternatively, if you are on Kubeflow Pipelines standalone, or AI Platform Pipelines, you can run this command to create node pool:

gcloud container node-pools create PREEMPTIBLE_GPU_POOL \
    --cluster=CLUSTER_NAME \
    --enable-autoscaling --max-nodes=MAX_NODES --min-nodes=MIN_NODES \
    --preemptible \
    --node-taints=preemptible=true:NoSchedule \
    --service-account=DEPLOYMENT_NAME-vm@PROJECT_NAME.iam.gserviceaccount.com \
    --accelerator=type=GPU_TYPE,count=GPU_COUNT

Below is an example of command:

gcloud container node-pools create preemptible-gpu-pool \
    --cluster=user-4-18 \
    --enable-autoscaling --max-nodes=4 --min-nodes=0 \
    --preemptible \
    --node-taints=preemptible=true:NoSchedule \
    --service-account=user-4-18-vm@ml-pipeline-project.iam.gserviceaccount.com \
    --accelerator=type=nvidia-tesla-t4,count=2

3. Schedule your pipeline to run on the preemptible VMs with preemptible GPUs

In the DSL code for your pipeline, add the following to the ContainerOp instance:

.apply(gcp.use_preemptible_nodepool()

The above function works for both methods of generating the ContainerOp:

Note:

  • Call .set_gpu_limit(#NUM_GPUs, GPU_VENDOR) on your ContainerOp to specify the GPU limit (for example, 1) and vendor (for example, 'nvidia').
  • Call .set_retry(#NUM_RETRY) on your ContainerOp to retry the task after the task is preempted.
  • If you modified the node taint when creating the node pool, pass the same node toleration to the use_preemptible_nodepool() function.
  • use_preemptible_nodepool() also accepts a parameter hard_constraint. When the hard_constraint is True, the system will strictly schedule the task in preemptible VMs. When the hard_constraint is False, the system will try to schedule the task in preemptible VMs. If it cannot find the preemptible VMs, or the preemptible VMs are busy, the system will schedule the task in normal VMs.

For example:

import kfp.dsl as dsl
import kfp.gcp as gcp

class FlipCoinOp(dsl.ContainerOp):
  """Flip a coin and output heads or tails randomly."""

  def __init__(self):
    super(FlipCoinOp, self).__init__(
      name='Flip',
      image='python:alpine3.6',
      command=['sh', '-c'],
      arguments=['python -c "import random; result = \'heads\' if random.randint(0,1) == 0 '
                 'else \'tails\'; print(result)" | tee /tmp/output'],
      file_outputs={'output': '/tmp/output'})

@dsl.pipeline(
  name='pipeline flip coin',
  description='shows how to use dsl.Condition.'
)

def flipcoin():
  flip = FlipCoinOp().set_gpu_limit(1, 'nvidia').apply(gcp.use_preemptible_nodepool())
if __name__ == '__main__':
  import kfp.compiler as compiler
  compiler.Compiler().compile(flipcoin, __file__ + '.zip')

Debugging

Run the following command if your nodepool didn’t show up or has error during provisioning:

kubectl --context=${MGMTCTXT} --namespace=${KF_PROJECT} describe containernodepool -l kf-name=${KF_NAME}

Next steps

3 - Customize Kubeflow on Google Cloud

Tailoring a Google Kubernetes Engine deployment of Kubeflow

This guide describes how to customize your deployment of Kubeflow on Google Kubernetes Engine (GKE) on Google Cloud.

Before you start

The variables defined in this page can be found in kubeflow-distribution/kubeflow/env.sh. They are the same value as you set based on your Kubeflow deployment.

Customizing Kubeflow before deployment

The Kubeflow deployment process is divided into two steps, hydrate and apply, so that you can modify your configuration before deploying your Kubeflow cluster.

Follow the guide to deploying Kubeflow on Google Cloud. You can add your patches in corresponding component folder, and include those patches in kustomization.yaml file. Learn more about the usage of kustomize. You can also find the existing kustomization in googlecloudplatform/kubeflow-distribution as example. After adding the patches, you can run make hydrate to validate the resulting resources. Finally, you can run make apply to deploy the customized Kubeflow.

Customizing an existing deployment

You can also customize an existing Kubeflow deployment. In that case, this guide assumes that you have already followed the guide to deploying Kubeflow on Google Cloud and have deployed Kubeflow to a Google Kubernetes Engine cluster.

Before you start

This guide assumes the following settings:

  • The ${KF_DIR} environment variable contains the path to your Kubeflow application directory, which holds your Kubeflow configuration files. For example, /opt/kubeflow-distribution/kubeflow/.

    export KF_DIR=<path to your Kubeflow application directory>
    cd "${KF_DIR}"
    
  • Make sure your environment variables are set up for the Kubeflow cluster you want to customize. For further background about the settings, see the guide to deploying Kubeflow with the CLI.

Customizing Google Cloud resources

To customize Google Cloud resources, such as your Kubernetes Engine cluster, you can modify the Deployment settings starting in ${KF_DIR}/common/cnrm.

This folder contains multiple dependencies on sibling directories for Google Cloud resources. So you can start from here by reviewing kustomization.yaml. Depends on the type of Google Cloud resources you want to customize, you can add patches in corresponding directory.

  1. Make sure you checkin the existing resources in /build folder to source control.

  2. Add the patches in corresponding directory, and update kustomization.yaml to include patches.

  3. Run make hydrate to build new resources in /build folder.

  4. Carefully examine the result resources in /build folder. If the customization is addition only, you can run make apply to directly patch the resources.

  5. It is possible that you are modifying immutable resources. In this case, you will need to delete existing resource and applying new resources. Note that this might mean lost of your service and data, please execute carefully. General approach to delete and deploy Google Cloud resources:

    1. Revert to old resources in /build using source control.

    2. Carefully delete the resource you need to delete by using kubectl delete.

    3. Rebuild and apply new Google Cloud resources

    cd common/cnrm
    NAME=$(NAME) KFCTXT=$(KFCTXT) LOCATION=$(LOCATION) PROJECT=$(PROJECT) make apply
    

Customizing Kubeflow resources

You can use kustomize to customize Kubeflow. Make sure that you have the minimum required version of kustomize: 2.0.3 or later. For more information about kustomize in Kubeflow, see how Kubeflow uses kustomize.

To customize the Kubernetes resources running within the cluster, you can modify the kustomize manifests in corresponding component under ${KF_DIR}.

For example, to modify settings for the Jupyter web app:

  1. Open ${KF_DIR}/apps/jupyter/jupyter-web-app/kustomization.yaml in a text editor.

  2. Review the file’s inclusion of deployment-patch.yaml, and add your modification to deployment-patch.yaml based on the original content in ${KF_DIR}/apps/jupyter/jupyter-web-app/upstream/base/deployment.yaml. For example: change volumeMounts’s mountPath if you need to customize it.

  3. Verify the output resources in /build folder using Makefile"

    cd "${KF_DIR}"
    make hydrate
    
  4. Redeploy Kubeflow using Makefile:

    cd "${KF_DIR}"
    make apply
    

Common customizations

Add users to Kubeflow

You must grant each user the minimal permission scope that allows them to connect to the Kubernetes cluster.

For Google Cloud, you should grant the following Cloud Identity and Access Management (IAM) roles.

In the following commands, replace [PROJECT] with your Google Cloud project and replace [EMAIL] with the user’s email address:

  • To access the Kubernetes cluster, the user needs the Kubernetes Engine Cluster Viewer role:

    gcloud projects add-iam-policy-binding [PROJECT] --member=user:[EMAIL] --role=roles/container.clusterViewer
    
  • To access the Kubeflow UI through IAP, the user needs the IAP-secured Web App User role:

    gcloud projects add-iam-policy-binding [PROJECT] --member=user:[EMAIL] --role=roles/iap.httpsResourceAccessor
    

    Note, you need to grant the user IAP-secured Web App User role even if the user is already an owner or editor of the project. IAP-secured Web App User role is not implied by the Project Owner or Project Editor roles.

  • To be able to run gcloud container clusters get-credentials and see logs in Cloud Logging (formerly Stackdriver), the user needs viewer access on the project:

    gcloud projects add-iam-policy-binding [PROJECT] --member=user:[EMAIL] --role=roles/viewer
    

Alternatively, you can also grant these roles on the IAM page in the Cloud Console. Make sure you are in the same project as your Kubeflow deployment.

Add GPU nodes to your cluster

To add GPU accelerators to your Kubeflow cluster, you have the following options:

  • Pick a Google Cloud zone that provides NVIDIA Tesla K80 Accelerators (nvidia-tesla-k80).
  • Or disable node-autoprovisioning in your Kubeflow cluster.
  • Or change your node-autoprovisioning configuration.

To see which accelerators are available in each zone, run the following command:

gcloud compute accelerator-types list

Create the ContainerNodePool resource adopting GPU, for example, create a new file containernodepool-gpu.yaml file and fulfill the value KUBEFLOW-NAME, KF-PROJECT, LOCATION based on your Kubeflow deployment:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
  labels:
    kf-name: KF_NAME # kpt-set: ${name}
  name: containernodepool-gpu
  namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}
spec:
  location: LOCATION # kpt-set: ${location}
  initialNodeCount: 1
  autoscaling:
    minNodeCount: 0
    maxNodeCount: 5
  nodeConfig:
    machineType: n1-standard-4
    diskSizeGb: 100
    diskType: pd-standard
    preemptible: true
    oauthScopes:
    - "https://www.googleapis.com/auth/logging.write"
    - "https://www.googleapis.com/auth/monitoring"
    - "https://www.googleapis.com/auth/devstorage.read_only"
    guestAccelerator:
    - type: "nvidia-tesla-k80"
      count: 1
    metadata:
      disable-legacy-endpoints: "true"
  management:
    autoRepair: true
    autoUpgrade: true
  clusterRef:
    name: KF_NAME # kpt-set: ${name}
    namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}

Note that the metadata:name must be unique in your Kubeflow project. Because the management cluster uses this as ID and your Google Cloud project as a namespace to identify a node pool.

Apply the node pool patch file above by running:

kubectl --context="${MGMTCTXT}" --namespace="${KF_PROJECT}" apply -f <path-to-gpu-nodepool-file>

After adding GPU nodes to your cluster, you need to install NVIDIA’s device drivers to the nodes. Google provides a DaemonSet that automatically installs the drivers for you. To deploy the installation DaemonSet, run the following command:

kubectl --context="${KF_NAME}" apply -f https://raw.githubusercontent.com/googlecloudplatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

To disable node-autoprovisioning, edit ${KF_DIR}/common/cluster/upstream/cluster.yaml to set enabled to false:

    ...
    clusterAutoscaling:
      enabled: false
      autoProvisioningDefaults:
    ...

Add Cloud TPUs to your cluster

Note: The following instruction should be used when creating Google Kubernetes Engine cluster, because the TPU enablement flag enableTpu is immutable once cluster is created. You need to create new cluster if existing cluster doesn’t have TPU enabled.

Set enableTpu:true in ${KF_DIR}/common/cluster/upstream/cluster.yaml and enable alias IP (VPC-native traffic routing):

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
...
spec:
  ...
  enableTpu: true
  networkingMode: VPC_NATIVE
  networkRef:
    name: containercluster-dep-vpcnative
  subnetworkRef:
    name: containercluster-dep-vpcnative
  ipAllocationPolicy:
    servicesSecondaryRangeName: servicesrange
    clusterSecondaryRangeName: clusterrange
  ...
...
---
apiVersion: compute.cnrm.cloud.google.com/v1beta1
kind: ComputeNetwork
metadata:
  name: containercluster-dep-vpcnative
spec:
  routingMode: REGIONAL
  autoCreateSubnetworks: false
---
apiVersion: compute.cnrm.cloud.google.com/v1beta1
kind: ComputeSubnetwork
metadata:
  name: containercluster-dep-vpcnative
spec:
  ipCidrRange: 10.2.0.0/16
  region: us-west1
  networkRef:
    name: containercluster-dep-vpcnative
  secondaryIpRange:
  - rangeName: servicesrange
    ipCidrRange: 10.3.0.0/16
  - rangeName: clusterrange
    ipCidrRange: 10.4.0.0/16

You can learn more at Creating a new cluster with Cloud TPU support, and view an example Vpc Native Container Cluster config connector yaml file.

More customizations

Refer to the navigation panel on the left of these docs for more customizations, including using your own domain and more.

4 - Using Your Own Domain

Using a custom domain with Kubeflow on Google Cloud

This guide assumes you have already set up Kubeflow on Google Cloud. If you haven’t done so, follow the guide to getting started with Kubeflow on Google Cloud.

Using your own domain

If you want to use your own domain instead of ${KF_NAME}.endpoints.${PROJECT}.cloud.goog, follow these instructions after building your cluster:

  1. Remove the substitution hostname in the Kptfile.

    kpt cfg delete-subst instance hostname
    
  2. Create a new setter hostname in the Kptfile.

    kpt cfg create-setter instance/ hostname --field "data.hostname" --value ""
    
  3. Configure new setter with your own domain.

    kpt cfg set ./instance hostname <enter your domain here>
    
  4. Apply the changes.

    make apply-kubeflow
    
  5. Check Ingress to verify that your domain was properly configured.

    kubectl -n istio-system describe ingresses
    
  6. Get the address of the static IP address created.

    IPNAME=${KF_NAME}-ip
    gcloud compute addresses describe ${IPNAME} --global
    
  7. Use your DNS provider to map the fully qualified domain specified in the third step to the above IP address.

5 - Authenticating Kubeflow to Google Cloud

Authentication and authorization to Google Cloud

This page describes in-cluster and local authentication for Kubeflow Google Cloud deployments.

In-cluster authentication

Starting from Kubeflow v0.6, you consume Kubeflow from custom namespaces (that is, namespaces other than kubeflow). The kubeflow namespace is only for running Kubeflow system components. Individual jobs and model deployments run in separate namespaces.

Google Kubernetes Engine (GKE) workload identity

Starting in v0.7, Kubeflow uses the new Google Kubernetes Engine feature: workload identity. This is the recommended way to access Google Cloud APIs from your Google Kubernetes Engine cluster. You can configure a Kubernetes service account (KSA) to act as a Google Cloud service account (GSA).

If you deployed Kubeflow following the Google Cloud instructions, then the profiler controller automatically binds the “default-editor” service account for every profile namespace to a default Google Cloud service account created during kubeflow deployment. The Kubeflow deployment process also creates a default profile for the cluster admin.

For more info about profiles see the Multi-user isolation page.

Here is an example profile spec:

apiVersion: kubeflow/v1beta1
kind: Profile
spec:
  plugins:
  - kind: WorkloadIdentity
    spec:
      gcpServiceAccount: ${SANAME}@${PROJECT}.iam.gserviceaccount.com
...

You can verify that there is a KSA called default-editor and that it has an annotation of the corresponding GSA:

kubectl -n ${PROFILE_NAME} describe serviceaccount default-editor

...
Name:        default-editor
Annotations: iam.gke.io/gcp-service-account: ${KFNAME}-user@${PROJECT}.iam.gserviceaccount.com
...

You can double check that GSA is also properly set up:

gcloud --project=${PROJECT} iam service-accounts get-iam-policy ${KFNAME}-user@${PROJECT}.iam.gserviceaccount.com

When a pod uses KSA default-editor, it can access Google Cloud APIs with the role granted to the GSA.

Provisioning custom Google service accounts in namespaces: When creating a profile, you can specify a custom Google Cloud service account for the namespace to control which Google Cloud resources are accessible.

Prerequisite: you must have permission to edit your Google Cloud project’s IAM policy and to create a profile custom resource (CR) in your Kubeflow cluster.

  1. if you don’t already have a Google Cloud service account you want to use, create a new one. For example: user1-gcp@<project-id>.iam.gserviceaccount.com:
gcloud iam service-accounts create user1-gcp@<project-id>.iam.gserviceaccount.com
  1. You can bind roles to the Google Cloud service account to allow access to the desired Google Cloud resources. For example to run BigQuery job, you can grant access like so:
gcloud projects add-iam-policy-binding <project-id> \
      --member='serviceAccount:user1-gcp@<project-id>.iam.gserviceaccount.com' \
      --role='roles/bigquery.jobUser'
  1. Grant owner permission of service account user1-gcp@<project-id>.iam.gserviceaccount.com to cluster account <cluster-name>-admin@<project-id>.iam.gserviceaccount.com:
gcloud iam service-accounts add-iam-policy-binding \
      user1-gcp@<project-id>.iam.gserviceaccount.com \
      --member='serviceAccount:<cluster-name>-admin@<project-id>.iam.gserviceaccount.com' --role='roles/owner'
  1. Manually create a profile for user1 and specify the Google Cloud service account to bind in plugins field:
apiVersion: kubeflow/v1beta1
kind: Profile
metadata:
  name: profileName   # replace with the name of the profile (the user's namespace name)
spec:
  owner:
    kind: User
    name: user1@email.com   # replace with the email of the user
  plugins:
  - kind: WorkloadIdentity
    spec:
      gcpServiceAccount: user1-gcp@project-id.iam.gserviceaccount.com

Note: The profile controller currently doesn’t perform any access control checks to see whether the user creating the profile should be able to use the Google Cloud service account. As a result, any user who can create a profile can get access to any service account for which the admin controller has owner permissions. We will improve this in subsequent releases.

You can find more details on workload identity in the Google Kubernetes Engine documentation.

Authentication from Kubeflow Pipelines

Starting from Kubeflow v1.1, Kubeflow Pipelines supports multi-user isolation. Therefore, pipeline runs are executed in user namespaces also using the default-editor KSA.

Additionally, the Kubeflow Pipelines UI, visualization, and TensorBoard server instances are deployed in your user namespace using the default-editor KSA. Therefore, to visualize results in the Pipelines UI, they can fetch artifacts in Google Cloud Storage using permissions of the same GSA you configured for this namespace.

For more details, refer to Authenticating Pipelines to Google Cloud.


Local authentication

gcloud

Use the gcloud tool to interact with Google Cloud on the command line. You can use the gcloud command to set up Google Kubernetes Engine (GKE) clusters, and interact with other Google services.

Logging in

You have two options for authenticating the gcloud command:

  • You can use a user account to authenticate using a Google account (typically Gmail). You can register a user account using gcloud auth login, which brings up a browser window to start the familiar Google authentication flow.

  • You can create a service account within your Google Cloud project. You can then download a .json key file associated with the account, and run the gcloud auth activate-service-account command to authenticate your gcloud session.

You can find more information in the Google Cloud docs.

Listing active accounts

You can run the following command to verify you are authenticating with the expected account:

gcloud auth list

In the output of the command, an asterisk denotes your active account.

Viewing IAM roles

Permissions are handled in Google Cloud using IAM Roles. These roles define which resources your account can read or write to. Provided you have the necessary permissions, you can check which roles were assigned to your account using the following gcloud command:

PROJECT_ID=your-gcp-project-id-here

gcloud projects get-iam-policy $PROJECT_ID --flatten="bindings[].members" \
    --format='table(bindings.role)' \
    --filter="bindings.members:$(gcloud config list account --format 'value(core.account)')"

You can view and modify roles through the Google Cloud IAM console.

You can find more information about IAM in the Google Cloud docs.


kubectl

The kubectl tool is used for interacting with a Kubernetes cluster through the command line.

Connecting to a cluster using a Google Cloud account

If you set up your Kubernetes cluster using Google Kubernetes Engine, you can authenticate with the cluster using a Google Cloud account. The following commands fetch the credentials for your cluster and save them to your local kubeconfig file:

CLUSTER_NAME=your-gke-cluster
ZONE=your-gcp-zone

gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE

You can find more information in the Google Cloud docs.

Changing active clusters

If you work with multiple Kubernetes clusters, you may have multiple contexts saved in your local kubeconfig file. You can view the clusters you have saved by run the following command:

kubectl config get-contexts

You can view which cluster is currently being controlled by kubectl with the following command:

CONTEXT_NAME=your-new-context

kubectl config set-context $CONTEXT_NAME

You can find more information in the Kubernetes docs.

Checking RBAC permissions

Like GKE IAM, Kubernetes permissions are typically handled with a “role-based authorization control” (RBAC) system. Each Kubernetes service account has a set of authorized roles associated with it. If your account doesn’t have the right roles assigned to it, certain tasks fail.

You can check if an account has the proper permissions to run a command by building a query structured as kubectl auth can-i [VERB] [RESOURCE] --namespace [NAMESPACE]. For example, the following command verifies that your account has permissions to create deployments in the kubeflow namespace:

kubectl auth can-i create deployments --namespace kubeflow

You can find more information in the Kubernetes docs.

Adding RBAC permissions

If you find you are missing a permission you need, you can grant the missing roles to your service account using Kubernetes resources.

  • Roles describe the permissions you want to assign. For example, verbs: ["create"], resources:["deployments"]
  • RoleBindings define a mapping between the Role, and a specific service account

By default, Roles and RoleBindings apply only to resources in a specific namespace, but there are also ClusterRoles and ClusterRoleBindings that can grant access to resources cluster-wide

You can find more information in the Kubernetes docs.

Next steps

See the troubleshooting guide for help with diagnosing and fixing issues you may encounter with Kubeflow on Google Cloud

6 - Securing Your Clusters

How to secure Kubeflow clusters using private Google Kubernetes Engine

Currently we are collecting interest for supporting private Kubeflow cluster deployment. Please upvote to Support private Google Kubernetes Engine cluster on Google Cloud feature request if it fits your use case.

7 - Troubleshooting Deployments on Google Cloud

Help fixing problems on Google Kubernetes Engine and Google Cloud

This guide helps diagnose and fix issues you may encounter with Kubeflow on Google Kubernetes Engine (GKE) and Google Cloud.

Before you start

This guide covers troubleshooting specifically for Kubeflow deployments on Google Cloud.

For more help, search for resolved issues on GitHub or create a new one in the Kubeflow on Google Cloud repository.

This guide assumes the following settings:

  • The ${KF_DIR} environment variable contains the path to your Kubeflow application directory, which holds your Kubeflow configuration files. For example, /opt/kubeflow-distribution/kubeflow/.

    export KF_DIR=<path to your Kubeflow application directory>
    
  • The ${CONFIG_FILE} environment variable contains the path to your Kubeflow configuration file.

    export CONFIG_FILE=${KF_DIR}/kfctl_gcp_iap.v1.0.2.yaml
    

    Or:

    export CONFIG_FILE=${KF_DIR}/kfctl_gcp_basic_auth.v1.0.2.yaml
    
  • The ${KF_NAME} environment variable contains the name of your Kubeflow deployment. You can find the name in your ${CONFIG_FILE} configuration file, as the value for the metadata.name key.

    export KF_NAME=<the name of your Kubeflow deployment>
    
  • The ${PROJECT} environment variable contains the ID of your Google Cloud project. You can find the project ID in your ${CONFIG_FILE} configuration file, as the value for the project key.

    export PROJECT=<your Google Cloud project ID>
    
  • The ${ZONE} environment variable contains the Google Cloud zone where your Kubeflow resources are deployed.

    export ZONE=<your Google Cloud zone>
    
  • For further background about the above settings, see the guide to deploying Kubeflow with the CLI.

Troubleshooting Kubeflow deployment on Google Cloud

Here are some tips for troubleshooting Google Cloud.

  • Make sure you are a Google Cloud project owner.
  • Make sure you are using HTTPS.
  • Check project quota page to see if any service’s current usage reached quota limit, increase them as needed.
  • Check deployment manager page and see if there’s a failed deployment.
  • Check if endpoint is up: do DNS lookup against your Cloud Identity-Aware Proxy (Cloud IAP) URL and see if it resolves to the correct IP address.
  • Check if certificate succeeded: kubectl describe certificates -n istio-system should give you certificate status.
  • Check ingress status: kubectl describe ingress -n istio-system
  • Check if endpoint entry is created. There should be one entry with name <deployment>.endpoints.<project>.cloud.goog
    • If endpoint entry doesn’t exist, check kubectl describe cloudendpoint -n istio-system
  • If using IAP: make sure you added https://<deployment>.endpoints.<project>.cloud.goog/_gcp_gatekeeper/authenticate as an authorized redirect URI for the OAUTH credentials used to create the deployment.
  • If using IAP: see the guide to monitoring your Cloud IAP setup.
  • See the sections below for troubleshooting specific problems.
  • Please report a bug if you can’t resolve the problem by following the above steps.

DNS name not registered

This section provides troubleshooting information for problems creating a DNS entry for your ingress. The ingress is a K8s resource that creates a Google Cloud loadbalancer to enable http(s) access to Kubeflow web services from outside the cluster. This section assumes you are using Cloud Endpoints and a DNS name of the following pattern

https://${KF_NAME}.endpoints.${PROJECT}.cloud.goog

Symptoms:

  • When you access the URL in Chrome you get the error: server IP address could not be found

  • nslookup for the domain name doesn’t return the IP address associated with the ingress

    nslookup ${KF_NAME}.endpoints.${PROJECT}.cloud.goog
    Server:   127.0.0.1
    Address:  127.0.0.1#53
    
    ** server can't find ${KF_NAME}.endpoints.${PROJECT}.cloud.goog: NXDOMAIN
    

Troubleshooting

  1. Check the cloudendpoints resource

    kubectl get cloudendpoints -o yaml ${KF_NAME}
    kubectl describe cloudendpoints ${KF_NAME}
    
    • Check if there are errors indicating problems creating the endpoint
  2. The status of the cloudendpoints object will contain the cloud operation used to register the operation

    • For example

       status:
         config: ""
         configMapHash: ""
         configSubmit: operations/serviceConfigs.jlewi-1218-001.endpoints.cloud-ml-dev.cloud.goog:43fe6c6f-eb9c-41d0-ac85-b547fc3e6e38
         endpoint: jlewi-1218-001.endpoints.cloud-ml-dev.cloud.goog
         ingressIP: 35.227.243.83
         jwtAudiences: null
         lastAppliedSig: 4f3b903a06a683b380bf1aac1deca72792472429
         observedGeneration: 1
         stateCurrent: ENDPOINT_SUBMIT_PENDING
      
  • You can check the status of the operation by running:

    gcloud --project=${PROJECT} endpoints operations describe ${OPERATION}
    
    • Operation is everything after operations/ in the configSubmit field

404 Page Not Found When Accessing Central Dashboard

This section provides troubleshooting information for 404s, page not found, being return by the central dashboard which is served at

https://${KUBEFLOW_FQDN}/
  • KUBEFLOW_FQDN is your project’s OAuth web app URI domain name <name>.endpoints.<project>.cloud.goog
  • Since we were able to sign in this indicates the Ambassador reverse proxy is up and healthy we can confirm this is the case by running the following command
kubectl -n ${NAMESPACE} get pods -l service=envoy

NAME                     READY     STATUS    RESTARTS   AGE
envoy-76774f8d5c-lx9bd   2/2       Running   2          4m
envoy-76774f8d5c-ngjnr   2/2       Running   2          4m
envoy-76774f8d5c-sg555   2/2       Running   2          4m
  • Try other services to see if they’re accessible for example

    https://${KUBEFLOW_FQDN}/whoami
    https://${KUBEFLOW_FQDN}/tfjobs/ui
    https://${KUBEFLOW_FQDN}/hub
    
  • If other services are accessible then we know its a problem specific to the central dashboard and not ingress

  • Check that the centraldashboard is running

    kubectl get pods -l app=centraldashboard
    NAME                                READY     STATUS    RESTARTS   AGE
    centraldashboard-6665fc46cb-592br   1/1       Running   0          7h
    
  • Check a service for the central dashboard exists

    kubectl get service -o yaml centraldashboard
    
  • Check that an Ambassador route is properly defined

    kubectl get service centraldashboard -o jsonpath='{.metadata.annotations.getambassador\.io/config}'
    
    apiVersion: ambassador/v0
      kind:  Mapping
      name: centralui-mapping
      prefix: /
      rewrite: /
      service: centraldashboard.kubeflow,
    
  • Check the logs of Ambassador for errors. See if there are errors like the following indicating an error parsing the route.If you are using the new Stackdriver Kubernetes monitoring you can use the following filter in the stackdriver console

     resource.type="k8s_container"
     resource.labels.location=${ZONE}
     resource.labels.cluster_name=${CLUSTER}
     metadata.userLabels.service="ambassador"
    "could not parse YAML"
    

502 Server Error

A 502 usually means traffic isn’t even making it to the envoy reverse proxy. And it usually indicates the loadbalancer doesn’t think any backends are healthy.

  • In Cloud Console select Network Services -> Load Balancing
    • Click on the load balancer (the name should contain the name of the ingress)

    • The exact name can be found by looking at the ingress.kubernetes.io/url-map annotation on your ingress object

      URLMAP=$(kubectl --namespace=${NAMESPACE} get ingress envoy-ingress -o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/url-map}')
      echo ${URLMAP}
      
    • Click on your loadbalancer

    • This will show you the backend services associated with the load balancer

      • There is 1 backend service for each K8s service the ingress rule routes traffic too

      • The named port will correspond to the NodePort a service is using

        NODE_PORT=$(kubectl --namespace=${NAMESPACE} get svc envoy -o jsonpath='{.spec.ports[0].nodePort}')
        BACKEND_NAME=$(gcloud compute --project=${PROJECT} backend-services list --filter=name~k8s-be-${NODE_PORT}- --format='value(name)')
        gcloud compute --project=${PROJECT} backend-services get-health --global ${BACKEND_NAME}
        
    • Make sure the load balancer reports the backends as healthy

      • If the backends aren’t reported as healthy check that the pods associated with the K8s service are up and running

      • Check that health checks are properly configured

        • Click on the health check associated with the backend service for envoy
        • Check that the path is /healthz and corresponds to the path of the readiness probe on the envoy pods
        • See K8s docs for important information about how health checks are determined from readiness probes.
      • Check firewall rules to ensure traffic isn’t blocked from the Google Cloud loadbalancer

        • The firewall rule should be added automatically by the ingress but its possible it got deleted if you have some automatic firewall policy enforcement. You can recreate the firewall rule if needed with a rule like this

          gcloud compute firewall-rules create $NAME \
          --project $PROJECT \
          --allow tcp:$PORT \
          --target-tags $NODE_TAG \
          --source-ranges 130.211.0.0/22,35.191.0.0/16
          
        • To get the node tag

          # From the Kubernetes Engine cluster get the name of the managed instance group
          gcloud --project=$PROJECT container clusters --zone=$ZONE describe $CLUSTER
          # Get the template associated with the MIG
          gcloud --project=kubeflow-rl compute instance-groups managed describe --zone=${ZONE} ${MIG_NAME}
          # Get the instance tags from the template
          gcloud --project=kubeflow-rl compute instance-templates describe ${TEMPLATE_NAME}
          

          For more info see Google Cloud HTTP health check docs

    • In Stackdriver Logging look at the Cloud Http Load Balancer logs

      • Logs are labeled with the forwarding rule
      • The forwarding rules are available via the annotations on the ingress
        ingress.kubernetes.io/forwarding-rule
        ingress.kubernetes.io/https-forwarding-rule
        
    • Verify that requests are being properly routed within the cluster

    • Connect to one of the envoy proxies

      kubectl exec -ti `kubectl get pods --selector=service=envoy -o jsonpath='{.items[0].metadata.name}'` /bin/bash
      
    • Install curl in the pod

      apt-get update && apt-get install -y curl
      
    • Verify access to the whoami app

      curl -L -s -i http://envoy:8080/noiap/whoami
      
    • If this doesn’t return a 200 OK response; then there is a problem with the K8s resources

      • Check the pods are running
      • Check services are pointing at the points (look at the endpoints for the various services)

GKE Certificate Fails To Be Provisioned

A common symptom of your certificate failing to be provisioned is SSL errors like ERR_SSL_VERSION_OR_CIPHER_MISMATCH when you try to access the Kubeflow https endpoint.

To troubleshoot check the status of your Google Kubernetes Engine managed certificate

kubectl -n istio-system describe managedcertificate

If the certificate is in status FailedNotVisible then it means Google Cloud failed to provision the certificate because it could not verify that you owned the domain by doing an ACME challenge. In order for Google Cloud to provision your certificate

  1. Your ingress must be created in order to associated a Google Cloud Load Balancer(GCLB) with the IP address for your endpoint
  2. There must be a DNS entry mapping your domain name to the IP.

If there is a problem preventing either of the above then Google Cloud will be unable to provision your certificate and eventually enter the permanent failure state FailedNotVisible indicating your endpoint isn’t accessible. The most common cause is the ingress can’t be created because the K8s secret containing OAuth credentials doesn’t exist.

To fix this you must first resolve the underlying problems preventing your ingress or DNS entry from being created. Once the underlying problem has been fixed you can follow the steps below to force a new certificate to be generated.

You can fix the certificate by performing the following steps to delete the existing certificate and create a new one.

  1. Get the name of the Google Cloud certificate

    kubectl -n istio-system describe managedcertificate gke-certificate
    
    • The status will contain Certificate Name which will start with mcrt make a note of this.
  2. Delete the ingress

    kubectl -n istio-system delete ingress envoy-ingress
    
  3. Ensure the certificate was deleted

    gcloud --project=${PROJECT} compute ssl-certificates list
    
    • Make sure the certificate obtained in the first step no longer exists
  4. Reapply kubeflow in order to recreate the ingress and certificate

    • If you deployed with kfctl rerun kfctl apply
    • If you deployed using the Google Cloud blueprint rerun make apply-kubeflow
  5. Monitor the certificate to make sure it can be provisioned

    kubectl --context=gcp-private-0527 -n istio-system describe managedcertificate gke-certificate
    
  6. Since the ingress has been recreated we need to restart the pods that configure it

    kubectl -n istio-system delete pods -l service=backend-updater
    kubectl -n istio-system delete pods -l service=iap-enabler
    

Problems with SSL certificate from Let’s Encrypt

As of Kubeflow 1.0, Kubeflow should be using Google Kubernetes Engine Managed Certificates and no longer using Let’s Encrypt.

See the guide to monitoring your Cloud IAP setup.

Envoy pods crash-looping: root cause is backend quota exceeded

If your logs show the Envoy pods crash-looping, the root cause may be that you have exceeded your quota for some backend services such as loadbalancers. This is particularly likely if you have multiple, differently named deployments in the same Google Cloud project using Cloud IAP.

The error

The error looks like this for the pod’s Envoy container:

kubectl logs -n kubeflow envoy-79ff8d86b-z2snp envoy
[2019-01-22 00:19:44.400][1][info][main] external/envoy/source/server/server.cc:184] initializing epoch 0 (hot restart version=9.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363)
[2019-01-22 00:19:44.400][1][critical][main] external/envoy/source/server/server.cc:71] error initializing configuration '/etc/envoy/envoy-config.json': unable to read file: /etc/envoy/envoy-config.json

And the Cloud IAP container shows a message like this:

Waiting for backend id PROJECT=<your-project> NAMESPACE=kubeflow SERVICE=envoy filter=name~k8s-be-30352-...

Diagnosing the cause

You can verify the cause of the problem by entering the following command:

kubectl -n istio-system describe ingress

Look for something like this in the output:

Events:
  Type     Reason  Age                  From                     Message
  ----     ------  ----                 ----                     -------
  Warning  Sync    14m (x193 over 19h)  loadbalancer-controller  Error during sync: googleapi: Error 403: Quota 'BACKEND_SERVICES' exceeded. Limit: 5.0 globally., quotaExceeded

Fixing the problem

If you have any redundant Kubeflow deployments, you can delete them using the Deployment Manager.

Alternatively, you can request more backend services quota on the Google Cloud Console.

  1. Go to the quota settings for backend services on the Google Cloud Console.
  2. Click EDIT QUOTAS. A quota editing form opens on the right of the screen.
  3. Follow the form instructions to apply for more quota.

Legacy networks are not supported

Cloud Filestore and Google Kubernetes Engine try to use the network named default by default. For older projects, this will be a legacy network which is incompatible with Cloud Filestore and newer Google Kubernetes Engine features like private clusters. This will manifest as the error “default is invalid; legacy networks are not supported” when deploying Kubeflow.

Here’s an example error when deploying Cloud Filestore:

ERROR: (gcloud.deployment-manager.deployments.update) Error in Operation [operation-1533189457517-5726d7cfd19c9-e1b0b0b5-58ca11b8]: errors:
- code: RESOURCE_ERROR
  location: /deployments/jl-0801-b-gcfs/resources/filestore
  message: '{"ResourceType":"gcp-types/file-v1beta1:projects.locations.instances","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"network
    default is invalid; legacy networks are not supported.","status":"INVALID_ARGUMENT","statusMessage":"Bad
    Request","requestPath":"https://file.googleapis.com/v1beta1/projects/cloud-ml-dev/locations/us-central1-a/instances","httpMethod":"POST"}}'

To fix this we can create a new network:

cd ${KF_DIR}
cp .cache/master/deployment/gke/deployment_manager_configs/network.* \
   ./gcp_config/

Edit network.yaml to set the name for the network.

Edit gcfs.yaml to use the name of the newly created network.

Apply the changes.

cd ${KF_DIR}
kfctl apply -V -f ${CONFIG}

Changing the OAuth client used by IAP

If you need to change the OAuth client used by IAP, you can run the following commands to replace the Kubernetes secret containing the ID and secret.

kubectl -n kubeflow delete secret kubeflow-oauth
kubectl -n kubeflow create secret generic kubeflow-oauth \
       --from-literal=client_id=${CLIENT_ID} \
       --from-literal=client_secret=${CLIENT_SECRET}

Troubleshooting SSL certificate errors

This section describes how to enable service management API to avoid managed certificates failure.

To check your certificate:

  1. Run the following command:

    kubectl -n istio-system describe managedcertificate gke-certificate
    

    Make sure the certificate status is either Active or Provisioning which means it is not ready. For more details on certificate status, refer to the certificate statuses descriptions section. Also, make sure the domain name is correct.

  2. Run the following command to look for the errors using the certificate name from the previous step:

    gcloud beta --project=${PROJECT} compute ssl-certificates describe --global ${CERTIFICATE_NAME}
    
  3. Run the following command:

    kubectl -n istio-system get ingress envoy-ingress -o yaml
    

    Make sure of the following:

    • networking.gke.io/managed-certificates annotation value points to the name of the Kubernetes managed certificate resource and is gke-certificate;

    • public IP address that is displayed in the status is assigned. See the example of IP address below:

      status:
        loadBalancer:
          ingress:
           - ip: 35.186.212.202
      
    • DNS entry for the domain has propagated. To verify this, use the following nslookup command example:

      `nslookup ${DOMAIN}`
      
    • domain name is the fully qualified domain name which be the host value in the ingress. See the example below:

      ${KF_APP_NAME}.endpoints.${PROJECT}.cloud.goog
      

    Note that managed certificates cannot provision the certificate if the DNS lookup does not work properly.

8 - Kubeflow On-premises on Anthos

Running Kubeflow across on-premises and cloud environments with Anthos

Introduction

Anthos is a hybrid and multi-cloud application platform developed and supported by Google. Anthos is built on open source technologies, including Kubernetes, Istio, and Knative.

Using Anthos, you can create a consistent setup across your on-premises and cloud environments, helping you to automate policy and security at scale.

We are collecting interest for Kubeflow on Google Cloud On Prem. You can subscribe to the GitHub issue googlecloudplatform/kubeflow-distribution#138.

Next steps

While waiting for a response from the support team, you may like to deploy Kubeflow on Google Cloud.

9 - Changelog

Kubeflow on Google Cloud Changelog

1.7.0

Release notes

Changes:

  • 🔼 Upgraded upstream Manifests to v1.7.0.
  • 🔼 Upgraded Kubeflow Pipelines to v2.0.0-alpha.7.
  • 🔼 Upgraded KNative to v1.8.5 (#404).
  • 🔼 Upgraded cert-manager to v1.10.2 (#405).
  • 🔼 Upgraded ASM to v1.16.2 (#406).
  • 🔼 Upgraded KServe to v0.10 (#408).
  • 🔨 Fixed ASM deployment issue (#413, #419)
  • 🔨 Fixed user header issue in KServe web-app (#414)
  • 🧪 Validated deployment using GKE 1.23, GKE 1.24, GKE 1.25, and GKE 1.26.

1.6.1

Release notes

Changes:

  • 🔼 Upgraded upstream Manifests to v1.6.1.
  • 🔼 Upgraded pipelines to v2.0.0-alpha.6 (fixes #392).
  • 🔼 Updated MySQL to 8.0 (#391).
  • 🔨 Fixed ASM deployment issue (#389)
  • 🔨 Minor improvements of deployment process.
  • 🧪 Validated deployment using GKE 1.22.

1.6.0:

Release notes

Changes:

  • 🔼 Upgraded upstream Manifests to v1.6.0.
  • 🔼 Upgraded ASM to v1.14 (#385).
  • 🔼 Upgraded Knative to v1.2 (#373).
  • 🔼 Upgraded cert-manager to v1.5 (#372).
  • 🔼 Upgraded pipelines to v2.0.0-alpha.4.
  • 🔼 Upgraded APIs to support GKE 1.22 (#349).
  • 🔨 Improved deployment stability (#371, #376, #384, #386).
  • 🚚 Removed deprecated kfserving, cloud-endpoints, and application manifests (#375, #377).
  • 🧪 Validated deployment using GKE 1.21 and GKE 1.22.

1.5.1

Release notes

Changes:

  • 🔼 Upgraded ASM to v1.13.
  • 🔨 Fixed KServe issues with dashboard (#362) and directory (#361).
  • 🚚 Increased the maximum length of Kubeflow cluster name (#359).
  • 🚚 Moved RequestAuthentication policy creation to iap-enabler to improve GitOps friendliness (#364).
  • 🧪 Validated deployment using GKE 1.21.11.

1.5.0:

Release notes

Changes:

  • 🔼 Upgrade Kubeflow components versions as listed in components versions table
  • 🚀 Integrated with Config Controller, simplified management cluster maintenance cost, there is no need to manually upgrade Config Connector CRD.
  • 🚚 Switch from kfserving to KServe as default serving component, you can switch back to kfserving in config.yaml.
  • 🔨 Fixed cloudsqlproxy issue with livenessProbe configuration.
  • 🧪 Validated deployment using GKE 1.20.12.

1.4.1

Release notes

Changes:

Changes on top of v1.4.0:

  • 🔼 Upgrade: Integrate with Kubeflow 1.4.1 manifests (kubeflow/manifests#2084)
  • 🔨 Fix: Change cloud endpoint images destination (#343)
  • 🔨 Fix: Use yq4 in iap-ingress Makefile.

1.4.0:

Release notes

Changes:

  • 🔼 Upgrade Kubeflow components versions as listed in components versions table
  • 🚢 Removed GKE 1.18 image version and k8s runtime pin, now GKE version is default to STABLE channel.
  • 🌊 Set Emissary Executor as default Argo Workflow executor for Kubeflow Pipelines.
  • 🔼 Upgraded kpt versions from 0.X.X to 1.0.0-beta.6.
  • 🔼 Upgraded yq from v3 to v4.
  • 🔼 Upgraded ASM (Anthos Service Mesh) to 1.10.4-asm.6.
  • 🚀 Unblocked KFSserving usage by removing commonLabels from kustomization patch (#298 and #324).
  • 🔗 Integrated with KFServing Web App UI.
  • 🔗 Integrated with unified operator: training-operator.
  • 🪁 Simplified deployment: Removed requirement for independent installation of yq, jq, kustomize, kpt.