This is the multi-page printable view of this section. Click here to print.
Kubeflow on Google Cloud
- 1: Deployment
- 1.1: Overview
- 1.2: Setting up GCP Project
- 1.3: Setting up OAuth client
- 1.4: Deploying Management cluster
- 1.5: Deploying Kubeflow cluster
- 1.6: Upgrade Kubeflow
- 1.7: Monitoring Cloud IAP Setup
- 1.8: Deleting Kubeflow
- 2: Pipelines on Google Cloud
- 2.1: Connecting to Kubeflow Pipelines on Google Cloud using the SDK
- 2.2: Authenticating Pipelines to Google Cloud
- 2.3: Upgrading
- 2.4: Enabling GPU and TPU
- 2.5: Using Preemptible VMs and GPUs on Google Cloud
- 3: Customize Kubeflow on Google Cloud
- 4: Using Your Own Domain
- 5: Authenticating Kubeflow to Google Cloud
- 6: Securing Your Clusters
- 7: Troubleshooting Deployments on Google Cloud
- 8: Kubeflow On-premises on Anthos
- 9: Changelog
1 - Deployment
1.1 - Overview
This guide describes how to deploy Kubeflow and a series of Kubeflow components on Google Kubernetes Engine (GKE).
Features
Kubeflow deployed on Google Cloud includes the following:
- Full-fledged multi-user Kubeflow running on Google Kubernetes Engine.
- Cluster Autoscaler with automatic resizing of the node pool.
- Cloud Endpoint integrated with Identity-aware Proxy (IAP).
- GPU and Cloud TPU accelerated nodes available for your Machine Learning (ML) workloads.
- Cloud Logging for easy debugging and troubleshooting.
- Other managed services offered by Google Cloud, such as Cloud Storage, Cloud SQL, Anthos Service Mesh, Identity and Access Management (IAM), Config Controller, and so on.
Figure 1. User interface of full-fledged Kubeflow deployment on Google Cloud.
Management cluster
Kubeflow on Google Cloud employs a management cluster, which lets you manage Google Cloud resources via Config Controller. The management cluster is independent from the Kubeflow cluster and manages Kubeflow clusters. You can also use a management cluster from a different Google Cloud project, by assigning owner permissions to the associated service account.
Figure 2. Example of Kubeflow on Google Cloud deployment.
Deployment process
To set up a Kubeflow environment on Google Cloud, complete these steps:
- Set up Google Cloud project.
- Set up OAuth client.
- Deploy Management cluster.
- Deploy Kubeflow cluster.
For debugging approaches to common issues encountered during these deployment steps, see troubleshooting deployments to find common issues and debugging approaches. If the issue isnβt included in the list of commonly encountered issues, report a bug at googlecloudplatform/kubeflow-distribution.
Next steps
- Deploy Kubeflow Cluster.
- Run a full ML workflow on Kubeflow by using the end-to-end MNIST notebook.
1.2 - Setting up GCP Project
In order to deploy Kubeflow on Google Cloud, you need to set up a Google Cloud project and enable necessary APIs for the deployment.
Setting up a project
Follow these steps to set up your Google Cloud project:
-
Select or create a project on the Google Cloud Console. If you plan to use different Google Cloud projects for Management Cluster and Kubeflow Clusters: create one Management project for Management Cluster, and create one or more Kubeflow projects for Kubeflow Clusters.
-
Make sure that you have the Owner role for the project in Cloud IAM (Identity and Access Management). The deployment process creates various service accounts with appropriate roles in order to enable seamless integration with Google Cloud services. This process requires that you have the owner role for the project in order to deploy Kubeflow.
-
Make sure that billing is enabled for your project. Refer to Enable billing for a project.
-
Enable the following APIs by running the following command in a Cloud Shell or local terminal (needs to be authenticated via
gcloud auth login
):gcloud services enable \ serviceusage.googleapis.com \ compute.googleapis.com \ container.googleapis.com \ iam.googleapis.com \ servicemanagement.googleapis.com \ cloudresourcemanager.googleapis.com \ ml.googleapis.com \ iap.googleapis.com \ sqladmin.googleapis.com \ meshconfig.googleapis.com \ krmapihosting.googleapis.com \ servicecontrol.googleapis.com \ endpoints.googleapis.com \ cloudbuild.googleapis.com
Alternatively, you can these APIs can be enabled via Google Cloud Console:
* [Service Usage API](https://cloud.google.com/service-usage/docs/reference/rest)
* [Compute Engine API](https://console.cloud.google.com/apis/library/compute.googleapis.com)
* [Kubernetes Engine API](https://console.cloud.google.com/apis/library/container.googleapis.com)
* [Identity and Access Management (IAM) API](https://console.cloud.google.com/apis/library/iam.googleapis.com)
* [Service Management API](https://console.cloud.google.com/apis/api/servicemanagement.googleapis.com)
* [Cloud Resource Manager API](https://console.developers.google.com/apis/library/cloudresourcemanager.googleapis.com)
* [AI Platform Training & Prediction API](https://console.developers.google.com/apis/library/ml.googleapis.com)
* [Cloud Identity-Aware Proxy API](https://console.cloud.google.com/apis/library/iap.googleapis.com)
* [Cloud Build API](https://console.cloud.google.com/apis/library/cloudbuild.googleapis.com)
* [Cloud SQL Admin API](https://console.cloud.google.com/apis/library/sqladmin.googleapis.com)
* [Config Controller (KRM API Hosting API)](https://console.cloud.google.com/apis/library/krmapihosting.googleapis.com)
* [Service Control API](https://console.cloud.google.com/apis/library/servicecontrol.googleapis.com)
* [Google Cloud Endpoints](https://console.cloud.google.com/apis/library/endpoints.googleapis.com)
-
If you are using the Google Cloud Free Program or the 12-month trial period with $300 credit, note that the free tier does not offer enough resources for default full Kubeflow installation. You need to upgrade to a paid account.
For more information, see the following issues:
- kubeflow/website #1065 reports the problem.
- kubeflow/kubeflow #3936 requests a Kubeflow configuration to work with a free trial project.
Read the Google Cloud Resource quotas to understand quotas on resource usage that Compute Engine enforces, and to learn how to check and increase your quotas.
-
Initialize your project to prepare it for Anthos Service Mesh installation:
PROJECT_ID=<YOUR_PROJECT_ID>
curl --request POST \ --header "Authorization: Bearer $(gcloud auth print-access-token)" \ --data '' \ https://meshconfig.googleapis.com/v1alpha1/projects/${PROJECT_ID}:initialize
Refer to Anthos Service Mesh documentation for details.
If you encounter a
Workload Identity Pool does not exist
error, refer to the following issue:- kubeflow/website #2121 describes that creating and then removing a temporary Kubernetes cluster may be needed for projects that haven’t had a cluster set up beforehand.
You do not need a running Google Kubernetes Engine cluster. The deployment process creates a cluster for you.
Next steps
- Set up an OAuth credential to use Cloud Identity-Aware Proxy (Cloud IAP). Cloud IAP is recommended for production deployments or deployments with access to sensitive data.
- Set up Management Cluster to deploy and manage Kubeflow clusters.
- Deploy Kubeflow using kubectl, kustomize and kpt.
1.3 - Setting up OAuth client
Set up OAuth Consent Screen and Client Credential
If you want to use Cloud Identity-Aware Proxy (Cloud IAP) when deploying Kubeflow on Google Cloud, then you must follow these instructions to create an OAuth client for use with Kubeflow.
Cloud IAP is recommended for production deployments or deployments with access to sensitive data.
Follow the steps below to create an OAuth client ID that identifies Cloud IAP when requesting access to a user’s email account. Kubeflow uses the email address to verify the user’s identity.
-
Set up your OAuth consent screen:
-
In the Application name box, enter the name of your application. The example below uses the name “Kubeflow”.
-
Under Support email, select the email address that you want to display as a public contact. You must use either your email address or a Google Group that you own.
-
If you see Authorized domains, enter
<project>.cloud.goog
- where <project> is your Google Cloud project ID.
- If you are using your own domain, such as acme.com, you should add that as well
- The Authorized domains option appears only for certain project configurations. If you don’t see the option, then there’s nothing you need to set.
-
Click Save.
-
Here’s an example of the completed form:
-
-
On the credentials screen:
- Click Create credentials, and then click OAuth client ID.
- Under Application type, select Web application.
- In the Name box enter any name for your OAuth client ID. This is not the name of your application nor the name of your Kubeflow deployment. It’s just a way to help you identify the OAuth client ID.
-
Click Create. A dialog box appears, like the one below:
-
Copy the client ID shown in the dialog box, because you need the client ID in the next step.
-
On the Create credentials screen, find your newly created OAuth credential and click the pencil icon to edit it:
-
In the Authorized redirect URIs box, enter the following (if it’s not already present in the list of authorized redirect URIs):
https://iap.googleapis.com/v1/oauth/clientIds/<CLIENT_ID>:handleRedirect
<CLIENT_ID>
is the OAuth client ID that you copied from the dialog box in step four. It looks likeXXX.apps.googleusercontent.com
.- Note that the URI is not dependent on the Kubeflow deployment or endpoint. Multiple Kubeflow deployments can share the same OAuth client without the need to modify the redirect URIs.
-
Press Enter/Return to add the URI. Check that the URI now appears as a confirmed item under Authorized redirect URIs. (The URI should no longer be editable.)
Here’s an example of the completed form:
-
Click Save.
-
Make note that you can find your OAuth client credentials in the credentials section of the Google Cloud Console. You need to retrieve the client ID and client secret later when you’re ready to enable Cloud IAP.
Next steps
- Set up your management cluster.
- Grant your users the IAP-secured Web App User IAM role so they can access the Kubeflow console through IAP.
1.4 - Deploying Management cluster
This guide describes how to setup a management cluster which you will use to deploy one or more instances of Kubeflow.
The management cluster is used to run Cloud Config Connector. Cloud Config Connector is a Kubernetes addon that allows you to manage Google Cloud resources through Kubernetes.
While the management cluster can be deployed in the same project as your Kubeflow cluster, typically you will want to deploy it in a separate project used for administering one or more Kubeflow instances, because it will run with escalated permissions to create Google Cloud resources in the managed projects.
Optionally, the cluster can be configured with Anthos Config Management to manage Google Cloud infrastructure using GitOps.
Deployment steps
Install the required tools
-
gcloud components install kubectl kustomize kpt anthoscli beta gcloud components update # If the output said the Cloud SDK component manager is disabled for installation, copy the command from output and run it.
You can install specific version of kubectl by following instruction (Example: Install kubectl on Linux). Latest patch version of kubectl from
v1.17
tov1.19
works well too.Note: Starting from Kubeflow 1.4, it requires
kpt v1.0.0-beta.6
or above to operate ingooglecloudplatform/kubeflow-distribution
repository. gcloud hasn’t caught up with this kpt version yet, install kpt separately from https://github.com/GoogleContainerTools/kpt/tags for now. Note that kpt requires docker to be installed.
Fetch googlecloudplatform/kubeflow-distribution package
The management cluster manifests live in GitHub repository googlecloudplatform/kubeflow-distribution, use the following commands to pull Kubeflow manifests:
-
Clone the GitHub repository and check out the latest manifests:
git clone https://github.com/googlecloudplatform/kubeflow-distribution.git cd kubeflow-distribution git checkout master
Alternatively, you can get the package by using
kpt
:# Check out the latest Kubeflow kpt pkg get https://github.com/googlecloudplatform/kubeflow-distribution.git@master kubeflow-distribution cd kubeflow-distribution
-
Go to
kubeflow-distribution/management
directory for Management cluster configurations.cd management
Tip
To continuously manage the management cluster, you are recommended to check the management configuration directory into source control. For example,MGMT_DIR=~/kubeflow-distribution/management/
.
Configure Environment Variables
Fill in environment variables in kubeflow-distribution/management/env.sh
as followed:
MGMT_PROJECT=<the project where you deploy your management cluster>
MGMT_NAME=<name of your management cluster>
LOCATION=<location of your management cluster, use either us-central1 or us-east1>
And run:
source env.sh
This guide assumes the following convention:
-
The
${MGMT_PROJECT}
environment variable contains the Google Cloud project ID where management cluster is deployed to. -
${MGMT_NAME}
is the cluster name of your management cluster and the prefix for other Google Cloud resources created in the deployment process. Management cluster should be a different cluster from your Kubeflow cluster.Note,
${MGMT_NAME}
should- start with a lowercase letter
- only contain lowercase letters, numbers and
-
- end with a number or a letter
- contain no more than 18 characters
-
The
${LOCATION}
environment variable contains the location of your management cluster. you can choose between regional or zonal, see Available regions and zones.
Configure kpt setter values
Use kpt to set values for the name, project, and location of your management cluster. Run the following command:
bash kpt-set.sh
Note, you can find out which setters exist in a package and what their current values are by running the following command:
kpt fn eval -i list-setters:v0.1 ./manifests
Prerequisite for Config Controller
In order to deploy Google Cloud Services like Kubernetes resources, we need to create a management cluster with Config Controller installed. Follow Before you begin to create default network if not existed. Make sure to use ${MGMT_PROJECT}
for PROJECT_ID.
Deploy Management Cluster
-
Deploy the management cluster by applying cluster resources:
make create-cluster
-
Create a kubectl context for the management cluster, it will be named
${MGMT_NAME}
:make create-context
-
Grant permission to Config Controller service account:
make grant-owner-permission
Config Controller has created a default service account, this step grants owner permission to this service account in order to allow Config Controller to manage Google Cloud resources. Refer to Config Controller setup.
Understanding the deployment process
This section gives you more details about the configuration and deployment process, so that you can customize your management cluster if necessary.
Config Controller
Management cluster is a tool for managing Google Cloud services like KRM, for example: Google Kubernetes Engine container cluster, MySQL database, etc. And you can use one Management cluster for multiple Kubeflow clusters, across multiple Google Cloud projects. This capability is offered by Config Connector.
Starting with Kubeflow 1.5, we leveraged the managed version of Config Connector, which is called Config Controller. Therefore, The Management cluster is the Config Controller cluster deployed using Config Controller setup process. Note that you can create only one Management cluster within a Google Cloud project, and you usually need just one Management cluster.
Management cluster layout
Inside the Config Controller, we manage Google Cloud resources in namespace mode. That means one namespace is responsible to manage Google Cloud resources deployed to the Google Cloud project with the same name. Your management cluster contains following namespaces:
- config-control
- namespace with the same name as your Kubeflow clusters’ Google Cloud project name
config-control
is the default namespace which is installed while creating Management cluster, you have granted the default service account (like service-<management-project-id>@gcp-sa-yakima.iam.gserviceaccount.com
)
within this project to manage Config Connector. It is the prerequisite for managing resources in other Google Cloud projects.
namespace with the same name as your Kubeflow clusters' Google Cloud project name
is the resource pool for Kubeflow cluster’s Google Cloud project.
For each Kubeflow Google Cloud project, you will have service account with pattern kcc-<kf-project-name>@<management-project-name>.iam.gserviceaccount.com
in config-control
namespace, and it needs to have owner permission to ${KF_PROJECT}
, you will perform this step during Deploy Kubeflow cluster. After setup, your Google Cloud resources in Kubeflow cluster project will be deployed to the namespace with name ${KF_PROJECT}
in the management cluster.
Your management cluster directory contains the following file:
- Makefile is a file that defines rules to automate deployment process. You can refer to GNU make documentation for more introduction. The Makefile we provide is designed to be user maintainable. You are encouraged to read, edit and maintain it to suit your own deployment customization needs.
Debug
If you encounter issue creating Google Cloud resources using Config Controller. You can list resources in the ${KF_PROJECT}
namespace of management cluster to learn about the detail.
Learn more with Monitoring your resources
kubectl --context=${MGMT_NAME} get all -n ${KF_PROJECT}
# If you want to check the service account creation status
kubectl --context=${MGMT_NAME} get IAMServiceAccount -n ${KF_PROJECT}
kubectl --context=${MGMT_NAME} get IAMServiceAccount <service-account-name> -n ${KF_PROJECT} -oyaml
FAQs
-
Where is
kfctl
?kfctl
is no longer being used to apply resources for Google Cloud, because required functionalities are now supported by generic tools including Make, Kustomize, kpt, and Cloud Config Connector. -
Why do we use an extra management cluster to manage Google Cloud resources?
The management cluster is very lightweight cluster that runs Cloud Config Connector. Cloud Config Connector makes it easier to configure Google Cloud resources using YAML and Kustomize.
For a more detailed explanation of the drastic changes happened in Kubeflow v1.1 on Google Cloud, read googlecloudplatform/kubeflow-distribution #123.
Next steps
- Deploy Kubeflow using kubectl, kustomize and kpt.
1.5 - Deploying Kubeflow cluster
This guide describes how to use kubectl
and kpt to
deploy Kubeflow on Google Cloud.
Deployment steps
Prerequisites
Before installing Kubeflow on the command line:
-
You must have created a management cluster and installed Config Connector.
-
If you don’t have a management cluster follow the instructions
-
Your management cluster will need a namespace setup to administer the Google Cloud project where Kubeflow will be deployed. This step will be included in later step of current page.
-
-
You need to use Linux or Cloud Shell for ASM installation. Currently ASM installation doesn’t work on macOS because it comes with an old version of bash.
-
Make sure that your Google Cloud project meets the minimum requirements described in the project setup guide.
-
Follow the guide setting up OAuth credentials to create OAuth credentials for Cloud Identity-Aware Proxy (Cloud IAP).
- Unfortunately Google Kubernetes Engine’s BackendConfig currently doesn’t support creating IAP OAuth clients programmatically.
Install the required tools
-
Install gcloud.
-
Install gcloud components
gcloud components install kubectl kustomize kpt anthoscli beta gcloud components update
You can install specific version of kubectl by following instruction (Example: Install kubectl on Linux). Latest patch version of kubectl from
v1.17
tov1.19
works well too.Note: Starting from Kubeflow 1.4, it requires
kpt v1.0.0-beta.6
or above to operate ingooglecloudplatform/kubeflow-distribution
repository. gcloud hasn’t caught up with this kpt version yet, install kpt separately from https://github.com/GoogleContainerTools/kpt/tags for now. Note that kpt requires docker to be installed.Note: You also need to install required tools for ASM installation tool
install_asm
.
Fetch googlecloudplatform/kubeflow-distribution and upstream packages
-
If you have already installed Management cluster, you have
googlecloudplatform/kubeflow-distribution
locally. You just need to runcd kubeflow
to access Kubeflow cluster manifests. Otherwise, you can run the following commands:# Check out the latest Kubeflow git clone https://github.com/googlecloudplatform/kubeflow-distribution.git cd kubeflow-distribution git checkout master
Alternatively, you can get the package by using
kpt
:# Check out the latest Kubeflow kpt pkg get https://github.com/googlecloudplatform/kubeflow-distribution.git@master kubeflow-distribution cd kubeflow-distribution
-
Run the following command to pull upstream manifests from
kubeflow/manifests
repository.# Visit Kubeflow cluster related manifests cd kubeflow bash ./pull-upstream.sh
Environment Variables
Log in to gcloud. You only need to run this command once:
gcloud auth login
-
Review and fill all the environment variables in
kubeflow-distribution/kubeflow/env.sh
, they will be used bykpt
later on, and some of them will be used in this deployment guide. Review the comment inenv.sh
for the explanation for each environment variable. After defining these environment variables, run:source env.sh
-
Set environment variables with OAuth Client ID and Secret for IAP:
export CLIENT_ID=<Your CLIENT_ID> export CLIENT_SECRET=<Your CLIENT_SECRET>
Note
Do not omit the export because scripts triggered by make need these environment variables. Do not check in these two environment variables configuration to source control, they are secrets.
kpt setter config
Run the following commands to configure kpt setter for your Kubeflow cluster:
bash ./kpt-set.sh
Everytime you change environment variables, make sure you run the command above to apply kpt setter change to all packages. Otherwise, kustomize build will not be able to pick up new changes.
Note, you can find out which setters exist in a package and their current values by running the following commands:
kpt fn eval -i list-setters:v0.1 ./apps
kpt fn eval -i list-setters:v0.1 ./common
You can learn more about list-setters
in kpt documentation.
Authorize Cloud Config Connector for each Kubeflow project
In the Management cluster deployment we created the Google Cloud service account serviceAccount:kcc-${KF_PROJECT}@${MGMT_PROJECT}.iam.gserviceaccount.com
this is the service account that Config Connector will use to create any Google Cloud resources in ${KF_PROJECT}
. You need to grant this Google Cloud service account sufficient privileges to create the desired resources in Kubeflow project.
You only need to perform steps below once for each Kubeflow project, but make sure to do it even when KF_PROJECT and MGMT_PROJECT are the same project.
The easiest way to do this is to grant the Google Cloud service account owner permissions on one or more projects.
-
Set the Management environment variable if you haven’t:
MGMT_PROJECT=<the project where you deploy your management cluster> MGMT_NAME=<the kubectl context name for management cluster>
-
Apply ConfigConnectorContext for
${KF_PROJECT}
in management cluster:make apply-kcc
Configure Kubeflow
Make sure you are using KF_PROJECT in the gcloud CLI tool:
gcloud config set project ${KF_PROJECT}
Deploy Kubeflow
To deploy Kubeflow, run the following command:
make apply
-
If deployment returns an error due to missing resources in
serving.kserve.io
API group, rerunmake apply
. This is due to a race condition between CRD and runtime resources in KServe.- This issue is being tracked in googlecloudplatform/kubeflow-distribution#384
-
If resources can’t be created because
webhook.cert-manager.io
is unavailable wait and then rerunmake apply
- This issue is being tracked in kubeflow/manifests#1234
-
If resources can’t be created with an error message like:
error: unable to recognize ".build/application/app.k8s.io_v1beta1_application_application-controller-kubeflow.yaml": no matches for kind "Application" in version "app.k8s.io/v1beta1β
This issue occurs when the CRD endpoint isn’t established in the Kubernetes API server when the CRD’s custom object is applied. This issue is expected and can happen multiple times for different kinds of resource. To resolve this issue, try running
make apply
again.
Check your deployment
Follow these steps to verify the deployment:
-
When the deployment finishes, check the resources installed in the namespace
kubeflow
in your new cluster. To do this from the command line, first set yourkubectl
credentials to point to the new cluster:gcloud container clusters get-credentials "${KF_NAME}" --zone "${ZONE}" --project "${KF_PROJECT}"
Then, check what’s installed in the
kubeflow
namespace of your Google Kubernetes Engine cluster:kubectl -n kubeflow get all
Access the Kubeflow user interface (UI)
To access the Kubeflow central dashboard, follow these steps:
-
Use the following command to grant yourself the IAP-secured Web App User role:
gcloud projects add-iam-policy-binding "${KF_PROJECT}" --member=user:<EMAIL> --role=roles/iap.httpsResourceAccessor
Note, you need the
IAP-secured Web App User
role even if you are already an owner or editor of the project.IAP-secured Web App User
role is not implied by theProject Owner
orProject Editor
roles. -
Enter the following URI into your browser address bar. It can take 20 minutes for the URI to become available:
https://${KF_NAME}.endpoints.${KF_PROJECT}.cloud.goog/
You can run the following command to get the URI for your deployment:
kubectl -n istio-system get ingress NAME HOSTS ADDRESS PORTS AGE envoy-ingress your-kubeflow-name.endpoints.your-gcp-project.cloud.goog 34.102.232.34 80 5d13h
The following command sets an environment variable named
HOST
to the URI:export HOST=$(kubectl -n istio-system get ingress envoy-ingress -o=jsonpath={.spec.rules[0].host})
Notes:
- It can take 20 minutes for the URI to become available. Kubeflow needs to provision a signed SSL certificate and register a DNS name.
- If you own or manage the domain or a subdomain with Cloud DNS then you can configure this process to be much faster. Check kubeflow/kubeflow#731.
Understanding the deployment process
This section gives you more details about the kubectl, kustomize, config connector configuration and deployment process, so that you can customize your Kubeflow deployment if necessary.
Application layout
Your Kubeflow application directory kubeflow-distribution/kubeflow
contains the following files and
directories:
-
Makefile is a file that defines rules to automate deployment process. You can refer to GNU make documentation for more introduction. The Makefile we provide is designed to be user maintainable. You are encouraged to read, edit and maintain it to suit your own deployment customization needs.
-
apps, common, contrib are a series of independent components directory containing kustomize packages for deploying Kubeflow components. The structure is to align with upstream kubeflow/manifests.
-
googlecloudplatform/kubeflow-distribution repository only stores
kustomization.yaml
andpatches
for Google Cloud specific resources. -
./pull_upstream.sh
will pullkubeflow/manifests
and store manifests inupstream
folder of each component in this guide. googlecloudplatform/kubeflow-distribution repository doesn’t store the copy of upstream manifests.
-
-
build is a directory that will contain the hydrated manifests outputted by the
make
rules, each component will have its own build directory. You can customize the build path when callingmake
command.
Source Control
It is recommended that you check in your entire local repository into source control.
Checking in build is recommended so you can easily see differences by git diff
in manifests before applying them.
Google Cloud service accounts
The kfctl deployment process creates three service accounts in your Google Cloud project. These service accounts follow the principle of least privilege. The service accounts are:
${KF_NAME}-admin
is used for some admin tasks like configuring the load balancers. The principle is that this account is needed to deploy Kubeflow but not needed to actually run jobs.${KF_NAME}-user
is intended to be used by training jobs and models to access Google Cloud resources (Cloud Storage, BigQuery, etc.). This account has a much smaller set of privileges compared toadmin
.${KF_NAME}-vm
is used only for the virtual machine (VM) service account. This account has the minimal permissions needed to send metrics and logs to Stackdriver.
Upgrade Kubeflow
Refer to Upgrading Kubeflow cluster.
Next steps
- Run a full ML workflow on Kubeflow, using the end-to-end MNIST tutorial or the GitHub issue summarization Pipelines example.
- Learn how to delete your Kubeflow deployment using the CLI.
- To add users to Kubeflow, go to a dedicated section in Customizing Kubeflow on Google Cloud.
- To taylor your Kubeflow deployment on Google Cloud, go to Customizing Kubeflow on Google Cloud.
- For troubleshooting Kubeflow deployments on Google Cloud, go to the Troubleshooting deployments guide.
1.6 - Upgrade Kubeflow
Before you start
To better understand upgrade process, you should read the following sections first:
- Understanding the deployment process for management cluster
- Understanding the deployment process for Kubeflow cluster
This guide assumes the following settings:
- The
${MGMT_DIR}
and${MGMT_NAME}
environment variables are the same as in Management cluster setup. - The
${KF_NAME}
,${CLIENT_ID}
and${CLIENT_SECRET}
environment variables are the same as in Deploy using kubectl and kpt. - The
${KF_DIR}
environment variable contains the path to your Kubeflow application directory, which holds your Kubeflow configuration files. For example,/opt/kubeflow-distribution/kubeflow/
.
General upgrade instructions
Starting from Kubeflow v1.5, we have integrated with Config Controller. You don’t need to manually upgrade Management cluster any more, since it managed by Upgrade Config Controller.
Starting from Kubeflow v1.3, we have reworked on the structure of googlecloudplatform/kubeflow-distribution
repository. All resources are located in kubeflow-distribution/management
directory. Upgrade to Management cluster v1.3 is not supported.
Before Kubeflow v1.3, both management cluster and Kubeflow cluster follow the same instance
and upstream
folder convention. To upgrade, you’ll typically need to update packages in upstream
to the new version and repeat the make apply-<subcommand>
commands in their respective deployment process.
However, specific upgrades might need manual actions below.
Upgrading management cluster
Upgrading management cluster before 1.5
It is strongly recommended to use source control to keep a copy of your working repository for recording changes at each step.
Due to the refactoring of kubeflow/manifests
repository, the way we depend on googlecloudplatform/kubeflow-distribution
has changed drastically. This section suits for upgrading from Kubeflow 1.3 to higher.
-
The instructions below assume that your current working directory is
cd "${MGMT_DIR}"
-
Use your management cluster’s kubectl context:
# Look at all your contexts kubectl config get-contexts # Select your management cluster's context kubectl config use-context "${MGMT_NAME}" # Verify the context connects to the cluster properly kubectl get namespace
If you are using a different environment, you can always reconfigure the context by:
make create-context
-
Check your existing config connector version:
# For Kubeflow v1.3, it should be 1.46.0 $ kubectl get namespace cnrm-system -ojsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com\/version}' 1.46.0
-
Merge the content from new Kubeflow version of
googlecloudplatform/kubeflow-distribution
WORKING_BRANCH=<your-github-working-branch> VERSION_TAG=<targeted-kubeflow-version-tag-on-github> git checkout -b "${WORKING_BRANCH}" git remote add upstream https://github.com/googlecloudplatform/kubeflow-distribution.git # This is one time only. git fetch upstream git merge "${VERSION_TAG}"
-
Make sure your build directory (
./build
by default) is checked in to source control (git). -
Run the following command to hydrate Config Connector resources:
make hydrate-kcc
-
Compare the difference on your source control tracking after making hydration change. If they are addition or modification only, proceed to next step. If it includes deletion, you need to use
kubectl delete
to manually clean up the deleted resource for cleanup. -
After confirmation, run the following command to apply new changes:
make apply-kcc
-
Check version has been upgraded after applying new Config Connector resource:
$ kubectl get namespace cnrm-system -ojsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com\/version}'
Upgrade management cluster from v1.1 to v1.2
-
The instructions below assume that your current working directory is
cd "${MGMT_DIR}"
-
Use your management cluster’s kubectl context:
# Look at all your contexts kubectl config get-contexts # Select your management cluster's context kubectl config use-context "${MGMT_NAME}" # Verify the context connects to the cluster properly kubectl get namespace
If you are using a different environment, you can always reconfigure the context by:
make create-context
-
Check your existing config connector version:
# For Kubeflow v1.1, it should be 1.15.1 $ kubectl get namespace cnrm-system -ojsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com\/version}' 1.15.1
-
Uninstall the old config connector in the management cluster:
kubectl delete sts,deploy,po,svc,roles,clusterroles,clusterrolebindings --all-namespaces -l cnrm.cloud.google.com/system=true --wait=true kubectl delete validatingwebhookconfiguration abandon-on-uninstall.cnrm.cloud.google.com --ignore-not-found --wait=true kubectl delete validatingwebhookconfiguration validating-webhook.cnrm.cloud.google.com --ignore-not-found --wait=true kubectl delete mutatingwebhookconfiguration mutating-webhook.cnrm.cloud.google.com --ignore-not-found --wait=true
These commands uninstall the config connector without removing your resources.
-
Replace your
./Makefile
with the version in Kubeflowv1.2.0
: https://github.com/googlecloudplatform/kubeflow-distribution/blob/v1.2.0/management/Makefile.If you made any customizations in
./Makefile
, you should merge your changes with the upstream version. We’ve refactored the Makefile to move substantial commands into the upstream package, so hopefully future upgrades won’t require a manual merge of the Makefile. -
Update
./upstream/management
package:make update
-
Use kpt to set user values:
kpt cfg set -R . name ${MGMT_NAME} kpt cfg set -R . gcloud.core.project ${MGMT_PROJECT} kpt cfg set -R . location ${LOCATION}
Note, you can find out which setters exist in a package and what there current values are by:
kpt cfg list-setters .
-
Apply upgraded config connector:
make apply-kcc
Note, you can optionally also run
make apply-cluster
, but it should be the same as your existing management cluster. -
Check that your config connector upgrade is successful:
# For Kubeflow v1.2, it should be 1.29.0 $ kubectl get namespace cnrm-system -ojsonpath='{.metadata.annotations.cnrm\.cloud\.google\.com\/version}' 1.29.0
Upgrading Kubeflow cluster
DISCLAIMERS
To upgrade from specific versions of Kubeflow, you may need to take certain manual actions β refer to specific sections in the guidelines below.
General instructions for upgrading Kubeflow cluster
-
The instructions below assume that:
-
Your current working directory is:
cd ${KF_DIR}
-
Your kubectl uses a context that connects to your Kubeflow cluster
# List your existing contexts kubectl config get-contexts # Use the context that connects to your Kubeflow cluster kubectl config use-context ${KF_NAME}
-
-
Merge the new version of
googlecloudplatform/kubeflow-distribution
(example: v1.3.1), you don’t need to do it again if you have already done so during management cluster upgrade.WORKING_BRANCH=<your-github-working-branch> VERSION_TAG=<targeted-kubeflow-version-tag-on-github> git checkout -b "${WORKING_BRANCH}" git remote add upstream https://github.com/googlecloudplatform/kubeflow-distribution.git # This is one time only. git fetch upstream git merge "${VERSION_TAG}"
-
Change the
KUBEFLOW_MANIFESTS_VERSION
in./pull-upstream.sh
with the targeted kubeflow version same as$VERSION_TAG
. Run the following commands to pull new changes from upstreamkubeflow/manifests
.bash ./pull-upstream.sh
-
(Optional) If you only want to upgrade some of Kubeflow components, you can comment non-upgrade components in
kubeflow/config.yaml
file. Commands below will only apply the remaining components. -
Make sure you have checked in
build
folders for each component. The following command will change them so you can compare for difference.make hydrate
-
Once you confirm the changes are ready to apply, run the following command to upgrade Kubeflow cluster:
make apply
Note
Kubeflow on Google Cloud doesn’t guarantee the upgrade for each Kubeflow component always works with the general upgrade guide here. Please refer to corresponding repository in Kubeflow org for upgrade support.Upgrade Kubeflow cluster to v1.6
Starting from Kubeflow v1.6.0:
- Component with deprecated API versions were upgraded to support Google Kubernetes Engine v1.22. If you would like to upgrade your Google Kubernetes Engine cluster, follow GCP instructions.
- ASM was upgraded to v1.14. Follow the instructions on how to upgrade ASM (Anthos Service Mesh). If you want to use ASM version prior to 1.11, refer to the legacy instructions.
- Knative was upgraded to v1.2. Follow Knative instructions to check current version and see if the update includes any breaking changes.
- Cert-manager was upgraded to v1.5. To check your current version and see if the update includes any breaking changes, follow the cert-manager instructions.
- Deprecated kfserving component was removed. To upgrade to KServe, follow the KServe Migration guide.
Upgrade Kubeflow cluster to v1.5
In Kubeflow v1.5.1 we use ASM v1.13. See how to upgrade ASM. To use ASM versions prior to 1.11, follow the legacy instructions.
Starting from Kubeflow v1.5, Kubeflow manifests have included KServe as an independent component from kfserving, Google Cloud distribution has switched over from kfserving to KServe for default installed components. If you want to upgrade Kubeflow while keeping kfsering, you can comment KServe and uncomment kfserving in kubeflow-distribution/kubeflow/config.yaml
file. If you want to upgrade to KServe, follow the KServe Migration guide.
Upgrade Kubeflow cluster to v1.3
Due to the refactoring of kubeflow/manifests
repository, the way we depend on googlecloudplatform/kubeflow-distribution
has changed drastically. Upgrade to Kubeflow cluster v1.3 is not supported. And individual component upgrade has been deferred to its corresponding repository for support.
Upgrade Kubeflow cluster from v1.1 to v1.2
-
The instructions below assume
-
Your current working directory is:
cd ${KF_DIR}
-
Your kubectl uses a context that connects to your Kubeflow cluster:
# List your existing contexts kubectl config get-contexts # Use the context that connects to your Kubeflow cluster kubectl config use-context ${KF_NAME}
-
-
(Recommended) Replace your
./Makefile
with the version in Kubeflowv1.2.0
: https://github.com/googlecloudplatform/kubeflow-distribution/blob/v1.2.0/kubeflow/Makefile.If you made any customizations in
./Makefile
, you should merge your changes with the upstream version.This step is recommended, because we introduced usability improvements and fixed compatibility for newer Kustomize versions (while still being compatible with Kustomize v3.2.1) to the Makefile. However, the deployment process is backward-compatible, so this is recommended, but not required.
-
Update
./upstream/manifests
package:make update
-
Before applying new resources, you need to delete some immutable resources that were updated in this release:
kubectl delete statefulset kfserving-controller-manager -n kubeflow --wait kubectl delete crds experiments.kubeflow suggestions.kubeflow trials.kubeflow
WARNING: This step deletes all Katib running resources.
Refer to a github comment in the v1.2 release issue for more details.
-
Redeploy:
make apply
To evaluate the changes before deploying them you can:
- Run
make hydrate
. - Compare the contents
of
.build
with a historic version with tools likegit diff
.
- Run
Upgrade ASM (Anthos Service Mesh)
If you want to upgrade ASM instead of the Kubeflow components, refer to [kubeflow/common/asm/README.md](https://github.com/googlecloudplatform/kubeflow-distribution/blob/master/kubeflow/asm/README.md for the latest instructions on upgrading ASM. Detailed explanation is listed below. Note: if you are going to upgrade major or minor version of ASM, it is recommended to read the official ASM upgrade documentation before proceeding with the steps below.
Install a new ASM workload
In order to use the new ASM version, we need to download the corresponding ASM configuration package and asmcli
script. Get a list of available ASM packages and the corresponding asmcli
scripts by running the following command:
curl https://storage.googleapis.com/csm-artifacts/asm/ASMCLI_VERSIONS
It should return a list of ASM versions that can be installed with asmcli script. To install older versions, refer to the legacy instructions. The returned list will have a format of ${ASM_PACKAGE_VERSION}:${ASMCLI_SCRIPT_VERSION}
. For example, in the following output:
...
1.13.2-asm.5+config2:asmcli_1.13.2-asm.5-config2
1.13.2-asm.5+config1:asmcli_1.13.2-asm.5-config1
1.13.2-asm.2+config2:asmcli_1.13.2-asm.2-config2
1.13.2-asm.2+config1:asmcli_1.13.2-asm.2-config1
1.13.1-asm.1+config1:asmcli_1.13.1-asm.1-config1
...
record 1.13.2-asm.5+config2:asmcli_1.13.2-asm.5-config2 corresponds to:
ASM_PACKAGE_VERSION=1.13.2-asm.5+config2
ASMCLI_SCRIPT_VERSION=asmcli_1.13.2-asm.5-config2
You need to set these two values in kubeflow/asm/Makefile. Then, run the following command in kubeflow/asm
directory to install the new ASM. Note, the old ASM will not be uninstalled.
make apply
Once installed successfully, you can see istiod Deployment
in your cluster with name in pattern istiod-asm-VERSION-REVISION
. For example, istiod-asm-1132-5
would correspond to ASM version 1.13.2-asm.5.
Upgrade other Kubeflow components to use new ASM
There are multiple Kubeflow components with ASM namespace label, including user created namespaces. To upgrade them at once, change the following line in kubeflow/env.sh
with the new ASM version asm-VERSION-REVISION
, like asm-1132-5
.
export ASM_LABEL=asm-1132-5
Then run the following commands in kubeflow/
directory to configure the environmental variables:
source env.sh
Run the following command to configure kpt setter:
bash kpt-set.sh
Examine the change using source control after running the following command:
make hydrate
Refer to Deploying and redeploying workloads for the complete steps to adopt the new ASM version. As part of the instructions, you can run the following command to update namespaces’ labels across the cluster:
make apply
(Optional) Uninstall the old ASM workload
Once you validated that new ASM installation and sidecar-injection for Kubeflow components are working as expected. You can Complete the transition to the new ASM or Rollback to the old ASM as instructed in Deploy and Redeploy workloads.
1.7 - Monitoring Cloud IAP Setup
Cloud Identity-Aware Proxy (Cloud IAP) is the recommended solution for accessing your Kubeflow deployment from outside the cluster, when running Kubeflow on Google Cloud.
This document is a step-by-step guide to ensuring that your IAP-secured endpoint is available, and to debugging problems that may cause the endpoint to be unavailable.
Introduction
When deploying Kubeflow using the command-line interface, you choose the authentication method you want to use. One of the options is Cloud IAP. This document assumes that you have already deployed Kubeflow.
Kubeflow uses the Google-managed certificate to provide an SSL certificate for the Kubeflow Ingress.
Cloud IAP gives you the following benefits:
- Users can log in in using their Google Cloud accounts.
- You benefit from Google’s security expertise to protect your sensitive workloads.
Monitoring your Cloud IAP setup
Follow these instructions to monitor your Cloud IAP setup and troubleshoot any problems:
-
Examine the Ingress and Google Cloud Build (GCB) load balancer to make sure it is available:
kubectl -n istio-system describe ingress Name: envoy-ingress Namespace: kubeflow Address: 35.244.132.160 Default backend: default-http-backend:80 (10.20.0.10:8080) Annotations: ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ADD 12m loadbalancer-controller kubeflow/envoy-ingress Warning Translate 12m (x10 over 12m) loadbalancer-controller error while evaluating the ingress spec: could not find service "kubeflow/envoy" Warning Translate 12m (x2 over 12m) loadbalancer-controller error while evaluating the ingress spec: error getting BackendConfig for port "8080" on service "kubeflow/envoy", err: no BackendConfig for service port exists. Warning Sync 12m loadbalancer-controller Error during sync: Error running backend syncing routine: received errors when updating backend service: googleapi: Error 400: The resource 'projects/code-search-demo/global/backendServices/k8s-be-32230--bee2fc38fcd6383f' is not ready, resourceNotReady googleapi: Error 400: The resource 'projects/code-search-demo/global/backendServices/k8s-be-32230--bee2fc38fcd6383f' is not ready, resourceNotReady Normal CREATE 11m loadbalancer-controller ip: 35.244.132.160 ...
There should be an annotation indicating that we are using managed certificate:
annotation: networking.gke.io/managed-certificates: gke-certificate
Any problems with creating the load balancer are reported as Kubernetes events in the results of the above
describe
command.-
If the address isn’t set then there was a problem creating the load balancer.
-
The
CREATE
event indicates the load balancer was successfully created on the specified IP address. -
The most common error is running out of Google Cloud resource quota. To fix this problem, you must either increase the quota for the relevant resource on your Google Cloud project or delete some existing resources.
-
-
Verify that a managed certificate resource is generated:
kubectl describe -n istio-system managedcertificate gke-certificate
The status field should have information about the current status of the Certificate. Eventually, certificate status should be
Active
. -
Wait for the load balancer to report the back ends as healthy:
kubectl describe -n istio-system ingress envoy-ingress ... Annotations: kubernetes.io/ingress.global-static-ip-name: kubeflow-ip kubernetes.io/tls-acme: true certmanager.k8s.io/issuer: letsencrypt-prod ingress.kubernetes.io/backends: {"k8s-be-31380--5e1566252944dfdb":"HEALTHY","k8s-be-32133--5e1566252944dfdb":"HEALTHY"} ...
Both backends should be reported as healthy. It can take several minutes for the load balancer to consider the back ends healthy.
The service with port
31380
is the one that handles Kubeflow traffic. (31380 is the default port of the serviceistio-ingressgateway
.)If the backend is unhealthy, check the pods in
istio-system
:kubectl get pods -n istio-system
- The
istio-ingressgateway-XX
pods should be running - Check the logs of pod
backend-updater-0
,iap-enabler-XX
to see if there is any error - Follow the steps here to check the load balancer and backend service on Google Cloud.
-
Try accessing Cloud IAP at the fully qualified domain name in your web browser:
https://<your-fully-qualified-domain-name>
If you get SSL errors when you log in, this typically means that your SSL certificate is still propagating. Wait a few minutes and try again. SSL propagation can take up to 10 minutes.
If you do not see a login prompt and you get a 404 error, the configuration of Cloud IAP is not yet complete. Keep retrying for up to 10 minutes.
-
If you get an error
Error: redirect_uri_mismatch
after logging in, this means the list of OAuth authorized redirect URIs does not include your domain.The full error message looks like the following example and includes the relevant links:
The redirect URI in the request, https://<my_kubeflow>.endpoints.<my_project>.cloud.goog/_gcp_gatekeeper/authenticate, does not match the ones authorized for the OAuth client. To update the authorized redirect URIs, visit: https://console.developers.google.com/apis/credentials/oauthclient/22222222222-7meeee7a9a76jvg54j0g2lv8lrsb4l8g.apps.googleusercontent.com?project=22222222222
Follow the link in the error message to find the OAuth credential being used and add the redirect URI listed in the error message to the list of authorized URIs. For more information, read the guide to setting up OAuth for Cloud IAP.
Next steps
- The Google Kubernetes Engine troubleshooting guide for Kubeflow.
- How to customize Kubeflow cluster and add users to the cluster.
- Google Cloud guide to Cloud IAP.
1.8 - Deleting Kubeflow
This page explains how to delete your Kubeflow cluster or management cluster on Google Cloud.
Before you start
This guide assumes the following settings:
-
For Management cluster: The
${MGMT_PROJECT}
,${MGMT_DIR}
and${MGMT_NAME}
environment variables are the same as in Deploy Management cluster. -
For Kubeflow cluster: The
${KF_PROJECT}
,${KF_NAME}
and${MGMTCTXT}
environment variables are the same as in Deploy Kubeflow cluster. -
The
${KF_DIR}
environment variable contains the path to your Kubeflow application directory, which holds your Kubeflow configuration files. For example,/opt/kubeflow-distribution/kubeflow/
.
Deleting your Kubeflow cluster
-
To delete the applications running in the Kubeflow namespace, remove that namespace:
kubectl delete namespace kubeflow
-
To delete the cluster and all Google Cloud resources, run the following commands:
cd "${KF_DIR}" make delete
Warning: this will delete the persistent disks storing metadata. If you want to preserve the disks, don’t run this command; instead selectively delete only those resources you want to delete.
Clean up your management cluster
The following instructions introduce how to clean up all resources created when installing management cluster in management project, and when using management cluster to manage Google Cloud resources in managed Kubeflow projects.
Delete or keep managed Google Cloud resources
There are Google Cloud resources managed by Config Connector in the management cluster after you deploy Kubeflow clusters with this management cluster.
To delete all the managed Google Cloud resources, delete the managed project namespace:
kubectl config use-context "${MGMTCTXT}"
kubectl delete namespace --wait "${KF_PROJECT}"
To keep all the managed Google Cloud resources, you can delete the management cluster directly.
If you need fine-grained control, refer to Config Connector: Keeping resources after deletion for more details.
After deleting Config Connector resources for a managed project, you can revoke IAM permission that let the management cluster manage the project:
gcloud projects remove-iam-policy-binding "${KF_PROJECT}" \
"--member=serviceAccount:${MGMT_NAME}-cnrm-system@${MGMT_PROJECT}.iam.gserviceaccount.com" \
--role=roles/owner
Delete management cluster
To delete the Google service account and the management cluster:
cd "${MGMT_DIR}"
make delete-cluster
Starting from Kubeflow v1.5, Google Cloud distribution has switched to Config Controller for Google-managed Management cluster. You can learn more detail by reading Delete your Config Controller.
Note, after deleting the management cluster, all the managed Google Cloud
resources will be kept. You will be responsible for managing them by yourself.
If you want to delete the managed Google Cloud resources, make sure to delete resources in the ${KF_PROJECT}
namespace in the management cluster first.
You can learn more about the ${KF_PROJECT}
namespace in kubeflow-distribution/kubeflow/kcc
folder.
You can create a management cluster to manage them again if you apply the same Config Connector resources. Refer to Managing and deleting resources - Acquiring an existing resource.
2 - Pipelines on Google Cloud
2.1 - Connecting to Kubeflow Pipelines on Google Cloud using the SDK
This guide describes how to connect to your Kubeflow Pipelines cluster on Google Cloud using the Kubeflow Pipelines SDK.
Before you begin
- You need a Kubeflow Pipelines deployment on Google Cloud using one of the installation options.
- Install the Kubeflow Pipelines SDK.
How SDK connects to Kubeflow Pipelines API
Kubeflow Pipelines includes an API service named ml-pipeline-ui
. The
ml-pipeline-ui
API service is deployed in the same Kubernetes namespace you
deployed Kubeflow Pipelines in.
The Kubeflow Pipelines SDK can send REST API requests to this API service, but the SDK needs to know the hostname to connect to the API service.
If the hostname can be accessed without authentication, it’s very simple to
connect to it. For example, you can use kubectl port-forward
to access it via
localhost:
# The Kubeflow Pipelines API service and the UI is available at
# http://localhost:3000 without authentication check.
$ kubectl port-forward svc/ml-pipeline-ui 3000:80 --namespace kubeflow
# Change the namespace if you deployed Kubeflow Pipelines in a different
# namespace.
import kfp
client = kfp.Client(host='http://localhost:3000')
When deploying Kubeflow Pipelines on Google Cloud, a public endpoint for this API service is auto-configured for you, but this public endpoint has security checks to protect your cluster from unauthorized access.
The following sections introduce how to authenticate your SDK requests to connect to Kubeflow Pipelines via the public endpoint.
Connecting to Kubeflow Pipelines standalone or AI Platform Pipelines
Refer to Connecting to AI Platform Pipelines using the Kubeflow Pipelines SDK for both Kubeflow Pipelines standalone and AI Platform Pipelines.
Kubeflow Pipelines standalone deployments also show up in AI Platform Pipelines. They have the
name “pipeline” by default, but you can customize the name by overriding
the appName
parameter in params.env
when deploying Kubeflow Pipelines standalone.
Connecting to Kubeflow Pipelines in a full Kubeflow deployment
A full Kubeflow deployment on Google Cloud uses an Identity-Aware Proxy (IAP) to manage access to the public Kubeflow endpoint. The steps below let you connect to Kubeflow Pipelines in a full Kubeflow deployment with authentication through IAP.
-
Find out your IAP OAuth 2.0 client ID.
You or your cluster admin followed Set up OAuth for Cloud IAP to deploy your full Kubeflow deployment on Google Cloud. You need the OAuth client ID created in that step.
You can browse all of your existing OAuth client IDs in the Credentials page of Google Cloud Console.
-
Create another SDK OAuth Client ID for authenticating Kubeflow Pipelines SDK users. Follow the steps to set up a client ID to authenticate from a desktop app. Take a note of the client ID and client secret. This client ID and secret can be shared among all SDK users, because a separate login step is still needed below.
-
To connect to Kubeflow Pipelines public endpoint, initiate SDK client like the following:
import kfp client = kfp.Client(host='https://<KF_NAME>.endpoints.<PROJECT>.cloud.goog/pipeline', client_id='<AAAAAAAAAAAAAAAAAAAAAA>.apps.googleusercontent.com', other_client_id='<BBBBBBBBBBBBBBBBBBB>.apps.googleusercontent.com', other_client_secret='<CCCCCCCCCCCCCCCCCCCC>')
- Pass your IAP OAuth client ID found in step 1 to
client_id
argument. - Pass your SDK OAuth client ID and secret created in step 2 to
other_client_id
andother_client_secret
arguments.
- Pass your IAP OAuth client ID found in step 1 to
-
When you init the SDK client for the first time, you will be asked to log in. The Kubeflow Pipelines SDK stores obtained credentials in
$HOME/.config/kfp/credentials.json
. You do not need to log in again unless you manually delete the credentials file.To use the SDK from cron tasks where you cannot log in manually, you can copy the credentials file in `$HOME/.config/kfp/credentials.json` to another machine. However, you should keep the credentials safe and never expose it to third parties.
-
After login, you can use the client.
print(client.list_pipelines())
Troubleshooting
-
Error “Failed to authorize with API resource references: there is no user identity header” when using SDK methods.
Direct access to the API service without authentication works for Kubeflow Pipelines standalone, AI Platform Pipelines, and Kubeflow 1.0 or earlier.
However, it fails authorization checks for Kubeflow Pipelines with multi-user isolation in the full Kubeflow deployment starting from Kubeflow 1.1. Multi-user isolation requires all API access to authenticate as a user. Refer to Kubeflow Pipelines Multi-user isolation documentation for more details.
2.2 - Authenticating Pipelines to Google Cloud
This page describes authentication for Kubeflow Pipelines to Google Cloud. Available options listed below have different tradeoffs. You should choose the one that fits your use-case.
- Configuring your cluster to access Google Cloud using Compute Engine default service account with the “cloud-platform” scope is easier to set up than the other options. However, this approach grants excessive permissions. Therefore, it is not suitable if you need workload permission separation.
- Workload Identity takes more efforts to set up, but allows fine-grained permission control. It is recommended for production use-cases.
- Google service account keys stored as Kubernetes secrets is the legacy approach and no longer recommended in Google Kubernetes Engine. However, it’s the only option to use Google Cloud APIs when your cluster is an anthos or on-prem cluster.
Before you begin
There are various options on how to install Kubeflow Pipelines in the Installation Options for Kubeflow Pipelines guide. Be aware that authentication support and cluster setup instructions will vary depending on the method you used to install Kubeflow Pipelines.
- For Kubeflow Pipelines standalone, you can compare and choose from all 3 options.
- For full Kubeflow starting from Kubeflow 1.1, Workload Identity is the recommended and default option.
- For AI Platform Pipelines, Compute Engine default service account is the only supported option.
Compute Engine default service account
This is good for trying out Kubeflow Pipelines, because it is easy to set up.
However, it does not support permission separation for workloads in the cluster. Any workload in the cluster will be able to call any Google Cloud APIs in the chosen scope.
Cluster setup to use Compute Engine default service account
By default, your Google Kubernetes Engine nodes use Compute Engine default service account. If you allowed cloud-platform
scope when creating the cluster,
Kubeflow Pipelines can authenticate to Google Cloud and manage resources in your project without further configuration.
Use one of the following options to create a Google Kubernetes Engine cluster that uses the Compute Engine default service account:
- If you followed instructions in Setting up AI Platform Pipelines and checked
Allow access to the following Cloud APIs
, your cluster is already using Compute Engine default service account. - In Google Cloud Console UI, you can enable it in
Create a Kubernetes cluster -> default-pool -> Security -> Access Scopes -> Allow full access to all Cloud APIs
like the following: - Using
gcloud
CLI, you can enable it with--scopes cloud-platform
like the following:
gcloud container clusters create <cluster-name> \
--scopes cloud-platform
Please refer to gcloud container clusters create command documentation for other available options.
Authoring pipelines to use default service account
Pipelines don’t need any specific changes to authenticate to Google Cloud, it will use the default service account transparently.
However, you must update existing pipelines that use the use_gcp_secret kfp sdk operator. Remove the use_gcp_secret
usage to let your pipeline authenticate to Google Cloud using the default service account.
Securing the cluster with fine-grained Google Cloud permission control
Workload Identity
Workload Identity is the recommended way for your Google Kubernetes Engine applications to consume services provided by Google APIs. You accomplish this by configuring a Kubernetes service account to act as a Google service account. Any Pods running as the Kubernetes service account then use the Google service account to authenticate to cloud services.
Referenced from Workload Identity Documentation. Please read this doc for:
- A detailed introduction to Workload Identity.
- Instructions to enable it on your cluster.
- Whether its limitations affect your adoption.
Terminology
This document distinguishes between Kubernetes service accounts (KSAs) and Google service accounts (GSAs). KSAs are Kubernetes resources, while GSAs are specific to Google Cloud. Other documentation usually refers to both of them as just “service accounts”.
Authoring pipelines to use Workload Identity
Pipelines don’t need any specific changes to authenticate to Google Cloud. With Workload Identity, pipelines run as the Google service account that is bound to the KSA.
However, existing pipelines that use use_gcp_secret kfp sdk operator need to remove the use_gcp_secret
usage to use the bound GSA.
You can also continue to use use_gcp_secret
in a cluster with Workload Identity enabled and use_gcp_secret
will take precedence for those workloads.
Cluster setup to use Workload Identity for Full Kubeflow
Starting from Kubeflow 1.1, Kubeflow Pipelines supports multi-user isolation. Therefore, pipeline runs are executed in user namespaces using the default-editor
KSA. The default-editor
KSA is auto-bound to the GSA specified in the user profile, which defaults to a shared GSA ${KFNAME}-user@${PROJECT}.iam.gserviceaccount.com
.
If you want to bind the default-editor
KSA with a different GSA for a specific namespace, refer to the In-cluster authentication to Google Cloud guide.
Additionally, the Kubeflow Pipelines UI, visualization, and TensorBoard server instances are deployed in your user namespace using the default-editor
KSA. Therefore, to visualize results in the Pipelines UI, they can fetch artifacts in Google Cloud Storage using permissions of the same GSA you configured for this namespace.
Cluster setup to use Workload Identity for Pipelines Standalone
1. Create your cluster with Workload Identity enabled
-
In Google Cloud Console UI, you can enable Workload Identity in
Create a Kubernetes cluster -> Security -> Enable Workload Identity
like the following: -
Using
gcloud
CLI, you can enable it with:
gcloud beta container clusters create <cluster-name> \
--release-channel regular \
--workload-pool=project-id.svc.id.goog
References:
2. Deploy Kubeflow Pipelines
Deploy via Pipelines Standalone as usual.
3. Bind Workload Identities for KSAs used by Kubeflow Pipelines
The following helper bash scripts bind Workload Identities for KSAs used by Kubeflow Pipelines:
- gcp-workload-identity-setup.sh helps you create GSAs and bind them to KSAs used by pipelines workloads. This script provides an interactive command line dialog with explanation messages.
- wi-utils.sh alternatively provides minimal utility bash functions that let you customize your setup. The minimal utilities make it easy to read and use programmatically.
For example, to get a default setup using gcp-workload-identity-setup.sh
, you can
$ curl -O https://raw.githubusercontent.com/kubeflow/pipelines/master/manifests/kustomize/gcp-workload-identity-setup.sh
$ chmod +x ./gcp-workload-identity-setup.sh
$ ./gcp-workload-identity-setup.sh
# This prints the command's usage example and introduction.
# Then you can run the command with required parameters.
# Command output will tell you which GSAs and Workload Identity bindings have been
# created.
4. Configure IAM permissions of used GSAs
If you used gcp-workload-identity-setup.sh
to bind Workload Identities for your cluster, you can simply add the following IAM bindings:
- Give GSA
<cluster-name>-kfp-system@<project-id>.iam.gserviceaccount.com
Storage Object Viewer
role to let UI load data in GCS in the same project. - Give GSA
<cluster-name>-kfp-user@<project-id>.iam.gserviceaccount.com
any permissions your pipelines need. For quick tryouts, you can give itProject Editor
role for all permissions.
If you configured bindings by yourself, here are Google Cloud permission requirements for KFP KSAs:
- Pipelines use
pipeline-runner
KSA. Configure IAM permissions of the GSA bound to this KSA to allow pipelines use Google Cloud APIs. - Pipelines UI uses
ml-pipeline-ui
KSA. Pipelines Visualization Server usesml-pipeline-visualizationserver
KSA. If you need to view artifacts and visualizations stored in Google Cloud Storage (GCS) from pipelines UI, you should add Storage Object Viewer permission (or the minimal required permission) to their bound GSAs.
Google service account keys stored as Kubernetes secrets
It is recommended to use Workload Identity for easier and secure management, but you can also choose to use GSA keys.
Authoring pipelines to use GSA keys
Each pipeline step describes a
container that is run independently. If you want to grant access for a single step to use
one of your service accounts, you can use
kfp.gcp.use_gcp_secret()
.
Examples for how to use this function can be found in the
Kubeflow examples repo.
Cluster setup to use use_gcp_secret for Full Kubeflow
From Kubeflow 1.1, there’s no longer a user-gcp-sa
secrets deployed for you. Recommend using Workload Identity instead.
For Kubeflow 1.0 or earlier, you don’t need to do anything. Full Kubeflow deployment has already deployed the user-gcp-sa
secret for you.
Cluster setup to use use_gcp_secret for Pipelines Standalone
Pipelines Standalone require your manual setup for the user-gcp-sa
secret used by use_gcp_secret
.
Instructions to set up the secret:
-
First download the GCE VM service account token (refer to Google Cloud documentation for more information):
gcloud iam service-accounts keys create application_default_credentials.json \ --iam-account [SA-NAME]@[PROJECT-ID].iam.gserviceaccount.com
-
Run:
kubectl create secret -n [your-namespace] generic user-gcp-sa \ --from-file=user-gcp-sa.json=application_default_credentials.json
2.3 - Upgrading
Before you begin
There are various options on how to install Kubeflow Pipelines in the Installation Options for Kubeflow Pipelines guide. Be aware that upgrade support and instructions will vary depending on the method you used to install Kubeflow Pipelines.
Upgrade-related feature matrix
Installation \ Features | In-place upgrade | Reinstallation on the same cluster | Reinstallation on a different cluster | User customizations across upgrades (via Kustomize) |
---|---|---|---|---|
Standalone | β | β οΈ Data is deleted by default. | β | |
Standalone (managed storage) | β | β | β | β |
full Kubeflow (>= v1.1) | β | β | Needs documentation | β |
full Kubeflow (< v1.1) | β | β | ||
AI Platform Pipelines | β | |||
AI Platform Pipelines (managed storage) | β | β |
Notes:
- When you deploy Kubeflow Pipelines with managed storage on Google Cloud, you pipeline’s metadata and artifacts are stored in Cloud Storage and Cloud SQL. Using managed storage makes it easier to manage, back up, and restore Kubeflow Pipelines data.
Kubeflow Pipelines Standalone
Upgrade Support for Kubeflow Pipelines Standalone is in Beta.
Upgrading Kubeflow Pipelines Standalone introduces how to upgrade in-place.
Full Kubeflow
On Google Cloud, the full Kubeflow deployment follows the package pattern starting from Kubeflow 1.1.
The package pattern enables you to upgrade the full Kubeflow in-place while keeping user customizations β refer to the Upgrade Kubeflow on Google Cloud documentation for instructions.
However, there’s no current support to upgrade from Kubeflow 1.0 or earlier to Kubeflow 1.1 while keeping Kubeflow Pipelines data. This may change in the future, so provide your feedback in kubeflow/pipelines#4346 on GitHub.
AI Platform Pipelines
Upgrade Support for AI Platform Pipelines is in Alpha.
Warning
Kubeflow Pipelines Standalone deployments also show up in the AI Platform Pipelines dashboard, DO NOT follow instructions below if you deployed Kubeflow Pipelines using standalone deployment. Because data is deleted by default when a Kubeflow Pipelines Standalone deployment is deleted.Below are the steps that describe how to upgrade your AI Platform Pipelines instance while keeping existing data:
For instances without managed storage:
- Delete your AI Platform Pipelines instance WITHOUT selecting Delete cluster. The persisted artifacts and database data are stored in persistent volumes in the cluster. They are kept by default when you do not delete the cluster.
- Reinstall Kubeflow Pipelines from the Google Cloud Marketplace using the same Google Kubernetes Engine cluster, namespace, and application name. Persisted data will be automatically picked up during reinstallation.
For instances with managed storage:
- Delete your AI Platform Pipelines instance.
- If you are upgrading from Kubeflow Pipelines 0.5.1, note that the Cloud Storage bucket is a required starting from 1.0.0. Previously deployed instances should be using a bucket named like “
- ”. Browse your Cloud Storage buckets to find your existing bucket name and provide it in the next step. - Reinstall Kubeflow Pipelines from the Google Cloud Marketplace using the same application name and managed storage options as before. You can freely install it in any cluster and namespace (not necessarily the same as before), because persisted artifacts and database data are stored in managed storages (Cloud Storage and Cloud SQL), and will be automatically picked up during reinstallation.
2.4 - Enabling GPU and TPU
This page describes how to enable GPU or TPU for a pipeline on Google Kubernetes Engine by using the Pipelines DSL language.
Prerequisites
To enable GPU and TPU on your Kubeflow cluster, follow the instructions on how to customize the Google Kubernetes Engine cluster for Kubeflow before setting up the cluster.
Configure ContainerOp to consume GPUs
After enabling the GPU, the Kubeflow setup script installs a default GPU pool with type nvidia-tesla-k80 with auto-scaling enabled. The following code consumes 2 GPUs in a ContainerOp.
import kfp.dsl as dsl
gpu_op = dsl.ContainerOp(name='gpu-op', ...).set_gpu_limit(2)
The code above will be compiled into Kubernetes Pod spec:
container:
...
resources:
limits:
nvidia.com/gpu: "2"
If the cluster has multiple node pools with different GPU types, you can specify the GPU type by the following code.
import kfp.dsl as dsl
gpu_op = dsl.ContainerOp(name='gpu-op', ...).set_gpu_limit(2)
gpu_op.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-p4')
The code above will be compiled into Kubernetes Pod spec:
container:
...
resources:
limits:
nvidia.com/gpu: "2"
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-p4
See GPU tutorial for a complete example to build a Kubeflow pipeline that uses GPUs.
Check the Google Kubernetes Engine GPU guide to learn more about GPU settings.
Configure ContainerOp to consume TPUs
Use the following code to configure ContainerOp to consume TPUs on Google Kubernetes Engine:
import kfp.dsl as dsl
import kfp.gcp as gcp
tpu_op = dsl.ContainerOp(name='tpu-op', ...).apply(gcp.use_tpu(
tpu_cores = 8, tpu_resource = 'v2', tf_version = '1.12'))
The above code uses 8 v2 TPUs with TF version to be 1.12. The code above will be compiled into Kubernetes Pod spec:
container:
...
resources:
limits:
cloud-tpus.google.com/v2: "8"
metadata:
annotations:
tf-version.cloud-tpus.google.com: "1.12"
To learn more, see an example pipeline that uses a preemptible node pool with TPU or GPU..
See the Google Kubernetes Engine TPU Guide to learn more about TPU settings.
2.5 - Using Preemptible VMs and GPUs on Google Cloud
This document describes how to configure preemptible virtual machines (preemptible VMs) and GPUs on preemptible VM instances (preemptible GPUs) for your workflows running on Kubeflow Pipelines on Google Cloud.
Introduction
Preemptible VMs are Compute Engine VM instances that last a maximum of 24 hours and provide no availability guarantees. The pricing of preemptible VMs is lower than that of standard Compute Engine VMs.
GPUs attached to preemptible instances (preemptible GPUs) work like normal GPUs but persist only for the life of the instance.
Using preemptible VMs and GPUs can reduce costs on Google Cloud. In addition to using preemptible VMs, your Google Kubernetes Engine (GKE) cluster can autoscale based on current workloads.
This guide assumes that you have already deployed Kubeflow Pipelines. If not, follow the guide to deploying Kubeflow on Google Cloud.
Before you start
The variables defined in this page can be found in kubeflow-distribution/kubeflow/env.sh. They are the same value as you set based on your Kubeflow deployment.
Using preemptible VMs with Kubeflow Pipelines
In summary, the steps to schedule a pipeline to run on preemptible VMs are as follows:
- Create a node pool in your cluster that contains preemptible VMs.
- Configure your pipelines to run on the preemptible VMs.
The following sections contain more detail about the above steps.
1. Create a node pool with preemptible VMs
Create a preemptible-nodepool.yaml
as below and fulfill all placerholder content KF_NAME
, KF_PROJECT
, LOCATION
:
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
labels:
kf-name: KF_NAME # kpt-set: ${name}
name: PREEMPTIBLE_CPU_POOL
namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}
spec:
location: LOCATION # kpt-set: ${location}
initialNodeCount: 1
autoscaling:
minNodeCount: 0
maxNodeCount: 5
nodeConfig:
machineType: n1-standard-4
diskSizeGb: 100
diskType: pd-standard
preemptible: true
taint:
- effect: NO_SCHEDULE
key: preemptible
value: "true"
oauthScopes:
- "https://www.googleapis.com/auth/logging.write"
- "https://www.googleapis.com/auth/monitoring"
- "https://www.googleapis.com/auth/devstorage.read_only"
serviceAccountRef:
external: KF_NAME-vm@KF_PROJECT.iam.gserviceaccount.com # kpt-set: ${name}-vm@${gcloud.core.project}.iam.gserviceaccount.com
metadata:
disable-legacy-endpoints: "true"
management:
autoRepair: true
autoUpgrade: true
clusterRef:
name: KF_NAME # kpt-set: ${name}
namespace: KF_PROJECT # kpt-set: ${name}
Where:
PREEMPTIBLE_CPU_POOL
is the name of the node pool.KF_NAME
is the name of the Kubeflow Google Kubernetes Engine cluster.KF_PROJECT
is the name of your Kubeflow Google Cloud project.LOCATION
is the region of this nodepool, for example: us-west1-b.KF_NAME-vm@KF_PROJECT.iam.gserviceaccount.com
is your service account, replace theKF_NAME
andKF_PROJECT
using the value above in this pattern, you can get vm service account you have already created in Kubeflow cluster deployment
Apply the nodepool patch file above by running:
kubectl --context=${MGMTCTXT} --namespace=${KF_PROJECT} apply -f <path-to-nodepool-file>/preemptible-nodepool.yaml
For Kubeflow Pipelines standalone only
Alternatively, if you are on Kubeflow Pipelines standalone, or AI Platform Pipelines, you can run this command to create node pool:
gcloud container node-pools create PREEMPTIBLE_CPU_POOL \
--cluster=CLUSTER_NAME \
--enable-autoscaling --max-nodes=MAX_NODES --min-nodes=MIN_NODES \
--preemptible \
--node-taints=preemptible=true:NoSchedule \
--service-account=DEPLOYMENT_NAME-vm@PROJECT_NAME.iam.gserviceaccount.com
Below is an example of command:
gcloud container node-pools create preemptible-cpu-pool \
--cluster=user-4-18 \
--enable-autoscaling --max-nodes=4 --min-nodes=0 \
--preemptible \
--node-taints=preemptible=true:NoSchedule \
--service-account=user-4-18-vm@ml-pipeline-project.iam.gserviceaccount.com
2. Schedule your pipeline to run on the preemptible VMs
After configuring a node pool with preemptible VMs, you must configure your pipelines to run on the preemptible VMs.
In the DSL code for
your pipeline, add the following to the ContainerOp
instance:
.apply(gcp.use_preemptible_nodepool())
The above function works for both methods of generating the ContainerOp
:
- The
ContainerOp
generated fromkfp.components.func_to_container_op
. - The
ContainerOp
generated from the task factory function, which is loaded bycomponents.load_component_from_url
.
Note:
- Call
.set_retry(#NUM_RETRY)
on yourContainerOp
to retry the task after the task is preempted. - If you modified the
node taint
when creating the node pool, pass the same node toleration to the
use_preemptible_nodepool()
function. use_preemptible_nodepool()
also accepts a parameterhard_constraint
. When thehard_constraint
isTrue
, the system will strictly schedule the task in preemptible VMs. When thehard_constraint
isFalse
, the system will try to schedule the task in preemptible VMs. If it cannot find the preemptible VMs, or the preemptible VMs are busy, the system will schedule the task in normal VMs.
For example:
import kfp.dsl as dsl
import kfp.gcp as gcp
class FlipCoinOp(dsl.ContainerOp):
"""Flip a coin and output heads or tails randomly."""
def __init__(self):
super(FlipCoinOp, self).__init__(
name='Flip',
image='python:alpine3.6',
command=['sh', '-c'],
arguments=['python -c "import random; result = \'heads\' if random.randint(0,1) == 0 '
'else \'tails\'; print(result)" | tee /tmp/output'],
file_outputs={'output': '/tmp/output'})
@dsl.pipeline(
name='pipeline flip coin',
description='shows how to use dsl.Condition.'
)
def flipcoin():
flip = FlipCoinOp().apply(gcp.use_preemptible_nodepool())
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(flipcoin, __file__ + '.zip')
Using preemptible GPUs with Kubeflow Pipelines
This guide assumes that you have already deployed Kubeflow Pipelines. In summary, the steps to schedule a pipeline to run with preemptible GPUs are as follows:
- Make sure you have enough GPU quota.
- Create a node pool in your Google Kubernetes Engine cluster that contains preemptible VMs with preemptible GPUs.
- Configure your pipelines to run on the preemptible VMs with preemptible GPUs.
The following sections contain more detail about the above steps.
1. Make sure you have enough GPU quota
Add GPU quota to your Google Cloud project. The Google Cloud documentation lists the availability of GPUs across regions. To check the available quota for resources in your project, go to the Quotas page in the Google Cloud Console.
2. Create a node pool of preemptible VMs with preemptible GPUs
Create a preemptible-gpu-nodepool.yaml
as below and fulfill all placerholder content:
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
labels:
kf-name: KF_NAME # kpt-set: ${name}
name: KF_NAME-containernodepool-gpu
namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}
spec:
location: LOCATION # kpt-set: ${location}
initialNodeCount: 1
autoscaling:
minNodeCount: 0
maxNodeCount: 5
nodeConfig:
machineType: n1-standard-4
diskSizeGb: 100
diskType: pd-standard
preemptible: true
oauthScopes:
- "https://www.googleapis.com/auth/logging.write"
- "https://www.googleapis.com/auth/monitoring"
- "https://www.googleapis.com/auth/devstorage.read_only"
serviceAccountRef:
external: KF_NAME-vm@KF_PROJECT.iam.gserviceaccount.com # kpt-set: ${name}-vm@${gcloud.core.project}.iam.gserviceaccount.com
guestAccelerator:
- type: "nvidia-tesla-k80"
count: 1
metadata:
disable-legacy-endpoints: "true"
management:
autoRepair: true
autoUpgrade: true
clusterRef:
name: KF_NAME # kpt-set: ${name}
namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}
Where:
PREEMPTIBLE_CPU_POOL
is the name of the node pool.KF_NAME
is the name of the Kubeflow Google Kubernetes Engine cluster.KF_PROJECT
is the name of your Kubeflow Google Cloud project.LOCATION
is the region of this nodepool, for example: us-west1-b.KF_NAME-vm@KF_PROJECT.iam.gserviceaccount.com
is your service account, replace theKF_NAME
andKF_PROJECT
using the value above in this pattern, you can get vm service account you have already created in Kubeflow cluster deployment.
For Kubeflow Pipelines standalone only
Alternatively, if you are on Kubeflow Pipelines standalone, or AI Platform Pipelines, you can run this command to create node pool:
gcloud container node-pools create PREEMPTIBLE_GPU_POOL \
--cluster=CLUSTER_NAME \
--enable-autoscaling --max-nodes=MAX_NODES --min-nodes=MIN_NODES \
--preemptible \
--node-taints=preemptible=true:NoSchedule \
--service-account=DEPLOYMENT_NAME-vm@PROJECT_NAME.iam.gserviceaccount.com \
--accelerator=type=GPU_TYPE,count=GPU_COUNT
Below is an example of command:
gcloud container node-pools create preemptible-gpu-pool \
--cluster=user-4-18 \
--enable-autoscaling --max-nodes=4 --min-nodes=0 \
--preemptible \
--node-taints=preemptible=true:NoSchedule \
--service-account=user-4-18-vm@ml-pipeline-project.iam.gserviceaccount.com \
--accelerator=type=nvidia-tesla-t4,count=2
3. Schedule your pipeline to run on the preemptible VMs with preemptible GPUs
In the DSL code for
your pipeline, add the following to the ContainerOp
instance:
.apply(gcp.use_preemptible_nodepool()
The above function works for both methods of generating the ContainerOp
:
- The
ContainerOp
generated fromkfp.components.func_to_container_op
. - The
ContainerOp
generated from the task factory function, which is loaded bycomponents.load_component_from_url
.
Note:
- Call
.set_gpu_limit(#NUM_GPUs, GPU_VENDOR)
on yourContainerOp
to specify the GPU limit (for example,1
) and vendor (for example,'nvidia'
). - Call
.set_retry(#NUM_RETRY)
on yourContainerOp
to retry the task after the task is preempted. - If you modified the
node taint
when creating the node pool, pass the same node toleration to the
use_preemptible_nodepool()
function. use_preemptible_nodepool()
also accepts a parameterhard_constraint
. When thehard_constraint
isTrue
, the system will strictly schedule the task in preemptible VMs. When thehard_constraint
isFalse
, the system will try to schedule the task in preemptible VMs. If it cannot find the preemptible VMs, or the preemptible VMs are busy, the system will schedule the task in normal VMs.
For example:
import kfp.dsl as dsl
import kfp.gcp as gcp
class FlipCoinOp(dsl.ContainerOp):
"""Flip a coin and output heads or tails randomly."""
def __init__(self):
super(FlipCoinOp, self).__init__(
name='Flip',
image='python:alpine3.6',
command=['sh', '-c'],
arguments=['python -c "import random; result = \'heads\' if random.randint(0,1) == 0 '
'else \'tails\'; print(result)" | tee /tmp/output'],
file_outputs={'output': '/tmp/output'})
@dsl.pipeline(
name='pipeline flip coin',
description='shows how to use dsl.Condition.'
)
def flipcoin():
flip = FlipCoinOp().set_gpu_limit(1, 'nvidia').apply(gcp.use_preemptible_nodepool())
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(flipcoin, __file__ + '.zip')
Debugging
Run the following command if your nodepool didn’t show up or has error during provisioning:
kubectl --context=${MGMTCTXT} --namespace=${KF_PROJECT} describe containernodepool -l kf-name=${KF_NAME}
Next steps
- Explore further options for customizing Kubeflow on Google Cloud.
- See how to build pipelines with the SDK.
3 - Customize Kubeflow on Google Cloud
This guide describes how to customize your deployment of Kubeflow on Google Kubernetes Engine (GKE) on Google Cloud.
Before you start
The variables defined in this page can be found in kubeflow-distribution/kubeflow/env.sh. They are the same value as you set based on your Kubeflow deployment.
Customizing Kubeflow before deployment
The Kubeflow deployment process is divided into two steps, hydrate and apply, so that you can modify your configuration before deploying your Kubeflow cluster.
Follow the guide to deploying Kubeflow on Google Cloud. You can add your patches in corresponding component folder, and include those patches in kustomization.yaml
file. Learn more about the usage of kustomize. You can also find the existing kustomization in googlecloudplatform/kubeflow-distribution as example. After adding the patches, you can run make hydrate
to validate the resulting resources. Finally, you can run make apply
to deploy the customized Kubeflow.
Customizing an existing deployment
You can also customize an existing Kubeflow deployment. In that case, this guide assumes that you have already followed the guide to deploying Kubeflow on Google Cloud and have deployed Kubeflow to a Google Kubernetes Engine cluster.
Before you start
This guide assumes the following settings:
-
The
${KF_DIR}
environment variable contains the path to your Kubeflow application directory, which holds your Kubeflow configuration files. For example,/opt/kubeflow-distribution/kubeflow/
.export KF_DIR=<path to your Kubeflow application directory> cd "${KF_DIR}"
-
Make sure your environment variables are set up for the Kubeflow cluster you want to customize. For further background about the settings, see the guide to deploying Kubeflow with the CLI.
Customizing Google Cloud resources
To customize Google Cloud resources, such as your Kubernetes Engine cluster, you can
modify the Deployment settings starting in ${KF_DIR}/common/cnrm
.
This folder contains multiple dependencies on sibling directories for Google Cloud resources. So you can start from here by reviewing kustomization.yaml
. Depends on the type of Google Cloud resources you want to customize, you can add patches in corresponding directory.
-
Make sure you checkin the existing resources in
/build
folder to source control. -
Add the patches in corresponding directory, and update
kustomization.yaml
to include patches. -
Run
make hydrate
to build new resources in/build
folder. -
Carefully examine the result resources in
/build
folder. If the customization is addition only, you can runmake apply
to directly patch the resources. -
It is possible that you are modifying immutable resources. In this case, you will need to delete existing resource and applying new resources. Note that this might mean lost of your service and data, please execute carefully. General approach to delete and deploy Google Cloud resources:
-
Revert to old resources in
/build
using source control. -
Carefully delete the resource you need to delete by using
kubectl delete
. -
Rebuild and apply new Google Cloud resources
cd common/cnrm NAME=$(NAME) KFCTXT=$(KFCTXT) LOCATION=$(LOCATION) PROJECT=$(PROJECT) make apply
-
Customizing Kubeflow resources
You can use kustomize to customize Kubeflow. Make sure that you have the minimum required version of kustomize: 2.0.3 or later. For more information about kustomize in Kubeflow, see how Kubeflow uses kustomize.
To customize the Kubernetes resources running within the cluster, you can modify
the kustomize manifests in corresponding component under ${KF_DIR}
.
For example, to modify settings for the Jupyter web app:
-
Open
${KF_DIR}/apps/jupyter/jupyter-web-app/kustomization.yaml
in a text editor. -
Review the file’s inclusion of
deployment-patch.yaml
, and add your modification todeployment-patch.yaml
based on the original content in${KF_DIR}/apps/jupyter/jupyter-web-app/upstream/base/deployment.yaml
. For example: changevolumeMounts
’smountPath
if you need to customize it. -
Verify the output resources in
/build
folder usingMakefile
"cd "${KF_DIR}" make hydrate
-
Redeploy Kubeflow using
Makefile
:cd "${KF_DIR}" make apply
Common customizations
Add users to Kubeflow
You must grant each user the minimal permission scope that allows them to connect to the Kubernetes cluster.
For Google Cloud, you should grant the following Cloud Identity and Access Management (IAM) roles.
In the following commands, replace [PROJECT]
with your Google Cloud project and replace [EMAIL]
with the user’s email address:
-
To access the Kubernetes cluster, the user needs the Kubernetes Engine Cluster Viewer role:
gcloud projects add-iam-policy-binding [PROJECT] --member=user:[EMAIL] --role=roles/container.clusterViewer
-
To access the Kubeflow UI through IAP, the user needs the IAP-secured Web App User role:
gcloud projects add-iam-policy-binding [PROJECT] --member=user:[EMAIL] --role=roles/iap.httpsResourceAccessor
Note, you need to grant the user
IAP-secured Web App User
role even if the user is already an owner or editor of the project.IAP-secured Web App User
role is not implied by theProject Owner
orProject Editor
roles. -
To be able to run
gcloud container clusters get-credentials
and see logs in Cloud Logging (formerly Stackdriver), the user needs viewer access on the project:gcloud projects add-iam-policy-binding [PROJECT] --member=user:[EMAIL] --role=roles/viewer
Alternatively, you can also grant these roles on the IAM page in the Cloud Console. Make sure you are in the same project as your Kubeflow deployment.
Add GPU nodes to your cluster
To add GPU accelerators to your Kubeflow cluster, you have the following options:
- Pick a Google Cloud zone that provides NVIDIA Tesla K80 Accelerators
(
nvidia-tesla-k80
). - Or disable node-autoprovisioning in your Kubeflow cluster.
- Or change your node-autoprovisioning configuration.
To see which accelerators are available in each zone, run the following command:
gcloud compute accelerator-types list
Create the ContainerNodePool resource adopting GPU, for example, create a new file containernodepool-gpu.yaml
file and fulfill the value KUBEFLOW-NAME
, KF-PROJECT
, LOCATION
based on your Kubeflow deployment:
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
labels:
kf-name: KF_NAME # kpt-set: ${name}
name: containernodepool-gpu
namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}
spec:
location: LOCATION # kpt-set: ${location}
initialNodeCount: 1
autoscaling:
minNodeCount: 0
maxNodeCount: 5
nodeConfig:
machineType: n1-standard-4
diskSizeGb: 100
diskType: pd-standard
preemptible: true
oauthScopes:
- "https://www.googleapis.com/auth/logging.write"
- "https://www.googleapis.com/auth/monitoring"
- "https://www.googleapis.com/auth/devstorage.read_only"
guestAccelerator:
- type: "nvidia-tesla-k80"
count: 1
metadata:
disable-legacy-endpoints: "true"
management:
autoRepair: true
autoUpgrade: true
clusterRef:
name: KF_NAME # kpt-set: ${name}
namespace: KF_PROJECT # kpt-set: ${gcloud.core.project}
Note that the metadata:name
must be unique in your Kubeflow project. Because the management cluster uses this as ID and your Google Cloud project as a namespace to identify a node pool.
Apply the node pool patch file above by running:
kubectl --context="${MGMTCTXT}" --namespace="${KF_PROJECT}" apply -f <path-to-gpu-nodepool-file>
After adding GPU nodes to your cluster, you need to install NVIDIA’s device drivers to the nodes. Google provides a DaemonSet that automatically installs the drivers for you. To deploy the installation DaemonSet, run the following command:
kubectl --context="${KF_NAME}" apply -f https://raw.githubusercontent.com/googlecloudplatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
To disable node-autoprovisioning, edit ${KF_DIR}/common/cluster/upstream/cluster.yaml
to set
enabled
to false
:
...
clusterAutoscaling:
enabled: false
autoProvisioningDefaults:
...
Add Cloud TPUs to your cluster
Note: The following instruction should be used when creating Google Kubernetes Engine cluster, because the TPU enablement flag enableTpu
is immutable once cluster is created. You need to create new cluster if existing cluster doesn’t have TPU enabled.
Set enableTpu:true
in ${KF_DIR}/common/cluster/upstream/cluster.yaml
and enable alias IP (VPC-native traffic routing):
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
...
spec:
...
enableTpu: true
networkingMode: VPC_NATIVE
networkRef:
name: containercluster-dep-vpcnative
subnetworkRef:
name: containercluster-dep-vpcnative
ipAllocationPolicy:
servicesSecondaryRangeName: servicesrange
clusterSecondaryRangeName: clusterrange
...
...
---
apiVersion: compute.cnrm.cloud.google.com/v1beta1
kind: ComputeNetwork
metadata:
name: containercluster-dep-vpcnative
spec:
routingMode: REGIONAL
autoCreateSubnetworks: false
---
apiVersion: compute.cnrm.cloud.google.com/v1beta1
kind: ComputeSubnetwork
metadata:
name: containercluster-dep-vpcnative
spec:
ipCidrRange: 10.2.0.0/16
region: us-west1
networkRef:
name: containercluster-dep-vpcnative
secondaryIpRange:
- rangeName: servicesrange
ipCidrRange: 10.3.0.0/16
- rangeName: clusterrange
ipCidrRange: 10.4.0.0/16
You can learn more at Creating a new cluster with Cloud TPU support, and view an example Vpc Native Container Cluster config connector yaml file.
More customizations
Refer to the navigation panel on the left of these docs for more customizations, including using your own domain and more.
4 - Using Your Own Domain
This guide assumes you have already set up Kubeflow on Google Cloud. If you haven’t done so, follow the guide to getting started with Kubeflow on Google Cloud.
Using your own domain
If you want to use your own domain instead of ${KF_NAME}.endpoints.${PROJECT}.cloud.goog, follow these instructions after building your cluster:
-
Remove the substitution
hostname
in the Kptfile.kpt cfg delete-subst instance hostname
-
Create a new setter
hostname
in the Kptfile.kpt cfg create-setter instance/ hostname --field "data.hostname" --value ""
-
Configure new setter with your own domain.
kpt cfg set ./instance hostname <enter your domain here>
-
Apply the changes.
make apply-kubeflow
-
Check Ingress to verify that your domain was properly configured.
kubectl -n istio-system describe ingresses
-
Get the address of the static IP address created.
IPNAME=${KF_NAME}-ip gcloud compute addresses describe ${IPNAME} --global
-
Use your DNS provider to map the fully qualified domain specified in the third step to the above IP address.
5 - Authenticating Kubeflow to Google Cloud
This page describes in-cluster and local authentication for Kubeflow Google Cloud deployments.
In-cluster authentication
Starting from Kubeflow v0.6, you consume Kubeflow from custom namespaces (that is, namespaces other than kubeflow
).
The kubeflow
namespace is only for running Kubeflow system components. Individual jobs and model deployments
run in separate namespaces.
Google Kubernetes Engine (GKE) workload identity
Starting in v0.7, Kubeflow uses the new Google Kubernetes Engine feature: workload identity. This is the recommended way to access Google Cloud APIs from your Google Kubernetes Engine cluster. You can configure a Kubernetes service account (KSA) to act as a Google Cloud service account (GSA).
If you deployed Kubeflow following the Google Cloud instructions, then the profiler controller automatically binds the “default-editor” service account for every profile namespace to a default Google Cloud service account created during kubeflow deployment. The Kubeflow deployment process also creates a default profile for the cluster admin.
For more info about profiles see the Multi-user isolation page.
Here is an example profile spec:
apiVersion: kubeflow/v1beta1
kind: Profile
spec:
plugins:
- kind: WorkloadIdentity
spec:
gcpServiceAccount: ${SANAME}@${PROJECT}.iam.gserviceaccount.com
...
You can verify that there is a KSA called default-editor and that it has an annotation of the corresponding GSA:
kubectl -n ${PROFILE_NAME} describe serviceaccount default-editor
...
Name: default-editor
Annotations: iam.gke.io/gcp-service-account: ${KFNAME}-user@${PROJECT}.iam.gserviceaccount.com
...
You can double check that GSA is also properly set up:
gcloud --project=${PROJECT} iam service-accounts get-iam-policy ${KFNAME}-user@${PROJECT}.iam.gserviceaccount.com
When a pod uses KSA default-editor, it can access Google Cloud APIs with the role granted to the GSA.
Provisioning custom Google service accounts in namespaces: When creating a profile, you can specify a custom Google Cloud service account for the namespace to control which Google Cloud resources are accessible.
Prerequisite: you must have permission to edit your Google Cloud project’s IAM policy and to create a profile custom resource (CR) in your Kubeflow cluster.
- if you don’t already have a Google Cloud service account you want to use, create a new one. For example:
user1-gcp@<project-id>.iam.gserviceaccount.com
:
gcloud iam service-accounts create user1-gcp@<project-id>.iam.gserviceaccount.com
- You can bind roles to the Google Cloud service account to allow access to the desired Google Cloud resources. For example to run BigQuery job, you can grant access like so:
gcloud projects add-iam-policy-binding <project-id> \
--member='serviceAccount:user1-gcp@<project-id>.iam.gserviceaccount.com' \
--role='roles/bigquery.jobUser'
- Grant
owner
permission of service accountuser1-gcp@<project-id>.iam.gserviceaccount.com
to cluster account<cluster-name>-admin@<project-id>.iam.gserviceaccount.com
:
gcloud iam service-accounts add-iam-policy-binding \
user1-gcp@<project-id>.iam.gserviceaccount.com \
--member='serviceAccount:<cluster-name>-admin@<project-id>.iam.gserviceaccount.com' --role='roles/owner'
- Manually create a profile for user1 and specify the Google Cloud service account to bind in
plugins
field:
apiVersion: kubeflow/v1beta1
kind: Profile
metadata:
name: profileName # replace with the name of the profile (the user's namespace name)
spec:
owner:
kind: User
name: user1@email.com # replace with the email of the user
plugins:
- kind: WorkloadIdentity
spec:
gcpServiceAccount: user1-gcp@project-id.iam.gserviceaccount.com
Note: The profile controller currently doesn’t perform any access control checks to see whether the user creating the profile should be able to use the Google Cloud service account. As a result, any user who can create a profile can get access to any service account for which the admin controller has owner permissions. We will improve this in subsequent releases.
You can find more details on workload identity in the Google Kubernetes Engine documentation.
Authentication from Kubeflow Pipelines
Starting from Kubeflow v1.1, Kubeflow Pipelines supports multi-user isolation. Therefore, pipeline runs are executed in user namespaces also using the default-editor
KSA.
Additionally, the Kubeflow Pipelines UI, visualization, and TensorBoard server instances are deployed in your user namespace using the default-editor
KSA. Therefore, to visualize results in the Pipelines UI, they can fetch artifacts in Google Cloud Storage using permissions of the same GSA you configured for this namespace.
For more details, refer to Authenticating Pipelines to Google Cloud.
Local authentication
gcloud
Use the gcloud
tool to interact with Google Cloud on the command line.
You can use the gcloud
command to set up Google Kubernetes Engine (GKE) clusters,
and interact with other Google services.
Logging in
You have two options for authenticating the gcloud
command:
-
You can use a user account to authenticate using a Google account (typically Gmail). You can register a user account using
gcloud auth login
, which brings up a browser window to start the familiar Google authentication flow. -
You can create a service account within your Google Cloud project. You can then download a
.json
key file associated with the account, and run thegcloud auth activate-service-account
command to authenticate yourgcloud
session.
You can find more information in the Google Cloud docs.
Listing active accounts
You can run the following command to verify you are authenticating with the expected account:
gcloud auth list
In the output of the command, an asterisk denotes your active account.
Viewing IAM roles
Permissions are handled in Google Cloud using IAM Roles. These roles define which resources your account can read or write to. Provided you have the necessary permissions, you can check which roles were assigned to your account using the following gcloud command:
PROJECT_ID=your-gcp-project-id-here
gcloud projects get-iam-policy $PROJECT_ID --flatten="bindings[].members" \
--format='table(bindings.role)' \
--filter="bindings.members:$(gcloud config list account --format 'value(core.account)')"
You can view and modify roles through the Google Cloud IAM console.
You can find more information about IAM in the Google Cloud docs.
kubectl
The kubectl
tool is used for interacting with a Kubernetes cluster through the command line.
Connecting to a cluster using a Google Cloud account
If you set up your Kubernetes cluster using Google Kubernetes Engine, you can authenticate with the cluster using a Google Cloud account.
The following commands fetch the credentials for your cluster and save them to your local
kubeconfig
file:
CLUSTER_NAME=your-gke-cluster
ZONE=your-gcp-zone
gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE
You can find more information in the Google Cloud docs.
Changing active clusters
If you work with multiple Kubernetes clusters, you may have multiple contexts saved in your local
kubeconfig
file.
You can view the clusters you have saved by run the following command:
kubectl config get-contexts
You can view which cluster is currently being controlled by kubectl
with the following command:
CONTEXT_NAME=your-new-context
kubectl config set-context $CONTEXT_NAME
You can find more information in the Kubernetes docs.
Checking RBAC permissions
Like GKE IAM, Kubernetes permissions are typically handled with a “role-based authorization control” (RBAC) system. Each Kubernetes service account has a set of authorized roles associated with it. If your account doesn’t have the right roles assigned to it, certain tasks fail.
You can check if an account has the proper permissions to run a command by building a query structured as
kubectl auth can-i [VERB] [RESOURCE] --namespace [NAMESPACE]
. For example, the following command verifies
that your account has permissions to create deployments in the kubeflow
namespace:
kubectl auth can-i create deployments --namespace kubeflow
You can find more information in the Kubernetes docs.
Adding RBAC permissions
If you find you are missing a permission you need, you can grant the missing roles to your service account using Kubernetes resources.
- Roles describe the permissions you want to assign. For example,
verbs: ["create"], resources:["deployments"]
- RoleBindings define a mapping between the
Role
, and a specific service account
By default, Roles
and RoleBindings
apply only to resources in a specific namespace, but there are also
ClusterRoles
and ClusterRoleBindings
that can grant access to resources cluster-wide
You can find more information in the Kubernetes docs.
Next steps
See the troubleshooting guide for help with diagnosing and fixing issues you may encounter with Kubeflow on Google Cloud
6 - Securing Your Clusters
Currently we are collecting interest for supporting private Kubeflow cluster deployment. Please upvote to Support private Google Kubernetes Engine cluster on Google Cloud feature request if it fits your use case.
7 - Troubleshooting Deployments on Google Cloud
Out of date
This guide contains outdated information pertaining to Kubeflow 1.0. This guide needs to be updated for Kubeflow 1.1.This guide helps diagnose and fix issues you may encounter with Kubeflow on Google Kubernetes Engine (GKE) and Google Cloud.
Before you start
This guide covers troubleshooting specifically for Kubeflow deployments on Google Cloud.
For more help, search for resolved issues on GitHub or create a new one in the Kubeflow on Google Cloud repository.
This guide assumes the following settings:
-
The
${KF_DIR}
environment variable contains the path to your Kubeflow application directory, which holds your Kubeflow configuration files. For example,/opt/kubeflow-distribution/kubeflow/
.export KF_DIR=<path to your Kubeflow application directory>
-
The
${CONFIG_FILE}
environment variable contains the path to your Kubeflow configuration file.export CONFIG_FILE=${KF_DIR}/kfctl_gcp_iap.v1.0.2.yaml
Or:
export CONFIG_FILE=${KF_DIR}/kfctl_gcp_basic_auth.v1.0.2.yaml
-
The
${KF_NAME}
environment variable contains the name of your Kubeflow deployment. You can find the name in your${CONFIG_FILE}
configuration file, as the value for themetadata.name
key.export KF_NAME=<the name of your Kubeflow deployment>
-
The
${PROJECT}
environment variable contains the ID of your Google Cloud project. You can find the project ID in your${CONFIG_FILE}
configuration file, as the value for theproject
key.export PROJECT=<your Google Cloud project ID>
-
The
${ZONE}
environment variable contains the Google Cloud zone where your Kubeflow resources are deployed.export ZONE=<your Google Cloud zone>
-
For further background about the above settings, see the guide to deploying Kubeflow with the CLI.
Troubleshooting Kubeflow deployment on Google Cloud
Here are some tips for troubleshooting Google Cloud.
- Make sure you are a Google Cloud project owner.
- Make sure you are using HTTPS.
- Check project quota page to see if any service’s current usage reached quota limit, increase them as needed.
- Check deployment manager page and see if thereβs a failed deployment.
- Check if endpoint is up: do DNS lookup against your Cloud Identity-Aware Proxy (Cloud IAP) URL and see if it resolves to the correct IP address.
- Check if certificate succeeded:
kubectl describe certificates -n istio-system
should give you certificate status. - Check ingress status:
kubectl describe ingress -n istio-system
- Check if endpoint entry is created. There should be one entry with name
<deployment>.endpoints.<project>.cloud.goog
- If endpoint entry doesn’t exist, check
kubectl describe cloudendpoint -n istio-system
- If endpoint entry doesn’t exist, check
- If using IAP: make sure you added
https://<deployment>.endpoints.<project>.cloud.goog/_gcp_gatekeeper/authenticate
as an authorized redirect URI for the OAUTH credentials used to create the deployment. - If using IAP: see the guide to monitoring your Cloud IAP setup.
- See the sections below for troubleshooting specific problems.
- Please report a bug if you can’t resolve the problem by following the above steps.
DNS name not registered
This section provides troubleshooting information for problems creating a DNS entry for your ingress. The ingress is a K8s resource that creates a Google Cloud loadbalancer to enable http(s) access to Kubeflow web services from outside the cluster. This section assumes you are using Cloud Endpoints and a DNS name of the following pattern
https://${KF_NAME}.endpoints.${PROJECT}.cloud.goog
Symptoms:
-
When you access the URL in Chrome you get the error: server IP address could not be found
-
nslookup for the domain name doesn’t return the IP address associated with the ingress
nslookup ${KF_NAME}.endpoints.${PROJECT}.cloud.goog Server: 127.0.0.1 Address: 127.0.0.1#53 ** server can't find ${KF_NAME}.endpoints.${PROJECT}.cloud.goog: NXDOMAIN
Troubleshooting
-
Check the
cloudendpoints
resourcekubectl get cloudendpoints -o yaml ${KF_NAME} kubectl describe cloudendpoints ${KF_NAME}
- Check if there are errors indicating problems creating the endpoint
-
The status of the
cloudendpoints
object will contain the cloud operation used to register the operation-
For example
status: config: "" configMapHash: "" configSubmit: operations/serviceConfigs.jlewi-1218-001.endpoints.cloud-ml-dev.cloud.goog:43fe6c6f-eb9c-41d0-ac85-b547fc3e6e38 endpoint: jlewi-1218-001.endpoints.cloud-ml-dev.cloud.goog ingressIP: 35.227.243.83 jwtAudiences: null lastAppliedSig: 4f3b903a06a683b380bf1aac1deca72792472429 observedGeneration: 1 stateCurrent: ENDPOINT_SUBMIT_PENDING
-
-
You can check the status of the operation by running:
gcloud --project=${PROJECT} endpoints operations describe ${OPERATION}
- Operation is everything after
operations/
in theconfigSubmit
field
- Operation is everything after
404 Page Not Found When Accessing Central Dashboard
This section provides troubleshooting information for 404s, page not found, being return by the central dashboard which is served at
https://${KUBEFLOW_FQDN}/
- KUBEFLOW_FQDN is your project’s OAuth web app URI domain name
<name>.endpoints.<project>.cloud.goog
- Since we were able to sign in this indicates the Ambassador reverse proxy is up and healthy we can confirm this is the case by running the following command
kubectl -n ${NAMESPACE} get pods -l service=envoy
NAME READY STATUS RESTARTS AGE
envoy-76774f8d5c-lx9bd 2/2 Running 2 4m
envoy-76774f8d5c-ngjnr 2/2 Running 2 4m
envoy-76774f8d5c-sg555 2/2 Running 2 4m
-
Try other services to see if they’re accessible for example
https://${KUBEFLOW_FQDN}/whoami https://${KUBEFLOW_FQDN}/tfjobs/ui https://${KUBEFLOW_FQDN}/hub
-
If other services are accessible then we know its a problem specific to the central dashboard and not ingress
-
Check that the centraldashboard is running
kubectl get pods -l app=centraldashboard NAME READY STATUS RESTARTS AGE centraldashboard-6665fc46cb-592br 1/1 Running 0 7h
-
Check a service for the central dashboard exists
kubectl get service -o yaml centraldashboard
-
Check that an Ambassador route is properly defined
kubectl get service centraldashboard -o jsonpath='{.metadata.annotations.getambassador\.io/config}' apiVersion: ambassador/v0 kind: Mapping name: centralui-mapping prefix: / rewrite: / service: centraldashboard.kubeflow,
-
Check the logs of Ambassador for errors. See if there are errors like the following indicating an error parsing the route.If you are using the new Stackdriver Kubernetes monitoring you can use the following filter in the stackdriver console
resource.type="k8s_container" resource.labels.location=${ZONE} resource.labels.cluster_name=${CLUSTER} metadata.userLabels.service="ambassador" "could not parse YAML"
502 Server Error
A 502 usually means traffic isn’t even making it to the envoy reverse proxy. And it usually indicates the loadbalancer doesn’t think any backends are healthy.
- In Cloud Console select Network Services -> Load Balancing
-
Click on the load balancer (the name should contain the name of the ingress)
-
The exact name can be found by looking at the
ingress.kubernetes.io/url-map
annotation on your ingress objectURLMAP=$(kubectl --namespace=${NAMESPACE} get ingress envoy-ingress -o jsonpath='{.metadata.annotations.ingress\.kubernetes\.io/url-map}') echo ${URLMAP}
-
Click on your loadbalancer
-
This will show you the backend services associated with the load balancer
-
There is 1 backend service for each K8s service the ingress rule routes traffic too
-
The named port will correspond to the NodePort a service is using
NODE_PORT=$(kubectl --namespace=${NAMESPACE} get svc envoy -o jsonpath='{.spec.ports[0].nodePort}') BACKEND_NAME=$(gcloud compute --project=${PROJECT} backend-services list --filter=name~k8s-be-${NODE_PORT}- --format='value(name)') gcloud compute --project=${PROJECT} backend-services get-health --global ${BACKEND_NAME}
-
-
Make sure the load balancer reports the backends as healthy
-
If the backends aren’t reported as healthy check that the pods associated with the K8s service are up and running
-
Check that health checks are properly configured
- Click on the health check associated with the backend service for envoy
- Check that the path is /healthz and corresponds to the path of the readiness probe on the envoy pods
- See K8s docs for important information about how health checks are determined from readiness probes.
-
Check firewall rules to ensure traffic isn’t blocked from the Google Cloud loadbalancer
-
The firewall rule should be added automatically by the ingress but its possible it got deleted if you have some automatic firewall policy enforcement. You can recreate the firewall rule if needed with a rule like this
gcloud compute firewall-rules create $NAME \ --project $PROJECT \ --allow tcp:$PORT \ --target-tags $NODE_TAG \ --source-ranges 130.211.0.0/22,35.191.0.0/16
-
To get the node tag
# From the Kubernetes Engine cluster get the name of the managed instance group gcloud --project=$PROJECT container clusters --zone=$ZONE describe $CLUSTER # Get the template associated with the MIG gcloud --project=kubeflow-rl compute instance-groups managed describe --zone=${ZONE} ${MIG_NAME} # Get the instance tags from the template gcloud --project=kubeflow-rl compute instance-templates describe ${TEMPLATE_NAME}
For more info see Google Cloud HTTP health check docs
-
-
-
In Stackdriver Logging look at the Cloud Http Load Balancer logs
- Logs are labeled with the forwarding rule
- The forwarding rules are available via the annotations on the ingress
ingress.kubernetes.io/forwarding-rule ingress.kubernetes.io/https-forwarding-rule
-
Verify that requests are being properly routed within the cluster
-
Connect to one of the envoy proxies
kubectl exec -ti `kubectl get pods --selector=service=envoy -o jsonpath='{.items[0].metadata.name}'` /bin/bash
-
Install curl in the pod
apt-get update && apt-get install -y curl
-
Verify access to the whoami app
curl -L -s -i http://envoy:8080/noiap/whoami
-
If this doesn’t return a 200 OK response; then there is a problem with the K8s resources
- Check the pods are running
- Check services are pointing at the points (look at the endpoints for the various services)
-
GKE Certificate Fails To Be Provisioned
A common symptom of your certificate failing to be provisioned is SSL errors like ERR_SSL_VERSION_OR_CIPHER_MISMATCH
when
you try to access the Kubeflow https endpoint.
To troubleshoot check the status of your Google Kubernetes Engine managed certificate
kubectl -n istio-system describe managedcertificate
If the certificate is in status FailedNotVisible
then it means Google Cloud failed to provision the certificate
because it could not verify that you owned the domain by doing an ACME challenge. In order for Google Cloud to provision your certificate
- Your ingress must be created in order to associated a Google Cloud Load Balancer(GCLB) with the IP address for your endpoint
- There must be a DNS entry mapping your domain name to the IP.
If there is a problem preventing either of the above then Google Cloud will be unable to provision your certificate
and eventually enter the permanent failure state FailedNotVisible
indicating your endpoint isn’t accessible. The most common
cause is the ingress can’t be created because the K8s secret containing OAuth credentials doesn’t
exist.
To fix this you must first resolve the underlying problems preventing your ingress or DNS entry from being created. Once the underlying problem has been fixed you can follow the steps below to force a new certificate to be generated.
You can fix the certificate by performing the following steps to delete the existing certificate and create a new one.
-
Get the name of the Google Cloud certificate
kubectl -n istio-system describe managedcertificate gke-certificate
- The status will contain
Certificate Name
which will start withmcrt
make a note of this.
- The status will contain
-
Delete the ingress
kubectl -n istio-system delete ingress envoy-ingress
-
Ensure the certificate was deleted
gcloud --project=${PROJECT} compute ssl-certificates list
- Make sure the certificate obtained in the first step no longer exists
-
Reapply kubeflow in order to recreate the ingress and certificate
- If you deployed with
kfctl
rerunkfctl apply
- If you deployed using the Google Cloud blueprint rerun
make apply-kubeflow
- If you deployed with
-
Monitor the certificate to make sure it can be provisioned
kubectl --context=gcp-private-0527 -n istio-system describe managedcertificate gke-certificate
-
Since the ingress has been recreated we need to restart the pods that configure it
kubectl -n istio-system delete pods -l service=backend-updater kubectl -n istio-system delete pods -l service=iap-enabler
Problems with SSL certificate from Let’s Encrypt
As of Kubeflow 1.0, Kubeflow should be using Google Kubernetes Engine Managed Certificates and no longer using Let’s Encrypt.
See the guide to monitoring your Cloud IAP setup.
Envoy pods crash-looping: root cause is backend quota exceeded
If your logs show the Envoy pods crash-looping, the root cause may be that you have exceeded your quota for some backend services such as loadbalancers. This is particularly likely if you have multiple, differently named deployments in the same Google Cloud project using Cloud IAP.
The error
The error looks like this for the pod’s Envoy container:
kubectl logs -n kubeflow envoy-79ff8d86b-z2snp envoy
[2019-01-22 00:19:44.400][1][info][main] external/envoy/source/server/server.cc:184] initializing epoch 0 (hot restart version=9.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363)
[2019-01-22 00:19:44.400][1][critical][main] external/envoy/source/server/server.cc:71] error initializing configuration '/etc/envoy/envoy-config.json': unable to read file: /etc/envoy/envoy-config.json
And the Cloud IAP container shows a message like this:
Waiting for backend id PROJECT=<your-project> NAMESPACE=kubeflow SERVICE=envoy filter=name~k8s-be-30352-...
Diagnosing the cause
You can verify the cause of the problem by entering the following command:
kubectl -n istio-system describe ingress
Look for something like this in the output:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Sync 14m (x193 over 19h) loadbalancer-controller Error during sync: googleapi: Error 403: Quota 'BACKEND_SERVICES' exceeded. Limit: 5.0 globally., quotaExceeded
Fixing the problem
If you have any redundant Kubeflow deployments, you can delete them using the Deployment Manager.
Alternatively, you can request more backend services quota on the Google Cloud Console.
- Go to the quota settings for backend services on the Google Cloud Console.
- Click EDIT QUOTAS. A quota editing form opens on the right of the screen.
- Follow the form instructions to apply for more quota.
Legacy networks are not supported
Cloud Filestore and Google Kubernetes Engine try to use the network named default
by default. For older projects,
this will be a legacy network which is incompatible with Cloud Filestore and newer Google Kubernetes Engine features
like private clusters. This will
manifest as the error “default is invalid; legacy networks are not supported” when
deploying Kubeflow.
Here’s an example error when deploying Cloud Filestore:
ERROR: (gcloud.deployment-manager.deployments.update) Error in Operation [operation-1533189457517-5726d7cfd19c9-e1b0b0b5-58ca11b8]: errors:
- code: RESOURCE_ERROR
location: /deployments/jl-0801-b-gcfs/resources/filestore
message: '{"ResourceType":"gcp-types/file-v1beta1:projects.locations.instances","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"network
default is invalid; legacy networks are not supported.","status":"INVALID_ARGUMENT","statusMessage":"Bad
Request","requestPath":"https://file.googleapis.com/v1beta1/projects/cloud-ml-dev/locations/us-central1-a/instances","httpMethod":"POST"}}'
To fix this we can create a new network:
cd ${KF_DIR}
cp .cache/master/deployment/gke/deployment_manager_configs/network.* \
./gcp_config/
Edit network.yaml
to set the name for the network.
Edit gcfs.yaml
to use the name of the newly created network.
Apply the changes.
cd ${KF_DIR}
kfctl apply -V -f ${CONFIG}
Changing the OAuth client used by IAP
If you need to change the OAuth client used by IAP, you can run the following commands to replace the Kubernetes secret containing the ID and secret.
kubectl -n kubeflow delete secret kubeflow-oauth
kubectl -n kubeflow create secret generic kubeflow-oauth \
--from-literal=client_id=${CLIENT_ID} \
--from-literal=client_secret=${CLIENT_SECRET}
Troubleshooting SSL certificate errors
This section describes how to enable service management API to avoid managed certificates failure.
To check your certificate:
-
Run the following command:
kubectl -n istio-system describe managedcertificate gke-certificate
Make sure the certificate status is either
Active
orProvisioning
which means it is not ready. For more details on certificate status, refer to the certificate statuses descriptions section. Also, make sure the domain name is correct. -
Run the following command to look for the errors using the certificate name from the previous step:
gcloud beta --project=${PROJECT} compute ssl-certificates describe --global ${CERTIFICATE_NAME}
-
Run the following command:
kubectl -n istio-system get ingress envoy-ingress -o yaml
Make sure of the following:
-
networking.gke.io/managed-certificates
annotation value points to the name of the Kubernetes managed certificate resource and isgke-certificate
; -
public IP address that is displayed in the status is assigned. See the example of IP address below:
status: loadBalancer: ingress: - ip: 35.186.212.202
-
DNS entry for the domain has propagated. To verify this, use the following
nslookup
command example:`nslookup ${DOMAIN}`
-
domain name is the fully qualified domain name which be the host value in the ingress. See the example below:
${KF_APP_NAME}.endpoints.${PROJECT}.cloud.goog
Note that managed certificates cannot provision the certificate if the DNS lookup does not work properly.
-
8 - Kubeflow On-premises on Anthos
Introduction
Anthos is a hybrid and multi-cloud application platform developed and supported by Google. Anthos is built on open source technologies, including Kubernetes, Istio, and Knative.
Using Anthos, you can create a consistent setup across your on-premises and cloud environments, helping you to automate policy and security at scale.
We are collecting interest for Kubeflow on Google Cloud On Prem. You can subscribe to the GitHub issue googlecloudplatform/kubeflow-distribution#138.
Next steps
While waiting for a response from the support team, you may like to deploy Kubeflow on Google Cloud.
9 - Changelog
1.8.0
Changes:
- πΌ Upgraded upstream
Manifests
tov1.8.0
.
1.7.1
Changes:
- π¨ Fixed
make apply
deployment issue (#425, #426, #427). - π§ͺ Validated deployment using
GKE 1.25.8
.
1.7.0
Changes:
- πΌ Upgraded upstream
Manifests
tov1.7.0
. - πΌ Upgraded
Kubeflow Pipelines
tov2.0.0-alpha.7
. - πΌ Upgraded
KNative
tov1.8.5
(#404). - πΌ Upgraded
cert-manager
tov1.10.2
(#405). - πΌ Upgraded
ASM
tov1.16.2
(#406). - πΌ Upgraded
KServe
tov0.10
(#408). - π¨ Fixed
ASM
deployment issue (#413, #419) - π¨ Fixed user header issue in
KServe
web-app (#414) - π§ͺ Validated deployment using
GKE 1.23
,GKE 1.24
,GKE 1.25
, andGKE 1.26
.
1.6.1
Changes:
- πΌ Upgraded upstream
Manifests
tov1.6.1
. - πΌ Upgraded
pipelines
tov2.0.0-alpha.6
(fixes #392). - πΌ Updated
MySQL
to8.0
(#391). - π¨ Fixed
ASM
deployment issue (#389) - π¨ Minor improvements of deployment process.
- π§ͺ Validated deployment using
GKE 1.22
.
1.6.0:
Changes:
- πΌ Upgraded upstream
Manifests
tov1.6.0
. - πΌ Upgraded
ASM
tov1.14
(#385). - πΌ Upgraded
Knative
tov1.2
(#373). - πΌ Upgraded
cert-manager
tov1.5
(#372). - πΌ Upgraded
pipelines
tov2.0.0-alpha.4
. - πΌ Upgraded APIs to support
GKE 1.22
(#349). - π¨ Improved deployment stability (#371, #376, #384, #386).
- π Removed deprecated
kfserving
,cloud-endpoints
, andapplication
manifests (#375, #377). - π§ͺ Validated deployment using
GKE 1.21
andGKE 1.22
.
1.5.1
Changes:
- πΌ Upgraded
ASM
tov1.13
. - π¨ Fixed
KServe
issues with dashboard (#362) and directory (#361). - π Increased the maximum length of Kubeflow cluster name (#359).
- π Moved RequestAuthentication policy creation to
iap-enabler
to improve GitOps friendliness (#364). - π§ͺ Validated deployment using
GKE 1.21.11
.
1.5.0:
Changes:
- πΌ Upgrade Kubeflow components versions as listed in components versions table
- π Integrated with
Config Controller
, simplified management cluster maintenance cost, there is no need to manually upgradeConfig Connector
CRD. - π Switch from
kfserving
toKServe
as default serving component, you can switch back tokfserving
inconfig.yaml
. - π¨ Fixed
cloudsqlproxy
issue withlivenessProbe
configuration. - π§ͺ Validated deployment using
GKE 1.20.12
.
1.4.1
Changes:
Changes on top of v1.4.0:
- πΌ Upgrade: Integrate with Kubeflow
1.4.1
manifests (kubeflow/manifests#2084) - π¨ Fix: Change cloud endpoint images destination (#343)
- π¨ Fix: Use
yq4
iniap-ingress
Makefile.
1.4.0:
Changes:
- πΌ Upgrade Kubeflow components versions as listed in components versions table
- π’ Removed
GKE 1.18
image version and k8s runtime pin, now GKE version is default toSTABLE
channel. - π Set Emissary Executor as default Argo Workflow executor for Kubeflow Pipelines.
- πΌ Upgraded
kpt
versions from0.X.X
to1.0.0-beta.6
. - πΌ Upgraded
yq
fromv3
tov4
. - πΌ Upgraded
ASM
(Anthos Service Mesh) to1.10.4-asm.6
. - π Unblocked
KFSserving
usage by removingcommonLabels
from kustomization patch (#298 and #324). - π Integrated with
KFServing
Web App UI. - π Integrated with unified operator:
training-operator
. - πͺ Simplified deployment: Removed requirement for independent installation of
yq
,jq
,kustomize
,kpt
.