22 KiB
Kubeflow Pipelines - GitHub Issue Summarization
Introduction
Note: This tutorial is deprecated. It will be updated soon.
Kubeflow is an OSS project to support a machine learning stack on Kubernetes, to make deployments of ML workflows on Kubernetes simple, portable and scalable.
Kubeflow Pipelines is a component of Kubeflow that makes it easy to compose, deploy and manage end-to-end machine learning workflows. The Kubeflow Pipelines documentation is here.
This codelab will walk you through creating your own Kubeflow deployment, and running a KubeFlow Pipelines workflow for model training and serving -- both from the Pipelines UI, and from a Jupyter Notebook.
What does a Kubeflow deployment look like?
A Kubeflow deployment is:
- Portable - Works on any Kubernetes cluster, whether it lives on Google Cloud Platform (GCP), on premises, or across providers.
- Scalable - Can utilize fluctuating resources and is only constrained by the number of resources allocated to the Kubernetes cluster.
- Composable - Enhanced with service workers to work offline or on low-quality networks.
What you'll build
In this lab, you will build a web app that summarizes GitHub issues using Kubeflow Pipelines to train and serve a model. Upon completion, your infrastructure will contain:
- A Kubernetes Engine cluster with default Kubeflow installation
- A pipeline that performs distributed training of a Tensor2Tensor model on GPUs
- A serving container that provides predictions from the trained model
- A UI that interprets the predictions to provide summarizations for GitHub issues
- A notebook that creates a pipeline from scratch using the Kubeflow Pipelines SDK
What you'll learn
The pipeline you will build trains a Tensor2Tensor model on GitHub issue data, learning to predict issue titles from issue bodies. It then exports the trained model and deploys the exported model using Tensorflow Serving. The final step in the pipeline launches a web app, which interacts with the TF-Serving instance in order to get model predictions.
- How to set up a Kubeflow cluster using GKE
- How to build and run ML workflows using Kubeflow Pipelines
- How to define and run pipelines from within a Kubeflow JupyterHub notebook
What you'll need
- A basic understanding of Kubernetes
- An active GCP project for which you have Owner permissions
- A GitHub account
- Access to the Google Cloud Shell, available in the Google Cloud Platform (GCP) Console
This is an advanced codelab focused on Kubeflow. For more background and an introduction to the platform, see the Introduction to Kubeflow documentation. Non-relevant concepts and code blocks are glossed over and provided for you to simply copy and paste.
Project setup
Google Cloud Platform (GCP) organizes resources into projects. This allows you to collect all the related resources for a single application in one place. For this lab, you'll need a GCP project with billing enabled.
Set up the environment
Note: For simplicity, these instructions assume you're using the Cloud Shell, but you could also do the lab from your laptop after installing the necessary packages. If you're on your laptop, use of a virtual environment (e.g. Conda or virtualenv) is highly recommended.
Cloud Shell
Visit the GCP Console in the browser and log in with your project credentials.
Then, click the "Activate Cloud Shell" icon in the top right of the console to start up a Cloud Shell (if one is not already opened for you).
Set your GitHub token
This codelab calls the GitHub API to retrieve publicly available data. To prevent rate-limiting, especially at events where a large number of anonymized requests are sent to the GitHub APIs, set up an access token with no permissions. This is simply to authorize you as an individual rather than anonymous user.
- Navigate to https://github.com/settings/tokens and generate a new token with no scopes.
- Save it somewhere safe. If you lose it, you will need to delete and create a new one.
- Set the GITHUB_TOKEN environment variable:
export GITHUB_TOKEN=<token>
Set your GCP project ID and cluster name
To find your project ID, visit the GCP Console's Home panel. If the screen is empty, click on Yes at the prompt to create a dashboard.
In the Cloud Shell terminal, run these commands to set the cluster name and project ID. For the zone, pick a zone where nvidia-tesla-k80
s are available.
export DEPLOYMENT_NAME=kubeflow
export PROJECT_ID=<your_project_id>
export ZONE=<your-zone>
gcloud config set project ${PROJECT_ID}
gcloud config set compute/zone ${ZONE}
Create a storage bucket
Note: Bucket names must be unique across all of GCP, not just your organization
Create a Cloud Storage bucket for storing pipeline files. Fill in a new, unique bucket name and issue the "mb" (make bucket) command:
export BUCKET_NAME=kubeflow-${PROJECT_ID}
gsutil mb gs://${BUCKET_NAME}
Alternatively, you can create a bucket via the GCP Console.
Install the Kubeflow Pipelines SDK
Run the following command to install the Kubeflow Pipelines SDK:
pip3 install https://storage.googleapis.com/ml-pipeline/release/0.1.7/kfp.tar.gz --upgrade
We'll use this SDK a bit later in the lab.
Note: Python 3.5 or higher is required. If you're running in a Conda Python 3 environment, you'll use
pip
instead ofpip3
.
Optional: Pin useful dashboards
In the GCP console, you can pin the Kubernetes Engine and Storage dashboards for easier access.
Create a Kubeflow cluster
Create a cluster
Create a managed Kubernetes cluster on Kubernetes Engine by visiting the Kubeflow Click-to-Deploy site in your browser and signing in with your GCP account.
Fill in the following values in the resulting form:
- Project: Enter your GCP
$PROJECT_ID
in the top field - Deployment name: Set the default value to
kubeflow
. Alternatively, set$DEPLOYMENT_NAME
to a different value and use it here. Note that this value must be unique within the project. - GKE Zone: Use the value you have set for
$ZONE
, selecting it from the pulldown. - Kubeflow Version: v0.4.1
- Check the Skip IAP box
Set up kubectl to use your new cluster's credentials
When the cluster has been instantiated, connect your environment to the Kubernetes Engine cluster by running the following command in your Cloud Shell:
gcloud container clusters get-credentials ${DEPLOYMENT_NAME} \
--project ${PROJECT_ID} \
--zone ${ZONE}
This configures your kubectl
context so that you can interact with your cluster. To verify the connection, run the following command:
kubectl get nodes -o wide
You should see two nodes listed, both with a status of "Ready
", and other information about node age, version, external IP address, OS image, kernel version, and container runtime.
Add a GPU node pool to the cluster
Run the following command to create an accelerator node pool:
gcloud container node-pools create accel \
--project ${PROJECT_ID} \
--zone ${ZONE} \
--cluster ${DEPLOYMENT_NAME} \
--accelerator type=nvidia-tesla-k80,count=4 \
--num-nodes 1 \
--machine-type n1-highmem-8 \
--disk-size=220GB \
--scopes cloud-platform \
--verbosity error
Note: This can take a few minutes.
Install Nvidia drivers on these nodes by applying a daemonset :
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/cos/daemonset-preloaded.yaml
View the Kubeflow central dashboard
Once the cluster setup is complete, port-forward to view the Kubeflow central dashboard. Click the "Cloud Shell" button in the launcher:
This will open a tab showing your new cluster's services details. Click the Port Forwarding button towards the bottom of the page, then click the Run in Cloud Shell button in the resulting popup window.
A Cloud Shell window will start up, with the command to port-forward pre-populated at the prompt. Hit ‘return' to actually run the command in the Cloud Shell, then click the Open in web preview button that will appear in the services page.
This will launch the Kubeflow dashboard in a new browser tab.
Run a pipeline from the Pipelines dashboard
Pipeline description
The pipeline you will run has three steps:
- It starts by training a Tensor2Tensor model using preprocessed data. (More accurately, this step starts from an existing model checkpoint, then trains for a few more hundred steps-- it would take too long to fully train it). When it finishes, it exports the model in a form suitable for serving by TensorFlow serving.
- The next step in the pipeline deploys a TensorFlow-serving instance using that model.
- The last step launches a web app for interacting with the served model to retrieve predictions.
Download and compile the pipeline
To download the script containing the pipeline definition, execute this command:
curl -O https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/example_pipelines/gh_summ.py
Compile the pipeline definition file by running it:
python3 gh_summ.py
You will see the file gh_summ.py.tar.gz
appear as a result.
Note: If you get an error, make sure you have installed the Pipelines SDK and are using Python 3. If you're running in a Conda Python 3 environment, you'll use
python
instead ofpython3
Upload the compiled pipeline
From the Kubeflow dashboard, click the Pipeline Dashboard link to navigate to the Kubeflow Pipelines web UI. Click on Upload pipeline, and select Import by URL. Paste in the following URL, which points to the same pipeline that you just compiled.
https://github.com/kubeflow/examples/raw/master/github_issue_summarization/pipelines/example_pipelines/gh_summ.py.tar.gz
Give the pipeline a name (e.g. gh_summ
).
Run the pipeline
Click on the uploaded pipeline in the list —this lets you view the pipeline's static graph— then click on Start an experiment to create a new Experiment using the pipeline.
Give the Experiment a name (e.g. the same name as the pipeline, gh_summ
), then click Next to create it.
An Experiment is composed of multiple Runs. In Cloud Shell, execute these commands to gather the values to enter into the UI as parameters for the first Run:
gcloud config get-value project
echo ${GITHUB_TOKEN}
echo "gs://${BUCKET_NAME}/kubecon"
Give the Run a name (e.g. gh_summ-1
) and fill in three parameter fields:
project
github-token
working-dir
After filling in the fields, click Create.
Note: The pipeline will take approximately 15 minutes to complete.
Once the pipeline run is launched, you can click on an individual step in the run to get more information about it, including viewing its pod logs.
View the pipeline definition
While the pipeline is running, take a closer look at how it is put together and what it is doing.
View TensorBoard
The first step in the pipeline performs training and generates a model. Once this step is complete, view Artifacts and click the blue Start TensorBoard button, then once it's ready, click Open Tensorboard.
View the web app and make some predictions
The last step in the pipeline deploys a web app, which provides a UI for querying the trained model - served via TF Serving - to make predictions. After the pipeline completes, connect to the web app by visiting the Kubeflow central dashboard page, and appending /webapp/
at the end of the URL. (The trailing slash is required).
You should see something like this:
Click the Populate Random Issue button to retrieve a block of text. Click on Generate TItle to call the trained model and display a prediction.
Note: It can take a few seconds to display a summary— for this lab we're not using GPUs for the TensorFlow Serving instance.
Run a pipeline from a Jupyter notebook
Create a JupyterHub instance
You can also interactively define and run Kubeflow Pipelines from a Jupyter notebook. To create a notebook, navigate to the JupyterHub link on the central Kubeflow dashboard.
The first time you visit JupyterHub, you'll first be asked to log in. You can use any username/password you like (remember what it was). Then you will be prompted to spawn an instance. Select the TensorFlow 1.12 CPU image from the pulldown menu as shown below. Then click the Spawn button, which generates a new pod in your cluster.
Note: JupyterHub will take 3-5 minutes to become available. You can view the status of the container on the Kubernetes Engine -> Workloads section of the GCP Console.
Download a notebook
Once JupyterHub becomes available, open a terminal.
In the Terminal window, run:
cd work
curl -O https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/example_pipelines/pipelines-kubecon.ipynb
Return to the JupyterHub home screen, navigate to the work
folder, and open the notebook you just downloaded.
Execute the notebook
In the Setup section, find the second command cell (it starts with import kfp
). Fill in your own values for the environment variables WORKING_DIR
, PROJECT_NAME
, and GITHUB_TOKEN
, then execute the notebook one step at a time.
Follow the instructions in the notebook for the remainder of the lab.
Clean up
Destroy the cluster
Note: Cluster deletion can take a few minutes to complete
To remove all resources created by the Click-to-Deploy launcher, navigate to
Deployment Manager in the GCP Console and
delete the $DEPLOYMENT_NAME
deployment.
Remove the GitHub token
If you have no further use for it, navigate to https://github.com/settings/tokens and remove the generated token.
Conclusion
Congratulations -- you've created ML workflows using Kubeflow Pipelines!