{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Copyright 2019 The Kubeflow Authors. All Rights Reserved.\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# http://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# KubeFlow Pipeline local development quickstart\n", "\n", "In this notebook, we will demo: \n", "\n", "* Author components with the lightweight method and ContainerOp based on existing images.\n", "* Author pipelines.\n", "\n", "**Note: Make sure that you have docker installed in the local environment**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Setup" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# PROJECT_ID is used to construct the docker image registry. We will use Google Container Registry, \n", "# but any other accessible registry works as well. \n", "PROJECT_ID='Your-Gcp-Project-Id'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install Pipeline SDK\n", "!pip3 install kfp --upgrade\n", "!mkdir -p tmp/pipelines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1\n", "# Two ways to author a component to list blobs in a GCS bucket\n", "A pipeline is composed of one or more components. In this section, you will build a single component that lists the blobs in a GCS bucket. Then you build a pipeline that consists of this component. There are two ways to author a component. In the following sections we will go through each of them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Create a lightweight python component from a Python function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Define component function\n", "The requirements for the component function:\n", "* The function must be stand-alone.\n", "* The function can only import packages that are available in the base image.\n", "* If the function operates on numbers, the parameters must have type hints. Supported types are `int`, `float`, `bool`. Everything else is passed as `str`, that is, string.\n", "* To build a component with multiple output values, use Python’s `typing.NamedTuple` type hint syntax." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def list_blobs(bucket_name: str) -> str:\n", " '''Lists all the blobs in the bucket.'''\n", " import subprocess\n", "\n", " subprocess.call(['pip', 'install', '--upgrade', 'google-cloud-storage'])\n", " from google.cloud import storage\n", " storage_client = storage.Client()\n", " bucket = storage_client.get_bucket(bucket_name)\n", " list_blobs_response = bucket.list_blobs()\n", " blobs = ','.join([blob.name for blob in list_blobs_response])\n", " print(blobs)\n", " return blobs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Create a lightweight Python component" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import kfp.components as comp\n", "\n", "# Converts the function to a lightweight Python component.\n", "list_blobs_op = comp.func_to_container_op(list_blobs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3 Define pipeline\n", "Note that when accessing google cloud file system, you need to make sure the pipeline can authenticate to GCP. Refer to [Authenticating Pipelines to GCP](https://www.kubeflow.org/docs/gke/authentication-pipelines/) for details." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import kfp.dsl as dsl\n", "\n", "# Defines the pipeline.\n", "@dsl.pipeline(name='List GCS blobs', description='Lists GCS blobs.')\n", "def pipeline_func(bucket_name):\n", " list_blobs_task = list_blobs_op(bucket_name)\n", " # Use the following commented code instead if you want to use GSA key for authentication.\n", " #\n", " # from kfp.gcp import use_gcp_secret\n", " # list_blobs_task = list_blobs_op(bucket_name).apply(use_gcp_secret('user-gcp-sa'))\n", " # Same for below.", "\n", "# Compile the pipeline to a file.\n", "import kfp.compiler as compiler\n", "compiler.Compiler().compile(pipeline_func, 'tmp/pipelines/list_blobs.pipeline.tar.gz')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Wrap an existing Docker container image using `ContainerOp`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Create a Docker container\n", "Create your own container image that includes your program. If your component creates some outputs to be fed as inputs to the downstream components, each separate output must be written as a string to a separate local text file inside the container image. For example, if a trainer component needs to output the trained model path, it can write the path to a local file `/output.txt`. The string written to an output file cannot be too big. If it is too big (>> 100 kB), it is recommended to save the output to an external persistent storage and pass the storage path to the next component.\n", "\n", "Start by entering the value of your Google Cloud Platform Project ID." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell creates a file `app.py` that contains a Python script. The script takes a GCS bucket name as an input argument, gets the lists of blobs in that bucket, prints the list of blobs and also writes them to an output file." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "# Create folders if they don't exist.\n", "mkdir -p tmp/components/list-gcs-blobs\n", "\n", "# Create the Python file that lists GCS blobs.\n", "cat > ./tmp/components/list-gcs-blobs/app.py < ./tmp/components/list-gcs-blobs/Dockerfile < ./tmp/components/list-gcs-blobs/build_image.sh < ./tmp/components/view-input/app.py < ./tmp/components/view-input/Dockerfile < ./tmp/components/view-input/build_image.sh <