website/content/en/docs/components/pipelines/sdk/build-pipeline.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-pxRW-6f_uNq"
   },
   "source": [
    "# Build a Pipeline\n",
    "> A tutorial on building pipelines to orchestrate your ML workflow\n",
    "\n",
    "\n",
    "A Kubeflow pipeline is a portable and scalable definition of a machine learning\n",
    "(ML) workflow. Each step in your ML workflow, such as preparing data or\n",
    "training a model, is an instance of a pipeline component. This document\n",
    "provides an overview of pipeline concepts and best practices, and instructions\n",
    "describing how to build an ML pipeline.\n",
    "\n",
    "## Before you begin\n",
    "\n",
    "1.  Run the following command to install the Kubeflow Pipelines SDK. If you run this command in a Jupyter\n",
    "    notebook, restart the kernel after installing the SDK. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "04mM73j7nWJ-"
   },
   "outputs": [],
   "source": [
    "!pip install kfp --upgrade"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8KExWR1i_7Ur"
   },
   "source": [
    "2.  Import the `kfp` and `kfp.components` packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "TLAhMbMG_M3A"
   },
   "outputs": [],
   "source": [
    "import kfp\n",
    "import kfp.components as comp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "deBrhgzrD3Fr"
   },
   "source": [
    "## Understanding pipelines\n",
    "\n",
    "A Kubeflow pipeline is a portable and scalable definition of an ML workflow,\n",
    "based on containers. A pipeline is composed of a set of input parameters and a\n",
    "list of the steps in this workflow. Each step in a pipeline is an instance of a\n",
    "component, which is represented as an instance of \n",
    "[`ContainerOp`][container-op].\n",
    "\n",
    "You can use pipelines to:\n",
    "\n",
    "*   Orchestrate repeatable ML workflows.\n",
    "*   Accelerate experimentation by running a workflow with different sets of\n",
    "    hyperparameters.\n",
    "\n",
    "### Understanding pipeline components\n",
    "\n",
    "A pipeline component is a containerized application that performs one step in a\n",
    "pipeline's workflow. Pipeline components are defined in\n",
    "[component specifications][component-spec], which define the following:\n",
    "\n",
    "*   The component's interface, its inputs and outputs.\n",
    "*   The component's implementation, the container image and the command to\n",
    "    execute.\n",
    "*   The component's metadata, such as the name and description of the\n",
    "    component.\n",
    "\n",
    "You can build components by [defining a component specification for a\n",
    "containerized application][component-dev], or you can [use the Kubeflow\n",
    "Pipelines SDK to generate a component specification for a Python\n",
    "function][python-function-component]. You can also [reuse prebuilt components\n",
    "in your pipeline][prebuilt-components]. \n",
    "\n",
    "### Understanding the pipeline graph\n",
    "\n",
    "Each step in your pipeline's workflow is an instance of a component. When\n",
    "you define your pipeline, you specify the source of each step's inputs. Step\n",
    "inputs can be set from the pipeline's input arguments, constants, or step\n",
    "inputs can depend on the outputs of other steps in this pipeline. Kubeflow\n",
    "Pipelines uses these dependencies to define your pipeline's workflow as\n",
    "a graph.\n",
    "\n",
    "For example, consider a pipeline with the following steps: ingest data,\n",
    "generate statistics, preprocess data, and train a model. The following\n",
    "describes the data dependencies between each step.\n",
    "\n",
    "*   **Ingest data**: This step loads data from an external source which is\n",
    "    specified using a pipeline argument, and it outputs a dataset. Since\n",
    "    this step does not depend on the output of any other steps, this step\n",
    "    can run first.\n",
    "*   **Generate statistics**: This step uses the ingested dataset to generate\n",
    "    and output a set of statistics. Since this step depends on the dataset\n",
    "    produced by the ingest data step, it must run after the ingest data step.\n",
    "*   **Preprocess data**: This step preprocesses the ingested dataset and\n",
    "    transforms the data into a preprocessed dataset. Since this step depends\n",
    "    on the dataset produced by the ingest data step, it must run after the\n",
    "    ingest data step.\n",
    "*   **Train a model**: This step trains a model using the preprocessed dataset,\n",
    "    the generated statistics, and pipeline parameters, such as the learning\n",
    "    rate. Since this step depends on the preprocessed data and the generated\n",
    "    statistics, it must run after both the preprocess data and generate\n",
    "    statistics steps are complete.\n",
    "\n",
    "Since the generate statistics and preprocess data steps both depend on the\n",
    "ingested data, the generate statistics and preprocess data steps can run in\n",
    "parallel. All other steps are executed once their data dependencies are\n",
    "available.\n",
    "\n",
    "## Designing your pipeline\n",
    "\n",
    "When designing your pipeline, think about how to split your ML workflow into\n",
    "pipeline components. The process of splitting an ML workflow into pipeline\n",
    "components is similar to the process of splitting a monolithic script into\n",
    "testable functions. The following rules can help you define the components\n",
    "that you need to build your pipeline.\n",
    "\n",
    "*   Components should have a single responsibility. Having a single\n",
    "    responsibility makes it easier to test and reuse a component. For example,\n",
    "    if you have a component that loads data you can reuse that for similar\n",
    "    tasks that load data. If you have a component that loads and transforms\n",
    "    a dataset, the component can be less useful since you can use it only when\n",
    "    you need to load and transform that dataset. \n",
    "\n",
    "*   Reuse components when possible. Kubeflow Pipelines provides [components for\n",
    "    common pipeline tasks and for access to cloud\n",
    "    services][prebuilt-components].\n",
    "\n",
    "*   Consider what you need to know to debug your pipeline and research the\n",
    "    lineage of the models that your pipeline produces. Kubeflow Pipelines\n",
    "    stores the inputs and outputs of each pipeline step. By interrogating the\n",
    "    artifacts produced by a pipeline run, you can better understand the\n",
    "    variations in model quality between runs or track down bugs in your\n",
    "    workflow.\n",
    "\n",
    "In general, you should design your components with composability in mind. \n",
    "\n",
    "Pipelines are composed of component instances, also called steps. Steps can\n",
    "define their inputs as depending on the output of another step. The\n",
    "dependencies between steps define the pipeline workflow graph.\n",
    "\n",
    "### Building pipeline components\n",
    "\n",
    "Kubeflow pipeline components are containerized applications that perform a\n",
    "step in your ML workflow. Here are the ways that you can define pipeline\n",
    "components:\n",
    "\n",
    "*   If you have a containerized application that you want to use as a\n",
    "    pipeline component, create a component specification to define this\n",
    "    container image as a pipeline component.\n",
    "    \n",
    "    This option provides the flexibility to include code written in any\n",
    "    language in your pipeline, so long as you can package the application\n",
    "    as a container image. Learn more about [building pipeline\n",
    "    components][component-dev].\n",
    "\n",
    "*   If your component code can be expressed as a Python function, [evaluate if\n",
    "    your component can be built as a Python function-based\n",
    "    component][python-function-component]. The Kubeflow Pipelines SDK makes it\n",
    "    easier to build lightweight Python function-based components by saving you\n",
    "    the effort of creating a component specification.\n",
    "\n",
    "Whenever possible, [reuse prebuilt components][prebuilt-components] to save\n",
    "yourself the effort of building custom components.\n",
    "\n",
    "The example in this guide demonstrates how to build a pipeline that uses a\n",
    "Python function-based component and reuses a prebuilt component.\n",
    "\n",
    "### Understanding how data is passed between components\n",
    "\n",
    "When Kubeflow Pipelines runs a component, a container image is started in a\n",
    "Kubernetes Pod and your component’s inputs are passed in as command-line\n",
    "arguments. When your component has finished, the component's outputs are\n",
    "returned as files.\n",
    "\n",
    "In your component's specification, you define the components inputs and outputs\n",
    "and how the inputs and output paths are passed to your program as command-line\n",
    "arguments. You can pass small inputs, such as short strings or numbers, to your\n",
    "component by value. Large inputs, such as datasets, must be passed to your\n",
    "component as file paths. Outputs are written to the paths that Kubeflow\n",
    "Pipelines provides.\n",
    "\n",
    "Python function-based components make it easier to build pipeline components\n",
    "by building the component specification for you. Python function-based\n",
    "components also handle the complexity of passing inputs into your component\n",
    "and passing your function’s outputs back to your pipeline.\n",
    "\n",
    "Learn more about how [Python function-based components handle inputs and\n",
    "outputs][python-function-component-data-passing]. \n",
    "\n",
    "## Getting started building a pipeline\n",
    "\n",
    "The following sections demonstrate how to get started building a Kubeflow\n",
    "pipeline by walking through the process of converting a Python script into\n",
    "a pipeline.\n",
    "\n",
    "### Design your pipeline\n",
    "\n",
    "The following steps walk through some of the design decisions you may face\n",
    "when designing a pipeline.\n",
    "\n",
    "1.  Evaluate the process. In the following example, a Python function downloads\n",
    "    a zipped tar file (`.tar.gz`) that contains several CSV files, from a\n",
    "    public website. The function extracts the CSV files and then merges them\n",
    "    into a single file.\n",
    "\n",
    "[container-op]: https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.dsl.html#kfp.dsl.ContainerOp\n",
    "[component-spec]: https://www.kubeflow.org/docs/components/pipelines/reference/component-spec/\n",
    "[python-function-component]: https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/\n",
    "[component-dev]: https://www.kubeflow.org/docs/components/pipelines/sdk/component-development/\n",
    "[python-function-component-data-passing]: https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/#understanding-how-data-is-passed-between-components\n",
    "[prebuilt-components]: https://www.kubeflow.org/docs/examples/shared-resources/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Vn9MXolH_2BG"
   },
   "outputs": [],
   "source": [
    "import glob\n",
    "import pandas as pd\n",
    "import tarfile\n",
    "import urllib.request\n",
    "    \n",
    "def download_and_merge_csv(url: str, output_csv: str):\n",
    "  with urllib.request.urlopen(url) as res:\n",
    "    tarfile.open(fileobj=res, mode=\"r|gz\").extractall('data')\n",
    "  df = pd.concat(\n",
    "      [pd.read_csv(csv_file, header=None) \n",
    "       for csv_file in glob.glob('data/*.csv')])\n",
    "  df.to_csv(output_csv, index=False, header=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cWmF17kyIKGF"
   },
   "source": [
    "2.  Run the following Python command to test the function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "he6MK5x1Fwbk"
   },
   "outputs": [],
   "source": [
    "download_and_merge_csv(\n",
    "    url='https://storage.googleapis.com/ml-pipeline-playground/iris-csv-files.tar.gz', \n",
    "    output_csv='merged_data.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3.  Run the following to print the first few rows of the\n",
    "    merged CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head merged_data.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yT6Di92BOrNQ"
   },
   "source": [
    "4.  Design your pipeline. For example, consider the following pipeline designs.\n",
    "\n",
    "    *   Implement the pipeline using a single step. In this case, the pipeline\n",
    "        contains one component that works similarly to the example function.\n",
    "        This is a straightforward function, and implementing a single-step\n",
    "        pipeline is a reasonable approach in this case.\n",
    "        \n",
    "        The down side of this approach is that the zipped tar file would not be\n",
    "        an artifact of your pipeline runs. Not having this artifact available \n",
    "        could make it harder to debug this component in production.\n",
    "        \n",
    "    *   Implement this as a two-step pipeline. The first step downloads a file\n",
    "        from a website. The second step extracts the CSV files from a zipped\n",
    "        tar file and merges them into a single file. \n",
    "        \n",
    "        This approach has a few benefits:\n",
    "        \n",
    "        *   You can reuse the [Web Download component][web-download-component]\n",
    "            to implement the first step.\n",
    "        *   Each step has a single responsibility, which makes the components\n",
    "            easier to reuse.\n",
    "        *   The zipped tar file is an artifact of the first pipeline step.\n",
    "            This means that you can examine this artifact when debugging\n",
    "            pipelines that use this component.\n",
    "    \n",
    "    This example implements a two-step pipeline.\n",
    "\n",
    "### Build your pipeline components\n",
    "\n",
    "        \n",
    "1.  Build your pipeline components. This example modifies the initial script to\n",
    "    extract the contents of a zipped tar file, merge the CSV files that were\n",
    "    contained in the zipped tar file, and return the merged CSV file.\n",
    "    \n",
    "    This example builds a Python function-based component. You can also package\n",
    "    your component's code as a Docker container image and define the component\n",
    "    using a ComponentSpec.\n",
    "    \n",
    "    In this case, the following modifications were required to the original\n",
    "    function.\n",
    "\n",
    "    *   The file download logic was removed. The path to the zipped tar file\n",
    "        is passed as an argument to this function.\n",
    "    *   The import statements were moved inside of the function. Python\n",
    "        function-based components require standalone Python functions. This\n",
    "        means that any required import statements must be defined within the\n",
    "        function, and any helper functions must be defined within the function.\n",
    "        Learn more about [building Python function-based\n",
    "        components][python-function-components].\n",
    "    *   The function's arguments are decorated with the\n",
    "        [`kfp.components.InputPath`][input-path] and the\n",
    "        [`kfp.components.OutputPath`][output-path] annotations. These\n",
    "        annotations let Kubeflow Pipelines know to provide the path to the\n",
    "        zipped tar file and to create a path where your function stores the\n",
    "        merged CSV file. \n",
    "        \n",
    "    The following example shows the updated `merge_csv` function.\n",
    "\n",
    "[web-download-component]: https://github.com/kubeflow/pipelines/blob/master/components/web/Download/component.yaml\n",
    "[python-function-components]: https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/\n",
    "[input-path]: https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html?highlight=inputpath#kfp.components.InputPath\n",
    "[output-path]: https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html?highlight=outputpath#kfp.components.OutputPath"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "NB3eNHmNCN2C"
   },
   "outputs": [],
   "source": [
    "def merge_csv(file_path: comp.InputPath('Tarball'),\n",
    "              output_csv: comp.OutputPath('CSV')):\n",
    "  import glob\n",
    "  import pandas as pd\n",
    "  import tarfile\n",
    "\n",
    "  tarfile.open(name=file_path, mode=\"r|gz\").extractall('data')\n",
    "  df = pd.concat(\n",
    "      [pd.read_csv(csv_file, header=None) \n",
    "       for csv_file in glob.glob('data/*.csv')])\n",
    "  df.to_csv(output_csv, index=False, header=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WATSaHZkJySg"
   },
   "source": [
    "2.  Use [`kfp.components.create_component_from_func`][create_component_from_func]\n",
    "    to return a factory function that you can use to create pipeline steps.\n",
    "    This example also specifies the base container image to run this function\n",
    "    in, the path to save the component specification to, and a list of PyPI\n",
    "    packages that need to be installed in the container at runtime.\n",
    "\n",
    "[create_component_from_func]: (https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html#kfp.components.create_component_from_func\n",
    "[container-op]: https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.dsl.html#kfp.dsl.ContainerOp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "RDVQ8QjWOniD"
   },
   "outputs": [],
   "source": [
    "create_step_merge_csv = kfp.components.create_component_from_func(\n",
    "    func=merge_csv,\n",
    "    output_component_file='component.yaml', # This is optional. It saves the component spec for future use.\n",
    "    base_image='python:3.7',\n",
    "    packages_to_install=['pandas==1.1.4'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "j9Axem9HPHP2"
   },
   "source": [
    "### Build your pipeline\n",
    "\n",
    "1.  Use [`kfp.components.load_component_from_url`][load_component_from_url]\n",
    "    to load the component specification YAML for any components that you are\n",
    "    reusing in this pipeline.\n",
    "\n",
    "[load_component_from_url]: https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html?highlight=load_component_from_url#kfp.components.load_component_from_url"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "QDzFCaGQa_oR"
   },
   "outputs": [],
   "source": [
    "web_downloader_op = kfp.components.load_component_from_url(\n",
    "    'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/web/Download/component.yaml')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "p4bIwiHhbACy"
   },
   "source": [
    "2.  Define your pipeline as a Python function. \n",
    "\n",
    "    Your pipeline function's arguments define your pipeline's parameters. Use\n",
    "    pipeline parameters to experiment with different hyperparameters, such as\n",
    "    the learning rate used to train a model, or pass run-level inputs, such as\n",
    "    the path to an input file, into a pipeline run.\n",
    "    \n",
    "    Use the factory functions created by\n",
    "    `kfp.components.create_component_from_func` and\n",
    "    `kfp.components.load_component_from_url` to create your pipeline's tasks. \n",
    "    The inputs to the component factory functions can be pipeline parameters,\n",
    "    the outputs of other tasks, or a constant value. In this case, the\n",
    "    `web_downloader_task` task uses the `url` pipeline parameter, and the\n",
    "    `merge_csv_task` uses the `data` output of the `web_downloader_task`.\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "WsyKJeBOTlkz"
   },
   "outputs": [],
   "source": [
    "# Define a pipeline and create a task from a component:\n",
    "def my_pipeline(url):\n",
    "  web_downloader_task = web_downloader_op(url=url)\n",
    "  merge_csv_task = create_step_merge_csv(file=web_downloader_task.outputs['data'])\n",
    "  # The outputs of the merge_csv_task can be referenced using the\n",
    "  # merge_csv_task.outputs dictionary: merge_csv_task.outputs['output_csv']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OT3O_2GgVKoT"
   },
   "source": [
    "### Compile and run your pipeline\n",
    "\n",
    "After defining the pipeline in Python as described in the preceding section, use one of the following options to compile the pipeline and submit it to the Kubeflow Pipelines service.\n",
    "\n",
    "#### Option 1: Compile and then upload in UI\n",
    "\n",
    "1.  Run the following to compile your pipeline and save it as `pipeline.yaml`. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "U0Ll8ve2WNUo"
   },
   "outputs": [],
   "source": [
    "kfp.compiler.Compiler().compile(\n",
    "    pipeline_func=my_pipeline,\n",
    "    package_path='pipeline.yaml')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2.  Upload and run your `pipeline.yaml` using the Kubeflow Pipelines user interface.\n",
    "See the guide to [getting started with the UI][quickstart].\n",
    "\n",
    "[quickstart]: https://www.kubeflow.org/docs/components/pipelines/overview/quickstart"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jNLI1-_bfEky"
   },
   "source": [
    "#### Option 2: run the pipeline using Kubeflow Pipelines SDK client\n",
    "\n",
    "1.  Create an instance of the [`kfp.Client` class][kfp-client] following steps in [connecting to Kubeflow Pipelines using the SDK client][connect-api].\n",
    "\n",
    "[kfp-client]: https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.client.html#kfp.Client\n",
    "[connect-api]: https://www.kubeflow.org/docs/components/pipelines/sdk/connect-api"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client = kfp.Client() # change arguments accordingly"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2.  Run the pipeline using the `kfp.Client` instance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jRNHZpfnVJ0h"
   },
   "outputs": [],
   "source": [
    "client.create_run_from_pipeline_func(\n",
    "    my_pipeline,\n",
    "    arguments={\n",
    "        'url': 'https://storage.googleapis.com/ml-pipeline-playground/iris-csv-files.tar.gz'\n",
    "    })"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pnhZm12y_wvc"
   },
   "source": [
    "\n",
    "## Next steps\n",
    "\n",
    "*   Learn about advanced pipeline features, such as [authoring recursive\n",
    "    components][recursion] and [using conditional execution in a\n",
    "    pipeline][conditional].\n",
    "*   Learn how to [manipulate Kubernetes resources in a\n",
    "    pipeline][k8s-resources] (Experimental).\n",
    "\n",
    "[conditional]: https://github.com/kubeflow/pipelines/blob/master/samples/tutorials/DSL%20-%20Control%20structures/DSL%20-%20Control%20structures.py\n",
    "[recursion]: https://www.kubeflow.org/docs/components/pipelines/sdk/dsl-recursion/\n",
    "[k8s-resources]: https://www.kubeflow.org/docs/components/pipelines/sdk/manipulate-resources/"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Copy of build-pipelines.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "environment": {
   "name": "tf2-2-3-gpu.2-3.m56",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-2-3-gpu.2-3:m56"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}