website/content/docs/pipelines/sdk/component-development.md

+++
title = "Build Reusable Components"
description = "A detailed tutorial on creating components that you can use in various pipelines"
weight = 3
+++

This page describes how to author a reusable component that you can
load and run in Kubeflow Pipelines. A reusable component is a pre-implemented
standalone component that is easy to add as a step in any pipeline.

If you're new to
pipelines, see the conceptual guides to [pipelines](/docs/pipelines/concepts/pipeline/)
and [components](/docs/pipelines/concepts/component/).

## Summary

Below is a summary of the steps involved in creating and using a component:

1.  Write the program that contains your component's logic. The program must
    use specific methods to pass data to and from the component.
1.  Containerize the program.
1.  Write a component specification in YAML format that describes the
    component for the Kubeflow Pipelines system.
1.  Use the Kubeflow Pipelines SDK to load and run the component in your
    pipeline.

The rest of this page gives some explanation about input and output data,
followed by detailed descriptions of the above steps.

## Passing the data to and from the containerized program

When planning to write a component you need to think about how the component
communicates with upstream and downstream components. That is, how it consumes
input data and produces output data.

### Summary

For small pieces of data (smaller than 512 kibibyte (KiB)):

* Inputs: Read the value from a command-line argument.
* Outputs: Write the value to a local file, using a path provided as a
  command-line argument.

For bigger pieces of data (larger than 512 KiB) or for a storage-specific
component:

* Inputs: **Read the data URI** from a file provided as a command-line argument.
  Then **read the data** from that URI.
* Outputs: Upload the data to the URI provided as a command-line argument. Then
  write that URI to a local file, using a path provided as a command-line
  argument.

### More about input data

There are several ways to make input data available to a program running inside
a container:

* **Small pieces of data** (smaller than 512 kibibyte (KiB)): Pass the data
  content as a command-line argument:

  ```
  program.py --param 100
  ```

*  **Bigger data** (larger than 512 KiB): Kubeflow Pipelines doesn't provide a
   way of transferring larger pieces of data to the container running the
   program. Instead, the program (or the wrapper script) should receive data
   URIs instead of the data itself and then access the data from the URIs. For
   example:

  ```
  program.py --train-uri [https://server.edu/datasets/1/train.tsv](https://server.edu/datasets/1/train.tsv) \
             --eval-uri [https://server.edu/datasets/1/eval.tsv](https://server.edu/datasets/1/train.tsv)
  program.py --train-gcs-uri gs://bucket/datasets/1/train.tsv
  program.py --big-query-table my_table
  ```

### More about output data

The program must write the output data to some location and inform the system
about that location so that the system can pass the data between steps.
You should provide the paths to your output data as command-line arguments.
That is, you should not hardcode the paths.

You can choose a suitable storage solution for your output data. Options include
the following:

* [Google Cloud Storage](https://cloud.google.com/storage/docs/) is the
  recommended default storage solution for writing output.
* For structured data you can use
  [BigQuery](https://cloud.google.com/bigquery/docs/).
  You must provide the specific URI/path or table name to which to write the
  results.

The program should do the following:

* Upload the data to your chosen storage system.
* Pass out a URI pointing to the data, by writing that URI to a file and
  instructing the system to pick it up and treat it as the value of a particular
  component output.

Note that the example below accepts both a URI for uploading the data into, and
a file path to write that URI to.

```
program.py --out-model-uri gs://bucket/163/output_model \
           --out-model-uri-file /outputs/output_model_uri/data
```

Why should the program output the URI it has just received as an input argument?
The reason is that the URIs specified in the pipeline are usually not the real
URIs, but rather URI templates containing UIDs. The system resolves the URIs at
runtime when the containerized program starts. Only the containerized program
sees the fully-resolved URI.

Below is an example of such a URI:

```
gs://my-bucket/{{workflow.uid}}/{{pod.id}}/data
```

In cases where the program cannot control the URI/ID of the created object (for
example, where the URI is generated by the outside system), the program should
just accept the file path to write the resulting URI/ID:

```
program.py --out-model-uri-file /outputs/output_model_uri/data
```
<a id="future-proof">
### Future-proofing your code

The following guidelines help you avoid the need to modify the program code in
the near future or have different versions for different storage systems.

If the program has access to the TensorFlow package, you can use
[`tf.gfile`](https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile)
to read and write files. The `tf.gfile` module supports both local and Cloud
Storage paths.

If you cannot use `tf.gfile`, a solution is to read inputs from and write
outputs to local files, then add a storage-specific wrapper that downloads and
uploads the data from/to a specific storage solution such as
[Cloud Storage](https://cloud.google.com/storage/docs/) or
[Amazon S3](https://aws.amazon.com/s3/).
For example, create a wrapper script that uses the [`gsutil
cp`](https://cloud.google.com/storage/docs/gsutil/commands/cp) command to
download the input data before running the main program and to upload the output
data after the program finishes.

## Writing the program code

This section describes an example program that has two inputs (for small and
large pieces of data) and one output. The programming language in this example
is Python 3.

### program.py

```python
#!/usr/bin/env python3
import argparse
import os
from pathlib import Path
from tensorflow import gfile # Supports both local paths and Cloud Storage (GCS) or S3

# Function doing the actual work
def do_work(input1_file, output1_file, param1):
  for x in range(param1):
    line = next(input1_file)
    if not line:
      break
    _ = output1_file.write(line)

# Defining and parsing the command-line arguments
parser = argparse.ArgumentParser(description='My program description')
parser.add_argument('--input1-path', type=str, help='Path of the local file or GCS blob containing the Input 1 data.')
parser.add_argument('--param1', type=int, default=100, help='Parameter 1.')
parser.add_argument('--output1-path', type=str, help='Path of the local file or GCS blob where the Output 1 data should be written.')
parser.add_argument('--output1-path-file', type=str, help='Path of the local file where the Output 1 URI data should be written.')
args = parser.parse_args()

gfile.MakeDirs(os.path.dirname(args.output1_path))
# Opening the input/output files and performing the actual work
with gfile.Open(args.input1_path, 'r') as input1_file, gfile.Open(args.output1_path, 'w') as output1_file:
    do_work(input1_file, output1_file, args.param1)

# Writing args.output1_path to a file so that it will be passed to downstream tasks
Path(args.output1_path_file).parent.mkdir(parents=True, exist_ok=True)
Path(args.output1_path_file).write_text(args.output1_path)
```

The command line invocation of this program is:

```
python3 program.py --input1-path <URI to Input 1 data> \
                   --param1 <value of Param1 input> --output1-path <URI for Output 1 data> \
                   --output1-path-file <local file path for the Output 1 URI>
```

You need to pass the `URI for Output 1 data` forward so that the downstream
steps can access the UI. The program write the URI to a local file and tells the
system to grab it and expose it as an output. You should avoid hard-coding any
paths, so the program receives the path to the local file path through the
`--output1-path-file` command-line argument.

## Writing a Dockerfile to containerize your application

You need a [Docker](https://docs.docker.com/get-started/) container image that
packages your program.

The instructions on creating container images are not specific to Kubeflow
Pipelines. To make things easier for you, this section provides some guidelines
on standard container creation. You can use any procedure
of your choice to create the Docker containers.

Your [Dockerfile](https://docs.docker.com/engine/reference/builder/) must
contain all program code, including the wrapper, and the dependencies (operating
system packages, Python packages etc).

Ensure you have write access to a container registry where you can push
the container image. Examples include
[Google Container Registry](https://cloud.google.com/container-registry/docs/)
and [Docker Hub](https://hub.docker.com/).

Think of a name for your container image. This guide uses the name
`gcr.io/my-org/my-image'.

### Example Dockerfile

```
ARG BASE_IMAGE_TAG=1.12.0-py3
FROM tensorflow/tensorflow:$BASE_IMAGE_TAG
RUN python3 -m pip install keras
COPY ./src /pipelines/component/src
```

Create a `build_image.sh` script (see example below) to build the container
image based on the Dockerfile and push the container image to some container
repository.

Run the `build_image.sh` script to build the container image based on the Dockerfile
and push it to your chosen container repository.

Best practice: After pushing the image, get the strict image name with digest,
and use the strict image name for reproducibility.

### Example build_image.sh:

```bash
#!/bin/bash -e
image_name=gcr.io/my-org/my-image # Specify the image name here
image_tag=latest
full_image_name=${image_name}:${image_tag}
base_image_tag=1.12.0-py3

cd "$(dirname "$0")"
docker build --build-arg BASE_IMAGE_TAG=${base_image_tag} -t "${full_image_name}" .
docker push "$full_image_name"

# Output the strict image name (which contains the sha256 image digest)
docker inspect --format="{{index .RepoDigests 0}}" "${IMAGE_NAME}"
```

Make your script executable:

```
chmod +x build_image.sh
```

## Writing your component definition file

You need a component specification in YAML format that describes the
component for the Kubeflow Pipelines system.

For the complete definition of a Kubeflow Pipelines component, see the
[component specification](/docs/pipelines/reference/component-spec/).
However, for this tutorial you don't need to know the full schema of the
component specification. The tutorial provides enough information for the
relevant the components.

Start writing the component definition (`component.yaml`) by specifying your
container image in the component's implementation section:

```
implementation:
  container:
    image: gcr.io/my-org/my-image@sha256:a172..752f # Name of a container image that you've pushed to a container repo.
```

Complete the component's implementation section based on your (wrapper) program:

```
implementation:
  container:
    image: gcr.io/my-org/my-image@sha256:a172..752f
    # command is a list of strings (command-line arguments).
    # The YAML language has two syntaxes for lists and you can use either of them.
    # Here we use the "flow syntax" - comma-separated strings inside square brackets.
    command: [
      python3, /kfp/component/src/program.py, # Path of the program inside the container
      --input1-path, <URI to Input 1 data>,
      --param1, <value of Param1 input>,
      --output1-path, <URI template for Output 1 data>,
      --output1-path-file, <local file path for the Output 1 URI>,
    ]
```

The `command` section still contains some dummy placeholders (in angle
brackets). Let's replace them with real placeholders. A *placeholder* represents
a command-line argument that is replaced with some value or path before the
program is executed. In `component.yaml`, you specify the placeholders using
YAML's mapping syntax to distinguish them from the verbatim strings. There are
three placeholders available:

*   `{inputValue: Some input name}`
    This placeholder is replaced by the **value** of the argument to the
    specified input. This is useful for small pieces of input data.
*   `{outputPath: Some output name}`
    This placeholder is replaced by the auto-generated **local path** where the
    program should write its output data. This instructs the system to read the
    content of the file and store it as the value of the specified output.

As well as putting real placeholders in the command line, you need to add
corresponding input and output specifications to the inputs and outputs
sections. The input/output specification contains the input name, type,
description and default value. Only the name is required. The input and output
names are free-form strings, but be careful with the YAML syntax and use quotes
if necessary. The input/output names do not need to be the same as the
command-line flags which are usually quite short.

Replace the placeholders as follows:

+   Replace `<URI to Input 1 file>` with `{inputValue: Input 1 URI}` and
    add `Input 1 URI` to the inputs section. URLs are small, so we're passing
    them in as command-line arguments.
+   Replace `<value of Param1 input>` with `{inputValue: Parameter 1}` and add
    `Parameter 1` to the inputs section. Integers are small, so we're passing
    them in as command-line arguments.
+   Replace `<URI template for Output 1 file>` with `{inputValue: Output 1 URI
    template}` and add `Output 1 URI template` to the **inputs** section. This
    looks very confusing: you're adding an output URI into the inputs section.
    The reason is that currently you must manually pass in URIs, so this
    is input, not output.
+   Replace `<local file path for the Output 1 URI>` with `{outputPath: Output
    1 URI}` and add `Output 1 URI` to the **outputs** section. Again, this looks
    quite confusing: you now have both input and output called `Output 1 URI`.
    (Note that you can use different names.) The reason is that the URI is
    *pass through*. It's passed to the task as input and is then output from
    the task, so that downstream tasks have access to it.

After replacing the placeholders and adding inputs/outputs, your
`component.yaml` looks like this:

```
inputs: #List of input specs. Each input spec is a map.
- {name: Input 1 URI}
- {name: Parameter 1}
- {name: Output 1 URI template}
outputs:
- {name: Output 1 URI}
implementation:
  container:
    image: gcr.io/my-org/my-image@sha256:a172..752f
    command: [
      python3, /pipelines/component/src/program.py,
      --input1-path,
      {inputValue: Input 1 URI}, # Refers to the "Input 1 URI" input
      --param1,
      {inputValue: Parameter 1}, # Refers to the "Parameter 1" input
      --output1-path,
      {inputValue: Output 1 URI template}, # Refers to "Output 1 URI template" input
      --output1-path-file,
      {outputPath: Output 1 URI}, # Refers to the "Output 1 URI" output
    ]
```

The above component specification is sufficient, but you should add more
metadata to make it more useful. The example below includes the following
additions:

* Component name and description.
* For each input and output: description, default value, and type.

Final version of `component.yaml`:

```
name: Do dummy work
description: Performs some dummy work.
inputs:
- {name: Input 1 URI, type: GCSPath, description='GCS path to Input 1'}
- {name: Parameter 1, type: Integer, default='100', description='Parameter 1 description'} # The default values must be specified as YAML strings.
- {name: Output 1 URI template, type: GCSPath, description='GCS path template for Output 1'}
outputs:
- {name: Output 1 URI, type: GCSPath, description='GCS path for Output 1'}
implementation:
  container:
    image: gcr.io/my-org/my-image@sha256:a172..752f
    command: [
      python3, /pipelines/component/src/program.py,
      --input1-path,       {inputValue: Input 1 URI},
      --param1,            {inputValue: Parameter 1},
      --output1-path,      {inputValue: Output 1 URI template},
      --output1-path-file, {outputPath: Output 1 URI},
    ]
```

## Build your component into a pipeline with the Kubeflow Pipelines SDK

Here is a sample pipeline that shows how to load a component and use it to
compose a pipeline

```python
import kfp
# Load the component by calling load_component_from_file or load_component_from_url
# To load the component, the pipeline author only needs to have access to the component.yaml file.
# The Kubernetes cluster executing the pipeline needs access to the container image specified in the component.
dummy_op = kfp.components.load_component_from_file(os.path.join(component_root, 'component.yaml'))
# dummy_op = kfp.components.load_component_from_url('http://....../component.yaml')

# dummy_op is now a "factory function" that accepts the arguments for the component's inputs
# and produces a task object (e.g. ContainerOp instance).
# Inspect the dummy_op function in Jupyter Notebook by typing "dummy_op(" and pressing Shift+Tab
# You can also get help by writing help(dummy_op) or dummy_op? or dummy_op??
# The signature of the dummy_op function corresponds to the inputs section of the component.
# Some tweaks are performed to make the signature valid and pythonic:
# 1) All inputs with default values will come after the inputs without default values
# 2) The input names are converted to pythonic names (spaces and symbols replaced
#    with underscores and letters lowercased).

# Define a pipeline and create a task from a component:
@kfp.dsl.pipeline(name='My pipeline', description='')
def my_pipeline():
    dummy1_task = dummy_op(
        # Input name "Input 1 URI" is converted to pythonic parameter name "input_1_uri"
        input_1_uri='gs://my-bucket/datasets/train.tsv',
        parameter_1='100',
        # You must use Argo placeholders ("{{workflow.uid}}" and "{{pod.name}}")
        # to guarantee that the outputs from different pipeline runs and tasks write
        # to unique locations and do not overwrite each other.
        output_1_uri='gs://my-bucket/{{workflow.uid}}/{{pod.name}}/output_1/data',
    ).apply(kfp.gcp.use_gcp_secret('user-gcp-sa'))
    # To access GCS, you must configure the container to have access to a
    # GCS secret that grants required access to the bucket.
    # The outputs of the dummy1_task can be referenced using the
    # dummy1_task.outputs dictionary.
    # ! The output names are converted to lowercased dashed names.

    # Pass the outputs of the dummy1_task to some other component
    dummy2_task = dummy_op(
        input_1_uri=dummy1_task.outputs['output-1-uri'],
        parameter_1='200',
        output_1_uri='gs://my-bucket/{{workflow.uid}}/{{pod.name}}/output_1/data',
    ).apply(kfp.gcp.use_gcp_secret('user-gcp-sa'))
    # To access GCS, you must configure the container to have access to a
    # GCS secret that grants required access to the bucket.
# This pipeline can be compiled, uploaded and submitted for execution.
```

## Organizing the component files

This section provides a recommended way to organize the component files. There
is no requirement that you must organize the files in this way. However, using
the standard organization makes it possible to reuse the same scripts for
testing, image building and component versioning.
See this
[sample component](https://github.com/kubeflow/pipelines/tree/master/components/sample/keras/train_classifier)
for an real-life component example.

```
components/<component group>/<component name>/

    src/*            #Component source code files
    tests/*          #Unit tests
    run_tests.sh     #Small script that runs the tests
    README.md        #Documentation. Move to docs/ if multiple files needed

    Dockerfile       #Dockerfile to build the component container image
    build_image.sh   #Small script that runs docker build and docker push

    component.yaml   #Component definition in YAML format
```

## Next steps

* Consolidate what you've learned by reading the
  [best practices](/docs/pipelines/sdk/best-practices) for designing and
  writing components.
* See the [index of reusable components](/docs/pipelines/reusable-components/).