pipelines/components/tfx/SchemaGen
Alexey Volkov fa3b3043c6
Components - Added support for Dataflow in TFX components (#3684)
* Components - Added support for Dataflow in TFX components

To use Dataflow, pass beam_pipeline_args to a component.
```
transformer_op(
    ...,
    beam_pipeline_args = [
        '--runner=DataflowRunner',
        '--experiments=shuffle_mode=auto',
        '--project=' + project_id,
        '--temp_location=' + gcs_bucket + '/tmp'),
        '--region=' + gcp_region,
        '--disk_size_gb=50',
    ],
)
```

These components use URI-based I/O since TFX with Beam's DataflowRunner only supports GCS URIs for inputs and outputs. With URI-based IO, the user must specify all output URIs themselves (e.g. `CsvEampleGen(..., output_examples_uri=...)`). Do not forget to do so. The `kfp.dsl.EXECUTION_ID_PLACEHOLDER` object can help construct execution-unique URIs, but if the component has multiple URIs, you will need to add some prefixes that are different for each output.

There is a bug in TFX+Beam which prevents using DataflowRunner, but these componenct contain a workaround. The workaround can be removed when the fixed verson of TFX is released ddb01c0242

* Added the TFX on KFP Dataflow sample

* Updated the README.md file

* Enabled the blessing output of the Evaluator

The Evaluator does not always write to that URI, but for components with URI-based I/O this does not matter.

* Fixed the indent in YAML

* Addressed the review feedback

* Updated the sample after the component changes

* Fixed the Dataflow casing in the sample name

* Using channel_utils.unwrap_channel_dict

* Updated the sample pipeline

* Sjortened the .get expressions

* Updated the sample
2020-05-06 13:37:08 -07:00
..
with_URI_IO Components - Added support for Dataflow in TFX components (#3684) 2020-05-06 13:37:08 -07:00
component.py Components - Upgraded the TFX components to 0.21.4 (#3641) 2020-04-29 01:40:24 -07:00
component.yaml Components - Upgraded the TFX components to 0.21.4 (#3641) 2020-04-29 01:40:24 -07:00