History

vr140 0c0e4b6105 fix(samples): Replace deprecated tfx kubeflow example (#7342 ) Problem: In this commit `cd029714cb`, TFX remove duplicated kubeflow_gcp example for Chicago Taxi data. However, it is still referenced in this "TFX on KubeFlow Pipelines Example", e.g. https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow_gcp.py (broken link). Solution: Replace references to taxi pipeline tfx kubeflow example with penguin tfx kubeflow example.		2022-02-24 09:19:04 +00:00
..
utils	Assigned copyright to the project authors (#5587 )	2021-05-05 13:53:22 +08:00
README.md	replace deprecate usage (#2499 )	2019-10-27 13:21:24 -07:00
TFX Example.ipynb	fix(samples): Replace deprecated tfx kubeflow example (#7342 )	2022-02-24 09:19:04 +00:00

README.md

TFX Pipeline Example

This sample walks through running TFX Taxi Application Example on Kubeflow Pipelines cluster.

Overview

This pipeline demonstrates the TFX capabilities at scale. The pipeline uses a public BigQuery dataset and uses GCP services to preprocess data (Dataflow) and train the model (Cloud ML Engine). The model is then deployed to Cloud ML Engine Prediction service.

Setup

Enable DataFlow API for your GKE cluster: https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview

Create a local Python 3.5 conda environment

conda create -n tfx-kfp pip python=3.5.3

then activate the environment.

Install TFX and Kubeflow Pipelines SDK

pip3 install 'tfx==0.14.0' --upgrade
pip3 install 'kfp>=0.1.31' --upgrade

Upload the utility code to your storage bucket. You can modify this code if needed for a different dataset.

gsutil cp utils/taxi_utils.py gs://my-bucket/<path>/

If gsutil does not work, try tensorflow.gfile:

from tensorflow import gfile
gfile.Copy('utils/taxi_utils.py', 'gs://<my bucket>/<path>/taxi_utils.py')

Configure the TFX Pipeline

Modify the pipeline configurations at

TFX Example.ipynb

Configure

Set _input_bucket to the GCS directory where you've copied taxi_utils.py. I.e. gs:////
Set _output_bucket to the GCS directory where you've want the results to be written
Set GCP project ID (replace my-gcp-project). Note that it should be project ID, not project name.
The original BigQuery dataset has 100M rows, which can take time to process. Modify the selection criteria (% of records) to run a sample test.

Compile and run the pipeline

Run the notebook.

This will generate a file named chicago_taxi_pipeline_kubeflow.tar.gz. Upload this file to the Pipelines Cluster and create a run.