pipelines/samples/core/tfx-oss/TFX Example.ipynb

159 lines
4.0 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TFX on KubeFlow Pipelines Example\n",
"\n",
"This notebook should be run inside a KF Pipelines cluster.\n",
"\n",
"### Install TFX and KFP packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!pip3 install 'tfx==0.15.0' --upgrade\n",
"!python3 -m pip install 'kfp>=0.1.35' --quiet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Enable DataFlow API for your GKE cluster\n",
"<https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get the TFX repo with sample pipeline\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true,
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"# Directory and data locations (uses Google Cloud Storage).\n",
"import os\n",
"_input_bucket = '<your gcs bucket>'\n",
"_output_bucket = '<your gcs bucket>'\n",
"_pipeline_root = os.path.join(_output_bucket, 'tfx')\n",
"\n",
"# Google Cloud Platform project id to use when deploying this pipeline.\n",
"_project_id = '<your project id>'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# copy the trainer code to a storage bucket as the TFX pipeline will need that code file in GCS\n",
"from tensorflow.compat.v1 import gfile\n",
"gfile.Copy('examples/penguin/penguin_utils_cloud_tuner.py', _input_bucket + '/penguin_utils_cloud_tuner.py')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure the TFX pipeline example\n",
"\n",
"Reload this cell by running the load command to get the pipeline configuration file\n",
"```\n",
"%load https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/penguin/penguin_pipeline_kubeflow.py\n",
"```\n",
"\n",
"Configure:\n",
"- Set `_input_bucket` to the GCS directory where you've copied penguin_utils_cloud_tuner.py. I.e. gs://<my bucket>/<path>/\n",
"- Set `_output_bucket` to the GCS directory where you've want the results to be written\n",
"- Set GCP project ID (replace my-gcp-project). Note that it should be project ID, not project name.\n",
"\n",
"The dataset in BigQuery has 100M rows, you can change the query parameters in WHERE clause to limit the number of rows used.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"%load https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/penguin/penguin_pipeline_kubeflow.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit pipeline for execution on the Kubeflow cluster"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import kfp\n",
"\n",
"run_result = kfp.Client(\n",
" host=None # replace with Kubeflow Pipelines endpoint if this notebook is run outside of the Kubeflow cluster.\n",
").create_run_from_pipeline_package('penguin_kubeflow.tar.gz', arguments={})"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.6"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"metadata": {
"collapsed": false
},
"source": []
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}