examples/github_issue_summarization/pipelines/example_pipelines/pipelines-notebook.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# KubeFlow Pipeline: Github Issue Summarization using Tensor2Tensor\n",
    "\n",
    "This notebook assumes that you have already set up a GKE cluster with CAIP Pipelines (Hosted KFP) installed, with the addition of a GPU-enabled node pool, as per this codelab: [g.co/codelabs/kubecon18](g.co/codelabs/kubecon18).\n",
    "\n",
    "In this notebook, we will show how to:\n",
    "\n",
    "* Interactively define a KubeFlow Pipeline using the Pipelines Python SDK\n",
    "* Submit and run the pipeline\n",
    "* Add a step in the pipeline\n",
    "\n",
    "This example pipeline trains a [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor/) model on Github issue data, learning to predict issue titles from issue bodies. It then exports the trained model and deploys the exported model to [Tensorflow Serving](https://github.com/tensorflow/serving). \n",
    "The final step in the pipeline launches a web app which interacts with the TF-Serving instance in order to get model predictions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do some installations and imports, and set some variables.  Set the `WORKING_DIR` to a path under the Cloud Storage bucket you created earlier.  You may need to restart your kernel after the KFP SDK update."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U kfp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Restart kernel after the pip install\n",
    "import IPython\n",
    "\n",
    "IPython.Application.instance().kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import kfp  # the Pipelines SDK.  \n",
    "from kfp import compiler\n",
    "import kfp.dsl as dsl\n",
    "import kfp.gcp as gcp\n",
    "import kfp.components as comp\n",
    "from kfp.dsl.types import Integer, GCSPath, String\n",
    "\n",
    "import kfp.notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define some pipeline input variables. \n",
    "WORKING_DIR = 'gs://YOUR_GCS_BUCKET/t2t/notebooks' # Such as gs://bucket/object/path\n",
    "\n",
    "PROJECT_NAME = 'YOUR_PROJECT'\n",
    "GITHUB_TOKEN = 'YOUR_GITHUB_TOKEN'  # optional; used for prediction, to grab issue data from GH\n",
    "\n",
    "DEPLOY_WEBAPP = 'false'  # change this to 'true' to deploy a new version of the webapp part of the pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instantiate the KFP client and create an *Experiment* in the Kubeflow Pipeline System\n",
    "\n",
    "Next we'll instantiate a KFP client object with the `host` info from your Hosted KFP installation.  To do this, go to the Pipelines dashboard in the Cloud Console and click on the \"Settings\" gear for the KFP installation that you want to use. You'll see a popup window. Look for the \"Connect to this Kubeflow Pipelines instance...\" text and copy the \"client = kfp.Client(...)\" line below it. Edit the following cell to use that line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# CHANGE THIS with the info for your KFP cluster installation\n",
    "client = kfp.Client(host='xxxxxxxx-dot-us-centralx.pipelines.googleusercontent.com')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Kubeflow Pipeline system requires an \"Experiment\" to group pipeline runs. You can create a new experiment, or call `client.list_experiments()` to get existing ones. (This will also serve to check that your client is set up properly)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client.list_experiments()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exp = client.create_experiment(name='t2t_notebook')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define a Pipeline\n",
    "\n",
    "Authoring a pipeline is like authoring a normal Python function. The pipeline function describes the topology of the pipeline. The pipeline components (steps) are container-based. For this pipeline, we're using a mix of predefined components loaded from their [component definition files](https://www.kubeflow.org/docs/pipelines/sdk/component-development/), and some components defined via [the `dsl.ContainerOp` constructor](https://www.kubeflow.org/docs/pipelines/sdk/build-component/).  For this codelab, we've prebuilt all the components' containers.\n",
    "\n",
    "While not shown here, there are other ways to build Kubeflow Pipeline components as well, including converting stand-alone python functions to containers via [`kfp.components.func_to_container_op(func)`](https://www.kubeflow.org/docs/pipelines/sdk/lightweight-python-components/).  You can read more [here](https://www.kubeflow.org/docs/pipelines/sdk/).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This pipeline has several steps:\n",
    "\n",
    "- An existing model checkpoint is copied to your bucket.\n",
    "- Dataset metadata is logged to the Kubeflow metadata server.\n",
    "- A [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor/) model is trained using preprocessed data. (Training starts from the existing model checkpoint copied in the first step, then trains for a few more hundred steps-- it would take too long to fully train it now). When it finishes, it exports the model in a form suitable for serving by [TensorFlow serving](https://github.com/tensorflow/serving/).\n",
    "- Training metadata is logged to the metadata server.\n",
    "- The next step in the pipeline deploys a TensorFlow-serving instance using that model.\n",
    "- The last step launches a web app for interacting with the served model to retrieve predictions.\n",
    "\n",
    "We'll first define some constants and load some components from their definition files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "COPY_ACTION = 'copy_data'\n",
    "TRAIN_ACTION = 'train'\n",
    "DATASET = 'dataset'\n",
    "MODEL = 'model'\n",
    "\n",
    "copydata_op = comp.load_component_from_url(\n",
    "  'https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/components/t2t/datacopy_component.yaml'  # pylint: disable=line-too-long\n",
    "  )\n",
    "\n",
    "train_op = comp.load_component_from_url(\n",
    "  'https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/components/t2t/train_component.yaml' # pylint: disable=line-too-long\n",
    "  )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we'll define the pipeline itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@dsl.pipeline(\n",
    "  name='Github issue summarization',\n",
    "  description='Demonstrate Tensor2Tensor-based training and TF-Serving'\n",
    ")\n",
    "def gh_summ(\n",
    "  train_steps: 'Integer' = 2019300,\n",
    "  project: str = 'YOUR_PROJECT_HERE',\n",
    "  github_token: str = 'YOUR_GITHUB_TOKEN_HERE',\n",
    "  working_dir: 'GCSPath' = 'gs://YOUR_GCS_DIR_HERE',\n",
    "  checkpoint_dir: 'GCSPath' = 'gs://aju-dev-demos-codelabs/kubecon/model_output_tbase.bak2019000/',\n",
    "  deploy_webapp: str = 'true',\n",
    "  data_dir: 'GCSPath' = 'gs://aju-dev-demos-codelabs/kubecon/t2t_data_gh_all/'\n",
    "  ):\n",
    "\n",
    "  copydata = copydata_op(\n",
    "    data_dir=data_dir,\n",
    "    checkpoint_dir=checkpoint_dir,\n",
    "    model_dir='%s/%s/model_output' % (working_dir, dsl.RUN_ID_PLACEHOLDER),\n",
    "    action=COPY_ACTION,\n",
    "    )\n",
    "\n",
    "  train = train_op(\n",
    "    data_dir=data_dir,\n",
    "    model_dir=copydata.outputs['copy_output_path'],\n",
    "    action=TRAIN_ACTION, train_steps=train_steps,\n",
    "    deploy_webapp=deploy_webapp\n",
    "    )\n",
    "\n",
    "  serve = dsl.ContainerOp(\n",
    "      name='serve',\n",
    "      image='gcr.io/google-samples/ml-pipeline-kubeflow-tfserve:v5',\n",
    "      arguments=[\"--model_name\", 'ghsumm-%s' % (dsl.RUN_ID_PLACEHOLDER,),\n",
    "          \"--model_path\", train.outputs['train_output_path']\n",
    "          ]\n",
    "      )\n",
    "\n",
    "  train.set_gpu_limit(1)\n",
    "\n",
    "  with dsl.Condition(train.outputs['launch_server'] == 'true'):\n",
    "    webapp = dsl.ContainerOp(\n",
    "        name='webapp',\n",
    "        image='gcr.io/google-samples/ml-pipeline-webapp-launcher:v7ap',\n",
    "        arguments=[\"--model_name\", 'ghsumm-%s' % (dsl.RUN_ID_PLACEHOLDER,),\n",
    "            \"--github_token\", github_token]\n",
    "\n",
    "        )\n",
    "    webapp.after(serve)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submit an experiment *run*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "compiler.Compiler().compile(gh_summ, 'ghsumm.tar.gz')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The call below will run the compiled pipeline.  We won't actually do that now, but instead we'll add a new step to the pipeline, then run it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You'd uncomment this call to actually run the pipeline. \n",
    "# run = client.run_pipeline(exp.id, 'ghsumm', 'ghsumm.tar.gz',\n",
    "#                           params={'working_dir': WORKING_DIR,\n",
    "#                                   'github_token': GITHUB_TOKEN,\n",
    "#                                   'project': PROJECT_NAME})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add a step to the pipeline\n",
    "\n",
    "Next, let's add a new step to the pipeline above.  As currently defined, the pipeline accesses a directory of pre-processed data as input to training.  Let's see how we could include the pre-processing as part of the pipeline. \n",
    "\n",
    "We're going to cheat a bit, as processing the full dataset will take too long for this workshop, so we'll use a smaller sample. For that reason, you won't actually make use of the generated data from this step (we'll stick to using the full dataset for training), but this shows how you could do so if we had more time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we'll define the new pipeline step. Note the last line of this new function, which gives this step's pod the credentials to access GCP."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# defining the new data preprocessing pipeline step. \n",
    "# Note the last line, which gives this step's pod the credentials to access GCP\n",
    "def preproc_op(data_dir, project):\n",
    "  return dsl.ContainerOp(\n",
    "    name='datagen',\n",
    "    image='gcr.io/google-samples/ml-pipeline-t2tproc',\n",
    "    arguments=[ \"--data-dir\", data_dir, \"--project\", project]\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Modify the pipeline to add the new step\n",
    "\n",
    "Now, we'll redefine the pipeline to add the new step. We're reusing the component ops defined above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Then define a new Pipeline. It's almost the same as the original one, \n",
    "# but with the addition of the data processing step.\n",
    "\n",
    "@dsl.pipeline(\n",
    "  name='Github issue summarization',\n",
    "  description='Demonstrate Tensor2Tensor-based training and TF-Serving'\n",
    ")\n",
    "def gh_summ2(\n",
    "  train_steps: 'Integer' = 2019300,\n",
    "  project: str = 'YOUR_PROJECT_HERE',\n",
    "  github_token: str = 'YOUR_GITHUB_TOKEN_HERE',\n",
    "  working_dir: 'GCSPath' = 'YOUR_GCS_DIR_HERE',\n",
    "  checkpoint_dir: 'GCSPath' = 'gs://aju-dev-demos-codelabs/kubecon/model_output_tbase.bak2019000/',\n",
    "  deploy_webapp: str = 'true',\n",
    "  data_dir: 'GCSPath' = 'gs://aju-dev-demos-codelabs/kubecon/t2t_data_gh_all/'\n",
    "  ):\n",
    "\n",
    "  # The new pre-processing op.\n",
    "  preproc = preproc_op(project=project,\n",
    "      data_dir=('%s/%s/gh_data' % (working_dir, dsl.RUN_ID_PLACEHOLDER)))\n",
    "\n",
    "  copydata = copydata_op(\n",
    "    data_dir=data_dir,\n",
    "    checkpoint_dir=checkpoint_dir,\n",
    "    model_dir='%s/%s/model_output' % (working_dir, dsl.RUN_ID_PLACEHOLDER),\n",
    "    action=COPY_ACTION,\n",
    "    )\n",
    "\n",
    "  train = train_op(\n",
    "    data_dir=data_dir,\n",
    "    model_dir=copydata.outputs['copy_output_path'],\n",
    "    action=TRAIN_ACTION, train_steps=train_steps,\n",
    "    deploy_webapp=deploy_webapp\n",
    "    )\n",
    "  train.after(preproc)    \n",
    "\n",
    "  serve = dsl.ContainerOp(\n",
    "      name='serve',\n",
    "      image='gcr.io/google-samples/ml-pipeline-kubeflow-tfserve:v5',\n",
    "      arguments=[\"--model_name\", 'ghsumm-%s' % (dsl.RUN_ID_PLACEHOLDER,),\n",
    "          \"--model_path\", train.outputs['train_output_path']\n",
    "          ]\n",
    "      )\n",
    "\n",
    "  train.set_gpu_limit(1)\n",
    "\n",
    "  with dsl.Condition(train.outputs['launch_server'] == 'true'):\n",
    "    webapp = dsl.ContainerOp(\n",
    "        name='webapp',\n",
    "        image='gcr.io/google-samples/ml-pipeline-webapp-launcher:v7ap',\n",
    "        arguments=[\"--model_name\", 'ghsumm-%s' % (dsl.RUN_ID_PLACEHOLDER,),\n",
    "            \"--github_token\", github_token]\n",
    "\n",
    "        )\n",
    "    webapp.after(serve)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compile the new pipeline definition and submit the run"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "compiler.Compiler().compile(gh_summ2, 'ghsumm2.tar.gz')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "run = client.run_pipeline(exp.id, 'ghsumm2', 'ghsumm2.tar.gz',\n",
    "                          params={'working_dir': WORKING_DIR,\n",
    "                                  'github_token': GITHUB_TOKEN,\n",
    "                                  'deploy_webapp': DEPLOY_WEBAPP,\n",
    "                                  'project': PROJECT_NAME})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should be able to see your newly defined pipeline run in the dashboard:\n",
    "![](https://storage.googleapis.com/amy-jo/images/kf-pls/t2t_pipeline_in_dashboard.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The new pipeline has the following structure:\n",
    "\n",
    "![The new pipeline structure.](https://storage.googleapis.com/amy-jo/images/kf-pls/t2t_pipeline_structure.png)\n",
    "\n",
    "Below is a screenshot of the pipeline running.\n",
    "\n",
    "![The pipeline running.](https://storage.googleapis.com/amy-jo/images/kf-pls/t2t_pipeline_running.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When this new pipeline finishes running, you'll be able to see your generated processed data files in GCS under the path: `WORKING_DIR/<pipeline_name>/gh_data`. There isn't time in the workshop to pre-process the full dataset, but if there had been, we could have defined our pipeline to read from that generated directory for its training input."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-----------------------------\n",
    "Copyright 2018 Google LLC\n",
    "\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "you may not use this file except in compliance with the License.\n",
    "You may obtain a copy of the License at\n",
    "\n",
    "     http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software\n",
    "distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "See the License for the specific language governing permissions and\n",
    "limitations under the License."
   ]
  }
 ],
 "metadata": {
  "environment": {
   "name": "tf2-2-2-gpu.2-2.m48",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-2-2-gpu.2-2:m48"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}