Github Issue Summarization - Train using TFJob (#55)

* Github Issue Summarization - Train using TFJob * Create a Dockerfile to build the image for tf-job * Create a manifest to deploy the tf-job * Create instructions on how to do all of this Fixes https://github.com/kubeflow/examples/issues/43 * Address comments * Add gcloud commands * Add ks app * Update Dockerfile base image * Python train.py fixes * Remove tfjob.yaml as it is replaced by ksonnet app * Remove plot_model_history as it is not required for tfjob training * Don't change WORKDIR * Address reviewer comments * Fix links * Fix lint issues using yapf * Sort imports
2018-03-29 13:37:04 -07:00 · 2018-03-29 13:37:04 -07:00 · b24152cf06
parent 41372c9314
commit b24152cf06
15 changed files with 76023 additions and 1 deletions
--- a/github_issue_summarization/README.md
+++ b/github_issue_summarization/README.md
@ -27,7 +27,9 @@ By the end of this tutorial, you should learn how to:
 ## Steps:
 1.  [Setup a Kubeflow cluster](setup_a_kubeflow_cluster.md)
-1.  [Training the model](training_the_model.md)
+1.  Training the model. You can train the model either using Jupyter Notebook or using TFJob.
    1.  [Training the model using a Jupyter Notebook](training_the_model.md)
    1.  [Training the model using TFJob](training_the_model_tfjob.md)
 1.  [Serving the model](serving_the_model.md)
 1.  [Querying the model](querying_the_model.md)
 1.  [Teardown](teardown.md)
--- a/github_issue_summarization/notebooks/Dockerfile
+++ b/github_issue_summarization/notebooks/Dockerfile
@ -0,0 +1,3 @@
 FROM gcr.io/kubeflow-images-staging/tensorflow-1.6.0-notebook-cpu
 COPY tf-job/train.py /workdir/train.py
 COPY seq2seq_utils.py /workdir/seq2seq_utils.py
--- a/github_issue_summarization/notebooks/tf-job/ks-app/.ksonnet/registries/incubator/40285d8a14f1ac5787e405e1023cf0c07f6aa28c.yaml
+++ b/github_issue_summarization/notebooks/tf-job/ks-app/.ksonnet/registries/incubator/40285d8a14f1ac5787e405e1023cf0c07f6aa28c.yaml
@ -0,0 +1,39 @@
 apiVersion: 0.1.0
 gitVersion:
  commitSha: 40285d8a14f1ac5787e405e1023cf0c07f6aa28c
  refSpec: master
 kind: ksonnet.io/registry
 libraries:
  apache:
    path: apache
    version: master
  efk:
    path: efk
    version: master
  mariadb:
    path: mariadb
    version: master
  memcached:
    path: memcached
    version: master
  mongodb:
    path: mongodb
    version: master
  mysql:
    path: mysql
    version: master
  nginx:
    path: nginx
    version: master
  node:
    path: node
    version: master
  postgres:
    path: postgres
    version: master
  redis:
    path: redis
    version: master
  tomcat:
    path: tomcat
    version: master
--- a/github_issue_summarization/notebooks/tf-job/ks-app/app.yaml
+++ b/github_issue_summarization/notebooks/tf-job/ks-app/app.yaml
@ -0,0 +1,18 @@
 apiVersion: 0.1.0
 environments:
  default:
    destination:
      namespace: namespace
      server: https://1.2.3.4
    k8sVersion: v1.7.0
    path: default
 kind: ksonnet.io/app
 name: ks-app
 registries:
  incubator:
    gitVersion:
      commitSha: 40285d8a14f1ac5787e405e1023cf0c07f6aa28c
      refSpec: master
    protocol: github
    uri: github.com/ksonnet/parts/tree/master/incubator
 version: 0.0.1
--- a/github_issue_summarization/notebooks/tf-job/ks-app/components/params.libsonnet
+++ b/github_issue_summarization/notebooks/tf-job/ks-app/components/params.libsonnet
@ -0,0 +1,13 @@
 {
  global: {
    // User-defined global parameters; accessible to all component and environments, Ex:
    // replicas: 4,
  },
  components: {
    // Component-level parameters, defined initially from 'ks prototype use ...'
    // Each object below should correspond to a component in the components/ directory
    tfjob: {
    },
  },
 }
--- a/github_issue_summarization/notebooks/tf-job/ks-app/components/tfjob.jsonnet
+++ b/github_issue_summarization/notebooks/tf-job/ks-app/components/tfjob.jsonnet
@ -0,0 +1,7 @@
 local env = std.extVar("__ksonnet/environments");
 local params = std.extVar("__ksonnet/params").components["tfjob"];
 local k = import "k.libsonnet";
 local tfjob = import "tfjob.libsonnet";
 std.prune(k.core.v1.list.new([tfjob.parts(params)]))
--- a/github_issue_summarization/notebooks/tf-job/ks-app/components/tfjob.libsonnet
+++ b/github_issue_summarization/notebooks/tf-job/ks-app/components/tfjob.libsonnet
@ -0,0 +1,67 @@
 {
  parts(params):: {
    apiVersion: "kubeflow.org/v1alpha1",
    kind: "TFJob",
    metadata: {
      name: "tf-job-issue-summarization",
      namespace: params.namespace,
    },
    spec: {
      replicaSpecs: [
        {
          replicas: 1,
          template: {
            spec: {
              containers: [
                {
                  image: params.image,
                  name: "tensorflow",
                  volumeMounts: [
                    {
                      name: "gcp-credentials",
                      mountPath: "/secret/gcp-credentials",
                      readOnly: true,
                    },
                  ],
                  command: [
                    "python",
                  ],
                  args: [
                    "/workdir/train.py",
                    "--sample_size=" + params.sample_size,
                    "--input_data_gcs_bucket=" + params.input_data_gcs_bucket,
                    "--input_data_gcs_path=" + params.input_data_gcs_path,
                    "--output_model_gcs_bucket=" + params.output_model_gcs_bucket,
                    "--output_model_gcs_path=" + params.output_model_gcs_path,
                  ],
                  env: [
                    {
                      name: "GOOGLE_APPLICATION_CREDENTIALS",
                      value: "/secret/gcp-credentials/key.json",
                    },
                  ],
                },
              ],
              volumes: [
                {
                  name: "gcp-credentials",
                  secret: {
                    secretName: "gcp-credentials",
                  },
                },
              ],
              restartPolicy: "OnFailure",
            },
          },
          tfReplicaType: "MASTER",
        },
      ],
      terminationPolicy: {
        chief: {
          replicaIndex: 0,
          replicaName: "MASTER",
        },
      },
    },
  },
 }
--- a/github_issue_summarization/notebooks/tf-job/ks-app/environments/base.libsonnet
+++ b/github_issue_summarization/notebooks/tf-job/ks-app/environments/base.libsonnet
@ -0,0 +1,4 @@
 local components = std.extVar("__ksonnet/components");
 components {
  // Insert user-specified overrides here.
 }
--- a/github_issue_summarization/notebooks/tf-job/ks-app/environments/default/main.jsonnet
+++ b/github_issue_summarization/notebooks/tf-job/ks-app/environments/default/main.jsonnet
@ -0,0 +1,7 @@
 local base = import "base.libsonnet";
 local k = import "k.libsonnet";
 base {
  // Insert user-specified overrides here. For example if a component is named "nginx-deployment", you might have something like:
  //   "nginx-deployment"+: k.deployment.mixin.metadata.labels({foo: "bar"})
 }
--- a/github_issue_summarization/notebooks/tf-job/ks-app/environments/default/params.libsonnet
+++ b/github_issue_summarization/notebooks/tf-job/ks-app/environments/default/params.libsonnet
@ -0,0 +1,10 @@
 local params = import "../../components/params.libsonnet";
 params {
  components+: {
    // Insert component parameter overrides here. Ex:
    // guestbook +: {
    //   name: "guestbook-dev",
    //   replicas: params.global.replicas,
    // },
  },
 }
--- a/github_issue_summarization/notebooks/tf-job/ks-app/lib/v1.7.0/k.libsonnet
+++ b/github_issue_summarization/notebooks/tf-job/ks-app/lib/v1.7.0/k.libsonnet
@ -0,0 +1,80 @@
 local k8s = import "k8s.libsonnet";
 local apps = k8s.apps;
 local core = k8s.core;
 local extensions = k8s.extensions;
 local hidden = {
  mapContainers(f):: {
    local podContainers = super.spec.template.spec.containers,
    spec+: {
      template+: {
        spec+: {
          // IMPORTANT: This overwrites the 'containers' field
          // for this deployment.
          containers: std.map(f, podContainers),
        },
      },
    },
  },
  mapContainersWithName(names, f)::
    local nameSet =
      if std.type(names) == "array"
      then std.set(names)
      else std.set([names]);
    local inNameSet(name) = std.length(std.setInter(nameSet, std.set([name]))) > 0;
    self.mapContainers(
      function(c)
        if std.objectHas(c, "name") && inNameSet(c.name)
        then f(c)
        else c
    ),
 };
 k8s {
  apps:: apps {
    v1beta1:: apps.v1beta1 {
      local v1beta1 = apps.v1beta1,
      daemonSet:: v1beta1.daemonSet {
        mapContainers(f):: hidden.mapContainers(f),
        mapContainersWithName(names, f):: hidden.mapContainersWithName(names, f),
      },
      deployment:: v1beta1.deployment {
        mapContainers(f):: hidden.mapContainers(f),
        mapContainersWithName(names, f):: hidden.mapContainersWithName(names, f),
      },
    },
  },
  core:: core {
    v1:: core.v1 {
      list:: {
        new(items)::
          { apiVersion: "v1" } +
          { kind: "List" } +
          self.items(items),
        items(items):: if std.type(items) == "array" then { items+: items } else { items+: [items] },
      },
    },
  },
  extensions:: extensions {
    v1beta1:: extensions.v1beta1 {
      local v1beta1 = extensions.v1beta1,
      daemonSet:: v1beta1.daemonSet {
        mapContainers(f):: hidden.mapContainers(f),
        mapContainersWithName(names, f):: hidden.mapContainersWithName(names, f),
      },
      deployment:: v1beta1.deployment {
        mapContainers(f):: hidden.mapContainers(f),
        mapContainersWithName(names, f):: hidden.mapContainersWithName(names, f),
      },
    },
  },
 }
--- a/github_issue_summarization/notebooks/tf-job/ks-app/lib/v1.7.0/k8s.libsonnet
+++ b/github_issue_summarization/notebooks/tf-job/ks-app/lib/v1.7.0/k8s.libsonnet
--- a/github_issue_summarization/notebooks/tf-job/ks-app/lib/v1.7.0/swagger.json
+++ b/github_issue_summarization/notebooks/tf-job/ks-app/lib/v1.7.0/swagger.json
--- a/github_issue_summarization/notebooks/tf-job/train.py
+++ b/github_issue_summarization/notebooks/tf-job/train.py
@ -0,0 +1,205 @@
 """Train the github-issue-summarization model
 train.py trains the github-issue-summarization model.
 It reads the input data from GCS in a zip file format.
 --input_data_gcs_bucket and --input_data_gcs_path specify
 the location of input data.
 It write the model back to GCS.
 --output_model_gcs_bucket and --output_model_gcs_path specify
 the location of output.
 It also has parameters which control the training like
 --learning_rate and --sample_size
 """
 import argparse
 import logging
 import zipfile
 from google.cloud import storage  # pylint: disable=no-name-in-module
 import dill as dpickle
 import numpy as np
 import pandas as pd
 from keras import optimizers
 from keras.layers import GRU, BatchNormalization, Dense, Embedding, Input
 from keras.models import Model
 from sklearn.model_selection import train_test_split
 from ktext.preprocess import processor
 from seq2seq_utils import load_encoder_inputs, load_text_processor
 def main(): # pylint: disable=too-many-statements
  # Parsing flags.
  parser = argparse.ArgumentParser()
  parser.add_argument("--sample_size", type=int, default=2000000)
  parser.add_argument("--learning_rate", default="0.001")
  parser.add_argument(
    "--input_data_gcs_bucket", type=str, default="kubeflow-examples")
  parser.add_argument(
    "--input_data_gcs_path",
    type=str,
    default="github-issue-summarization-data/github-issues.zip")
  parser.add_argument(
    "--output_model_gcs_bucket", type=str, default="kubeflow-examples")
  parser.add_argument(
    "--output_model_gcs_path",
    type=str,
    default="github-issue-summarization-data/output_model.h5")
  parser.add_argument(
    "--output_body_preprocessor_dpkl",
    type=str,
    default="body_preprocessor.dpkl")
  parser.add_argument(
    "--output_title_preprocessor_dpkl",
    type=str,
    default="title_preprocessor.dpkl")
  parser.add_argument(
    "--output_train_title_vecs_npy", type=str, default="train_title_vecs.npy")
  parser.add_argument(
    "--output_train_body_vecs_npy", type=str, default="train_body_vecs.npy")
  parser.add_argument("--output_model_h5", type=str, default="output_model.h5")
  args = parser.parse_args()
  logging.info(args)
  learning_rate = float(args.learning_rate)
  pd.set_option('display.max_colwidth', 500)
  bucket = storage.Bucket(storage.Client(), args.input_data_gcs_bucket)
  storage.Blob(args.input_data_gcs_path,
               bucket).download_to_filename('github-issues.zip')
  zip_ref = zipfile.ZipFile('github-issues.zip', 'r')
  zip_ref.extractall('.')
  zip_ref.close()
  # Read in data sample 2M rows (for speed of tutorial)
  traindf, testdf = train_test_split(
    pd.read_csv('github_issues.csv').sample(n=args.sample_size), test_size=.10)
  # Print stats about the shape of the data.
  logging.info('Train: %d rows %d columns', traindf.shape[0], traindf.shape[1])
  logging.info('Test: %d rows %d columns', testdf.shape[0], testdf.shape[1])
  train_body_raw = traindf.body.tolist()
  train_title_raw = traindf.issue_title.tolist()
  # Clean, tokenize, and apply padding / truncating such that each document
  # length = 70. Also, retain only the top 8,000 words in the vocabulary and set
  # the remaining words to 1 which will become common index for rare words.
  body_pp = processor(keep_n=8000, padding_maxlen=70)
  train_body_vecs = body_pp.fit_transform(train_body_raw)
  logging.info('Example original body: %s', train_body_raw[0])
  logging.info('Example body after pre-processing: %s', train_body_vecs[0])
  # Instantiate a text processor for the titles, with some different parameters.
  title_pp = processor(
    append_indicators=True, keep_n=4500, padding_maxlen=12, padding='post')
  # process the title data
  train_title_vecs = title_pp.fit_transform(train_title_raw)
  logging.info('Example original title: %s', train_title_raw[0])
  logging.info('Example title after pre-processing: %s', train_title_vecs[0])
  # Save the preprocessor.
  with open(args.output_body_preprocessor_dpkl, 'wb') as f:
    dpickle.dump(body_pp, f)
  with open(args.output_title_preprocessor_dpkl, 'wb') as f:
    dpickle.dump(title_pp, f)
  # Save the processed data.
  np.save(args.output_train_title_vecs_npy, train_title_vecs)
  np.save(args.output_train_body_vecs_npy, train_body_vecs)
  _, doc_length = load_encoder_inputs(
    args.output_train_body_vecs_npy)
  num_encoder_tokens, body_pp = load_text_processor(
    args.output_body_preprocessor_dpkl)
  num_decoder_tokens, title_pp = load_text_processor(
    args.output_title_preprocessor_dpkl)
  # Arbitrarly set latent dimension for embedding and hidden units
  latent_dim = 300
  ###############
  # Encoder Model.
  ###############
  encoder_inputs = Input(shape=(doc_length,), name='Encoder-Input')
  # Word embeding for encoder (ex: Issue Body)
  x = Embedding(
    num_encoder_tokens, latent_dim, name='Body-Word-Embedding',
    mask_zero=False)(encoder_inputs)
  x = BatchNormalization(name='Encoder-Batchnorm-1')(x)
  # We do not need the `encoder_output` just the hidden state.
  _, state_h = GRU(latent_dim, return_state=True, name='Encoder-Last-GRU')(x)
  # Encapsulate the encoder as a separate entity so we can just
  # encode without decoding if we want to.
  encoder_model = Model(
    inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')
  seq2seq_encoder_out = encoder_model(encoder_inputs)
  ################
  # Decoder Model.
  ################
  decoder_inputs = Input(
    shape=(None,), name='Decoder-Input')  # for teacher forcing
  # Word Embedding For Decoder (ex: Issue Titles)
  dec_emb = Embedding(
    num_decoder_tokens,
    latent_dim,
    name='Decoder-Word-Embedding',
    mask_zero=False)(decoder_inputs)
  dec_bn = BatchNormalization(name='Decoder-Batchnorm-1')(dec_emb)
  # Set up the decoder, using `decoder_state_input` as initial state.
  decoder_gru = GRU(
    latent_dim, return_state=True, return_sequences=True, name='Decoder-GRU')
  decoder_gru_output, _ = decoder_gru(dec_bn, initial_state=seq2seq_encoder_out)
  x = BatchNormalization(name='Decoder-Batchnorm-2')(decoder_gru_output)
  # Dense layer for prediction
  decoder_dense = Dense(
    num_decoder_tokens, activation='softmax', name='Final-Output-Dense')
  decoder_outputs = decoder_dense(x)
  ################
  # Seq2Seq Model.
  ################
  seq2seq_Model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
  seq2seq_Model.compile(
    optimizer=optimizers.Nadam(lr=learning_rate),
    loss='sparse_categorical_crossentropy')
  seq2seq_Model.summary()
  #############
  # Save model.
  #############
  seq2seq_Model.save(args.output_model_h5)
  ######################
  # Upload model to GCS.
  ######################
  bucket = storage.Bucket(storage.Client(), args.output_model_gcs_bucket)
  storage.Blob(args.output_model_gcs_path, bucket).upload_from_filename(
    args.output_model_h5)
 if __name__ == '__main__':
  main()
--- a/github_issue_summarization/training_the_model_tfjob.md
+++ b/github_issue_summarization/training_the_model_tfjob.md
@ -0,0 +1,92 @@
 # Training the model using TFJob
 Kubeflow offers a TensorFlow job controller for kubernetes. This allows you to run your distributed Tensorflow training
 job on a kubernetes cluster. For this training job, we will read our training data from GCS and write our output model
 back to GCS.
 ## Create the image for training
 The [tf-job](notebooks/tf-job) directory contains the necessary files to create a image for training. The [train.py](notebooks/tf-job/train.py) file contains the training code. Here is how you can create an image and push it to gcr.
 ```commandline
 cd notebooks/
 docker build . -t gcr.io/agwl-kubeflow/tf-job-issue-summarization:latest
 gcloud docker -- push gcr.io/agwl-kubeflow/tf-job-issue-summarization:latest
 ```
 ## GCS Service account
 * Create a service account which will be used to read and write data from the GCS Bucket.
 * Give the storage account `roles/storage.admin` role so that it can access GCS Buckets.
 * Download its key as a json file and create a secret named `gcp-credentials` with the key `key.json`
 ```commandline
 SERVICE_ACCOUNT=github-issue-summarization
 PROJECT=kubeflow-example-project # The GCP Project name
 gcloud iam service-accounts --project=${PROJECT} create ${SERVICE_ACCOUNT} \
  --display-name "GCP Service Account for use with kubeflow examples"
 gcloud projects add-iam-policy-binding ${PROJECT} --member \
  serviceAccount:${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com --role=roles/storage.admin
 KEY_FILE=/home/agwl/secrets/${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com.json
 gcloud iam service-accounts keys create ${KEY_FILE} \
  --iam-account ${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com
 kubectl --namespace=${NAMESPACE} create secret generic gcp-credentials --from-file=key.json="${KEY_FILE}"
 ```
 ## Run the TFJob using your image
 [tf-job](notebooks/tf-job) contains a ksonnet app([ks-app](notebooks/tf-job/ks-app)) to deploy the TFJob.
 Create an environment to deploy the ksonnet app
 ```commandline
 cd notebooks/tf-job/ks-app
 ks env add tfjob --namespace ${NAMESPACE}
 ```
 Set the appropriate params for the tfjob component
 ```commandline
 ks param set tfjob namespace ${NAMESPACE} --env=tfjob
 # The image pushed in the previous step
 ks param set tfjob image "gcr.io/agwl-kubeflow/tf-job-issue-summarization:latest" --env=tfjob
 # Sample Size for training
 ks param set tfjob sample_size 100000 --env=tfjob
 # Set the input and output GCS Bucket locations
 ks param set tfjob input_data_gcs_bucket "kubeflow-examples" --env=tfjob
 ks param set tfjob input_data_gcs_path "github-issue-summarization-data/github-issues.zip" --env=tfjob
 ks param set tfjob output_model_gcs_bucket "kubeflow-examples" --env=tfjob
 ks param set tfjob output_model_gcs_path "github-issue-summarization-data/output_model.h5" --env=tfjob
 ```
 Deploy the app:
 ```commandline
 ks apply tfjob -c tfjob
 ```
 In a while you should see a new pod with the label `tf_job_name=tf-job-issue-summarization`
 ```commandline
 kubectl get pods -n=${NAMESPACE} -ltf_job_name=tf-job-issue-summarization
 ```
 You can view the logs of the tf-job operator using
 ```commandline
 kubectl logs -f $(kubectl get pods -n=${NAMESPACE} -lname=tf-job-operator -o=jsonpath='{.items[0].metadata.name}')
 ```
 You can view the actual training logs using
 ```commandline
 kubectl logs -f $(kubectl get pods -n=${NAMESPACE} -ltf_job_name=tf-job-issue-summarization -o=jsonpath='{.items[0].metadata.name}')
 ```