Create a notebook for mnist E2E on GCP (#723)

* A notebook to run the mnist E2E example on GCP. This fixes a number of issues with the example * Use ISTIO instead of Ambassador to add reverse proxy routes * The training job needs to be updated to run in a profile created namespace in order to have the required service accounts * See kubeflow/examples#713 * Running inside a notebook running on Kubeflow should ensure user is running inside an appropriately setup namespace * With ISTIO the default RBAC rules prevent the web UI from sending requests to the model server * A short term fix was to not include the ISTIO side car * In the future we can add an appropriate ISTIO rbac policy * Using a notebook allows us to eliminate the use of kustomize * This resolves kubeflow/examples#713 which required people to use and old version of kustomize * Rather than using kustomize we can use python f style strings to write the YAML specs and then easily substitute in user specific values * This should be more informative; it avoids introducing kustomize and users can see the resource specs. * I've opted to make the notebook GCP specific. I think its less confusing to users to have separate notebooks focused on specific platforms rather than having one notebook with a lot of caveats about what to do under different conditions * I've deleted the kustomize overlays for GCS since we don't want users to use them anymore * I used fairing and kaniko to eliminate the use of docker to build the images so that everything can run from a notebook running inside the cluster. * k8s_utils.py has some reusable functions to add some details from users (e.g. low level calls to K8s APIs.) * * Change the mnist test to just run the notebook * Copy the notebook test infra for xgboost_synthetic to py/kubeflow/examples/notebook_test to make it more reusable * Fix lint. * Update for lint. * A notebook to run the mnist E2E example. Related to: kubeflow/website#1553 * 1. Use fairing to build the model. 2. Construct the YAML spec directly in the notebook. 3. Use the TFJob python SDK. * Fix the ISTIO rule. * Fix UI and serving; need to update TF serving to match version trained on. * Get the IAP endpoint. * Start writing some helper python functions for K8s. * Commit before switching from replace to delete. * Create a library to bulk create objects. * Cleanup. * Add back k8s_util.py * Delete train.yaml; this shouldn't have been aded. * update the notebook image. * Refactor code into k8s_util; print out links. * Clean up the notebok. Should be working E2E. * Added section to get logs from stackdriver. * Add comment about profile. * Latest. * Override mnist_gcp.ipynb with mnist.ipynb I accidentally put my latest changes in mnist.ipynb even though that file was deleted. * More fixes. * Resolve some conflicts from the rebase; override with changes on remote branch.
2020-02-16 19:15:28 -08:00 · 2020-02-16 19:15:28 -08:00 · cc93a80420
parent b9a7719f29
commit cc93a80420
28 changed files with 2576 additions and 1564 deletions
--- a/.pylintrc
+++ b/.pylintrc
@ -56,7 +56,10 @@ confidence=
 # --enable=similarities". If you want to run only the classes checker, but have
 # no Warning level messages displayed, use"--disable=all --enable=classes
 # --disable=W"
-disable=import-star-module-level,old-octal-literal,oct-method,print-statement,unpacking-in-except,parameter-unpacking,backtick,old-raise-syntax,old-ne-operator,long-suffix,dict-view-method,dict-iter-method,metaclass-assignment,next-method-called,raising-string,indexing-exception,raw_input-builtin,long-builtin,file-builtin,execfile-builtin,coerce-builtin,cmp-builtin,buffer-builtin,basestring-builtin,apply-builtin,filter-builtin-not-iterating,using-cmp-argument,useless-suppression,range-builtin-not-iterating,suppressed-message,missing-docstring,no-absolute-import,old-division,cmp-method,reload-builtin,zip-builtin-not-iterating,intern-builtin,unichr-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,input-builtin,round-builtin,hex-method,nonzero-method,map-builtin-not-iterating,relative-import,invalid-name,bad-continuation,no-member,locally-disabled,fixme,import-error,too-many-locals,no-name-in-module,too-many-instance-attributes,no-self-use
+#
+# Kubeflow disables string-interpolation because we are starting to use f
+# style strings
+disable=import-star-module-level,old-octal-literal,oct-method,print-statement,unpacking-in-except,parameter-unpacking,backtick,old-raise-syntax,old-ne-operator,long-suffix,dict-view-method,dict-iter-method,metaclass-assignment,next-method-called,raising-string,indexing-exception,raw_input-builtin,long-builtin,file-builtin,execfile-builtin,coerce-builtin,cmp-builtin,buffer-builtin,basestring-builtin,apply-builtin,filter-builtin-not-iterating,using-cmp-argument,useless-suppression,range-builtin-not-iterating,suppressed-message,missing-docstring,no-absolute-import,old-division,cmp-method,reload-builtin,zip-builtin-not-iterating,intern-builtin,unichr-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,input-builtin,round-builtin,hex-method,nonzero-method,map-builtin-not-iterating,relative-import,invalid-name,bad-continuation,no-member,locally-disabled,fixme,import-error,too-many-locals,no-name-in-module,too-many-instance-attributes,no-self-use,logging-fstring-interpolation


 [REPORTS]
--- a/mnist/Dockerfile.model
+++ b/mnist/Dockerfile.model
@ -1,5 +1,6 @@
 #This container contains your model and any helper scripts specific to your model.
-FROM tensorflow/tensorflow:1.7.0
+# When building the image inside mnist.ipynb the base docker image will be overwritten
+FROM tensorflow/tensorflow:1.15.2-py3

 ADD model.py /opt/model.py
 RUN chmod +x /opt/model.py
--- a/mnist/Makefile
+++ b/mnist/Makefile
@ -19,6 +19,8 @@
 # To override variables do
 # make ${TARGET} ${VAR}=${VALUE}
 #
+#
+# TODO(jlewi): We should probably switch to Skaffold and Tekton

 # IMG is the base path for images..
 # Individual images will be
--- a/mnist/README.md
+++ b/mnist/README.md
@ -3,6 +3,8 @@
 **Table of Contents**  *generated with [DocToc](https://github.com/thlorenz/doctoc)*

 - [MNIST on Kubeflow](#mnist-on-kubeflow)
+- [MNIST on Kubeflow on GCP](#mnist-on-kubeflow-on-gcp)
+- [MNIST on other platforms](#mnist-on-other-platforms)
  - [Prerequisites](#prerequisites)
    - [Deploy Kubeflow](#deploy-kubeflow)
    - [Local Setup](#local-setup)
@ -13,21 +15,17 @@
  - [Preparing your Kubernetes Cluster](#preparing-your-kubernetes-cluster)
    - [Training your model](#training-your-model)
      - [Local storage](#local-storage)
-      - [Using GCS](#using-gcs)
      - [Using S3](#using-s3)
  - [Monitoring](#monitoring)
    - [Tensorboard](#tensorboard)
      - [Local storage](#local-storage-1)
-      - [Using GCS](#using-gcs-1)
      - [Using S3](#using-s3-1)
      - [Deploying TensorBoard](#deploying-tensorboard)
  - [Serving the model](#serving-the-model)
-    - [GCS](#gcs)
    - [S3](#s3)
    - [Local storage](#local-storage-2)
  - [Web Front End](#web-front-end)
    - [Connecting via port forwarding](#connecting-via-port-forwarding)
-    - [Using IAP on GCP](#using-iap-on-gcp)
  - [Conclusion and Next Steps](#conclusion-and-next-steps)

 <!-- END doctoc generated TOC please keep comment here to allow auto update -->
@ -37,6 +35,45 @@

 This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model.

+Follow the version of the guide that is specific to how you have deployed Kubeflow
+
+1. [MNIST on Kubeflow on GCP](#gcp)
+1. [MNIST on other platforms](#other)
+
+<a id=gcp></a>
+# MNIST on Kubeflow on GCP
+
+Follow these instructions to run the MNIST tutorial on GCP
+
+1. Follow the [GCP instructions](https://www.kubeflow.org/docs/gke/deploy/) to deploy Kubeflow with IAP
+
+1. Launch a Jupyter notebook
+
+   * The tutorial has been tested using the Jupyter Tensorflow 1.15 image
+
+1. Launch a terminal in Jupyter and clone the kubeflow examples repo
+
+   ```
+   git clone https://github.com/kubeflow/examples.git git_kubeflow-examples
+   ```
+
+   * **Tip** When you start a terminal in Jupyter, run the command `bash` to start
+      a bash terminal which is much more friendly then the default shell
+
+   * **Tip** You can change the URL from '/tree' to '/lab' to switch to using Jupyterlab
+
+1. Open the notebook `mnist/mnist_gcp.ipynb`
+
+1. Follow the notebook to train and deploy MNIST on Kubeflow
+
+<a id=other></a>
+# MNIST on other platforms
+
+The tutorial is currently not up to date for Kubeflow 1.0. Please check the issues
+
+* [kubeflow/examples#724](https://github.com/kubeflow/examples/issues/724) for AWS
+* [kubeflow/examples#725](https://github.com/kubeflow/examples/issues/725) for other platforms
+
 ## Prerequisites

 Before we get started there are a few requirements.
@ -166,100 +203,6 @@ And to check the logs
 kubectl logs mnist-train-local-chief-0
 ```

-
-#### Using GCS
-
-In this section we describe how to save the model to Google Cloud Storage (GCS).
-
-Storing the model in GCS has the advantages:
-
-* The model is readily available after the job finishes
-* We can run distributed training
-   
-  * Distributed training requires a storage system accessible to all the machines
-
-Enter the `training/GCS` from the `mnist` application directory.
-
-```
-cd training/GCS
-```
-
-Set an environment variable that points to your GCP project Id
-```
-PROJECT=<your project id>
-```
-
-Create a bucket on GCS to store our model. The name must be unique across all GCS buckets
-```
-BUCKET=distributed-$(date +%s)
-gsutil mb gs://$BUCKET/
-```
-
-Give the job a different name (to distinguish it from your job which didn't use GCS)
-
-```
-kustomize edit add configmap mnist-map-training --from-literal=name=mnist-train-dist
-```
-
-Optionally, if you want to use your custom training image, configurate that as below.
-
-```
-kustomize edit set image training-image=$DOCKER_URL
-```
-
-Next we configure it to run distributed by setting the number of parameter servers and workers to use. The `numPs` means the number of Ps and the `numWorkers` means the number of Worker.
-
-```
-../base/definition.sh --numPs 1 --numWorkers 2
-```
-
-Set the training parameters, such as training steps, batch size and learning rate.
-
-```
-kustomize edit add configmap mnist-map-training --from-literal=trainSteps=200
-kustomize edit add configmap mnist-map-training --from-literal=batchSize=100
-kustomize edit add configmap mnist-map-training --from-literal=learningRate=0.01
-```
-
-Now we need to configure parameters and telling the code to save the model to GCS.
-
-```
-MODEL_PATH=my-model
-kustomize edit add configmap mnist-map-training --from-literal=modelDir=gs://${BUCKET}/${MODEL_PATH}
-kustomize edit add configmap mnist-map-training --from-literal=exportDir=gs://${BUCKET}/${MODEL_PATH}/export
-```
-
-Build a yaml file for the `TFJob` specification based on your kustomize config:
-
-```
-kustomize build . > mnist-training.yaml
-```
-
-Then, in `mnist-training.yaml`, search for this line: `namespace: kubeflow`.
-Edit it to **replace `kubeflow` with the name of your user profile namespace**,
-which will probably have the form `kubeflow-<username>`.  (If you're not sure what this
-namespace is called, you can find it in the top menubar of the Kubeflow Central
-Dashboard.)
-
-After you've updated the namespace, apply the `TFJob` specification to the
-Kubeflow cluster:
-
-```
-kubectl apply -f mnist-training.yaml
-```
-
-You can then check the job status:
-
-```
-kubectl get tfjobs -n <your-user-namespace> -o yaml mnist-train-dist
-```
-
-And to check the logs:
-
-```
-kubectl logs -n <your-user-namespace> -f mnist-train-dist-chief-0
-```
-
 #### Using S3

 To use S3 we need to configure TensorFlow to use S3 credentials and variables. These credentials will be provided as kubernetes secrets and the variables will be passed in as environment variables. Modify the below values to suit your environment.
@ -426,27 +369,6 @@ kustomize edit add configmap mnist-map-monitoring --from-literal=pvcMountPath=/m
 kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=/mnt
 ```

-
-#### Using GCS
-
-Enter the `monitoring/GCS` from the `mnist` application directory.
-
-```
-cd monitoring/GCS
-```
-
-Configure TensorBoard to point to your model location
-
-```
-kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=${LOGDIR}
-```
-
-Assuming you followed the directions above if you used GCS you can use the following value
-
-```
-LOGDIR=gs://${BUCKET}/${MODEL_PATH}
-```
-
 #### Using S3

 Enter the `monitoring/S3` from the `mnist` application directory.
@ -551,64 +473,6 @@ The model code will export the model in saved model format which is suitable for
 To serve the model follow the instructions below. The instructins vary slightly based on where you are storing your model (e.g. GCS, S3, PVC). Depending on the storage system we provide different kustomization as a convenience for setting relevant environment variables.


-### GCS
-
-Here we show to serve the model when it is stored on GCS. This assumes that when you trained the model you set `exportDir` to a GCS URI; if not you can always copy it to GCS using `gsutil`.
-
-Check that a model was exported
-
-```
-EXPORT_DIR=gs://${BUCKET}/${MODEL_PATH}/export
-gsutil ls -r ${EXPORT_DIR}
-```
-
-The output should look something like
-
-```
-${EXPORT_DIR}/1547100373/saved_model.pb
-${EXPORT_DIR}/1547100373/variables/:
-${EXPORT_DIR}/1547100373/variables/
-${EXPORT_DIR}/1547100373/variables/variables.data-00000-of-00001
-${EXPORT_DIR}/1547100373/variables/variables.index
-```
-
-The number `1547100373` is a version number auto-generated by TensorFlow; it will vary on each run but should be monotonically increasing if you save a model to the same location as a previous location.
-
-Enter the `serving/GCS` from the `mnist` application directory.
-```
-cd serving/GCS
-```
-
-Set a different name for the tf-serving.
-
-```
-kustomize edit add configmap mnist-map-serving --from-literal=name=mnist-gcs-dist
-```
-
-Set your model path
-
-```
-kustomize edit add configmap mnist-map-serving --from-literal=modelBasePath=${EXPORT_DIR} 
-```
-
-Deploy it, and run a service to make the deployment accessible to other pods in the cluster
-
-```
-kustomize build . |kubectl apply -f -
-```
-
-You can check the deployment by running
-
-```
-kubectl describe deployments mnist-gcs-dist
-```
-
-The service should make the `mnist-gcs-dist` deployment accessible over port 9000
-
-```
-kubectl describe service mnist-gcs-dist
-```
-
 ### S3

 We can also serve the model when it is stored on S3. This assumes that when you trained the model you set `exportDir` to a S3
@ -799,16 +663,7 @@ POD_NAME=$(kubectl get pods --selector=app=web-ui --template '{{range .items}}{{
 kubectl port-forward ${POD_NAME} 8080:5000  
 ```

-You should now be able to open up the web app at your localhost. [Local Storage](http://localhost:8080) or [GCS](http://localhost:8080/?addr=mnist-gcs-dist) or [S3](http://localhost:8080/?addr=mnist-s3-serving).
-
-
-### Using IAP on GCP
-
-If you are using GCP and have set up IAP then you can access the web UI at
-
-```
-https://${DEPLOYMENT}.endpoints.${PROJECT}.cloud.goog/${NAMESPACE}/mnist/
-```
+You should now be able to open up the web app at your localhost. [Local Storage](http://localhost:8080) or [S3](http://localhost:8080/?addr=mnist-s3-serving).

 ## Conclusion and Next Steps

--- a/mnist/k8s_util.py
+++ b/mnist/k8s_util.py
@ -0,0 +1,147 @@
+"""Some utilities for working with Kubernetes.
+
+TODO: These should probably be replaced by functions in fairing.
+"""
+import logging
+import re
+import yaml
+
+from kubernetes import client as k8s_client
+from kubernetes.client import rest as k8s_rest
+
+def camel_to_snake(name):
+  name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
+  return re.sub('([a-z0-9])([A-Z])', r'\1_\2', name).lower()
+
+K8S_CREATE = "K8S_CREATE"
+K8S_REPLACE = "K8S_REPLACE"
+K8S_CREATE_OR_REPLACE = "K8S_CREATE_OR_REPLACE"
+
+def _get_result_name(result):
+  # For custom objects the result is a dict but for other objects
+  # its a python class
+  if isinstance(result, dict):
+    result_name = result["metadata"]["name"]
+    result_namespace = result["metadata"]["name"]
+  else:
+    result_name = result.metadata.name
+    result_namespace = result.metadata.namespace
+
+  return result_namespace, result_name
+
+def apply_k8s_specs(specs, mode=K8S_CREATE): # pylint: disable=too-many-branches,too-many-statements
+  """Run apply on the provided Kubernetes specs.
+
+  Args:
+    specs: A list of strings or dicts providing the YAML specs to
+      apply.
+
+    mode: (Optional): Mode indicates how the resources should be created.
+      K8S_CREATE - Use the create verb. Works with generateName
+      K8S_REPLACE - Issue a delete of existing resources before doing a create
+      K8s_CREATE_OR_REPLACE - Try to create an object; if it already exists
+        replace it
+  """
+  # TODO(jlewi): How should we handle patching existing updates?
+
+  results = []
+
+  if mode not in [K8S_CREATE, K8S_CREATE_OR_REPLACE, K8S_REPLACE]:
+    raise ValueError(f"Unknown mode {mode}")
+
+  for s in specs:
+    spec = s
+    if not isinstance(spec, dict):
+      spec = yaml.load(spec)
+
+    name = spec["metadata"]["name"]
+    namespace = spec["metadata"]["namespace"]
+    kind = spec["kind"]
+    kind_snake = camel_to_snake(kind)
+
+    plural = spec["kind"].lower() + "s"
+
+    result = None
+    if not "/" in spec["apiVersion"]:
+      group = None
+
+    else:
+      group, version = spec["apiVersion"].split("/", 1)
+
+    if group is None or group.lower() == "apps":
+      if group is None:
+        api = k8s_client.CoreV1Api()
+      else:
+        api = k8s_client.AppsV1Api()
+
+      create_method_name = f"create_namespaced_{kind_snake}"
+      create_method_args = [namespace, spec]
+
+      replace_method_name = f"delete_namespaced_{kind_snake}"
+      replace_method_args = [name, namespace]
+
+    else:
+      api = k8s_client.CustomObjectsApi()
+
+      create_method_name = f"create_namespaced_custom_object"
+      create_method = getattr(api, create_method_name)
+      create_method_args = [group, version, namespace, plural, spec]
+
+      delete_options = k8s_client.V1DeleteOptions()
+      replace_method_name = f"delete_namespaced_custom_object"
+      replace_method_args = [group, version, namespace, plural, name, delete_options]
+
+    create_method = getattr(api, create_method_name)
+    replace_method = getattr(api, replace_method_name)
+
+    if mode in [K8S_CREATE, K8S_CREATE_OR_REPLACE]:
+      try:
+        result = create_method(*create_method_args)
+        result_namespace, result_name = _get_result_name(result)
+        logging.info(f"Created {kind} {result_namespace}.{result_name}")
+        results.append(result)
+        continue
+      except k8s_rest.ApiException as e:
+        # 409 is conflict indicates resource already exists
+        if e.status == 409 and mode == K8S_CREATE_OR_REPLACE:
+          pass
+        else:
+          raise
+
+    # Using replace didn't work for virtualservices so we explicitly delete
+    # and then issue a create
+    result = replace_method(*replace_method_args)
+    logging.info(f"Deleted {kind} {namespace}.{name}")
+
+    result = create_method(*create_method_args)
+    result_namespace, result_name = _get_result_name(result)
+    logging.info(f"Created {kind} {result_namespace}.{result_name}")
+    # Now recreate it
+    results.append(result)
+
+  return results
+
+def get_iap_endpoint():
+  """Return the URL of the IAP endpoint"""
+  extensions = k8s_client.ExtensionsV1beta1Api()
+  kf_ingress = None
+
+  try:
+    kf_ingress = extensions.read_namespaced_ingress("envoy-ingress", "istio-system")
+  except k8s_rest.ApiException as e:
+    if e.status == 403:
+      logging.warning(f"The service account doesn't have sufficient privileges "
+                      f"to get the istio-system ingress. "
+                      f"You will have to manually enter the Kubeflow endpoint. "
+                      f"To make this function work ask someone with cluster "
+                      f"priveleges to create an appropriate "
+                      f"clusterrolebinding by running a command.\n"
+                      f"kubectl create --namespace=istio-system rolebinding "
+                       "--clusterrole=kubeflow-view "
+                       "--serviceaccount=$NAMESPACE}:default-editor "
+                       "${NAMESPACE}-istio-view")
+      return ""
+
+    raise
+
+  return f"https://{kf_ingress.spec.rules[0].host}"
--- a/mnist/mnist_gcp.ipynb
+++ b/mnist/mnist_gcp.ipynb
--- a/mnist/monitoring/GCS/kustomization.yaml
+++ b/mnist/monitoring/GCS/kustomization.yaml
@ -1,8 +0,0 @@
-apiVersion: kustomize.config.k8s.io/v1beta1
-kind: Kustomization
-
-bases:
- ../base
-
-configurations:
- params.yaml
--- a/mnist/notebook_setup.py
+++ b/mnist/notebook_setup.py
@ -0,0 +1,57 @@
+"""Some routines to setup the notebook.
+
+This is separated out from util.py because this module installs some of the pip packages
+that util depends on.
+"""
+
+import sys
+import logging
+import os
+import subprocess
+from importlib import reload
+
+from pathlib import Path
+
+TF_OPERATOR_COMMIT = "9238906"
+
+def notebook_setup():
+  # Install the SDK
+  logging.basicConfig(format='%(message)s')
+  logging.getLogger().setLevel(logging.INFO)
+
+  home = str(Path.home())
+
+  logging.info("pip installing requirements.txt")
+  subprocess.check_call(["pip3", "install", "--user", "-r", "requirements.txt"])
+
+  clone_dir = os.path.join(home, "git_tf-operator")
+  if not os.path.exists(clone_dir):
+    logging.info("Cloning the tf-operator repo")
+    subprocess.check_call(["git", "clone", "https://github.com/kubeflow/tf-operator.git",
+                           clone_dir])
+  logging.info(f"Checkout kubeflow/tf-operator @{TF_OPERATOR_COMMIT}")
+  subprocess.check_call(["git", "checkout", TF_OPERATOR_COMMIT], cwd=clone_dir)
+
+  logging.info("Configure docker credentials")
+  subprocess.check_call(["gcloud", "auth", "configure-docker", "--quiet"])
+  if os.getenv("GOOGLE_APPLICATION_CREDENTIALS"):
+    logging.info("Activating service account")
+    subprocess.check_call(["gcloud", "auth", "activate-service-account",
+                           "--key-file=" +
+                           os.getenv("GOOGLE_APPLICATION_CREDENTIALS"),
+                           "--quiet"])
+  # Installing the python packages locally doesn't appear to have them automatically
+  # added the path so we need to manually add the directory
+  local_py_path = os.path.join(home, ".local/lib/python3.6/site-packages")
+  tf_operator_py_path = os.path.join(clone_dir, "sdk/python")
+
+  for p in [local_py_path, tf_operator_py_path]:
+    if p not in sys.path:
+      logging.info("Adding %s to python path", p)
+      # Insert at front because we want to override any installed packages
+      sys.path.insert(0, p)
+
+  # Force a reload of kubeflow; since kubeflow is a multi namespace module
+  # if we've loaded up some new kubeflow subpackages we need to force a reload to see them.
+  import kubeflow
+  reload(kubeflow)
--- a/mnist/requirements.txt
+++ b/mnist/requirements.txt
@ -0,0 +1,2 @@
+git+git://github.com/kubeflow/fairing.git@9b0d4ed4796ba349ac6067bbd802ff1d6454d015
+retrying==1.3.3
--- a/mnist/serving/GCS/kustomization.yaml
+++ b/mnist/serving/GCS/kustomization.yaml
@ -1,5 +0,0 @@
-apiVersion: kustomize.config.k8s.io/v1beta1
-kind: Kustomization
-
-bases:
- ../base
--- a/mnist/serving/base/mnist-deploy-config.yaml
+++ b/mnist/serving/base/mnist-deploy-config.yaml
@ -1,11 +1,11 @@
+kind: ConfigMap
 apiVersion: v1
+metadata:
+  name: mnist-deploy-config
+  namespace: kubeflow
 data:
  monitoring_config.txt: |-
    prometheus_config: {
      enable: true,
      path: "/monitoring/prometheus/metrics"
-    }
-kind: ConfigMap
-metadata:
-  name: mnist-deploy-config
-  namespace: kubeflow
+    }
--- a/mnist/testing/conftest.py
+++ b/mnist/testing/conftest.py
@ -1,116 +0,0 @@
-import os
-import pytest
-
-def pytest_addoption(parser):
-
-  parser.addoption(
-    "--tfjob_name", help="Name for the TFjob.",
-    type=str, default="mnist-test-" + os.getenv('BUILD_ID'))
-
-  parser.addoption(
-    "--namespace", help=("The namespace to run in. This should correspond to"
-                         "a namespace associated with a Kubeflow namespace."),
-    type=str, default="kubeflow-kf-ci-v1-user")
-
-  parser.addoption(
-    "--repos", help="The repos to checkout; leave blank to use defaults",
-    type=str, default="")
-
-  parser.addoption(
-    "--trainer_image", help="TFJob training image",
-    type=str, default="gcr.io/kubeflow-examples/mnist/model:build-" + os.getenv('BUILD_ID'))
-
-  parser.addoption(
-    "--train_steps", help="train steps for mnist testing",
-    type=str, default="200")
-
-  parser.addoption(
-    "--batch_size", help="batch size for mnist trainning",
-    type=str, default="100")
-
-  parser.addoption(
-    "--learning_rate", help="mnist learnning rate",
-    type=str, default="0.01")
-
-  parser.addoption(
-    "--num_ps", help="The number of PS",
-    type=str, default="1")
-
-  parser.addoption(
-    "--num_workers", help="The number of Worker",
-    type=str, default="2")
-
-  parser.addoption(
-    "--model_dir", help="Path for model saving",
-    type=str, default="gs://kubeflow-ci-deployment_ci-temp/mnist/models/" + os.getenv('BUILD_ID'))
-
-  parser.addoption(
-    "--export_dir", help="Path for model exporting",
-    type=str, default="gs://kubeflow-ci-deployment_ci-temp/mnist/models/" + os.getenv('BUILD_ID'))
-
-  parser.addoption(
-    "--deploy_name", help="Name for the service deployment",
-    type=str, default="mnist-test-" + os.getenv('BUILD_ID'))
-
-  parser.addoption(
-      "--master", action="store", default="", help="IP address of GKE master")
-
-  parser.addoption(
-      "--service", action="store", default="mnist-test-" + os.getenv('BUILD_ID'),
-      help="The name of the mnist K8s service")
-
-@pytest.fixture
-def master(request):
-  return request.config.getoption("--master")
-
-@pytest.fixture
-def namespace(request):
-  return request.config.getoption("--namespace")
-
-@pytest.fixture
-def service(request):
-  return request.config.getoption("--service")
-
-@pytest.fixture
-def tfjob_name(request):
-  return request.config.getoption("--tfjob_name")
-
-@pytest.fixture
-def repos(request):
-  return request.config.getoption("--repos")
-
-@pytest.fixture
-def trainer_image(request):
-  return request.config.getoption("--trainer_image")
-
-@pytest.fixture
-def train_steps(request):
-  return request.config.getoption("--train_steps")
-
-@pytest.fixture
-def batch_size(request):
-  return request.config.getoption("--batch_size")
-
-@pytest.fixture
-def learning_rate(request):
-  return request.config.getoption("--learning_rate")
-
-@pytest.fixture
-def num_ps(request):
-  return request.config.getoption("--num_ps")
-
-@pytest.fixture
-def num_workers(request):
-  return request.config.getoption("--num_workers")
-
-@pytest.fixture
-def model_dir(request):
-  return request.config.getoption("--model_dir")
-
-@pytest.fixture
-def export_dir(request):
-  return request.config.getoption("--export_dir")
-
-@pytest.fixture
-def deploy_name(request):
-  return request.config.getoption("--deploy_name")
--- a/mnist/testing/deploy_test.py
+++ b/mnist/testing/deploy_test.py
@ -1,84 +0,0 @@
-"""Test deploying the mnist model.
-
-This file tests that we can deploy the model.
-
-It is an integration test as it depends on having access to
-a Kubeflow deployment to deploy on. It also depends on having a model.
-
-Python Path Requirements:
-  kubeflow/testing/py - https://github.com/kubeflow/testing/tree/master/py
-     * Provides utilities for testing
-
-Manually running the test
-  pytest deploy_test.py \
-    name=mnist-deploy-test-${BUILD_ID} \
-    namespace=${namespace} \
-    modelBasePath=${modelDir} \
-    exportDir=${modelDir} \
-
-"""
-
-import logging
-import os
-import pytest
-
-from kubernetes.config import kube_config
-from kubernetes import client as k8s_client
-
-from kubeflow.testing import util
-
-
-def test_deploy(record_xml_attribute, deploy_name, namespace, model_dir, export_dir):
-
-  util.set_pytest_junit(record_xml_attribute, "test_deploy")
-
-  util.maybe_activate_service_account()
-
-  app_dir = os.path.join(os.path.dirname(__file__), "../serving/GCS")
-  app_dir = os.path.abspath(app_dir)
-  logging.info("--app_dir not set defaulting to: %s", app_dir)
-
-  # TODO (@jinchihe) Using kustomize 2.0.3 to work around below issue:
-  # https://github.com/kubernetes-sigs/kustomize/issues/1295
-  kusUrl = 'https://github.com/kubernetes-sigs/kustomize/' \
-           'releases/download/v2.0.3/kustomize_2.0.3_linux_amd64'
-  util.run(['wget', '-q', '-O', '/usr/local/bin/kustomize', kusUrl], cwd=app_dir)
-  util.run(['chmod', 'a+x', '/usr/local/bin/kustomize'], cwd=app_dir)
-
-  # TODO (@jinchihe): The kubectl need to be upgraded to 1.14.0 due to below issue.
-  # Invalid object doesn't have additional properties ...
-  kusUrl = 'https://storage.googleapis.com/kubernetes-release/' \
-           'release/v1.14.0/bin/linux/amd64/kubectl'
-  util.run(['wget', '-q', '-O', '/usr/local/bin/kubectl', kusUrl], cwd=app_dir)
-  util.run(['chmod', 'a+x', '/usr/local/bin/kubectl'], cwd=app_dir)
-
-  # Configure custom parameters using kustomize
-  configmap = 'mnist-map-serving'
-  util.run(['kustomize', 'edit', 'set', 'namespace', namespace], cwd=app_dir)
-  util.run(['kustomize', 'edit', 'add', 'configmap', configmap,
-           '--from-literal=name' + '=' + deploy_name], cwd=app_dir)
-
-  util.run(['kustomize', 'edit', 'add', 'configmap', configmap,
-            '--from-literal=modelBasePath=' + model_dir], cwd=app_dir)
-  util.run(['kustomize', 'edit', 'add', 'configmap', configmap,
-            '--from-literal=exportDir=' + export_dir], cwd=app_dir)
-
-  # Apply the components
-  util.run(['kustomize', 'build', app_dir, '-o', 'generated.yaml'], cwd=app_dir)
-  util.run(['kubectl', 'apply', '-f', 'generated.yaml'], cwd=app_dir)
-
-  kube_config.load_kube_config()
-  api_client = k8s_client.ApiClient()
-  util.wait_for_deployment(api_client, namespace, deploy_name, timeout_minutes=4)
-
-  # We don't delete the resources. We depend on the namespace being
-  # garbage collected.
-
-if __name__ == "__main__":
-  logging.basicConfig(level=logging.INFO,
-                      format=('%(levelname)s|%(asctime)s'
-                              '|%(pathname)s|%(lineno)d| %(message)s'),
-                      datefmt='%Y-%m-%dT%H:%M:%S',
-                      )
-  logging.getLogger().setLevel(logging.INFO)
-  pytest.main()
--- a/mnist/testing/predict_test.py
+++ b/mnist/testing/predict_test.py
@ -1,123 +0,0 @@
-"""Test mnist_client.
-
-This file tests that we can send predictions to the model
-using REST.
-
-It is an integration test as it depends on having access to
-a deployed model.
-
-We use the pytest framework because
-  1. It can output results in junit format for prow/gubernator
-  2. It has good support for configuring tests using command line arguments
-     (https://docs.pytest.org/en/latest/example/simple.html)
-
-Python Path Requirements:
-  kubeflow/testing/py - https://github.com/kubeflow/testing/tree/master/py
-     * Provides utilities for testing
-
-Manually running the test
- 1. Configure your KUBECONFIG file to point to the desired cluster
-"""
-
-import json
-import logging
-import os
-import subprocess
-import requests
-from retrying import retry
-import six
-
-from kubernetes.config import kube_config
-from kubernetes import client as k8s_client
-
-import pytest
-
-from kubeflow.testing import util
-
-def is_retryable_result(r):
-  if r.status_code == requests.codes.NOT_FOUND:
-    message = "Request to {0} returned 404".format(r.url)
-    logging.error(message)
-    return True
-
-  return False
-
-@retry(wait_exponential_multiplier=1000, wait_exponential_max=10000,
-       stop_max_delay=5*60*1000,
-       retry_on_result=is_retryable_result)
-def send_request(*args, **kwargs):
-  # We don't use util.run because that ends up including the access token
-  # in the logs
-  token = subprocess.check_output(["gcloud", "auth", "print-access-token"])
-  if six.PY3 and hasattr(token, "decode"):
-    token = token.decode()
-  token = token.strip()
-
-  headers = {
-    "Authorization": "Bearer " + token,
-  }
-
-  if "headers" not in kwargs:
-    kwargs["headers"] = {}
-
-  kwargs["headers"].update(headers)
-
-  r = requests.post(*args, **kwargs)
-
-  return r
-
-@pytest.mark.xfail
-def test_predict(master, namespace, service):
-  app_credentials = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
-  if app_credentials:
-    print("Activate service account")
-    util.run(["gcloud", "auth", "activate-service-account",
-              "--key-file=" + app_credentials])
-
-  if not master:
-    print("--master set; using kubeconfig")
-    # util.load_kube_config appears to hang on python3
-    kube_config.load_kube_config()
-    api_client = k8s_client.ApiClient()
-    host = api_client.configuration.host
-    print("host={0}".format(host))
-    master = host.rsplit("/", 1)[-1]
-
-  this_dir = os.path.dirname(__file__)
-  test_data = os.path.join(this_dir, "test_data", "instances.json")
-  with open(test_data) as hf:
-    instances = json.load(hf)
-
-  # We proxy the request through the APIServer so that we can connect
-  # from outside the cluster.
-  url = ("https://{master}/api/v1/namespaces/{namespace}/services/{service}:8500"
-         "/proxy/v1/models/mnist:predict").format(
-           master=master, namespace=namespace, service=service)
-  logging.info("Request: %s", url)
-  r = send_request(url, json=instances, verify=False)
-
-  if r.status_code != requests.codes.OK:
-    msg = "Request to {0} exited with status code: {1} and content: {2}".format(
-      url, r.status_code, r.content)
-    logging.error(msg)
-    raise RuntimeError(msg)
-
-  content = r.content
-  if six.PY3 and hasattr(content, "decode"):
-    content = content.decode()
-  result = json.loads(content)
-  assert len(result["predictions"]) == 1
-  predictions = result["predictions"][0]
-  assert "classes" in predictions
-  assert "predictions" in predictions
-  assert len(predictions["predictions"]) == 10
-  logging.info("URL %s returned; %s", url, content)
-
-if __name__ == "__main__":
-  logging.basicConfig(level=logging.INFO,
-                      format=('%(levelname)s|%(asctime)s'
-                              '|%(pathname)s|%(lineno)d| %(message)s'),
-                      datefmt='%Y-%m-%dT%H:%M:%S',
-                      )
-  logging.getLogger().setLevel(logging.INFO)
-  pytest.main()
--- a/mnist/testing/test_data/instances.json
+++ b/mnist/testing/test_data/instances.json
@ -1,792 +0,0 @@
-{
-  "instances": [
-    {
-      "x": [
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.011764707043766975,
-        0.07058823853731155,
-        0.07058823853731155,
-        0.07058823853731155,
-        0.4941176772117615,
-        0.5333333611488342,
-        0.686274528503418,
-        0.10196079313755035,
-        0.6509804129600525,
-        1.0,
-        0.9686275124549866,
-        0.49803924560546875,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.11764706671237946,
-        0.1411764770746231,
-        0.3686274588108063,
-        0.6039215922355652,
-        0.6666666865348816,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.8823530077934265,
-        0.6745098233222961,
-        0.9921569228172302,
-        0.9490196704864502,
-        0.7647059559822083,
-        0.250980406999588,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.19215688109397888,
-        0.9333333969116211,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9843137860298157,
-        0.364705890417099,
-        0.32156863808631897,
-        0.32156863808631897,
-        0.2196078598499298,
-        0.15294118225574493,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.07058823853731155,
-        0.8588235974311829,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.7764706611633301,
-        0.7137255072593689,
-        0.9686275124549866,
-        0.9450981020927429,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.3137255012989044,
-        0.6117647290229797,
-        0.41960787773132324,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.803921639919281,
-        0.04313725605607033,
-        0.0,
-        0.16862745583057404,
-        0.6039215922355652,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.05490196496248245,
-        0.003921568859368563,
-        0.6039215922355652,
-        0.9921569228172302,
-        0.3529411852359772,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.545098066329956,
-        0.9921569228172302,
-        0.7450980544090271,
-        0.007843137718737125,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.04313725605607033,
-        0.7450980544090271,
-        0.9921569228172302,
-        0.27450981736183167,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.13725490868091583,
-        0.9450981020927429,
-        0.8823530077934265,
-        0.6274510025978088,
-        0.4235294461250305,
-        0.003921568859368563,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.3176470696926117,
-        0.9411765336990356,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.46666669845581055,
-        0.09803922474384308,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.1764705926179886,
-        0.729411780834198,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.5882353186607361,
-        0.10588236153125763,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.062745101749897,
-        0.364705890417099,
-        0.988235354423523,
-        0.9921569228172302,
-        0.7333333492279053,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.9764706492424011,
-        0.9921569228172302,
-        0.9764706492424011,
-        0.250980406999588,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.18039216101169586,
-        0.5098039507865906,
-        0.7176470756530762,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.8117647767066956,
-        0.007843137718737125,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.15294118225574493,
-        0.5803921818733215,
-        0.8980392813682556,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9803922176361084,
-        0.7137255072593689,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0941176563501358,
-        0.44705885648727417,
-        0.8666667342185974,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.7882353663444519,
-        0.30588236451148987,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.09019608050584793,
-        0.25882354378700256,
-        0.8352941870689392,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.7764706611633301,
-        0.3176470696926117,
-        0.007843137718737125,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.07058823853731155,
-        0.6705882549285889,
-        0.8588235974311829,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.7647059559822083,
-        0.3137255012989044,
-        0.03529411926865578,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.21568629145622253,
-        0.6745098233222961,
-        0.8862745761871338,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9568628072738647,
-        0.5215686559677124,
-        0.04313725605607033,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.5333333611488342,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.9921569228172302,
-        0.8313726186752319,
-        0.529411792755127,
-        0.5176470875740051,
-        0.062745101749897,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0,
-        0.0
-      ]
-    }
-  ]
-}
--- a/mnist/testing/tfjob_test.py
+++ b/mnist/testing/tfjob_test.py
@ -1,142 +0,0 @@
-"""Test training using TFJob.
-
-This file tests that we can submit the job
-and that the job runs to completion.
-
-It is an integration test as it depends on having access to
-a Kubeflow deployment to submit the TFJob to.
-
-Python Path Requirements:
-  kubeflow/tf-operator/py - https://github.com/kubeflow/tf-operator
-     * Provides utilities for testing TFJobs
-  kubeflow/testing/py - https://github.com/kubeflow/testing/tree/master/py
-     * Provides utilities for testing
-
-Manually running the test
-  pytest tfjobs_test.py \
-    tfjob_name=tfjobs-test-${BUILD_ID} \
-    namespace=${test_namespace} \
-    trainer_image=${trainning_image} \
-    train_steps=10 \
-    batch_size=10 \
-    learning_rate=0.01 \
-    num_ps=1 \
-    num_workers=2 \
-    model_dir=${model_dir} \
-    export_dir=${model_dir} \
-"""
-
-import json
-import logging
-import os
-import pytest
-
-from kubernetes.config import kube_config
-from kubernetes import client as k8s_client
-from kubeflow.tf_operator import tf_job_client #pylint: disable=no-name-in-module
-
-from kubeflow.testing import util
-
-def test_training(record_xml_attribute, tfjob_name, namespace, trainer_image, num_ps, #pylint: disable=too-many-arguments
-                  num_workers, train_steps, batch_size, learning_rate, model_dir, export_dir):
-
-  util.set_pytest_junit(record_xml_attribute, "test_mnist")
-
-  util.maybe_activate_service_account()
-
-  app_dir = os.path.join(os.path.dirname(__file__), "../training/GCS")
-  app_dir = os.path.abspath(app_dir)
-  logging.info("--app_dir not set defaulting to: %s", app_dir)
-
-  # TODO (@jinchihe) Using kustomize 2.0.3 to work around below issue:
-  # https://github.com/kubernetes-sigs/kustomize/issues/1295
-  kusUrl = 'https://github.com/kubernetes-sigs/kustomize/' \
-           'releases/download/v2.0.3/kustomize_2.0.3_linux_amd64'
-  util.run(['wget', '-q', '-O', '/usr/local/bin/kustomize', kusUrl], cwd=app_dir)
-  util.run(['chmod', 'a+x', '/usr/local/bin/kustomize'], cwd=app_dir)
-
-  # TODO (@jinchihe): The kubectl need to be upgraded to 1.14.0 due to below issue.
-  # Invalid object doesn't have additional properties ...
-  kusUrl = 'https://storage.googleapis.com/kubernetes-release/' \
-           'release/v1.14.0/bin/linux/amd64/kubectl'
-  util.run(['wget', '-q', '-O', '/usr/local/bin/kubectl', kusUrl], cwd=app_dir)
-  util.run(['chmod', 'a+x', '/usr/local/bin/kubectl'], cwd=app_dir)
-
-  # Configurate custom parameters using kustomize
-  util.run(['kustomize', 'edit', 'set', 'namespace', namespace], cwd=app_dir)
-  util.run(['kustomize', 'edit', 'set', 'image', 'training-image=' + trainer_image], cwd=app_dir)
-
-  util.run(['../base/definition.sh', '--numPs', num_ps], cwd=app_dir)
-  util.run(['../base/definition.sh', '--numWorkers', num_workers], cwd=app_dir)
-
-  trainning_config = {
-    "name": tfjob_name,
-    "trainSteps": train_steps,
-    "batchSize": batch_size,
-    "learningRate": learning_rate,
-    "modelDir": model_dir,
-    "exportDir": export_dir,
-  }
-
-  configmap = 'mnist-map-training'
-  for key, value in trainning_config.items():
-    util.run(['kustomize', 'edit', 'add', 'configmap', configmap,
-            '--from-literal=' + key + '=' + value], cwd=app_dir)
-
-  # Created the TFJobs.
-  util.run(['kustomize', 'build', app_dir, '-o', 'generated.yaml'], cwd=app_dir)
-  util.run(['kubectl', 'apply', '-f', 'generated.yaml'], cwd=app_dir)
-  logging.info("Created job %s in namespaces %s", tfjob_name, namespace)
-
-  kube_config.load_kube_config()
-  api_client = k8s_client.ApiClient()
-
-  # Wait for the job to complete.
-  logging.info("Waiting for job to finish.")
-  results = tf_job_client.wait_for_job(
-        api_client,
-        namespace,
-        tfjob_name,
-        status_callback=tf_job_client.log_status)
-  logging.info("Final TFJob:\n %s", json.dumps(results, indent=2))
-
-  # Check for errors creating pods and services. Can potentially
-  # help debug failed test runs.
-  creation_failures = tf_job_client.get_creation_failures_from_tfjob(
-      api_client, namespace, results)
-  if creation_failures:
-    logging.warning(creation_failures)
-
-  if not tf_job_client.job_succeeded(results):
-    failure = "Job {0} in namespace {1} in status {2}".format(  # pylint: disable=attribute-defined-outside-init
-        tfjob_name, namespace, results.get("status", {}))
-    logging.error(failure)
-
-    # if the TFJob failed, print out the pod logs for debugging.
-    pod_names = tf_job_client.get_pod_names(
-        api_client, namespace, tfjob_name)
-    logging.info("The Pods name:\n %s", pod_names)
-
-    core_api = k8s_client.CoreV1Api(api_client)
-
-    for pod in pod_names:
-      logging.info("Getting logs of Pod %s.", pod)
-      try:
-        pod_logs = core_api.read_namespaced_pod_log(pod, namespace)
-        logging.info("The logs of Pod %s log:\n %s", pod, pod_logs)
-      except k8s_client.rest.ApiException as e:
-        logging.info("Exception when calling CoreV1Api->read_namespaced_pod_log: %s\n", e)
-    return
-
-  # We don't delete the jobs. We rely on TTLSecondsAfterFinished
-  # to delete old jobs. Leaving jobs around should make it
-  # easier to debug.
-
-if __name__ == "__main__":
-  logging.basicConfig(level=logging.INFO,
-                      format=('%(levelname)s|%(asctime)s'
-                              '|%(pathname)s|%(lineno)d| %(message)s'),
-                      datefmt='%Y-%m-%dT%H:%M:%S',
-                      )
-  logging.getLogger().setLevel(logging.INFO)
-  pytest.main()
--- a/mnist/training/GCS/kustomization.yaml
+++ b/mnist/training/GCS/kustomization.yaml
@ -1,11 +0,0 @@
-apiVersion: kustomize.config.k8s.io/v1beta1
-kind: Kustomization
-
-bases:
- ../base
-
-images:
- name: training-image
-  newName: gcr.io/kubeflow-examples/mnist/model
-  newTag: build-1202842504546750464
-
--- a/mnist/web-ui/mnist_client.py
+++ b/mnist/web-ui/mnist_client.py
@ -27,7 +27,7 @@ from tensorflow.examples.tutorials.mnist import input_data
 from tensorflow_serving.apis import predict_pb2
 from tensorflow_serving.apis import prediction_service_pb2

-from PIL import Image
+from PIL import Image # pylint: disable=wrong-import-order


 def get_prediction(image, server_host='127.0.0.1', server_port=9000,
--- a/prow_config.yaml
+++ b/prow_config.yaml
@ -55,6 +55,7 @@ workflows:
      - postsubmit
    include_dirs:
      - xgboost_synthetic/*
+      - mnist/*
      - py/kubeflow/examples/create_e2e_workflow.py

  # E2E test for various notebooks
@ -67,17 +68,7 @@ workflows:
      - postsubmit
    include_dirs:
      - xgboost_synthetic/*
+      - mnist/*
      - py/kubeflow/examples/create_e2e_workflow.py
    kwargs:
      cluster_pattern: kf-v1-(?!n\d\d)
-
-  # E2E test for mnist example
-  - py_func: kubeflow.examples.create_e2e_workflow.create_workflow
-    name: mnist
-    job_types:
-      - periodic
-      - presubmit
-      - postsubmit
-    include_dirs:
-      - mnist/*
-      - py/kubeflow/examples/create_e2e_workflow.py
--- a/py/kubeflow/examples/create_e2e_workflow.py
+++ b/py/kubeflow/examples/create_e2e_workflow.py
@ -261,82 +261,23 @@ class Builder:
                                                           "xgboost_synthetic",
                                                           "testing")

-  def _build_tests_dag_mnist(self):
-    """Build the dag for the set of tests to run mnist TFJob tests."""
-
-    task_template = self._build_task_template()
-
    # ***************************************************************************
-    # Build mnist image
-    step_name = "build-image"
-    train_image_base = "gcr.io/kubeflow-examples/mnist"
-    train_image_tag = "build-" + PROW_DICT['BUILD_ID']
-    command = ["/bin/bash",
-               "-c",
-               "gcloud auth activate-service-account --key-file=$(GOOGLE_APPLICATION_CREDENTIALS) \
-                && make build-gcb IMG=" + train_image_base + " TAG=" + train_image_tag,
-              ]
+    # Test mnist
+    step_name = "mnist"
+    command = ["pytest", "mnist_gcp_test.py",
+               # Increase the log level so that info level log statements show up.
+               "--log-cli-level=info",
+               "--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'",
+               # Test timeout in seconds.
+               "--timeout=1800",
+               "--junitxml=" + self.artifacts_dir + "/junit_mnist-gcp-test.xml",
+               ]
+
    dependencies = []
-    build_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template,
-                                   command, dependencies)
-    build_step["container"]["workingDir"] = os.path.join(self.src_dir, "mnist")
-
-    # ***************************************************************************
-    # Test mnist TFJob
-    step_name = "tfjob-test"
-    # Using python2 to run the test to avoid dependency error.
-    command = ["python2", "-m", "pytest", "tfjob_test.py",
-               # Increase the log level so that info level log statements show up.
-               "--log-cli-level=info",
-               "--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'",
-               # Test timeout in seconds.
-               "--timeout=1800",
-               "--junitxml=" + self.artifacts_dir + "/junit_tfjob-test.xml",
-               ]
-
-    dependencies = [build_step['name']]
-    tfjob_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template,
-                                   command, dependencies)
-    tfjob_step["container"]["workingDir"] = os.path.join(self.src_dir,
-                                                           "mnist",
-                                                           "testing")
-
-    # ***************************************************************************
-    # Test mnist deploy
-    step_name = "deploy-test"
-    command = ["python2", "-m", "pytest", "deploy_test.py",
-               # Increase the log level so that info level log statements show up.
-               "--log-cli-level=info",
-               "--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'",
-               # Test timeout in seconds.
-               "--timeout=1800",
-               "--junitxml=" + self.artifacts_dir + "/junit_deploy-test.xml",
-               ]
-
-    dependencies = [tfjob_step["name"]]
-    deploy_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template,
-                                   command, dependencies)
-    deploy_step["container"]["workingDir"] = os.path.join(self.src_dir,
-                                                           "mnist",
-                                                           "testing")
-    # ***************************************************************************
-    # Test mnist predict
-    step_name = "predict-test"
-    command = ["pytest", "predict_test.py",
-               # Increase the log level so that info level log statements show up.
-               "--log-cli-level=info",
-               "--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'",
-               # Test timeout in seconds.
-               "--timeout=1800",
-               "--junitxml=" + self.artifacts_dir + "/junit_predict-test.xml",
-               ]
-
-    dependencies = [deploy_step["name"]]
-    predict_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template,
-                                   command, dependencies)
-    predict_step["container"]["workingDir"] = os.path.join(self.src_dir,
-                                                           "mnist",
-                                                           "testing")
+    mnist_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template,
+                                  command, dependencies)
+    mnist_step["container"]["workingDir"] = os.path.join(
+      self.src_dir, "py/kubeflow/examples/notebook_tests")

  def _build_exit_dag(self):
    """Build the exit handler dag"""
@ -432,8 +373,6 @@ class Builder:
    # Run a dag of tests
    if self.test_target_name.startswith("notebooks"):
      self._build_tests_dag_notebooks()
-    elif self.test_target_name == "mnist":
-      self._build_tests_dag_mnist()
    else:
      raise RuntimeError('Invalid test_target_name ' + self.test_target_name)

--- a/py/kubeflow/examples/notebook_tests/conftest.py
+++ b/py/kubeflow/examples/notebook_tests/conftest.py
@ -0,0 +1,34 @@
+import pytest
+
+def pytest_addoption(parser):
+  parser.addoption(
+    "--name", help="Name for the job. If not specified one was created "
+    "automatically", type=str, default="")
+  parser.addoption(
+    "--namespace", help=("The namespace to run in. This should correspond to"
+                         "a namespace associated with a Kubeflow namespace."),
+                   type=str,
+    default="kubeflow-kf-ci-v1-user")
+  parser.addoption(
+    "--image", help="Notebook image to use", type=str,
+    default="gcr.io/kubeflow-images-public/"
+            "tensorflow-1.15.2-notebook-cpu:1.0.0")
+  parser.addoption(
+    "--repos", help="The repos to checkout; leave blank to use defaults",
+    type=str, default="")
+
+@pytest.fixture
+def name(request):
+  return request.config.getoption("--name")
+
+@pytest.fixture
+def namespace(request):
+  return request.config.getoption("--namespace")
+
+@pytest.fixture
+def image(request):
+  return request.config.getoption("--image")
+
+@pytest.fixture
+def repos(request):
+  return request.config.getoption("--repos")
--- a/py/kubeflow/examples/notebook_tests/execute_notebook.py
+++ b/py/kubeflow/examples/notebook_tests/execute_notebook.py
@ -0,0 +1,58 @@
+import fire
+import tempfile
+import logging
+import os
+import subprocess
+
+logger = logging.getLogger(__name__)
+
+def prepare_env():
+  subprocess.check_call(["pip3", "install", "-U", "papermill"])
+  subprocess.check_call(["pip3", "install", "-r", "../requirements.txt"])
+
+
+def execute_notebook(notebook_path, parameters=None):
+  import papermill #pylint: disable=import-error
+  temp_dir = tempfile.mkdtemp()
+  notebook_output_path = os.path.join(temp_dir, "out.ipynb")
+  papermill.execute_notebook(notebook_path, notebook_output_path,
+                             cwd=os.path.dirname(notebook_path),
+                             parameters=parameters,
+                             log_output=True)
+  return notebook_output_path
+
+def run_notebook_test(notebook_path, expected_messages, parameters=None):
+  output_path = execute_notebook(notebook_path, parameters=parameters)
+  actual_output = open(output_path, 'r').read()
+  for expected_message in expected_messages:
+    if not expected_message in actual_output:
+      logger.error(actual_output)
+      assert False, "Unable to find from output: " + expected_message
+
+class NotebookExecutor:
+  @staticmethod
+  def test(notebook_path):
+    """Test a notebook.
+
+    Args:
+      notebook_path: Absolute path of the notebook.
+    """
+    prepare_env()
+    FILE_DIR = os.path.dirname(__file__)
+
+    EXPECTED_MGS = [
+        "Finished upload of",
+        "Model export success: mockup-model.dat",
+        "Pod started running True",
+        "Cluster endpoint: http:",
+    ]
+    run_notebook_test(notebook_path, EXPECTED_MGS)
+
+if __name__ == "__main__":
+  logging.basicConfig(level=logging.INFO,
+                      format=('%(levelname)s|%(asctime)s'
+                              '|%(message)s|%(pathname)s|%(lineno)d|'),
+                      datefmt='%Y-%m-%dT%H:%M:%S',
+                      )
+
+  fire.Fire(NotebookExecutor)
--- a/py/kubeflow/examples/notebook_tests/job.yaml
+++ b/py/kubeflow/examples/notebook_tests/job.yaml
@ -0,0 +1,51 @@
+# A batch job to run a notebook using papermill.
+# TODO(jlewi): We should switch to using Tekton
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: nb-test
+  labels:
+    app: nb-test
+spec:
+  backoffLimit: 1
+  template:
+    metadata:
+      annotations:
+        # TODO(jlewi): Do we really want to disable sidecar injection
+        # in the test? Would it be better to use istio to mimic what happens
+        # in notebooks?
+        sidecar.istio.io/inject: "false"
+      labels:
+        app: nb-test
+    spec:
+      restartPolicy: Never
+      securityContext:
+        runAsUser: 0
+      initContainers:
+      # This init container checks out the source code.
+      - command:
+        - /usr/local/bin/checkout_repos.sh
+        - --repos=kubeflow/examples@$(CHECK_TAG)
+        - --src_dir=/src
+        name: checkout
+        image: gcr.io/kubeflow-ci/test-worker:v20190802-c6f9140-e3b0c4
+        volumeMounts:
+        - mountPath: /src
+          name: src
+      containers:
+      - env:
+        - name: PYTHONPATH
+          value: /src/kubeflow/examples/py/
+      - name: executing-notebooks
+        image: execute-image
+        command: ["python3", "-m",
+                  "kubeflow.examples.notebook_tests.execute_notebook",
+                  "test", "/src/kubeflow/examples/mnist/mnist_gcp.ipynb"]
+        workingDir: /src/kubeflow/examples/py/kubeflow/examples/notebook_tests
+        volumeMounts:
+        - mountPath: /src
+          name: src
+      serviceAccount: default-editor
+      volumes:
+      - name: src
+        emptyDir: {}
--- a/py/kubeflow/examples/notebook_tests/mnist_gcp_test.py
+++ b/py/kubeflow/examples/notebook_tests/mnist_gcp_test.py
@ -0,0 +1,29 @@
+import logging
+import os
+
+import pytest
+
+from kubeflow.examples.notebook_tests import nb_test_util
+from kubeflow.testing import util
+
+# TODO(jlewi): This test is new; there's some work to be done to make it
+# reliable. So for now we mark it as expected to fail in presubmits
+# We only mark it as expected to fail
+# on presubmits because if expected failures don't show up in test grid
+# and we want signal in postsubmits and periodics
+@pytest.mark.xfail(os.getenv("JOB_TYPE") == "presubmit", reason="Flaky")
+def test_mnist_gcp(record_xml_attribute, name, namespace, # pylint: disable=too-many-branches,too-many-statements
+                   repos, image):
+  '''Generate Job and summit.'''
+  util.set_pytest_junit(record_xml_attribute, "test_mnist_gcp")
+  nb_test_util.run_papermill_job(name, namespace, repos, image)
+
+
+if __name__ == "__main__":
+  logging.basicConfig(level=logging.INFO,
+                      format=('%(levelname)s|%(asctime)s'
+                              '|%(pathname)s|%(lineno)d| %(message)s'),
+                      datefmt='%Y-%m-%dT%H:%M:%S',
+                      )
+  logging.getLogger().setLevel(logging.INFO)
+  pytest.main()
--- a/py/kubeflow/examples/notebook_tests/nb_test_util.py
+++ b/py/kubeflow/examples/notebook_tests/nb_test_util.py
@ -0,0 +1,80 @@
+"""Some utitilies for running notebook tests."""
+
+import datetime
+import logging
+import uuid
+import yaml
+
+from kubernetes import client as k8s_client
+from kubeflow.testing import argo_build_util
+from kubeflow.testing import util
+
+def run_papermill_job(name, namespace, # pylint: disable=too-many-branches,too-many-statements
+                      repos, image):
+  """Generate a K8s job to run a notebook using papermill
+
+  Args:
+    name: Name for the K8s job
+    namespace: The namespace where the job should run.
+    repos: (Optional) Which repos to checkout; if not specified tries
+      to infer based on PROW environment variables
+    image:
+  """
+
+  util.maybe_activate_service_account()
+
+  with open("job.yaml") as hf:
+    job = yaml.load(hf)
+
+  # We need to checkout the correct version of the code
+  # in presubmits and postsubmits. We should check the environment variables
+  # for the prow environment variables to get the appropriate values.
+  # We should probably also only do that if the
+  # See
+  # https://github.com/kubernetes/test-infra/blob/45246b09ed105698aa8fb928b7736d14480def29/prow/jobs.md#job-environment-variables
+  if not repos:
+    repos = argo_build_util.get_repo_from_prow_env()
+
+  logging.info("Repos set to %s", repos)
+  job["spec"]["template"]["spec"]["initContainers"][0]["command"] = [
+    "/usr/local/bin/checkout_repos.sh",
+    "--repos=" + repos,
+    "--src_dir=/src",
+    "--depth=all",
+  ]
+  job["spec"]["template"]["spec"]["containers"][0]["image"] = image
+  util.load_kube_config(persist_config=False)
+
+  if name:
+    job["metadata"]["name"] = name
+  else:
+    job["metadata"]["name"] = ("xgboost-test-" +
+                               datetime.datetime.now().strftime("%H%M%S")
+                               + "-" + uuid.uuid4().hex[0:3])
+    name = job["metadata"]["name"]
+
+  job["metadata"]["namespace"] = namespace
+
+  # Create an API client object to talk to the K8s master.
+  api_client = k8s_client.ApiClient()
+  batch_api = k8s_client.BatchV1Api(api_client)
+
+  logging.info("Creating job:\n%s", yaml.dump(job))
+  actual_job = batch_api.create_namespaced_job(job["metadata"]["namespace"],
+                                               job)
+  logging.info("Created job %s.%s:\n%s", namespace, name,
+               yaml.safe_dump(actual_job.to_dict()))
+
+  final_job = util.wait_for_job(api_client, namespace, name,
+                                timeout=datetime.timedelta(minutes=30))
+
+  logging.info("Final job:\n%s", yaml.safe_dump(final_job.to_dict()))
+
+  if not final_job.status.conditions:
+    raise RuntimeError("Job {0}.{1}; did not complete".format(namespace, name))
+
+  last_condition = final_job.status.conditions[-1]
+
+  if last_condition.type not in ["Complete"]:
+    logging.error("Job didn't complete successfully")
+    raise RuntimeError("Job {0}.{1} failed".format(namespace, name))
--- a/test/workflows/components/workflows.libsonnet
+++ b/test/workflows/components/workflows.libsonnet
@ -0,0 +1,252 @@
+{
+  // TODO(https://github.com/ksonnet/ksonnet/issues/222): Taking namespace as an argument is a work around for the fact that ksonnet
+  // doesn't support automatically piping in the namespace from the environment to prototypes.
+
+  // convert a list of two items into a map representing an environment variable
+  // TODO(jlewi): Should we move this into kubeflow/core/util.libsonnet
+  listToMap:: function(v)
+    {
+      name: v[0],
+      value: v[1],
+    },
+
+  // Function to turn comma separated list of prow environment variables into a dictionary.
+  parseEnv:: function(v)
+    local pieces = std.split(v, ",");
+    if v != "" && std.length(pieces) > 0 then
+      std.map(
+        function(i) $.listToMap(std.split(i, "=")),
+        std.split(v, ",")
+      )
+    else [],
+
+  parts(namespace, name):: {
+    // Workflow to run the e2e test.
+    e2e(prow_env, bucket):
+      // mountPath is the directory where the volume to store the test data
+      // should be mounted.
+      local mountPath = "/mnt/" + "test-data-volume";
+      // testDir is the root directory for all data for a particular test run.
+      local testDir = mountPath + "/" + name;
+      // outputDir is the directory to sync to GCS to contain the output for this job.
+      local outputDir = testDir + "/output";
+      local artifactsDir = outputDir + "/artifacts";
+      local goDir = testDir + "/go";
+      // Source directory where all repos should be checked out
+      local srcRootDir = testDir + "/src";
+      // The directory containing the kubeflow/examples repo
+      local srcDir = srcRootDir + "/kubeflow/examples";
+      local image = "gcr.io/kubeflow-ci/test-worker";
+      // The name of the NFS volume claim to use for test files.
+      // local nfsVolumeClaim = "kubeflow-testing";
+      local nfsVolumeClaim = "nfs-external";
+      // The name to use for the volume to use to contain test data.
+      local dataVolume = "kubeflow-test-volume";
+      local versionTag = name;
+      // The directory within the kubeflow_testing submodule containing
+      // py scripts to use.
+      local kubeflowExamplesPy = srcDir;
+      local kubeflowTestingPy = srcRootDir + "/kubeflow/testing/py";
+
+      local project = "kubeflow-ci";
+      // GKE cluster to use
+      // We need to truncate the cluster to no more than 40 characters because
+      // cluster names can be a max of 40 characters.
+      // We expect the suffix of the cluster name to be unique salt.
+      // We prepend a z because cluster name must start with an alphanumeric character
+      // and if we cut the prefix we might end up starting with "-" or other invalid
+      // character for first character.
+      local cluster =
+        if std.length(name) > 40 then
+          "z" + std.substr(name, std.length(name) - 39, 39)
+        else
+        name;
+      local zone = "us-east1-d";
+      local chart = srcDir + "/bin/examples-chart-0.2.1-" + versionTag + ".tgz";
+      {
+        // Build an Argo template to execute a particular command.
+        // step_name: Name for the template
+        // command: List to pass as the container command.
+        buildTemplate(step_name, command):: {
+          name: step_name,
+          container: {
+            command: command,
+            image: image,
+            workingDir: srcDir,
+            env: [
+              {
+                // Add the source directories to the python path.
+                name: "PYTHONPATH",
+                value: kubeflowExamplesPy + ":" + kubeflowTestingPy,
+              },
+              {
+                // Set the GOPATH
+                name: "GOPATH",
+                value: goDir,
+              },
+              {
+                name: "GOOGLE_APPLICATION_CREDENTIALS",
+                value: "/secret/gcp-credentials/key.json",
+              },
+              {
+                name: "GIT_TOKEN",
+                valueFrom: {
+                  secretKeyRef: {
+                    name: "github-token",
+                    key: "github_token",
+                  },
+                },
+              },
+              {
+                name: "EXTRA_REPOS",
+                value: "kubeflow/testing@HEAD",
+              },
+            ] + prow_env,
+            volumeMounts: [
+              {
+                name: dataVolume,
+                mountPath: mountPath,
+              },
+              {
+                name: "github-token",
+                mountPath: "/secret/github-token",
+              },
+              {
+                name: "gcp-credentials",
+                mountPath: "/secret/gcp-credentials",
+              },
+            ],
+          },
+        },  // buildTemplate
+
+        apiVersion: "argoproj.io/v1alpha1",
+        kind: "Workflow",
+        metadata: {
+          name: name,
+          namespace: namespace,
+        },
+        // TODO(jlewi): Use OnExit to run cleanup steps.
+        spec: {
+          entrypoint: "e2e",
+          volumes: [
+            {
+              name: "github-token",
+              secret: {
+                secretName: "github-token",
+              },
+            },
+            {
+              name: "gcp-credentials",
+              secret: {
+                secretName: "kubeflow-testing-credentials",
+              },
+            },
+            {
+              name: dataVolume,
+              persistentVolumeClaim: {
+                claimName: nfsVolumeClaim,
+              },
+            },
+          ],  // volumes
+          // onExit specifies the template that should always run when the workflow completes.
+          onExit: "exit-handler",
+          templates: [
+            {
+              name: "e2e",
+              steps: [
+                [{
+                  name: "checkout",
+                  template: "checkout",
+                }],
+                [
+                  {
+                    name: "create-pr-symlink",
+                    template: "create-pr-symlink",
+                  },
+                  // test_py_checks runs all py files matching "_test.py"
+                  // This is currently commented out because the only matching tests
+                  // are manual tests for some of the examples and/or they require
+                  // dependencies (e.g. tensorflow) not in the generic test worker image.
+                  //                                
+                  // 
+                  // test_py_checks doesn't have options to exclude specific directories.
+                  // Since there are no other tests we just comment it out.
+                  //
+                  // TODO(https://github.com/kubeflow/testing/issues/240): Modify py_test
+                  // so we can exclude specific files.
+                  //
+                  // {
+                  //    name: "py-test",
+                  //    template: "py-test",
+                  // },
+                  // {
+                  //  name: "py-lint",
+                  //  template: "py-lint",
+                  //},
+                ],
+              ],
+            },
+            {
+              name: "exit-handler",
+              steps: [
+                [{
+                  name: "copy-artifacts",
+                  template: "copy-artifacts",
+                }],
+              ],
+            },
+            {
+              name: "checkout",
+              container: {
+                command: [
+                  "/usr/local/bin/checkout.sh",
+                  srcRootDir,
+                ],
+                env: prow_env + [{
+                  name: "EXTRA_REPOS",
+                  value: "kubeflow/testing@HEAD",
+                }],
+                image: image,
+                volumeMounts: [
+                  {
+                    name: dataVolume,
+                    mountPath: mountPath,
+                  },
+                ],
+              },
+            },  // checkout
+            $.parts(namespace, name).e2e(prow_env, bucket).buildTemplate("py-test", [
+              "python",
+              "-m",
+              "kubeflow.testing.test_py_checks",
+              "--artifacts_dir=" + artifactsDir,
+              "--src_dir=" + srcDir,
+            ]),  // py test
+            $.parts(namespace, name).e2e(prow_env, bucket).buildTemplate("py-lint", [
+              "python",
+              "-m",
+              "kubeflow.testing.test_py_lint",
+              "--artifacts_dir=" + artifactsDir,
+              "--src_dir=" + srcDir,
+            ]),  // py lint
+            $.parts(namespace, name).e2e(prow_env, bucket).buildTemplate("create-pr-symlink", [
+              "python",
+              "-m",
+              "kubeflow.testing.prow_artifacts",
+              "--artifacts_dir=" + outputDir,
+              "create_pr_symlink",
+              "--bucket=" + bucket,
+            ]),  // create-pr-symlink
+            $.parts(namespace, name).e2e(prow_env, bucket).buildTemplate("copy-artifacts", [
+              "python",
+              "-m",
+              "kubeflow.testing.prow_artifacts",
+              "--artifacts_dir=" + outputDir,
+              "copy_artifacts",
+              "--bucket=" + bucket,
+            ]),  // copy-artifacts
+          ],  // templates
+        },
+      },  // e2e
+  },  // parts
+}
--- a/xgboost_synthetic/build-train-deploy.ipynb
+++ b/xgboost_synthetic/build-train-deploy.ipynb
@ -1257,7 +1257,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.5rc1"
+   "version": "3.6.9"
  }
 },
 "nbformat": 4,
--- a/xgboost_synthetic/testing/README.md
+++ b/xgboost_synthetic/testing/README.md
@ -0,0 +1 @@
+TODO: We should reuse/share logic in py/kubeflow/examples/notebook_tests
				`@ -0,0 +1 @@`
				`TODO: We should reuse/share logic in py/kubeflow/examples/notebook_tests`