Update training to use Kubeflow 0.4 and add testing. (#465)

* Update training to use Kubeflow 0.4 and add testing. * To support testing we need to create a ksonnet template to train the model so we can easily subsitute in different parameters during training. * We create a ksonnet component for just training; we don't use Argo. This makes the example much simpler. * To support S3 we add a generic ksonnet parameter to take environment variables as a comma separated list of variables. This should make it easy for users to set the environment variables needed to talk to S3. This is compatible with the existing Argo workflow which supports S3. * By default the training job runs non-distributed; this is because to run distributed the user needs a shared filesystem (e.g. S3/GCS/NFS). * Update the mnist workflow to correctly build the images. * We didn't update the workflow in the previous example to actually build the correct images. * Update the workflow to run the tfjob_test * Related to #460 E2E test for mnist. * Add a parameter to specify a secret that can be used to mount a secret such as the GCP service account key. * Update the README with instructions for GCS and S3. * Remove the instructions about Argo; the Argo workflow is outdated. Using Argo adds complexity to the example and the thinking is to remove that to provide a simpler example and to mirror the pytorch example. * Add a TOC to the README * Update prerequisite instructions. * Delete instructions for installing Kubeflow; just link to the getting started guide. * Argo CLI should no longer be needed. * GitHub token shouldn't be needed; I think that was only needed for ksonnet to pull the registry. * * Fix instructions; access keys shouldn't be stored as ksonnet parameters as these will get checked into source control.
2019-01-10 12:42:45 -08:00 · 2019-01-10 12:42:45 -08:00 · ef108dbbcc
parent 4dda73afbf
commit ef108dbbcc
14 changed files with 767 additions and 201 deletions
--- a/mnist/README.md
+++ b/mnist/README.md
@ -1,51 +1,43 @@
-# Training MNIST using Kubeflow, S3, and Argo.
+<!-- START doctoc generated TOC please keep comment here to allow auto update -->
+<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
+**Table of Contents**  *generated with [DocToc](https://github.com/thlorenz/doctoc)*

-This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model. We will be using Argo to manage the workflow, Tensorflow's S3 support for saving model training info, Tensorboard to visualize the training, and Kubeflow to deploy the Tensorflow operator and serve the model.
+- [Training MNIST](#training-mnist)
+  - [Prerequisites](#prerequisites)
+    - [Kubernetes Cluster Environment](#kubernetes-cluster-environment)
+    - [Local Setup](#local-setup)
+  - [Modifying existing examples](#modifying-existing-examples)
+    - [Prepare model](#prepare-model)
+    - [Build and push model image.](#build-and-push-model-image)
+  - [Preparing your Kubernetes Cluster](#preparing-your-kubernetes-cluster)
+    - [Training your model](#training-your-model)
+      - [Local storage](#local-storage)
+      - [Using GCS](#using-gcs)
+      - [Using S3](#using-s3)
+  - [Monitoring](#monitoring)
+    - [Tensorboard](#tensorboard)
+  - [Using Tensorflow serving](#using-tensorflow-serving)
+  - [Conclusion and Next Steps](#conclusion-and-next-steps)
+
+<!-- END doctoc generated TOC please keep comment here to allow auto update -->
+
+# Training MNIST
+
+This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model.

 ## Prerequisites

 Before we get started there a few requirements.

-### Kubernetes Cluster Environment
+### Deploy Kubeflow

-Your cluster must:
-
- Be at least version 1.9
- Have access to an S3-compatible object store ([Amazon S3](https://aws.amazon.com/s3/), [Google Storage](https://cloud.google.com/storage/docs/interoperability), [Minio](https://www.minio.io/kubernetes.html))
- Contain 3 nodes of at least 8 cores and 16 GB of RAM.
-
-If using GKE, the following will provision a cluster with the required features:
-
-```
-export CLOUDSDK_CONTAINER_USE_CLIENT_CERTIFICATE=True
-gcloud alpha container clusters create ${USER} --enable-kubernetes-alpha --machine-type=n1-standard-8 --num-nodes=3 --disk-size=200 --zone=us-west1-a --cluster-version=1.9.3-gke.0 --image-type=UBUNTU
-```
-
-If using Azure, the following will provision a cluster with the required features, [using the az cli](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest):
-
-```
-# Create a resource group
-az group create -n kubeflowrg -l eastus
-# Deploy the cluster
-az aks create -n kubeflowaks -g kubeflowrg -l eastus -k 1.9.6 -c 3 -s Standard_NC6
-# Authentication into the cluster
-az aks get-credentials -n kubeflowaks -g kubeflowrg
-```
-
-NOTE: You must be a Kubernetes admin to follow this guide. If you are not an admin, please contact your local cluster administrator for a client cert, or credentials to pass into the following commands:
-
-```
-$ kubectl config set-credentials <username> --username=<admin_username> --password=<admin_password>
-$ kubectl config set-context <context_name> --cluster=<cluster_name> --user=<username> --namespace=<namespace>
-$ kubectl config use-context <context_name>
-```
+Follow the [Getting Started Guide](https://www.kubeflow.org/docs/started/getting-started/) to deploy Kubeflow

 ### Local Setup

 You also need the following command line tools:

 - [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
- [argo](https://github.com/argoproj/argo/blob/master/demo.md#1-download-argo)
 - [ksonnet](https://ksonnet.io/#get-started)

 To run the client at the end of the example, you must have [requirements.txt](requirements.txt) intalled in your active python environment.
@ -54,11 +46,7 @@ To run the client at the end of the example, you must have [requirements.txt](re
 pip install -r requirements.txt
 ```

-NOTE: These instructions rely on Github, and may cause issues if behind a firewall with many Github users. Make sure you [generate and export this token](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/):
-
-```
-export GITHUB_TOKEN=xxxxxxxx
-```
+NOTE: These instructions rely on Github, and may cause issues if behind a firewall with many Github users. 

 ## Modifying existing examples

@ -93,154 +81,389 @@ With our data and workloads ready, now the cluster must be prepared. We will be

 In the following instructions we will install our required components to a single namespace.  For these instructions we will assume the chosen namespace is `tfworkflow`:

-### Deploying Tensorflow Operator and Argo.
+### Training your model

-We are using the Tensorflow operator to automate the deployment of our distributed model training, and Argo to create the overall training pipeline. The easiest way to install these components on your Kubernetes cluster is by using Kubeflow's ksonnet prototypes.
+#### Local storage
+
+Let's start by runing the training job on Kubeflow and storing the model in a directory local to the pod e.g. '/tmp'.
+This is useful as a smoke test to ensure everything works. Since `/tmp` is not a filesystem external to the container, all data
+is lost once the job finishes. So to make the model available after the job finishes we will need to use an external filesystem
+like GCS or S3 as discussed in the next section.

 ```
-NAMESPACE=tfworkflow
-APP_NAME=my-kubeflow
-ks init ${APP_NAME}
-cd ${APP_NAME}
-
-ks registry add kubeflow github.com/kubeflow/kubeflow/tree/v0.2.4/kubeflow
-
-ks pkg install kubeflow/core@v0.2.4
-ks pkg install kubeflow/argo
-
-# Deploy TF Operator and Argo
-kubectl create namespace ${NAMESPACE}
-ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE}
-ks generate argo kubeflow-argo --name=kubeflow-argo --namespace=${NAMESPACE}
-
-ks apply default -c kubeflow-core
-ks apply default -c kubeflow-argo
-
-# Switch context for the rest of the example
-kubectl config set-context $(kubectl config current-context) --namespace=${NAMESPACE}
-cd -
-
-# Create a user for our workflow
-kubectl apply -f tf-user.yaml
+KSENV=local
+cd ks_app
+ks env add ${KSENV}
 ```

-You can check to make sure the components have deployed:
+Give the job a name to indicate it is running locally

 ```
-$ kubectl get pods -l name=tf-job-operator
-NAME                              READY     STATUS    RESTARTS   AGE
-tf-job-operator-78757955b-2glvj   1/1       Running   0          1m
-
-$ kubectl get pods -l app=workflow-controller
-NAME                                   READY     STATUS    RESTARTS   AGE
-workflow-controller-7d8f4bc5df-4zltg   1/1       Running   0          1m
-
-$ kubectl get crd
-NAME                    AGE
-tfjobs.kubeflow.org     1m
-workflows.argoproj.io   1m
-
-$ argo list
-NAME   STATUS   AGE   DURATION
+ks param set --env=${KSENV} train name mnist-train-local
 ```

-### Creating secrets for our workflow and setting S3 variables.
-
-For fetching and uploading data, our workflow requires S3 credentials and variables. These credentials will be provided as kubernetes secrets, and the variables will be passed into the workflow. Modify the below values to suit your environment.
+You can now submit the job 

 ```
-export S3_ENDPOINT=s3.us-west-2.amazonaws.com  #replace with your s3 endpoint in a host:port format, e.g. minio:9000
-export AWS_ENDPOINT_URL=https://${S3_ENDPOINT} #use http instead of https for default minio installs
-export AWS_ACCESS_KEY_ID=xxxxx
-export AWS_SECRET_ACCESS_KEY=xxxxx
-export AWS_REGION=us-west-2
-export BUCKET_NAME=mybucket
-export S3_USE_HTTPS=1 #set to 0 for default minio installs
-export S3_VERIFY_SSL=1 #set to 0 for defaul minio installs
-
-kubectl create secret generic aws-creds --from-literal=awsAccessKeyID=${AWS_ACCESS_KEY_ID} \
- --from-literal=awsSecretAccessKey=${AWS_SECRET_ACCESS_KEY}
+ks apply ${KSENV} -c train
 ```

-## Defining your training workflow
-
-This is the bulk of the work, let's walk through what is needed:
-
-1. Train the model
-1. Export the model
-1. Serve the model
-
-Now let's look at how this is represented in our [example workflow](model-train.yaml)
-
-The argo workflow can be daunting, but basically our steps above extrapolate as follows:
-
-1. `get-workflow-info`: Generate and set variables for consumption in the rest of the pipeline.
-1. `tensorboard`: Tensorboard is spawned, configured to watch the S3 URL for the training output.
-1. `train-model`: A TFJob is spawned taking in variables such as number of workers, what path the datasets are at, which model container image, etc. The model is exported at the end.
-1. `serve-model`: Optionally, the model is served.
-
-With our workflow defined, we can now execute it.
-
-## Submitting your training workflow
-
-First we need to set a few variables in our workflow. Make sure to set your docker registry or remove the `IMAGE` parameters in order to use our defaults:
+And you can check the job

 ```
-DOCKER_BASE_URL=docker.io/elsonrodriguez # Put your docker registry here
-export S3_DATA_URL=s3://${BUCKET_NAME}/data/mnist/
-export S3_TRAIN_BASE_URL=s3://${BUCKET_NAME}/models
-export JOB_NAME=myjob-$(uuidgen  | cut -c -5 | tr '[:upper:]' '[:lower:]')
-export TF_MODEL_IMAGE=${DOCKER_BASE_URL}/mytfmodel:1.7
-export TF_WORKER=3
-export MODEL_TRAIN_STEPS=200
+kubectl get tfjobs -o yaml mnist-train-local
 ```

-Next, submit your workflow.
+And to check the logs 

 ```
-argo submit model-train.yaml -n ${NAMESPACE} --serviceaccount tf-user \
-    -p aws-endpoint-url=${AWS_ENDPOINT_URL} \
-    -p s3-endpoint=${S3_ENDPOINT} \
-    -p aws-region=${AWS_REGION} \
-    -p tf-model-image=${TF_MODEL_IMAGE} \
-    -p s3-data-url=${S3_DATA_URL} \
-    -p s3-train-base-url=${S3_TRAIN_BASE_URL} \
-    -p job-name=${JOB_NAME} \
-    -p tf-worker=${TF_WORKER} \
-    -p model-train-steps=${MODEL_TRAIN_STEPS} \
-    -p s3-use-https=${S3_USE_HTTPS} \
-    -p s3-verify-ssl=${S3_VERIFY_SSL} \
-    -p namespace=${NAMESPACE}
+kubectl logs mnist-train-local-chief-0
 ```

-Your training workflow should now be executing.
+Storing the model in a directory inside the container isn't useful because the directory is
+lost as soon as the pod is deleted.
+
+So in the next sections we cover saving the model on a suitable filesystem like GCS or S3.
+
+#### Using GCS
+
+In this section we describe how to save the model to Google Cloud Storage (GCS).
+
+Storing the model in GCS has the advantages
+
+* The model is readily available after the job finishes
+* We can run distributed training
+   
+  * Distributed training requires a storage system accessible to all the machines
+
+Lets start by creating an environment to store parameters particular to writing the model to GCS
+and running distributed.

-You can verify and keep track of your workflow using the argo commands:
 ```
-$ argo list
-NAME                STATUS    AGE   DURATION
-tf-workflow-h7hwh   Running   1h    1h
+KSENV=distributed
+cd ks_app
+ks env add ${KSENV}
+```

-$ argo get tf-workflow-h7hwh
+Give the job a different name (to distinguish it from your job which didn't use GCS)
+
+```
+ks param set --env=${KSENV} train name mnist-train-dist
+```
+
+Next we configure it to run distributed by setting the number of parameter servers and workers to use.
+
+```
+ks param set --env=${KSENV} train numPs 1
+ks param set --env=${KSENV} train numWorkers 2
+```
+Now we need to configure parameters telling the code to save the model to GCS.
+
+```
+ks param set --env=${KSENV} train modelDir gs://${BUCKET}/${MODEL_PATH}
+ks param set --env=${KSENV} train exportDir gs://${BUCKET}/${MODEL_PATH}/export
+```
+
+In order to write to GCS we need to supply the TFJob with GCP credentials. We do
+this by telling our training code to use a [Google service account](https://cloud.google.com/docs/authentication/production#obtaining_and_providing_service_account_credentials_manually).
+
+If you followed the [getting started guide for GKE](https://www.kubeflow.org/docs/started/getting-started-gke/) 
+then a number of steps have already been performed for you
+
+  1. We created a Google service account named `${DEPLOYMENT}-user`
+
+     * You can run the following command to list all service accounts in your project
+
+       ```
+       gcloud --project=${PROJECT} iam service-accounts list
+       ```
+
+  1. We stored the private key for this account in a K8s secret named `user-gcp-sa`
+
+     * To see the secrets in your cluster
+     
+       ```
+       kubectl get secrets
+       ```
+
+  1. We granted this service account permission to read/write GCS buckets in this project
+
+     * To see the IAM policy you can do
+
+       ```
+       gcloud projects get-iam-policy ${PROJECT} --format=yaml
+       ```
+
+     * The output should look like
+
+       ```
+        bindings:
+        ...
+        - members:
+          - serviceAccount:${DEPLOYMENT}-user@${PROJEC}.iam.gserviceaccount.com
+            ...
+          role: roles/storage.admin
+          ...
+        etag: BwV_BqSmSCY=
+        version: 1
+      ```
+To use this service account we perform the following steps
+
+  1. Mount the secret into the pod
+
+     ```
+     ks param set --env=${KSENV} train secret user-gcp-sa=/var/secrets
+     ```
+
+     * Setting this ksonnet parameter causes a volumeMount and volume to be added to your TFJob
+     * To see this you can run
+
+       ```
+       ks show ${KSENV} -c train
+       ```
+
+     * The output should now include a volumeMount and volume section
+
+       ```
+apiVersion: kubeflow.org/v1beta1
+kind: TFJob
+metadata:
+  ...
+spec:
+  tfReplicaSpecs:
+    Chief:
+      ...
+      template:
+        ...
+        spec:
+          containers:
+          - command:
+            ...
+            volumeMounts:
+            - mountPath: /var/secrets
+              name: user-gcp-sa
+              readOnly: true
+            ...
+          volumes:
+          - name: user-gcp-sa
+            secret:
+              secretName: user-gcp-sa
+       ...
+       ```
+
+  1. Next we need to set the environment variable `GOOGLE_APPLICATION_CREDENTIALS` so that our code knows
+     where to look for the service account key.
+
+     ```
+     ks param set --env=${KSENV} train envVariables GOOGLE_APPLICATION_CREDENTIALS=/var/secrets/user-gcp-sa.json     
+     ```
+
+     * If we look at the spec for our job we can see that the environment variable `GOOGLE_APPLICATION_CREDENTIALS` is set.
+
+       ```
+        ks show ${KSENV} -c train
+
+        apiVersion: kubeflow.org/v1beta1
+        kind: TFJob
+        metadata:
+          ...
+        spec:
+          tfReplicaSpecs:
+            Chief:
+              replicas: 1
+              template:
+                spec:
+                  containers:
+                  - command:
+                    ..
+                    env:
+                    ...
+                    - name: GOOGLE_APPLICATION_CREDENTIALS
+                      value: /var/secrets/user-gcp-sa.json
+                    ...
+                  ...
+            ...
+       ```
+
+
+You can now submit the job
+
+```
+ks apply ${KSENV} -c train
+```
+
+And you can check the job
+
+```
+kubectl get tfjobs -o yaml mnist-train-dist
+```
+
+And to check the logs 
+
+```
+kubectl logs mnist-train-dist-chief-0
+```
+
+
+#### Using S3
+
+**Note** This example isn't working on S3 yet. There is an open issue [#466](https://github.com/kubeflow/examples/issues/466) 
+to fix that.
+
+To use S3 we need we need to configure TensorFlow to use S3 credentials and variables. These credentials will be provided as kubernetes secrets, and the variables will be passed in as environment variables. Modify the below values to suit your environment.
+
+Give the job a different name (to distinguish it from your job which didn't use GCS)
+
+```
+ks param set --env=${KSENV} train name mnist-train-dist
+```
+
+Next we configure it to run distributed by setting the number of parameter servers and workers to use.
+
+```
+ks param set --env=${KSENV} train numPs 1
+ks param set --env=${KSENV} train numWorkers 2
+```
+Now we need to configure parameters telling the code to save the model to S3.
+
+```
+ks param set --env=${KSENV} train modelDir ${S3_MODEL_PATH_URI}
+ks param set --env=${KSENV} train exportDir ${S3_MODEL_EXPORT_URI}
+```
+
+In order to write to S3 we need to supply the TensorFlow code with AWS credentials we also need to set
+various environment variables configuring access to S3.
+
+  1. Define a bunch of environment variables corresponding to your S3 settings; these will be used in subsequent steps
+
+     ```
+     export S3_ENDPOINT=s3.us-west-2.amazonaws.com  #replace with your s3 endpoint in a host:port format, e.g. minio:9000
+     export AWS_ENDPOINT_URL=https://${S3_ENDPOINT} #use http instead of https for default minio installs
+     export AWS_ACCESS_KEY_ID=xxxxx
+     export AWS_SECRET_ACCESS_KEY=xxxxx
+     export AWS_REGION=us-west-2
+     export BUCKET_NAME=mybucket
+     export S3_USE_HTTPS=1 #set to 0 for default minio installs
+     export S3_VERIFY_SSL=1 #set to 0 for defaul minio installs 
+     ```
+
+  1. Create a K8s secret containing your AWS credentials
+
+     ```
+     kubectl create secret generic aws-creds --from-literal=awsAccessKeyID=${AWS_ACCESS_KEY_ID} \
+       --from-literal=awsSecretAccessKey=${AWS_SECRET_ACCESS_KEY}
+     ```
+  
+  1. Mount the secret into the pod
+
+     ```
+     ks param set --env=${KSENV} train secret aws-creds=/var/secrets
+     ```
+
+     * Setting this ksonnet parameter causes a volumeMount and volume to be added to your TFJob
+     * To see this you can run
+
+       ```
+       ks show ${KSENV} -c train
+       ```
+
+     * The output should now include a volumeMount and volume section
+
+       ```
+apiVersion: kubeflow.org/v1beta1
+kind: TFJob
+metadata:
+  ...
+spec:
+  tfReplicaSpecs:
+    Chief:
+      ...
+      template:
+        ...
+        spec:
+          containers:
+          - command:
+            ...
+            volumeMounts:
+            - mountPath: /var/secrets
+              name: aws-creds
+              readOnly: true
+            ...
+          volumes:
+          - name: aws-creds
+            secret:
+              secretName: aws-creds
+       ...
+       ```
+  
+  1. Next we need to set a whole bunch of S3 related environment variables so that TensorFlow
+     knows how to talk to S3
+
+     ```
+     AWSENV="S3_ENDPOINT=${S3_ENDPOINT}"
+     AWSENV="${AWSENV},AWS_ENDPOINT_URL=${AWS_ENDPOINT_URL}"     
+     AWSENV="${AWSENV},AWS_REGION=${AWS_REGION}"
+     AWSENV="${AWSENV},BUCKET_NAME=${BUCKET_NAME}"
+     AWSENV="${AWSENV},S3_USE_HTTPS=${S3_USE_HTTPS}"
+     AWSENV="${AWSENV},S3_VERIFY_SSL=${S3_VERIFY_SSL}"
+
+     ks param set --env=${KSENV} train envVariables ${AWSENV}
+     ```
+
+     * If we look at the spec for our job we can see that the environment variable `GOOGLE_APPLICATION_CREDENTIALS` is set.
+
+       ```
+        ks show ${KSENV} -c train
+
+        apiVersion: kubeflow.org/v1beta1
+        kind: TFJob
+        metadata:
+          ...
+        spec:
+          tfReplicaSpecs:
+            Chief:
+              replicas: 1
+              template:
+                spec:
+                  containers:
+                  - command:
+                    ..
+                    env:
+                    ...
+                    - name: AWS_BUCKET
+                      value: somebucket
+                    ...
+                  ...
+            ...
+       ```
+
+
+You can now submit the job
+
+```
+ks apply ${KSENV} -c train
+```
+
+And you can check the job
+
+```
+kubectl get tfjobs -o yaml mnist-train-dist
+```
+
+And to check the logs 
+
+```
+kubectl logs mnist-train-dist-chief-0
 ```

 ## Monitoring

 There are various ways to monitor workflow/training job. In addition to using `kubectl` to query for the status of `pods`, some basic dashboards are also available.

-### Argo UI
-
-The Argo UI is useful for seeing what stage your worfklow is in:
-
-```
-PODNAME=$(kubectl get pod -l app=argo-ui -n${NAMESPACE} -o jsonpath='{.items[0].metadata.name}')
-kubectl port-forward ${PODNAME} 8001:8001
-```
-
-You should now be able to visit [http://127.0.0.1:8001](http://127.0.0.1:8001) to see the status of your workflows.
-
 ### Tensorboard

+TODO: This section needs to be updated
+
 Tensorboard is deployed just before training starts. To connect:

 ```
@ -333,36 +556,7 @@ Your model says the above number is... 7!

 You can also omit `TF_MNIST_IMAGE_PATH`, and the client will pick a random number from the mnist test data. Run it repeatedly and see how your model fares!

-### Disabling Serving
-
-Model serving can be turned off by passing in `-p model-serving=false` to the `model-train.yaml` workflow. Then if you wish to serve your model after training, use the `model-deploy.yaml` workflow. Simply pass in the desired finished argo workflow as an argument:
-
-```
-WORKFLOW=<the workflowname>
-argo submit model-deploy.yaml -n ${NAMESPACE} -p workflow=${WORKFLOW} --serviceaccount=tf-user
-```
-
-## Submitting new argo jobs
-
-If you want to rerun your workflow from scratch, then you will need to provide a new `job-name` to the argo workflow. For example:
-
-```
-#We're re-using previous variables except JOB_NAME
-export JOB_NAME=myawesomejob
-
-argo submit model-train.yaml -n ${NAMESPACE} --serviceaccount tf-user \
-    -p aws-endpoint-url=${AWS_ENDPOINT_URL} \
-    -p s3-endpoint=${S3_ENDPOINT} \
-    -p aws-region=${AWS_REGION} \
-    -p tf-model-image=${TF_MODEL_IMAGE} \
-    -p s3-data-url=${S3_DATA_URL} \
-    -p s3-train-base-url=${S3_TRAIN_BASE_URL} \
-    -p job-name=${JOB_NAME} \
-    -p tf-worker=${TF_WORKER} \
-    -p model-train-steps=${MODEL_TRAIN_STEPS} \
-    -p namespace=${NAMESPACE}
-```

 ## Conclusion and Next Steps

-This is an example of what your machine learning pipeline can look like. Feel free to play with the tunables and see if you can increase your model's accuracy (increasing `model-train-steps` can go a long way).
+This is an example of what your machine learning can look like. Feel free to play with the tunables and see if you can increase your model's accuracy (increasing `model-train-steps` can go a long way).
--- a/mnist/ks_app/.gitignore
+++ b/mnist/ks_app/.gitignore
@ -0,0 +1,4 @@
+/lib
+/.ksonnet/registries
+/app.override.yaml
+/.ks_environment
--- a/mnist/ks_app/app.yaml
+++ b/mnist/ks_app/app.yaml
@ -0,0 +1,15 @@
+apiVersion: 0.3.0
+environments:
+  jlewi:
+    destination:
+      namespace: kubeflow
+      server: https://35.196.210.94
+    k8sVersion: v1.11.5
+    path: jlewi
+kind: ksonnet.io/app
+name: ks_app
+registries:
+  incubator:
+    protocol: github
+    uri: github.com/ksonnet/parts/tree/master/incubator
+version: 0.0.1
--- a/mnist/ks_app/components/params.libsonnet
+++ b/mnist/ks_app/components/params.libsonnet
@ -0,0 +1,18 @@
+{
+  global: {},
+  components: {
+    train: {
+      batchSize: 100,
+      envVariables: 'GOOGLE_APPLICATION_CREDENTIALS=/var/secrets/user-gcp-sa.json',
+      exportDir: 'gs://kubeflow-ci_temp/mnist-jlewi',
+      image: 'gcr.io/kubeflow-examples/mnist/model:v20190108-v0.2-137-g38daafa-dirty-911944',
+      learningRate: '0.01',
+      modelDir: 'gs://kubeflow-ci_temp/mnist-jlewi',
+      name: 'mnist-train',
+      numPs: 1,
+      numWorkers: 2,
+      secret: '',
+      trainSteps: 200,
+    },
+  },
+}
--- a/mnist/ks_app/components/train.jsonnet
+++ b/mnist/ks_app/components/train.jsonnet
@ -0,0 +1,117 @@
+// Component to train a model.
+//
+// Parameters are used to control training
+//   image: Docker iamge to use
+//   modelDir: Location to write the model this can be a local path (e.g. to a PV)
+//             or it can be any filesystem URI that TF understands (e.g GCS, S3, HDFS)
+//   exportDir: Location to export the model
+//   trainSteps: Number of training steps to run
+//   batchSize: Batch size
+//   learningRate: Learning rate
+//   envVariables: Comma separated list of environment variables to set.
+//     Use this to set environment variables needed to configure S3 access.
+//   numWorkers: Number of workers
+//   numPs: Number of parameter servers
+//
+local k = import "k.libsonnet";
+local env = std.extVar("__ksonnet/environments");
+local params = std.extVar("__ksonnet/params").components.train;
+
+local util = import "util.libsonnet";
+
+// The code currently uses environment variables to control the training.
+local trainEnv = [
+  {
+    name: "TF_MODEL_DIR",
+    value: params.modelDir,
+  },
+  {
+    name: "TF_EXPORT_DIR",
+    value: params.exportDir,
+  },
+  {
+    name: "TF_TRAIN_STEPS",
+    value: std.toString(params.trainSteps),
+  },
+  {
+    name: "TF_BATCH_SIZE",
+    value: std.toString(params.batchSize),
+  },
+  {
+    name: "TF_LEARNING_RATE",
+    value: std.toString(params.learningRate),
+  },
+];
+
+local secretName = std.split(params.secret, "=")[0];
+local secretMountPath = std.split(params.secret, "=")[1];
+
+local replicaSpec = {
+  containers: [
+    {
+      command: [
+        "/usr/bin/python",
+        "/opt/model.py",
+      ],
+      env: trainEnv + util.parseEnv(params.envVariables),
+      image: params.image,
+      name: "tensorflow",
+      volumeMounts: if secretMountPath != "" then
+        [
+          {
+            name: secretName,
+            mountPath: secretMountPath,
+            readOnly: true,
+          },
+        ] else [],
+      workingDir: "/opt",
+    },
+  ],
+  volumes:
+    if secretName != "" then
+      [
+        {
+          name: secretName,
+          secret: {
+            secretName: secretName,
+          },
+        },
+      ] else [],
+  restartPolicy: "OnFailure",
+};
+
+
+local tfjob = {
+  apiVersion: "kubeflow.org/v1beta1",
+  kind: "TFJob",
+  metadata: {
+    name: params.name,
+    namespace: env.namespace,
+  },
+  spec: {
+    tfReplicaSpecs: {
+      Chief: {
+        replicas: 1,
+        template: {
+          spec: replicaSpec,
+        },
+      },
+      [if params.numWorkers > 0 then "Worker"]: {
+        replicas: params.numWorkers,
+        template: {
+          spec: replicaSpec,
+        },
+      },
+      [if params.numWorkers > 0 then "Ps"]: {
+        replicas: params.numPs,
+        template: {
+          spec: replicaSpec,
+        },
+      },
+    },
+  },
+};
+
+k.core.v1.list.new([
+  tfjob,
+])
--- a/mnist/ks_app/components/util.libsonnet
+++ b/mnist/ks_app/components/util.libsonnet
@ -0,0 +1,19 @@
+{
+  // convert a list of two items into a map representing an environment variable
+  // TODO(jlewi): Should we move this into kubeflow/core/util.libsonnet
+  listToMap:: function(v)
+    {
+      name: v[0],
+      value: v[1],
+    },
+
+  // Function to turn comma separated list of environment variables into a dictionary.
+  parseEnv:: function(v)
+    local pieces = std.split(v, ",");
+    if v != "" && std.length(pieces) > 0 then
+      std.map(
+        function(i) $.listToMap(std.split(i, "=")),
+        std.split(v, ",")
+      )
+    else [],
+}
--- a/mnist/ks_app/environments/base.libsonnet
+++ b/mnist/ks_app/environments/base.libsonnet
@ -0,0 +1,4 @@
+local components = std.extVar("__ksonnet/components");
+components + {
+  // Insert user-specified overrides here.
+}
--- a/mnist/ks_app/environments/jlewi/globals.libsonnet
+++ b/mnist/ks_app/environments/jlewi/globals.libsonnet
@ -0,0 +1,2 @@
+{
+}
--- a/mnist/ks_app/environments/jlewi/main.jsonnet
+++ b/mnist/ks_app/environments/jlewi/main.jsonnet
@ -0,0 +1,9 @@
+local base = import "base.libsonnet";
+// uncomment if you reference ksonnet-lib
+// local k = import "k.libsonnet";
+// local deployment = k.apps.v1beta2.deployment;
+
+base + {
+  // Insert user-specified overrides here. For example if a component is named \"nginx-deployment\", you might have something like:\n")
+  // "nginx-deployment"+: deployment.mixin.metadata.withLabels({foo: "bar"})
+}
--- a/mnist/ks_app/environments/jlewi/params.libsonnet
+++ b/mnist/ks_app/environments/jlewi/params.libsonnet
@ -0,0 +1,20 @@
+local params = std.extVar('__ksonnet/params');
+local globals = import 'globals.libsonnet';
+local envParams = params + {
+  components+: {
+    "mnist-train"+: {
+      envVariables: 'GOOGLE_APPLICATION_CREDENTIALS=/var/secrets/user-gcp-sa.json',
+    },
+    train+: {
+      name: 'mnist-train-dist',
+      secret: 'user-gcp-sa=/var/secrets',
+    },
+  },
+};
+
+{
+  components: {
+    [x]: envParams.components[x] + globals
+    for x in std.objectFields(envParams.components)
+  },
+}
--- a/mnist/model.py
+++ b/mnist/model.py
@ -27,6 +27,7 @@ import numpy as np
 import tensorflow as tf

 # Configure model options
+# TODO(jlewi): Why environment variables and not command line arguments?
 TF_DATA_DIR = os.getenv("TF_DATA_DIR", "/tmp/data/")
 TF_MODEL_DIR = os.getenv("TF_MODEL_DIR", None)
 TF_EXPORT_DIR = os.getenv("TF_EXPORT_DIR", "mnist/")
--- a/mnist/testing/tfjob_test.py
+++ b/mnist/testing/tfjob_test.py
@ -0,0 +1,99 @@
+"""Test training using TFJob.
+
+This file tests that we can submit the job from ksonnet
+and that the job runs to completion.
+
+It is an integration test as it depends on having access to
+a Kubeflow deployment to submit the TFJob to.
+
+Python Path Requirements:
+  kubeflow/tf-operator/py - https://github.com/kubeflow/tf-operator
+     * Provides utilities for testing TFJobs
+  kubeflow/testing/py - https://github.com/kubeflow/testing/tree/master/py
+     * Provides utilities for testing
+
+Manually running the test
+ 1. Configure your KUBECONFIG file to point to the desired cluster
+ 2. Set --params=name=${NAME},namespace=${NAMESPACE}
+    * name should be the name for your job
+    * namespace should be the namespace to use
+ 3. To test a new image set the parameter image e.g
+     --params=name=${NAME},namespace=${NAMESPACE},image=${IMAGE}
+ 4. To control how long it trains set sample_size and num_epochs
+     --params=numTrainSteps=10,batchSize=10,...
+"""
+
+import json
+import logging
+import os
+
+from kubernetes import client as k8s_client
+from py import tf_job_client
+from py import test_runner
+
+from kubeflow.testing import ks_util
+from kubeflow.testing import test_util
+from kubeflow.testing import util
+
+class TFJobTest(test_util.TestCase):
+  def __init__(self, args):
+    namespace, name, env = test_runner.parse_runtime_params(args)
+    self.app_dir = args.app_dir
+
+    if not self.app_dir:
+      self.app_dir = os.path.join(os.path.dirname(__file__), "..",
+                                  "ks_app")
+      self.app_dir = os.path.abspath(self.app_dir)
+      logging.info("--app_dir not set defaulting to: %s", self.app_dir)
+
+    self.env = env
+    self.namespace = namespace
+    self.params = args.params
+    self.ks_cmd = ks_util.get_ksonnet_cmd(self.app_dir)
+    super(TFJobTest, self).__init__(class_name="TFJobTest", name=name)
+
+  def test_train(self):
+    # We repeat the test multiple times.
+    # This ensures that if we delete the job we can create a new job with the
+    # same name.
+    api_client = k8s_client.ApiClient()
+
+    component = "train"
+    # Setup the ksonnet app
+    ks_util.setup_ks_app(self.app_dir, self.env, self.namespace, component,
+                         self.params)
+
+
+    # Create the TF job
+    util.run([self.ks_cmd, "apply", self.env, "-c", component],
+             cwd=self.app_dir)
+    logging.info("Created job %s in namespaces %s", self.name, self.namespace)
+
+    # Wait for the job to complete.
+    logging.info("Waiting for job to finish.")
+    results = tf_job_client.wait_for_job(
+          api_client,
+          self.namespace,
+          self.name,
+          status_callback=tf_job_client.log_status)
+    logging.info("Final TFJob:\n %s", json.dumps(results, indent=2))
+
+    # Check for errors creating pods and services. Can potentially
+    # help debug failed test runs.
+    creation_failures = tf_job_client.get_creation_failures_from_tfjob(
+        api_client, self.namespace, results)
+    if creation_failures:
+      logging.warning(creation_failures)
+
+    if not tf_job_client.job_succeeded(results):
+      self.failure = "Job {0} in namespace {1} in status {2}".format(  # pylint: disable=attribute-defined-outside-init
+          self.name, self.namespace, results.get("status", {}))
+      logging.error(self.failure)
+      return
+
+    # We don't delete the jobs. We rely on TTLSecondsAfterFinished
+    # to delete old jobs. Leaving jobs around should make it
+    # easier to debug.
+
+if __name__ == "__main__":
+  test_runner.main(module=__name__)
--- a/test/workflows/components/mnist.jsonnet
+++ b/test/workflows/components/mnist.jsonnet
@ -15,7 +15,16 @@ local defaultParams = {
  dataVolume: "kubeflow-test-volume",

  // Default step image:
-  stepImage: "gcr.io/kubeflow-ci/test-worker:v20181017-bfeaaf5-dirty-4adcd0",
+  stepImage: "gcr.io/kubeflow-ci/test-worker:v20190104-f2a1cdf-e3b0c4",
+
+  // Which Kubeflow cluster to use for running TFJobs on.
+  kfProject: "kubeflow-ci",
+  kfZone: "us-east1-d",
+  kfCluster: "kf-v0-4-n00",
+
+  // The bucket where the model should be written
+  // This needs to be writable by the GCP service account in the Kubeflow cluster (not the test cluster)
+  modelBucket: "kubeflow-ci_temp",
 };

 local params = defaultParams + overrides;
@ -56,11 +65,17 @@ local srcRootDir = testDir + "/src";
 // The directory containing the kubeflow/kubeflow repo
 local srcDir = srcRootDir + "/" + prowDict.REPO_OWNER + "/" + prowDict.REPO_NAME;

-
 // These variables control where the docker images get pushed and what 
 // tag to use
-local imageBase = "gcr.io/kubeflow-ci/github-issue-summarization";
+local imageBase = "gcr.io/kubeflow-ci/mnist";
 local imageTag = "build-" + prowDict["BUILD_ID"];
+local trainerImage = imageBase + "/model:" + imageTag;
+
+// Directory where model should be stored.
+local modelDir = "gs://" + params.modelBucket + "/mnist/models/" + prowDict["BUILD_ID"];
+
+// value of KUBECONFIG environment variable. This should be  a full path.
+local kubeConfig = testDir + "/.kube/kubeconfig";

 // Build template is a template for constructing Argo step templates.
 //
@ -88,6 +103,7 @@ local buildTemplate = {
  // The directory within the kubeflow_testing submodule containing
  // py scripts to use.
  local kubeflowTestingPy = srcRootDir + "/kubeflow/testing/py",
+  local tfOperatorPy = srcRootDir + "/kubeflow/tf-operator",

  // Actual template for Argo
  argoTemplate: {
@ -101,7 +117,7 @@ local buildTemplate = {
        {
          // Add the source directories to the python path.
          name: "PYTHONPATH",
-          value: kubeflowTestingPy,
+          value: kubeflowTestingPy + ":" + tfOperatorPy,
        },
        {
          name: "GOOGLE_APPLICATION_CREDENTIALS",
@ -115,6 +131,12 @@ local buildTemplate = {
              key: "github_token",
            },
          },
+        },        
+        {
+          // We use a directory in our NFS share to store our kube config.
+          // This way we can configure it on a single step and reuse it on subsequent steps.
+          name: "KUBECONFIG",
+          value: kubeConfig,
        },
      ] + prowEnv + template.env_vars,
      volumeMounts: [
@ -147,7 +169,7 @@ local dagTemplates = [

      env_vars: [{
        name: "EXTRA_REPOS",
-        value: "kubeflow/testing@HEAD",
+        value: "kubeflow/testing@HEAD;kubeflow/tf-operator@HEAD",
      }],
    },
    dependencies: null,
@ -186,24 +208,61 @@ local dagTemplates = [
        "TAG=" + imageTag,
      ]]
      ),
-      workingDir: srcDir + "/github_issue_summarization",      
+      workingDir: srcDir + "/mnist",
    },
    dependencies: ["checkout"],
  }, // build-images
  {
-    // Run the python test to train the model
+    // Configure KUBECONFIG
    template: buildTemplate {
-      name: "train-test",
+      name: "get-kubeconfig",
+      command: util.buildCommand([
+      [
+        "gcloud",
+        "auth",
+        "activate-service-account",
+        "--key-file=${GOOGLE_APPLICATION_CREDENTIALS}",
+      ],
+      [
+        "gcloud",
+        "--project=" + params.kfProject,        
+        "container",
+        "clusters",
+        "get-credentials",
+        "--zone=" + params.kfZone,
+        params.kfCluster,
+      ]]
+      ),
+      workingDir: srcDir + "/github_issue_summarization",
+    },
+    dependencies: ["checkout"],
+  }, // get-kubeconfig
+  {
+    // Run the python test for TFJob
+    template: buildTemplate {
+      name: "tfjob-test",
      command: [
        "python",
-        "train_test.py",
-      ],
-      // Use the newly built image.
-      image: imageBase + "/trainer-estimator:" + imageTag,
-      workingDir: "/issues",
+        "tfjob_test.py",
+        "--artifacts_path=" + artifactsDir,
+        "--params=" + std.join(",", [
+          "name=mnist-test-" + prowDict["BUILD_ID"], 
+          "namespace=kubeflow",
+          "numTrainSteps=10",
+          "batchSize=10",
+          "image=" + trainerImage,
+          "numPs=1",
+          "numWorkers=2",
+          "modelDir=" + modelDir ,
+          "exportDir=" + modelDir, 
+          "envVariables=GOOGLE_APPLICATION_CREDENTIALS=/var/secrets/user-gcp-sa.json",
+          "secret=user-gcp-sa=/var/secrets",
+      ])],
+      workingDir: srcDir + "/mnist/testing",
    },
-    dependencies: ["build-images"],
-  },  // train-test
+    dependencies: ["build-images", "get-kubeconfig"],
+  },  // tfjob-test
+  // TODO(jlewi): We should add a non-distributed test that just uses the default values.
 ];

 // Dag defines the tasks in the graph
--- a/test/workflows/environments/test/params.libsonnet
+++ b/test/workflows/environments/test/params.libsonnet
@ -12,6 +12,11 @@ local envParams = params + {
      name: 'jlewi-gis-search-test-456-0105-104058',
      prow_env: 'JOB_NAME=gis-search-test,JOB_TYPE=presubmit,REPO_NAME=examples,REPO_OWNER=kubeflow,BUILD_NUMBER=0105-104058,BUILD_ID=0105-104058,PULL_NUMBER=456',
    },
+    mnist+: {
+      namespace: 'kubeflow-test-infra',
+      name: 'jlewi-mnist-test-465-0109-050605',
+      prow_env: 'JOB_NAME=mnist-test,JOB_TYPE=presubmit,REPO_NAME=examples,REPO_OWNER=kubeflow,BUILD_NUMBER=0109-050605,BUILD_ID=0109-050605,PULL_NUMBER=465',
+    },
  },
 };