* A notebook to run the mnist E2E example on GCP.
This fixes a number of issues with the example
* Use ISTIO instead of Ambassador to add reverse proxy routes
* The training job needs to be updated to run in a profile created namespace in order to have the required service accounts
* See kubeflow/examples#713
* Running inside a notebook running on Kubeflow should ensure user
is running inside an appropriately setup namespace
* With ISTIO the default RBAC rules prevent the web UI from sending requests to the model server
* A short term fix was to not include the ISTIO side car
* In the future we can add an appropriate ISTIO rbac policy
* Using a notebook allows us to eliminate the use of kustomize
* This resolves kubeflow/examples#713 which required people to use
and old version of kustomize
* Rather than using kustomize we can use python f style strings to
write the YAML specs and then easily substitute in user specific values
* This should be more informative; it avoids introducing kustomize and
users can see the resource specs.
* I've opted to make the notebook GCP specific. I think its less confusing
to users to have separate notebooks focused on specific platforms rather
than having one notebook with a lot of caveats about what to do under
different conditions
* I've deleted the kustomize overlays for GCS since we don't want users to
use them anymore
* I used fairing and kaniko to eliminate the use of docker to build the images
so that everything can run from a notebook running inside the cluster.
* k8s_utils.py has some reusable functions to add some details from users
(e.g. low level calls to K8s APIs.)
* * Change the mnist test to just run the notebook
* Copy the notebook test infra for xgboost_synthetic to py/kubeflow/examples/notebook_test to make it more reusable
* Fix lint.
* Update for lint.
* A notebook to run the mnist E2E example.
Related to: kubeflow/website#1553
* 1. Use fairing to build the model. 2. Construct the YAML spec directly in the notebook. 3. Use the TFJob python SDK.
* Fix the ISTIO rule.
* Fix UI and serving; need to update TF serving to match version trained on.
* Get the IAP endpoint.
* Start writing some helper python functions for K8s.
* Commit before switching from replace to delete.
* Create a library to bulk create objects.
* Cleanup.
* Add back k8s_util.py
* Delete train.yaml; this shouldn't have been aded.
* update the notebook image.
* Refactor code into k8s_util; print out links.
* Clean up the notebok. Should be working E2E.
* Added section to get logs from stackdriver.
* Add comment about profile.
* Latest.
* Override mnist_gcp.ipynb with mnist.ipynb
I accidentally put my latest changes in mnist.ipynb even though that file
was deleted.
* More fixes.
* Resolve some conflicts from the rebase; override with changes on remote branch.
|
||
|---|---|---|
| .. | ||
| data | ||
| front | ||
| monitoring | ||
| serving | ||
| training | ||
| web-ui | ||
| .gitignore | ||
| Dockerfile.kustomize | ||
| Dockerfile.model | ||
| Makefile | ||
| README.md | ||
| image_build.jsonnet | ||
| k8s_util.py | ||
| kustomize-entrypoint.sh | ||
| mnist_gcp.ipynb | ||
| model.py | ||
| notebook_setup.py | ||
| requirements.txt | ||
README.md
Table of Contents generated with DocToc
- MNIST on Kubeflow
- MNIST on Kubeflow on GCP
- MNIST on other platforms
MNIST on Kubeflow
This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model.
Follow the version of the guide that is specific to how you have deployed Kubeflow
MNIST on Kubeflow on GCP
Follow these instructions to run the MNIST tutorial on GCP
-
Follow the GCP instructions to deploy Kubeflow with IAP
-
Launch a Jupyter notebook
- The tutorial has been tested using the Jupyter Tensorflow 1.15 image
-
Launch a terminal in Jupyter and clone the kubeflow examples repo
git clone https://github.com/kubeflow/examples.git git_kubeflow-examples-
Tip When you start a terminal in Jupyter, run the command
bashto start a bash terminal which is much more friendly then the default shell -
Tip You can change the URL from '/tree' to '/lab' to switch to using Jupyterlab
-
-
Open the notebook
mnist/mnist_gcp.ipynb -
Follow the notebook to train and deploy MNIST on Kubeflow
MNIST on other platforms
The tutorial is currently not up to date for Kubeflow 1.0. Please check the issues
- kubeflow/examples#724 for AWS
- kubeflow/examples#725 for other platforms
Prerequisites
Before we get started there are a few requirements.
Deploy Kubeflow
Follow the Getting Started Guide to deploy Kubeflow.
Local Setup
You also need the following command line tools:
Note: kustomize v2.0.3 is recommented since the problem in kustomize v2.1.0.
GCP Setup
If you are using GCP, need to enable Workload Identity to execute below steps.
Modifying existing examples
Many examples online use models that are unconfigurable, or don't work well in distributed mode. We will modify one of these examples to be better suited for distributed training and model serving.
Prepare model
There is a delta between existing distributed mnist examples and what's needed to run well as a TFJob.
Basically, we must:
- Add options in order to make the model configurable.
- Use
tf.estimator.train_and_evaluateto enable model exporting and serving. - Define serving signatures for model serving.
The resulting model is model.py.
(Optional) Build and push model image.
With our code ready, we will now build/push the docker image, or use the existing image gcr.io/kubeflow-ci/mnist/model:latest without building and pushing.
DOCKER_URL=docker.io/reponame/mytfmodel:tag # Put your docker registry here
docker build . --no-cache -f Dockerfile.model -t ${DOCKER_URL}
docker push ${DOCKER_URL}
Preparing your Kubernetes Cluster
With our data and workloads ready, now the cluster must be prepared. We will be deploying the TF Operator, and Argo to help manage our training job.
In the following instructions we will install our required components to a single namespace. For these instructions we will assume the chosen namespace is kubeflow.
kubectl config set-context $(kubectl config current-context) --namespace=kubeflow
Training your model
Local storage
Let's start by runing the training job on Kubeflow and storing the model in a local storage.
Fristly, refer to the document to create Persistent Volume(PV) and Persistent Volume Claim(PVC), the PVC name (${PVC_NAME}) will be used by pods of training and serving for local mode in steps below.
Enter the training/local from the mnist application directory.
cd training/local
Give the job a name to indicate it is running locally
kustomize edit add configmap mnist-map-training --from-literal=name=mnist-train-local
Point the job at your custom training image
kustomize edit set image training-image=$DOCKER_URL
Optionally, configure it to run distributed by setting the number of parameter servers and workers to use. The numPs means the number of Ps and the numWorkers means the number of Worker.
../base/definition.sh --numPs 1 --numWorkers 2
Set the training parameters, such as training steps, batch size and learning rate.
kustomize edit add configmap mnist-map-training --from-literal=trainSteps=200
kustomize edit add configmap mnist-map-training --from-literal=batchSize=100
kustomize edit add configmap mnist-map-training --from-literal=learningRate=0.01
To store the the exported model and checkpoints model, configure PVC name and mount piont.
kustomize edit add configmap mnist-map-training --from-literal=pvcName=${PVC_NAME}
kustomize edit add configmap mnist-map-training --from-literal=pvcMountPath=/mnt
Now we need to configure parameters and telling the code to save the model to PVC.
kustomize edit add configmap mnist-map-training --from-literal=modelDir=/mnt
kustomize edit add configmap mnist-map-training --from-literal=exportDir=/mnt/export
You can now submit the job
kustomize build . |kubectl apply -f -
And you can check the job
kubectl get tfjobs -o yaml mnist-train-local
And to check the logs
kubectl logs mnist-train-local-chief-0
Using S3
To use S3 we need to configure TensorFlow to use S3 credentials and variables. These credentials will be provided as kubernetes secrets and the variables will be passed in as environment variables. Modify the below values to suit your environment.
Enter the training/S3 from the mnist application directory.
cd training/S3
Give the job a different name (to distinguish it from your job which didn't use S3)
kustomize edit add configmap mnist-map-training --from-literal=name=mnist-train-dist
Optionally, if you want to use your custom training image, configurate that as below.
kustomize edit set image training-image=$DOCKER_URL
Next we configure it to run distributed by setting the number of parameter servers and workers to use. The numPs means the number of Ps and the numWorkers means the number of Worker.
../base/definition.sh --numPs 1 --numWorkers 2
Set the training parameters, such as training steps, batch size and learning rate.
kustomize edit add configmap mnist-map-training --from-literal=trainSteps=200
kustomize edit add configmap mnist-map-training --from-literal=batchSize=100
kustomize edit add configmap mnist-map-training --from-literal=learningRate=0.01
In order to write to S3 we need to supply the TensorFlow code with AWS credentials we also need to set various environment variables configuring access to S3.
-
Define a bunch of environment variables corresponding to your S3 settings; these will be used in subsequent steps
export S3_ENDPOINT=s3.us-west-2.amazonaws.com #replace with your s3 endpoint in a host:port format, e.g. minio:9000 export AWS_ENDPOINT_URL=https://${S3_ENDPOINT} #use http instead of https for default minio installs export AWS_ACCESS_KEY_ID=xxxxx export AWS_SECRET_ACCESS_KEY=xxxxx export AWS_REGION=us-west-2 export BUCKET_NAME=mybucket export S3_USE_HTTPS=1 #set to 0 for default minio installs export S3_VERIFY_SSL=1 #set to 0 for defaul minio installs export S3_MODEL_PATH_URI=s3://${BUCKET_NAME}/model export S3_MODEL_EXPORT_URI=s3://${BUCKET_NAME}/export -
Create a K8s secret containing your AWS credentials
kustomize edit add secret aws-creds --from-literal=awsAccessKeyID=${AWS_ACCESS_KEY_ID} \ --from-literal=awsSecretAccessKey=${AWS_SECRET_ACCESS_KEY} -
Pass secrets as environment variables into pod
kustomize edit add configmap mnist-map-training --from-literal=awsAccessKeyIDName=awsAccessKeyID kustomize edit add configmap mnist-map-training --from-literal=awsSecretAccessKeyName=awsSecretAccessKey -
Next we need to set a whole bunch of S3 related environment variables so that TensorFlow knows how to talk to S3
kustomize edit add configmap mnist-map-training --from-literal=S3_ENDPOINT=${S3_ENDPOINT} kustomize edit add configmap mnist-map-training --from-literal=AWS_ENDPOINT_URL=${AWS_ENDPOINT_URL} kustomize edit add configmap mnist-map-training --from-literal=AWS_REGION=${AWS_REGION} kustomize edit add configmap mnist-map-training --from-literal=BUCKET_NAME=${BUCKET_NAME} kustomize edit add configmap mnist-map-training --from-literal=S3_USE_HTTPS=${S3_USE_HTTPS} kustomize edit add configmap mnist-map-training --from-literal=S3_VERIFY_SSL=${S3_VERIFY_SSL} kustomize edit add configmap mnist-map-training --from-literal=modelDir=${S3_MODEL_PATH_URI} kustomize edit add configmap mnist-map-training --from-literal=exportDir=${S3_MODEL_EXPORT_URI}-
If we look at the spec for our job we can see that the environment variables related to S3 are set.
kustomize build . apiVersion: kubeflow.org/v1beta2 kind: TFJob metadata: ... spec: tfReplicaSpecs: Chief: replicas: 1 template: spec: containers: - command: .. env: ... - name: S3_ENDPOINT value: s3.us-west-2.amazonaws.com - name: AWS_ENDPOINT_URL value: https://s3.us-west-2.amazonaws.com - name: AWS_REGION value: us-west-2 - name: BUCKET_NAME value: mybucket - name: S3_USE_HTTPS value: "1" - name: S3_VERIFY_SSL value: "1" - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: key: awsAccessKeyID name: aws-creds-somevalue - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: key: awsSecretAccessKey name: aws-creds-somevalue ... ... ...
-
You can now submit the job
kustomize build . |kubectl apply -f -
And you can check the job
kubectl get tfjobs -o yaml mnist-train-dist
And to check the logs
kubectl logs -f mnist-train-dist-chief-0
Monitoring
There are various ways to monitor workflow/training job. In addition to using kubectl to query for the status of pods, some basic dashboards are also available.
Tensorboard
Local storage
Enter the monitoring/local from the mnist application directory.
cd monitoring/local
Configure PVC name, mount point, and set log directory.
kustomize edit add configmap mnist-map-monitoring --from-literal=pvcName=${PVC_NAME}
kustomize edit add configmap mnist-map-monitoring --from-literal=pvcMountPath=/mnt
kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=/mnt
Using S3
Enter the monitoring/S3 from the mnist application directory.
cd monitoring/S3
Assuming you followed the directions above if you used S3 you can use the following value
LOGDIR=${S3_MODEL_PATH_URI}
kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=${LOGDIR}
You need to point TensorBoard to AWS credentials to access S3 bucket with model.
-
Create a K8s secret containing your AWS credentials
kustomize edit add secret aws-creds --from-literal=awsAccessKeyID=${AWS_ACCESS_KEY_ID} \ --from-literal=awsSecretAccessKey=${AWS_SECRET_ACCESS_KEY} -
Pass secrets as environment variables into pod
kustomize edit add configmap mnist-map-monitoring --from-literal=awsAccessKeyIDName=awsAccessKeyID kustomize edit add configmap mnist-map-monitoring --from-literal=awsSecretAccessKeyName=awsSecretAccessKey -
Next we need to set a whole bunch of S3 related environment variables so that TensorBoard knows how to talk to S3
kustomize edit add configmap mnist-map-monitoring --from-literal=S3_ENDPOINT=${S3_ENDPOINT} kustomize edit add configmap mnist-map-monitoring --from-literal=AWS_ENDPOINT_URL=${AWS_ENDPOINT_URL} kustomize edit add configmap mnist-map-monitoring --from-literal=AWS_REGION=${AWS_REGION} kustomize edit add configmap mnist-map-monitoring --from-literal=BUCKET_NAME=${BUCKET_NAME} kustomize edit add configmap mnist-map-monitoring --from-literal=S3_USE_HTTPS=${S3_USE_HTTPS} kustomize edit add configmap mnist-map-monitoring --from-literal=S3_VERIFY_SSL=${S3_VERIFY_SSL}-
If we look at the spec for TensorBoard deployment we can see that the environment variables related to S3 are set.
kustomize build .... spec: containers: - command: .. env: ... - name: S3_ENDPOINT value: s3.us-west-2.amazonaws.com - name: AWS_ENDPOINT_URL value: https://s3.us-west-2.amazonaws.com - name: AWS_REGION value: us-west-2 - name: BUCKET_NAME value: mybucket - name: S3_USE_HTTPS value: "1" - name: S3_VERIFY_SSL value: "1" - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: key: awsAccessKeyID name: aws-creds-somevalue - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: key: awsSecretAccessKey name: aws-creds-somevalue ...
-
Deploying TensorBoard
Now you can deploy TensorBoard
kustomize build . | kubectl apply -f -
To access TensorBoard using port-forwarding
kubectl port-forward service/tensorboard-tb 8090:80
TensorBoard can now be accessed at http://127.0.0.1:8090.
Serving the model
The model code will export the model in saved model format which is suitable for serving with TensorFlow serving.
To serve the model follow the instructions below. The instructins vary slightly based on where you are storing your model (e.g. GCS, S3, PVC). Depending on the storage system we provide different kustomization as a convenience for setting relevant environment variables.
S3
We can also serve the model when it is stored on S3. This assumes that when you trained the model you set exportDir to a S3
URI; if not you can always copy it to S3 using the AWS CLI.
Assuming you followed the directions above, you should have set the following environment variables that will be used in this section:
echo ${S3_MODEL_EXPORT_URI}
echo ${AWS_REGION}
echo ${S3_ENDPOINT}
echo ${S3_USE_HTTPS}
echo ${S3_VERIFY_SSL}
Check that a model was exported to s3
aws s3 ls ${S3_MODEL_EXPORT_URI} --recursive
The output should look something like
${S3_MODEL_EXPORT_URI}/1547100373/saved_model.pb
${S3_MODEL_EXPORT_URI}/1547100373/variables/
${S3_MODEL_EXPORT_URI}/1547100373/variables/variables.data-00000-of-00001
${S3_MODEL_EXPORT_URI}/1547100373/variables/variables.index
The number 1547100373 is a version number auto-generated by TensorFlow; it will vary on each run but should be monotonically increasing if you save a model to the same location as a previous location.
Enter the serving/S3 folder from the mnist application directory.
cd serving/S3
Set a different name for the tf-serving.
kustomize edit add configmap mnist-map-serving --from-literal=name=mnist-s3-serving
Create a K8s secret containing your AWS credentials
kustomize edit add secret aws-creds --from-literal=awsAccessKeyID=${AWS_ACCESS_KEY_ID} \
--from-literal=awsSecretAccessKey=${AWS_SECRET_ACCESS_KEY}
Enable serving from S3 by configuring the following ksonnet parameters using the environment variables from above:
kustomize edit add configmap mnist-map-serving --from-literal=s3Enable=1 #This needs to be true for S3 connection to work
kustomize edit add configmap mnist-map-serving --from-literal=modelBasePath=${S3_MODEL_EXPORT_URI}/
kustomize edit add configmap mnist-map-serving --from-literal=S3_ENDPOINT=${S3_ENDPOINT}
kustomize edit add configmap mnist-map-serving --from-literal=AWS_REGION=${AWS_REGION}
kustomize edit add configmap mnist-map-serving --from-literal=S3_USE_HTTPS=${S3_USE_HTTPS}
kustomize edit add configmap mnist-map-serving --from-literal=S3_VERIFY_SSL=${S3_VERIFY_SSL}
kustomize edit add configmap mnist-map-serving --from-literal=AWS_ACCESS_KEY_ID=awsAccessKeyID
kustomize edit add configmap mnist-map-serving --from-literal=AWS_SECRET_ACCESS_KEY=awsSecretAccessKey
If we look at the spec for TensorFlow deployment we can see that the environment variables related to S3 are set.
kustomize build .
...
spec:
containers:
- command:
..
env:
...
- name: modelBasePath
value: s3://mybucket/export/
- name: s3Enable
value: "1"
- name: S3_ENDPOINT
value: s3.us-west-2.amazonaws.com
- name: AWS_REGION
value: us-west-2
- name: S3_USE_HTTPS
value: "1"
- name: S3_VERIFY_SSL
value: "1"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
key: awsAccessKeyID
name: aws-creds-somevalue
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
key: awsSecretAccessKey
name: aws-creds-somevalue
...
Deploy it, and run a service to make the deployment accessible to other pods in the cluster
kustomize build . |kubectl apply -f -
You can check the deployment by running
kubectl describe deployments mnist-s3-serving
The service should make the mnist-s3-serving deployment accessible over port 9000
kubectl describe service mnist-s3-serving
Local storage
The section shows how to serve the local model that was stored in PVC while training.
Enter the serving/local from the mnist application directory.
cd serving/local
Set a different name for the tf-serving.
kustomize edit add configmap mnist-map-serving --from-literal=name=mnist-service-local
Mount the PVC, by default the pvc will be mounted to the /mnt of the pod.
kustomize edit add configmap mnist-map-serving --from-literal=pvcName=${PVC_NAME}
kustomize edit add configmap mnist-map-serving --from-literal=pvcMountPath=/mnt
Configure a filepath for the exported model.
kustomize edit add configmap mnist-map-serving --from-literal=modelBasePath=/mnt/export
Deploy it, and run a service to make the deployment accessible to other pods in the cluster.
kustomize build . |kubectl apply -f -
You can check the deployment by running
kubectl describe deployments mnist-service-local
The service should make the mnist-service-local deployment accessible over port 9000.
kubectl describe service mnist-service-local
Web Front End
The example comes with a simple web front end that can be used with your model.
Enter the front from the mnist application directory.
cd front
To deploy the web front end
kustomize build . |kubectl apply -f -
Connecting via port forwarding
To connect to the web app via port-forwarding
POD_NAME=$(kubectl get pods --selector=app=web-ui --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
kubectl port-forward ${POD_NAME} 8080:5000
You should now be able to open up the web app at your localhost. Local Storage or S3.
Conclusion and Next Steps
This is an example of what your machine learning can look like. Feel free to play with the tunables and see if you can increase your model's accuracy (increasing model-train-steps can go a long way).