Create a notebook for mnist E2E on GCP (#723)

* A notebook to run the mnist E2E example on GCP.

This fixes a number of issues with the example
* Use ISTIO instead of Ambassador to add reverse proxy routes
* The training job needs to be updated to run in a profile created namespace in order to have the required service accounts
     * See kubeflow/examples#713
     * Running inside a notebook running on Kubeflow should ensure user
       is running inside an appropriately setup namespace
* With ISTIO the default RBAC rules prevent the web UI from sending requests to the model server
     * A short term fix was to not include the ISTIO side car
     * In the future we can add an appropriate ISTIO rbac policy

* Using a notebook allows us to eliminate the use of kustomize
  * This resolves kubeflow/examples#713 which required people to use
    and old version of kustomize

  * Rather than using kustomize we can use python f style strings to
    write the YAML specs and then easily substitute in user specific values

  * This should be more informative; it avoids introducing kustomize and
    users can see the resource specs.

* I've opted to make the notebook GCP specific. I think its less confusing
  to users to have separate notebooks focused on specific platforms rather
  than having one notebook with a lot of caveats about what to do under
  different conditions

* I've deleted the kustomize overlays for GCS since we don't want users to
  use them anymore

* I used fairing and kaniko to eliminate the use of docker to build the images
  so that everything can run from a notebook running inside the cluster.

* k8s_utils.py has some reusable functions to add some details from users
  (e.g. low level calls to K8s APIs.)

* * Change the mnist test to just run the notebook
  * Copy the notebook test infra for xgboost_synthetic to py/kubeflow/examples/notebook_test to make it more reusable

* Fix lint.

* Update for lint.

* A notebook to run the mnist E2E example.

Related to: kubeflow/website#1553

* 1. Use fairing to build the model. 2. Construct the YAML spec directly in the notebook. 3. Use the TFJob python SDK.

* Fix the ISTIO rule.

* Fix UI and serving; need to update TF serving to match version trained on.

* Get the IAP endpoint.

* Start writing some helper python functions for K8s.

* Commit before switching from replace to delete.

* Create a library to bulk create objects.

* Cleanup.

* Add back k8s_util.py

* Delete train.yaml; this shouldn't have been aded.

* update the notebook image.

* Refactor code into k8s_util; print out links.

* Clean up the notebok. Should be working E2E.

* Added section to get logs from stackdriver.

* Add comment about profile.

* Latest.

* Override mnist_gcp.ipynb with mnist.ipynb

I accidentally put my latest changes in mnist.ipynb even though that file
was deleted.

* More fixes.

* Resolve some conflicts from the rebase; override with changes on remote branch.
This commit is contained in:
Jeremy Lewi 2020-02-16 19:15:28 -08:00 committed by GitHub
parent b9a7719f29
commit cc93a80420
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
28 changed files with 2576 additions and 1564 deletions

View File

@ -56,7 +56,10 @@ confidence=
# --enable=similarities". If you want to run only the classes checker, but have # --enable=similarities". If you want to run only the classes checker, but have
# no Warning level messages displayed, use"--disable=all --enable=classes # no Warning level messages displayed, use"--disable=all --enable=classes
# --disable=W" # --disable=W"
disable=import-star-module-level,old-octal-literal,oct-method,print-statement,unpacking-in-except,parameter-unpacking,backtick,old-raise-syntax,old-ne-operator,long-suffix,dict-view-method,dict-iter-method,metaclass-assignment,next-method-called,raising-string,indexing-exception,raw_input-builtin,long-builtin,file-builtin,execfile-builtin,coerce-builtin,cmp-builtin,buffer-builtin,basestring-builtin,apply-builtin,filter-builtin-not-iterating,using-cmp-argument,useless-suppression,range-builtin-not-iterating,suppressed-message,missing-docstring,no-absolute-import,old-division,cmp-method,reload-builtin,zip-builtin-not-iterating,intern-builtin,unichr-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,input-builtin,round-builtin,hex-method,nonzero-method,map-builtin-not-iterating,relative-import,invalid-name,bad-continuation,no-member,locally-disabled,fixme,import-error,too-many-locals,no-name-in-module,too-many-instance-attributes,no-self-use #
# Kubeflow disables string-interpolation because we are starting to use f
# style strings
disable=import-star-module-level,old-octal-literal,oct-method,print-statement,unpacking-in-except,parameter-unpacking,backtick,old-raise-syntax,old-ne-operator,long-suffix,dict-view-method,dict-iter-method,metaclass-assignment,next-method-called,raising-string,indexing-exception,raw_input-builtin,long-builtin,file-builtin,execfile-builtin,coerce-builtin,cmp-builtin,buffer-builtin,basestring-builtin,apply-builtin,filter-builtin-not-iterating,using-cmp-argument,useless-suppression,range-builtin-not-iterating,suppressed-message,missing-docstring,no-absolute-import,old-division,cmp-method,reload-builtin,zip-builtin-not-iterating,intern-builtin,unichr-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,input-builtin,round-builtin,hex-method,nonzero-method,map-builtin-not-iterating,relative-import,invalid-name,bad-continuation,no-member,locally-disabled,fixme,import-error,too-many-locals,no-name-in-module,too-many-instance-attributes,no-self-use,logging-fstring-interpolation
[REPORTS] [REPORTS]

View File

@ -1,5 +1,6 @@
#This container contains your model and any helper scripts specific to your model. #This container contains your model and any helper scripts specific to your model.
FROM tensorflow/tensorflow:1.7.0 # When building the image inside mnist.ipynb the base docker image will be overwritten
FROM tensorflow/tensorflow:1.15.2-py3
ADD model.py /opt/model.py ADD model.py /opt/model.py
RUN chmod +x /opt/model.py RUN chmod +x /opt/model.py

View File

@ -19,6 +19,8 @@
# To override variables do # To override variables do
# make ${TARGET} ${VAR}=${VALUE} # make ${TARGET} ${VAR}=${VALUE}
# #
#
# TODO(jlewi): We should probably switch to Skaffold and Tekton
# IMG is the base path for images.. # IMG is the base path for images..
# Individual images will be # Individual images will be

View File

@ -3,6 +3,8 @@
**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* **Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)*
- [MNIST on Kubeflow](#mnist-on-kubeflow) - [MNIST on Kubeflow](#mnist-on-kubeflow)
- [MNIST on Kubeflow on GCP](#mnist-on-kubeflow-on-gcp)
- [MNIST on other platforms](#mnist-on-other-platforms)
- [Prerequisites](#prerequisites) - [Prerequisites](#prerequisites)
- [Deploy Kubeflow](#deploy-kubeflow) - [Deploy Kubeflow](#deploy-kubeflow)
- [Local Setup](#local-setup) - [Local Setup](#local-setup)
@ -13,21 +15,17 @@
- [Preparing your Kubernetes Cluster](#preparing-your-kubernetes-cluster) - [Preparing your Kubernetes Cluster](#preparing-your-kubernetes-cluster)
- [Training your model](#training-your-model) - [Training your model](#training-your-model)
- [Local storage](#local-storage) - [Local storage](#local-storage)
- [Using GCS](#using-gcs)
- [Using S3](#using-s3) - [Using S3](#using-s3)
- [Monitoring](#monitoring) - [Monitoring](#monitoring)
- [Tensorboard](#tensorboard) - [Tensorboard](#tensorboard)
- [Local storage](#local-storage-1) - [Local storage](#local-storage-1)
- [Using GCS](#using-gcs-1)
- [Using S3](#using-s3-1) - [Using S3](#using-s3-1)
- [Deploying TensorBoard](#deploying-tensorboard) - [Deploying TensorBoard](#deploying-tensorboard)
- [Serving the model](#serving-the-model) - [Serving the model](#serving-the-model)
- [GCS](#gcs)
- [S3](#s3) - [S3](#s3)
- [Local storage](#local-storage-2) - [Local storage](#local-storage-2)
- [Web Front End](#web-front-end) - [Web Front End](#web-front-end)
- [Connecting via port forwarding](#connecting-via-port-forwarding) - [Connecting via port forwarding](#connecting-via-port-forwarding)
- [Using IAP on GCP](#using-iap-on-gcp)
- [Conclusion and Next Steps](#conclusion-and-next-steps) - [Conclusion and Next Steps](#conclusion-and-next-steps)
<!-- END doctoc generated TOC please keep comment here to allow auto update --> <!-- END doctoc generated TOC please keep comment here to allow auto update -->
@ -37,6 +35,45 @@
This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model. This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model.
Follow the version of the guide that is specific to how you have deployed Kubeflow
1. [MNIST on Kubeflow on GCP](#gcp)
1. [MNIST on other platforms](#other)
<a id=gcp></a>
# MNIST on Kubeflow on GCP
Follow these instructions to run the MNIST tutorial on GCP
1. Follow the [GCP instructions](https://www.kubeflow.org/docs/gke/deploy/) to deploy Kubeflow with IAP
1. Launch a Jupyter notebook
* The tutorial has been tested using the Jupyter Tensorflow 1.15 image
1. Launch a terminal in Jupyter and clone the kubeflow examples repo
```
git clone https://github.com/kubeflow/examples.git git_kubeflow-examples
```
* **Tip** When you start a terminal in Jupyter, run the command `bash` to start
a bash terminal which is much more friendly then the default shell
* **Tip** You can change the URL from '/tree' to '/lab' to switch to using Jupyterlab
1. Open the notebook `mnist/mnist_gcp.ipynb`
1. Follow the notebook to train and deploy MNIST on Kubeflow
<a id=other></a>
# MNIST on other platforms
The tutorial is currently not up to date for Kubeflow 1.0. Please check the issues
* [kubeflow/examples#724](https://github.com/kubeflow/examples/issues/724) for AWS
* [kubeflow/examples#725](https://github.com/kubeflow/examples/issues/725) for other platforms
## Prerequisites ## Prerequisites
Before we get started there are a few requirements. Before we get started there are a few requirements.
@ -166,100 +203,6 @@ And to check the logs
kubectl logs mnist-train-local-chief-0 kubectl logs mnist-train-local-chief-0
``` ```
#### Using GCS
In this section we describe how to save the model to Google Cloud Storage (GCS).
Storing the model in GCS has the advantages:
* The model is readily available after the job finishes
* We can run distributed training
* Distributed training requires a storage system accessible to all the machines
Enter the `training/GCS` from the `mnist` application directory.
```
cd training/GCS
```
Set an environment variable that points to your GCP project Id
```
PROJECT=<your project id>
```
Create a bucket on GCS to store our model. The name must be unique across all GCS buckets
```
BUCKET=distributed-$(date +%s)
gsutil mb gs://$BUCKET/
```
Give the job a different name (to distinguish it from your job which didn't use GCS)
```
kustomize edit add configmap mnist-map-training --from-literal=name=mnist-train-dist
```
Optionally, if you want to use your custom training image, configurate that as below.
```
kustomize edit set image training-image=$DOCKER_URL
```
Next we configure it to run distributed by setting the number of parameter servers and workers to use. The `numPs` means the number of Ps and the `numWorkers` means the number of Worker.
```
../base/definition.sh --numPs 1 --numWorkers 2
```
Set the training parameters, such as training steps, batch size and learning rate.
```
kustomize edit add configmap mnist-map-training --from-literal=trainSteps=200
kustomize edit add configmap mnist-map-training --from-literal=batchSize=100
kustomize edit add configmap mnist-map-training --from-literal=learningRate=0.01
```
Now we need to configure parameters and telling the code to save the model to GCS.
```
MODEL_PATH=my-model
kustomize edit add configmap mnist-map-training --from-literal=modelDir=gs://${BUCKET}/${MODEL_PATH}
kustomize edit add configmap mnist-map-training --from-literal=exportDir=gs://${BUCKET}/${MODEL_PATH}/export
```
Build a yaml file for the `TFJob` specification based on your kustomize config:
```
kustomize build . > mnist-training.yaml
```
Then, in `mnist-training.yaml`, search for this line: `namespace: kubeflow`.
Edit it to **replace `kubeflow` with the name of your user profile namespace**,
which will probably have the form `kubeflow-<username>`. (If you're not sure what this
namespace is called, you can find it in the top menubar of the Kubeflow Central
Dashboard.)
After you've updated the namespace, apply the `TFJob` specification to the
Kubeflow cluster:
```
kubectl apply -f mnist-training.yaml
```
You can then check the job status:
```
kubectl get tfjobs -n <your-user-namespace> -o yaml mnist-train-dist
```
And to check the logs:
```
kubectl logs -n <your-user-namespace> -f mnist-train-dist-chief-0
```
#### Using S3 #### Using S3
To use S3 we need to configure TensorFlow to use S3 credentials and variables. These credentials will be provided as kubernetes secrets and the variables will be passed in as environment variables. Modify the below values to suit your environment. To use S3 we need to configure TensorFlow to use S3 credentials and variables. These credentials will be provided as kubernetes secrets and the variables will be passed in as environment variables. Modify the below values to suit your environment.
@ -426,27 +369,6 @@ kustomize edit add configmap mnist-map-monitoring --from-literal=pvcMountPath=/m
kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=/mnt kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=/mnt
``` ```
#### Using GCS
Enter the `monitoring/GCS` from the `mnist` application directory.
```
cd monitoring/GCS
```
Configure TensorBoard to point to your model location
```
kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=${LOGDIR}
```
Assuming you followed the directions above if you used GCS you can use the following value
```
LOGDIR=gs://${BUCKET}/${MODEL_PATH}
```
#### Using S3 #### Using S3
Enter the `monitoring/S3` from the `mnist` application directory. Enter the `monitoring/S3` from the `mnist` application directory.
@ -551,64 +473,6 @@ The model code will export the model in saved model format which is suitable for
To serve the model follow the instructions below. The instructins vary slightly based on where you are storing your model (e.g. GCS, S3, PVC). Depending on the storage system we provide different kustomization as a convenience for setting relevant environment variables. To serve the model follow the instructions below. The instructins vary slightly based on where you are storing your model (e.g. GCS, S3, PVC). Depending on the storage system we provide different kustomization as a convenience for setting relevant environment variables.
### GCS
Here we show to serve the model when it is stored on GCS. This assumes that when you trained the model you set `exportDir` to a GCS URI; if not you can always copy it to GCS using `gsutil`.
Check that a model was exported
```
EXPORT_DIR=gs://${BUCKET}/${MODEL_PATH}/export
gsutil ls -r ${EXPORT_DIR}
```
The output should look something like
```
${EXPORT_DIR}/1547100373/saved_model.pb
${EXPORT_DIR}/1547100373/variables/:
${EXPORT_DIR}/1547100373/variables/
${EXPORT_DIR}/1547100373/variables/variables.data-00000-of-00001
${EXPORT_DIR}/1547100373/variables/variables.index
```
The number `1547100373` is a version number auto-generated by TensorFlow; it will vary on each run but should be monotonically increasing if you save a model to the same location as a previous location.
Enter the `serving/GCS` from the `mnist` application directory.
```
cd serving/GCS
```
Set a different name for the tf-serving.
```
kustomize edit add configmap mnist-map-serving --from-literal=name=mnist-gcs-dist
```
Set your model path
```
kustomize edit add configmap mnist-map-serving --from-literal=modelBasePath=${EXPORT_DIR}
```
Deploy it, and run a service to make the deployment accessible to other pods in the cluster
```
kustomize build . |kubectl apply -f -
```
You can check the deployment by running
```
kubectl describe deployments mnist-gcs-dist
```
The service should make the `mnist-gcs-dist` deployment accessible over port 9000
```
kubectl describe service mnist-gcs-dist
```
### S3 ### S3
We can also serve the model when it is stored on S3. This assumes that when you trained the model you set `exportDir` to a S3 We can also serve the model when it is stored on S3. This assumes that when you trained the model you set `exportDir` to a S3
@ -799,16 +663,7 @@ POD_NAME=$(kubectl get pods --selector=app=web-ui --template '{{range .items}}{{
kubectl port-forward ${POD_NAME} 8080:5000 kubectl port-forward ${POD_NAME} 8080:5000
``` ```
You should now be able to open up the web app at your localhost. [Local Storage](http://localhost:8080) or [GCS](http://localhost:8080/?addr=mnist-gcs-dist) or [S3](http://localhost:8080/?addr=mnist-s3-serving). You should now be able to open up the web app at your localhost. [Local Storage](http://localhost:8080) or [S3](http://localhost:8080/?addr=mnist-s3-serving).
### Using IAP on GCP
If you are using GCP and have set up IAP then you can access the web UI at
```
https://${DEPLOYMENT}.endpoints.${PROJECT}.cloud.goog/${NAMESPACE}/mnist/
```
## Conclusion and Next Steps ## Conclusion and Next Steps

147
mnist/k8s_util.py Normal file
View File

@ -0,0 +1,147 @@
"""Some utilities for working with Kubernetes.
TODO: These should probably be replaced by functions in fairing.
"""
import logging
import re
import yaml
from kubernetes import client as k8s_client
from kubernetes.client import rest as k8s_rest
def camel_to_snake(name):
name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
return re.sub('([a-z0-9])([A-Z])', r'\1_\2', name).lower()
K8S_CREATE = "K8S_CREATE"
K8S_REPLACE = "K8S_REPLACE"
K8S_CREATE_OR_REPLACE = "K8S_CREATE_OR_REPLACE"
def _get_result_name(result):
# For custom objects the result is a dict but for other objects
# its a python class
if isinstance(result, dict):
result_name = result["metadata"]["name"]
result_namespace = result["metadata"]["name"]
else:
result_name = result.metadata.name
result_namespace = result.metadata.namespace
return result_namespace, result_name
def apply_k8s_specs(specs, mode=K8S_CREATE): # pylint: disable=too-many-branches,too-many-statements
"""Run apply on the provided Kubernetes specs.
Args:
specs: A list of strings or dicts providing the YAML specs to
apply.
mode: (Optional): Mode indicates how the resources should be created.
K8S_CREATE - Use the create verb. Works with generateName
K8S_REPLACE - Issue a delete of existing resources before doing a create
K8s_CREATE_OR_REPLACE - Try to create an object; if it already exists
replace it
"""
# TODO(jlewi): How should we handle patching existing updates?
results = []
if mode not in [K8S_CREATE, K8S_CREATE_OR_REPLACE, K8S_REPLACE]:
raise ValueError(f"Unknown mode {mode}")
for s in specs:
spec = s
if not isinstance(spec, dict):
spec = yaml.load(spec)
name = spec["metadata"]["name"]
namespace = spec["metadata"]["namespace"]
kind = spec["kind"]
kind_snake = camel_to_snake(kind)
plural = spec["kind"].lower() + "s"
result = None
if not "/" in spec["apiVersion"]:
group = None
else:
group, version = spec["apiVersion"].split("/", 1)
if group is None or group.lower() == "apps":
if group is None:
api = k8s_client.CoreV1Api()
else:
api = k8s_client.AppsV1Api()
create_method_name = f"create_namespaced_{kind_snake}"
create_method_args = [namespace, spec]
replace_method_name = f"delete_namespaced_{kind_snake}"
replace_method_args = [name, namespace]
else:
api = k8s_client.CustomObjectsApi()
create_method_name = f"create_namespaced_custom_object"
create_method = getattr(api, create_method_name)
create_method_args = [group, version, namespace, plural, spec]
delete_options = k8s_client.V1DeleteOptions()
replace_method_name = f"delete_namespaced_custom_object"
replace_method_args = [group, version, namespace, plural, name, delete_options]
create_method = getattr(api, create_method_name)
replace_method = getattr(api, replace_method_name)
if mode in [K8S_CREATE, K8S_CREATE_OR_REPLACE]:
try:
result = create_method(*create_method_args)
result_namespace, result_name = _get_result_name(result)
logging.info(f"Created {kind} {result_namespace}.{result_name}")
results.append(result)
continue
except k8s_rest.ApiException as e:
# 409 is conflict indicates resource already exists
if e.status == 409 and mode == K8S_CREATE_OR_REPLACE:
pass
else:
raise
# Using replace didn't work for virtualservices so we explicitly delete
# and then issue a create
result = replace_method(*replace_method_args)
logging.info(f"Deleted {kind} {namespace}.{name}")
result = create_method(*create_method_args)
result_namespace, result_name = _get_result_name(result)
logging.info(f"Created {kind} {result_namespace}.{result_name}")
# Now recreate it
results.append(result)
return results
def get_iap_endpoint():
"""Return the URL of the IAP endpoint"""
extensions = k8s_client.ExtensionsV1beta1Api()
kf_ingress = None
try:
kf_ingress = extensions.read_namespaced_ingress("envoy-ingress", "istio-system")
except k8s_rest.ApiException as e:
if e.status == 403:
logging.warning(f"The service account doesn't have sufficient privileges "
f"to get the istio-system ingress. "
f"You will have to manually enter the Kubeflow endpoint. "
f"To make this function work ask someone with cluster "
f"priveleges to create an appropriate "
f"clusterrolebinding by running a command.\n"
f"kubectl create --namespace=istio-system rolebinding "
"--clusterrole=kubeflow-view "
"--serviceaccount=$NAMESPACE}:default-editor "
"${NAMESPACE}-istio-view")
return ""
raise
return f"https://{kf_ingress.spec.rules[0].host}"

1791
mnist/mnist_gcp.ipynb Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1,8 +0,0 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../base
configurations:
- params.yaml

57
mnist/notebook_setup.py Normal file
View File

@ -0,0 +1,57 @@
"""Some routines to setup the notebook.
This is separated out from util.py because this module installs some of the pip packages
that util depends on.
"""
import sys
import logging
import os
import subprocess
from importlib import reload
from pathlib import Path
TF_OPERATOR_COMMIT = "9238906"
def notebook_setup():
# Install the SDK
logging.basicConfig(format='%(message)s')
logging.getLogger().setLevel(logging.INFO)
home = str(Path.home())
logging.info("pip installing requirements.txt")
subprocess.check_call(["pip3", "install", "--user", "-r", "requirements.txt"])
clone_dir = os.path.join(home, "git_tf-operator")
if not os.path.exists(clone_dir):
logging.info("Cloning the tf-operator repo")
subprocess.check_call(["git", "clone", "https://github.com/kubeflow/tf-operator.git",
clone_dir])
logging.info(f"Checkout kubeflow/tf-operator @{TF_OPERATOR_COMMIT}")
subprocess.check_call(["git", "checkout", TF_OPERATOR_COMMIT], cwd=clone_dir)
logging.info("Configure docker credentials")
subprocess.check_call(["gcloud", "auth", "configure-docker", "--quiet"])
if os.getenv("GOOGLE_APPLICATION_CREDENTIALS"):
logging.info("Activating service account")
subprocess.check_call(["gcloud", "auth", "activate-service-account",
"--key-file=" +
os.getenv("GOOGLE_APPLICATION_CREDENTIALS"),
"--quiet"])
# Installing the python packages locally doesn't appear to have them automatically
# added the path so we need to manually add the directory
local_py_path = os.path.join(home, ".local/lib/python3.6/site-packages")
tf_operator_py_path = os.path.join(clone_dir, "sdk/python")
for p in [local_py_path, tf_operator_py_path]:
if p not in sys.path:
logging.info("Adding %s to python path", p)
# Insert at front because we want to override any installed packages
sys.path.insert(0, p)
# Force a reload of kubeflow; since kubeflow is a multi namespace module
# if we've loaded up some new kubeflow subpackages we need to force a reload to see them.
import kubeflow
reload(kubeflow)

2
mnist/requirements.txt Normal file
View File

@ -0,0 +1,2 @@
git+git://github.com/kubeflow/fairing.git@9b0d4ed4796ba349ac6067bbd802ff1d6454d015
retrying==1.3.3

View File

@ -1,5 +0,0 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../base

View File

@ -1,11 +1,11 @@
kind: ConfigMap
apiVersion: v1 apiVersion: v1
metadata:
name: mnist-deploy-config
namespace: kubeflow
data: data:
monitoring_config.txt: |- monitoring_config.txt: |-
prometheus_config: { prometheus_config: {
enable: true, enable: true,
path: "/monitoring/prometheus/metrics" path: "/monitoring/prometheus/metrics"
} }
kind: ConfigMap
metadata:
name: mnist-deploy-config
namespace: kubeflow

View File

@ -1,116 +0,0 @@
import os
import pytest
def pytest_addoption(parser):
parser.addoption(
"--tfjob_name", help="Name for the TFjob.",
type=str, default="mnist-test-" + os.getenv('BUILD_ID'))
parser.addoption(
"--namespace", help=("The namespace to run in. This should correspond to"
"a namespace associated with a Kubeflow namespace."),
type=str, default="kubeflow-kf-ci-v1-user")
parser.addoption(
"--repos", help="The repos to checkout; leave blank to use defaults",
type=str, default="")
parser.addoption(
"--trainer_image", help="TFJob training image",
type=str, default="gcr.io/kubeflow-examples/mnist/model:build-" + os.getenv('BUILD_ID'))
parser.addoption(
"--train_steps", help="train steps for mnist testing",
type=str, default="200")
parser.addoption(
"--batch_size", help="batch size for mnist trainning",
type=str, default="100")
parser.addoption(
"--learning_rate", help="mnist learnning rate",
type=str, default="0.01")
parser.addoption(
"--num_ps", help="The number of PS",
type=str, default="1")
parser.addoption(
"--num_workers", help="The number of Worker",
type=str, default="2")
parser.addoption(
"--model_dir", help="Path for model saving",
type=str, default="gs://kubeflow-ci-deployment_ci-temp/mnist/models/" + os.getenv('BUILD_ID'))
parser.addoption(
"--export_dir", help="Path for model exporting",
type=str, default="gs://kubeflow-ci-deployment_ci-temp/mnist/models/" + os.getenv('BUILD_ID'))
parser.addoption(
"--deploy_name", help="Name for the service deployment",
type=str, default="mnist-test-" + os.getenv('BUILD_ID'))
parser.addoption(
"--master", action="store", default="", help="IP address of GKE master")
parser.addoption(
"--service", action="store", default="mnist-test-" + os.getenv('BUILD_ID'),
help="The name of the mnist K8s service")
@pytest.fixture
def master(request):
return request.config.getoption("--master")
@pytest.fixture
def namespace(request):
return request.config.getoption("--namespace")
@pytest.fixture
def service(request):
return request.config.getoption("--service")
@pytest.fixture
def tfjob_name(request):
return request.config.getoption("--tfjob_name")
@pytest.fixture
def repos(request):
return request.config.getoption("--repos")
@pytest.fixture
def trainer_image(request):
return request.config.getoption("--trainer_image")
@pytest.fixture
def train_steps(request):
return request.config.getoption("--train_steps")
@pytest.fixture
def batch_size(request):
return request.config.getoption("--batch_size")
@pytest.fixture
def learning_rate(request):
return request.config.getoption("--learning_rate")
@pytest.fixture
def num_ps(request):
return request.config.getoption("--num_ps")
@pytest.fixture
def num_workers(request):
return request.config.getoption("--num_workers")
@pytest.fixture
def model_dir(request):
return request.config.getoption("--model_dir")
@pytest.fixture
def export_dir(request):
return request.config.getoption("--export_dir")
@pytest.fixture
def deploy_name(request):
return request.config.getoption("--deploy_name")

View File

@ -1,84 +0,0 @@
"""Test deploying the mnist model.
This file tests that we can deploy the model.
It is an integration test as it depends on having access to
a Kubeflow deployment to deploy on. It also depends on having a model.
Python Path Requirements:
kubeflow/testing/py - https://github.com/kubeflow/testing/tree/master/py
* Provides utilities for testing
Manually running the test
pytest deploy_test.py \
name=mnist-deploy-test-${BUILD_ID} \
namespace=${namespace} \
modelBasePath=${modelDir} \
exportDir=${modelDir} \
"""
import logging
import os
import pytest
from kubernetes.config import kube_config
from kubernetes import client as k8s_client
from kubeflow.testing import util
def test_deploy(record_xml_attribute, deploy_name, namespace, model_dir, export_dir):
util.set_pytest_junit(record_xml_attribute, "test_deploy")
util.maybe_activate_service_account()
app_dir = os.path.join(os.path.dirname(__file__), "../serving/GCS")
app_dir = os.path.abspath(app_dir)
logging.info("--app_dir not set defaulting to: %s", app_dir)
# TODO (@jinchihe) Using kustomize 2.0.3 to work around below issue:
# https://github.com/kubernetes-sigs/kustomize/issues/1295
kusUrl = 'https://github.com/kubernetes-sigs/kustomize/' \
'releases/download/v2.0.3/kustomize_2.0.3_linux_amd64'
util.run(['wget', '-q', '-O', '/usr/local/bin/kustomize', kusUrl], cwd=app_dir)
util.run(['chmod', 'a+x', '/usr/local/bin/kustomize'], cwd=app_dir)
# TODO (@jinchihe): The kubectl need to be upgraded to 1.14.0 due to below issue.
# Invalid object doesn't have additional properties ...
kusUrl = 'https://storage.googleapis.com/kubernetes-release/' \
'release/v1.14.0/bin/linux/amd64/kubectl'
util.run(['wget', '-q', '-O', '/usr/local/bin/kubectl', kusUrl], cwd=app_dir)
util.run(['chmod', 'a+x', '/usr/local/bin/kubectl'], cwd=app_dir)
# Configure custom parameters using kustomize
configmap = 'mnist-map-serving'
util.run(['kustomize', 'edit', 'set', 'namespace', namespace], cwd=app_dir)
util.run(['kustomize', 'edit', 'add', 'configmap', configmap,
'--from-literal=name' + '=' + deploy_name], cwd=app_dir)
util.run(['kustomize', 'edit', 'add', 'configmap', configmap,
'--from-literal=modelBasePath=' + model_dir], cwd=app_dir)
util.run(['kustomize', 'edit', 'add', 'configmap', configmap,
'--from-literal=exportDir=' + export_dir], cwd=app_dir)
# Apply the components
util.run(['kustomize', 'build', app_dir, '-o', 'generated.yaml'], cwd=app_dir)
util.run(['kubectl', 'apply', '-f', 'generated.yaml'], cwd=app_dir)
kube_config.load_kube_config()
api_client = k8s_client.ApiClient()
util.wait_for_deployment(api_client, namespace, deploy_name, timeout_minutes=4)
# We don't delete the resources. We depend on the namespace being
# garbage collected.
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format=('%(levelname)s|%(asctime)s'
'|%(pathname)s|%(lineno)d| %(message)s'),
datefmt='%Y-%m-%dT%H:%M:%S',
)
logging.getLogger().setLevel(logging.INFO)
pytest.main()

View File

@ -1,123 +0,0 @@
"""Test mnist_client.
This file tests that we can send predictions to the model
using REST.
It is an integration test as it depends on having access to
a deployed model.
We use the pytest framework because
1. It can output results in junit format for prow/gubernator
2. It has good support for configuring tests using command line arguments
(https://docs.pytest.org/en/latest/example/simple.html)
Python Path Requirements:
kubeflow/testing/py - https://github.com/kubeflow/testing/tree/master/py
* Provides utilities for testing
Manually running the test
1. Configure your KUBECONFIG file to point to the desired cluster
"""
import json
import logging
import os
import subprocess
import requests
from retrying import retry
import six
from kubernetes.config import kube_config
from kubernetes import client as k8s_client
import pytest
from kubeflow.testing import util
def is_retryable_result(r):
if r.status_code == requests.codes.NOT_FOUND:
message = "Request to {0} returned 404".format(r.url)
logging.error(message)
return True
return False
@retry(wait_exponential_multiplier=1000, wait_exponential_max=10000,
stop_max_delay=5*60*1000,
retry_on_result=is_retryable_result)
def send_request(*args, **kwargs):
# We don't use util.run because that ends up including the access token
# in the logs
token = subprocess.check_output(["gcloud", "auth", "print-access-token"])
if six.PY3 and hasattr(token, "decode"):
token = token.decode()
token = token.strip()
headers = {
"Authorization": "Bearer " + token,
}
if "headers" not in kwargs:
kwargs["headers"] = {}
kwargs["headers"].update(headers)
r = requests.post(*args, **kwargs)
return r
@pytest.mark.xfail
def test_predict(master, namespace, service):
app_credentials = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
if app_credentials:
print("Activate service account")
util.run(["gcloud", "auth", "activate-service-account",
"--key-file=" + app_credentials])
if not master:
print("--master set; using kubeconfig")
# util.load_kube_config appears to hang on python3
kube_config.load_kube_config()
api_client = k8s_client.ApiClient()
host = api_client.configuration.host
print("host={0}".format(host))
master = host.rsplit("/", 1)[-1]
this_dir = os.path.dirname(__file__)
test_data = os.path.join(this_dir, "test_data", "instances.json")
with open(test_data) as hf:
instances = json.load(hf)
# We proxy the request through the APIServer so that we can connect
# from outside the cluster.
url = ("https://{master}/api/v1/namespaces/{namespace}/services/{service}:8500"
"/proxy/v1/models/mnist:predict").format(
master=master, namespace=namespace, service=service)
logging.info("Request: %s", url)
r = send_request(url, json=instances, verify=False)
if r.status_code != requests.codes.OK:
msg = "Request to {0} exited with status code: {1} and content: {2}".format(
url, r.status_code, r.content)
logging.error(msg)
raise RuntimeError(msg)
content = r.content
if six.PY3 and hasattr(content, "decode"):
content = content.decode()
result = json.loads(content)
assert len(result["predictions"]) == 1
predictions = result["predictions"][0]
assert "classes" in predictions
assert "predictions" in predictions
assert len(predictions["predictions"]) == 10
logging.info("URL %s returned; %s", url, content)
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format=('%(levelname)s|%(asctime)s'
'|%(pathname)s|%(lineno)d| %(message)s'),
datefmt='%Y-%m-%dT%H:%M:%S',
)
logging.getLogger().setLevel(logging.INFO)
pytest.main()

View File

@ -1,792 +0,0 @@
{
"instances": [
{
"x": [
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.011764707043766975,
0.07058823853731155,
0.07058823853731155,
0.07058823853731155,
0.4941176772117615,
0.5333333611488342,
0.686274528503418,
0.10196079313755035,
0.6509804129600525,
1.0,
0.9686275124549866,
0.49803924560546875,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.11764706671237946,
0.1411764770746231,
0.3686274588108063,
0.6039215922355652,
0.6666666865348816,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.8823530077934265,
0.6745098233222961,
0.9921569228172302,
0.9490196704864502,
0.7647059559822083,
0.250980406999588,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.19215688109397888,
0.9333333969116211,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9843137860298157,
0.364705890417099,
0.32156863808631897,
0.32156863808631897,
0.2196078598499298,
0.15294118225574493,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.07058823853731155,
0.8588235974311829,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.7764706611633301,
0.7137255072593689,
0.9686275124549866,
0.9450981020927429,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.3137255012989044,
0.6117647290229797,
0.41960787773132324,
0.9921569228172302,
0.9921569228172302,
0.803921639919281,
0.04313725605607033,
0.0,
0.16862745583057404,
0.6039215922355652,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.05490196496248245,
0.003921568859368563,
0.6039215922355652,
0.9921569228172302,
0.3529411852359772,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.545098066329956,
0.9921569228172302,
0.7450980544090271,
0.007843137718737125,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.04313725605607033,
0.7450980544090271,
0.9921569228172302,
0.27450981736183167,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.13725490868091583,
0.9450981020927429,
0.8823530077934265,
0.6274510025978088,
0.4235294461250305,
0.003921568859368563,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.3176470696926117,
0.9411765336990356,
0.9921569228172302,
0.9921569228172302,
0.46666669845581055,
0.09803922474384308,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.1764705926179886,
0.729411780834198,
0.9921569228172302,
0.9921569228172302,
0.5882353186607361,
0.10588236153125763,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.062745101749897,
0.364705890417099,
0.988235354423523,
0.9921569228172302,
0.7333333492279053,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.9764706492424011,
0.9921569228172302,
0.9764706492424011,
0.250980406999588,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.18039216101169586,
0.5098039507865906,
0.7176470756530762,
0.9921569228172302,
0.9921569228172302,
0.8117647767066956,
0.007843137718737125,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.15294118225574493,
0.5803921818733215,
0.8980392813682556,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9803922176361084,
0.7137255072593689,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0941176563501358,
0.44705885648727417,
0.8666667342185974,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.7882353663444519,
0.30588236451148987,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.09019608050584793,
0.25882354378700256,
0.8352941870689392,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.7764706611633301,
0.3176470696926117,
0.007843137718737125,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.07058823853731155,
0.6705882549285889,
0.8588235974311829,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.7647059559822083,
0.3137255012989044,
0.03529411926865578,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.21568629145622253,
0.6745098233222961,
0.8862745761871338,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.9568628072738647,
0.5215686559677124,
0.04313725605607033,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.5333333611488342,
0.9921569228172302,
0.9921569228172302,
0.9921569228172302,
0.8313726186752319,
0.529411792755127,
0.5176470875740051,
0.062745101749897,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0
]
}
]
}

View File

@ -1,142 +0,0 @@
"""Test training using TFJob.
This file tests that we can submit the job
and that the job runs to completion.
It is an integration test as it depends on having access to
a Kubeflow deployment to submit the TFJob to.
Python Path Requirements:
kubeflow/tf-operator/py - https://github.com/kubeflow/tf-operator
* Provides utilities for testing TFJobs
kubeflow/testing/py - https://github.com/kubeflow/testing/tree/master/py
* Provides utilities for testing
Manually running the test
pytest tfjobs_test.py \
tfjob_name=tfjobs-test-${BUILD_ID} \
namespace=${test_namespace} \
trainer_image=${trainning_image} \
train_steps=10 \
batch_size=10 \
learning_rate=0.01 \
num_ps=1 \
num_workers=2 \
model_dir=${model_dir} \
export_dir=${model_dir} \
"""
import json
import logging
import os
import pytest
from kubernetes.config import kube_config
from kubernetes import client as k8s_client
from kubeflow.tf_operator import tf_job_client #pylint: disable=no-name-in-module
from kubeflow.testing import util
def test_training(record_xml_attribute, tfjob_name, namespace, trainer_image, num_ps, #pylint: disable=too-many-arguments
num_workers, train_steps, batch_size, learning_rate, model_dir, export_dir):
util.set_pytest_junit(record_xml_attribute, "test_mnist")
util.maybe_activate_service_account()
app_dir = os.path.join(os.path.dirname(__file__), "../training/GCS")
app_dir = os.path.abspath(app_dir)
logging.info("--app_dir not set defaulting to: %s", app_dir)
# TODO (@jinchihe) Using kustomize 2.0.3 to work around below issue:
# https://github.com/kubernetes-sigs/kustomize/issues/1295
kusUrl = 'https://github.com/kubernetes-sigs/kustomize/' \
'releases/download/v2.0.3/kustomize_2.0.3_linux_amd64'
util.run(['wget', '-q', '-O', '/usr/local/bin/kustomize', kusUrl], cwd=app_dir)
util.run(['chmod', 'a+x', '/usr/local/bin/kustomize'], cwd=app_dir)
# TODO (@jinchihe): The kubectl need to be upgraded to 1.14.0 due to below issue.
# Invalid object doesn't have additional properties ...
kusUrl = 'https://storage.googleapis.com/kubernetes-release/' \
'release/v1.14.0/bin/linux/amd64/kubectl'
util.run(['wget', '-q', '-O', '/usr/local/bin/kubectl', kusUrl], cwd=app_dir)
util.run(['chmod', 'a+x', '/usr/local/bin/kubectl'], cwd=app_dir)
# Configurate custom parameters using kustomize
util.run(['kustomize', 'edit', 'set', 'namespace', namespace], cwd=app_dir)
util.run(['kustomize', 'edit', 'set', 'image', 'training-image=' + trainer_image], cwd=app_dir)
util.run(['../base/definition.sh', '--numPs', num_ps], cwd=app_dir)
util.run(['../base/definition.sh', '--numWorkers', num_workers], cwd=app_dir)
trainning_config = {
"name": tfjob_name,
"trainSteps": train_steps,
"batchSize": batch_size,
"learningRate": learning_rate,
"modelDir": model_dir,
"exportDir": export_dir,
}
configmap = 'mnist-map-training'
for key, value in trainning_config.items():
util.run(['kustomize', 'edit', 'add', 'configmap', configmap,
'--from-literal=' + key + '=' + value], cwd=app_dir)
# Created the TFJobs.
util.run(['kustomize', 'build', app_dir, '-o', 'generated.yaml'], cwd=app_dir)
util.run(['kubectl', 'apply', '-f', 'generated.yaml'], cwd=app_dir)
logging.info("Created job %s in namespaces %s", tfjob_name, namespace)
kube_config.load_kube_config()
api_client = k8s_client.ApiClient()
# Wait for the job to complete.
logging.info("Waiting for job to finish.")
results = tf_job_client.wait_for_job(
api_client,
namespace,
tfjob_name,
status_callback=tf_job_client.log_status)
logging.info("Final TFJob:\n %s", json.dumps(results, indent=2))
# Check for errors creating pods and services. Can potentially
# help debug failed test runs.
creation_failures = tf_job_client.get_creation_failures_from_tfjob(
api_client, namespace, results)
if creation_failures:
logging.warning(creation_failures)
if not tf_job_client.job_succeeded(results):
failure = "Job {0} in namespace {1} in status {2}".format( # pylint: disable=attribute-defined-outside-init
tfjob_name, namespace, results.get("status", {}))
logging.error(failure)
# if the TFJob failed, print out the pod logs for debugging.
pod_names = tf_job_client.get_pod_names(
api_client, namespace, tfjob_name)
logging.info("The Pods name:\n %s", pod_names)
core_api = k8s_client.CoreV1Api(api_client)
for pod in pod_names:
logging.info("Getting logs of Pod %s.", pod)
try:
pod_logs = core_api.read_namespaced_pod_log(pod, namespace)
logging.info("The logs of Pod %s log:\n %s", pod, pod_logs)
except k8s_client.rest.ApiException as e:
logging.info("Exception when calling CoreV1Api->read_namespaced_pod_log: %s\n", e)
return
# We don't delete the jobs. We rely on TTLSecondsAfterFinished
# to delete old jobs. Leaving jobs around should make it
# easier to debug.
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format=('%(levelname)s|%(asctime)s'
'|%(pathname)s|%(lineno)d| %(message)s'),
datefmt='%Y-%m-%dT%H:%M:%S',
)
logging.getLogger().setLevel(logging.INFO)
pytest.main()

View File

@ -1,11 +0,0 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../base
images:
- name: training-image
newName: gcr.io/kubeflow-examples/mnist/model
newTag: build-1202842504546750464

View File

@ -27,7 +27,7 @@ from tensorflow.examples.tutorials.mnist import input_data
from tensorflow_serving.apis import predict_pb2 from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2 from tensorflow_serving.apis import prediction_service_pb2
from PIL import Image from PIL import Image # pylint: disable=wrong-import-order
def get_prediction(image, server_host='127.0.0.1', server_port=9000, def get_prediction(image, server_host='127.0.0.1', server_port=9000,

View File

@ -55,6 +55,7 @@ workflows:
- postsubmit - postsubmit
include_dirs: include_dirs:
- xgboost_synthetic/* - xgboost_synthetic/*
- mnist/*
- py/kubeflow/examples/create_e2e_workflow.py - py/kubeflow/examples/create_e2e_workflow.py
# E2E test for various notebooks # E2E test for various notebooks
@ -67,17 +68,7 @@ workflows:
- postsubmit - postsubmit
include_dirs: include_dirs:
- xgboost_synthetic/* - xgboost_synthetic/*
- mnist/*
- py/kubeflow/examples/create_e2e_workflow.py - py/kubeflow/examples/create_e2e_workflow.py
kwargs: kwargs:
cluster_pattern: kf-v1-(?!n\d\d) cluster_pattern: kf-v1-(?!n\d\d)
# E2E test for mnist example
- py_func: kubeflow.examples.create_e2e_workflow.create_workflow
name: mnist
job_types:
- periodic
- presubmit
- postsubmit
include_dirs:
- mnist/*
- py/kubeflow/examples/create_e2e_workflow.py

View File

@ -261,82 +261,23 @@ class Builder:
"xgboost_synthetic", "xgboost_synthetic",
"testing") "testing")
def _build_tests_dag_mnist(self):
"""Build the dag for the set of tests to run mnist TFJob tests."""
task_template = self._build_task_template()
# *************************************************************************** # ***************************************************************************
# Build mnist image # Test mnist
step_name = "build-image" step_name = "mnist"
train_image_base = "gcr.io/kubeflow-examples/mnist" command = ["pytest", "mnist_gcp_test.py",
train_image_tag = "build-" + PROW_DICT['BUILD_ID'] # Increase the log level so that info level log statements show up.
command = ["/bin/bash", "--log-cli-level=info",
"-c", "--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'",
"gcloud auth activate-service-account --key-file=$(GOOGLE_APPLICATION_CREDENTIALS) \ # Test timeout in seconds.
&& make build-gcb IMG=" + train_image_base + " TAG=" + train_image_tag, "--timeout=1800",
] "--junitxml=" + self.artifacts_dir + "/junit_mnist-gcp-test.xml",
]
dependencies = [] dependencies = []
build_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template, mnist_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template,
command, dependencies) command, dependencies)
build_step["container"]["workingDir"] = os.path.join(self.src_dir, "mnist") mnist_step["container"]["workingDir"] = os.path.join(
self.src_dir, "py/kubeflow/examples/notebook_tests")
# ***************************************************************************
# Test mnist TFJob
step_name = "tfjob-test"
# Using python2 to run the test to avoid dependency error.
command = ["python2", "-m", "pytest", "tfjob_test.py",
# Increase the log level so that info level log statements show up.
"--log-cli-level=info",
"--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'",
# Test timeout in seconds.
"--timeout=1800",
"--junitxml=" + self.artifacts_dir + "/junit_tfjob-test.xml",
]
dependencies = [build_step['name']]
tfjob_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template,
command, dependencies)
tfjob_step["container"]["workingDir"] = os.path.join(self.src_dir,
"mnist",
"testing")
# ***************************************************************************
# Test mnist deploy
step_name = "deploy-test"
command = ["python2", "-m", "pytest", "deploy_test.py",
# Increase the log level so that info level log statements show up.
"--log-cli-level=info",
"--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'",
# Test timeout in seconds.
"--timeout=1800",
"--junitxml=" + self.artifacts_dir + "/junit_deploy-test.xml",
]
dependencies = [tfjob_step["name"]]
deploy_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template,
command, dependencies)
deploy_step["container"]["workingDir"] = os.path.join(self.src_dir,
"mnist",
"testing")
# ***************************************************************************
# Test mnist predict
step_name = "predict-test"
command = ["pytest", "predict_test.py",
# Increase the log level so that info level log statements show up.
"--log-cli-level=info",
"--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'",
# Test timeout in seconds.
"--timeout=1800",
"--junitxml=" + self.artifacts_dir + "/junit_predict-test.xml",
]
dependencies = [deploy_step["name"]]
predict_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template,
command, dependencies)
predict_step["container"]["workingDir"] = os.path.join(self.src_dir,
"mnist",
"testing")
def _build_exit_dag(self): def _build_exit_dag(self):
"""Build the exit handler dag""" """Build the exit handler dag"""
@ -432,8 +373,6 @@ class Builder:
# Run a dag of tests # Run a dag of tests
if self.test_target_name.startswith("notebooks"): if self.test_target_name.startswith("notebooks"):
self._build_tests_dag_notebooks() self._build_tests_dag_notebooks()
elif self.test_target_name == "mnist":
self._build_tests_dag_mnist()
else: else:
raise RuntimeError('Invalid test_target_name ' + self.test_target_name) raise RuntimeError('Invalid test_target_name ' + self.test_target_name)

View File

@ -0,0 +1,34 @@
import pytest
def pytest_addoption(parser):
parser.addoption(
"--name", help="Name for the job. If not specified one was created "
"automatically", type=str, default="")
parser.addoption(
"--namespace", help=("The namespace to run in. This should correspond to"
"a namespace associated with a Kubeflow namespace."),
type=str,
default="kubeflow-kf-ci-v1-user")
parser.addoption(
"--image", help="Notebook image to use", type=str,
default="gcr.io/kubeflow-images-public/"
"tensorflow-1.15.2-notebook-cpu:1.0.0")
parser.addoption(
"--repos", help="The repos to checkout; leave blank to use defaults",
type=str, default="")
@pytest.fixture
def name(request):
return request.config.getoption("--name")
@pytest.fixture
def namespace(request):
return request.config.getoption("--namespace")
@pytest.fixture
def image(request):
return request.config.getoption("--image")
@pytest.fixture
def repos(request):
return request.config.getoption("--repos")

View File

@ -0,0 +1,58 @@
import fire
import tempfile
import logging
import os
import subprocess
logger = logging.getLogger(__name__)
def prepare_env():
subprocess.check_call(["pip3", "install", "-U", "papermill"])
subprocess.check_call(["pip3", "install", "-r", "../requirements.txt"])
def execute_notebook(notebook_path, parameters=None):
import papermill #pylint: disable=import-error
temp_dir = tempfile.mkdtemp()
notebook_output_path = os.path.join(temp_dir, "out.ipynb")
papermill.execute_notebook(notebook_path, notebook_output_path,
cwd=os.path.dirname(notebook_path),
parameters=parameters,
log_output=True)
return notebook_output_path
def run_notebook_test(notebook_path, expected_messages, parameters=None):
output_path = execute_notebook(notebook_path, parameters=parameters)
actual_output = open(output_path, 'r').read()
for expected_message in expected_messages:
if not expected_message in actual_output:
logger.error(actual_output)
assert False, "Unable to find from output: " + expected_message
class NotebookExecutor:
@staticmethod
def test(notebook_path):
"""Test a notebook.
Args:
notebook_path: Absolute path of the notebook.
"""
prepare_env()
FILE_DIR = os.path.dirname(__file__)
EXPECTED_MGS = [
"Finished upload of",
"Model export success: mockup-model.dat",
"Pod started running True",
"Cluster endpoint: http:",
]
run_notebook_test(notebook_path, EXPECTED_MGS)
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format=('%(levelname)s|%(asctime)s'
'|%(message)s|%(pathname)s|%(lineno)d|'),
datefmt='%Y-%m-%dT%H:%M:%S',
)
fire.Fire(NotebookExecutor)

View File

@ -0,0 +1,51 @@
# A batch job to run a notebook using papermill.
# TODO(jlewi): We should switch to using Tekton
apiVersion: batch/v1
kind: Job
metadata:
name: nb-test
labels:
app: nb-test
spec:
backoffLimit: 1
template:
metadata:
annotations:
# TODO(jlewi): Do we really want to disable sidecar injection
# in the test? Would it be better to use istio to mimic what happens
# in notebooks?
sidecar.istio.io/inject: "false"
labels:
app: nb-test
spec:
restartPolicy: Never
securityContext:
runAsUser: 0
initContainers:
# This init container checks out the source code.
- command:
- /usr/local/bin/checkout_repos.sh
- --repos=kubeflow/examples@$(CHECK_TAG)
- --src_dir=/src
name: checkout
image: gcr.io/kubeflow-ci/test-worker:v20190802-c6f9140-e3b0c4
volumeMounts:
- mountPath: /src
name: src
containers:
- env:
- name: PYTHONPATH
value: /src/kubeflow/examples/py/
- name: executing-notebooks
image: execute-image
command: ["python3", "-m",
"kubeflow.examples.notebook_tests.execute_notebook",
"test", "/src/kubeflow/examples/mnist/mnist_gcp.ipynb"]
workingDir: /src/kubeflow/examples/py/kubeflow/examples/notebook_tests
volumeMounts:
- mountPath: /src
name: src
serviceAccount: default-editor
volumes:
- name: src
emptyDir: {}

View File

@ -0,0 +1,29 @@
import logging
import os
import pytest
from kubeflow.examples.notebook_tests import nb_test_util
from kubeflow.testing import util
# TODO(jlewi): This test is new; there's some work to be done to make it
# reliable. So for now we mark it as expected to fail in presubmits
# We only mark it as expected to fail
# on presubmits because if expected failures don't show up in test grid
# and we want signal in postsubmits and periodics
@pytest.mark.xfail(os.getenv("JOB_TYPE") == "presubmit", reason="Flaky")
def test_mnist_gcp(record_xml_attribute, name, namespace, # pylint: disable=too-many-branches,too-many-statements
repos, image):
'''Generate Job and summit.'''
util.set_pytest_junit(record_xml_attribute, "test_mnist_gcp")
nb_test_util.run_papermill_job(name, namespace, repos, image)
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format=('%(levelname)s|%(asctime)s'
'|%(pathname)s|%(lineno)d| %(message)s'),
datefmt='%Y-%m-%dT%H:%M:%S',
)
logging.getLogger().setLevel(logging.INFO)
pytest.main()

View File

@ -0,0 +1,80 @@
"""Some utitilies for running notebook tests."""
import datetime
import logging
import uuid
import yaml
from kubernetes import client as k8s_client
from kubeflow.testing import argo_build_util
from kubeflow.testing import util
def run_papermill_job(name, namespace, # pylint: disable=too-many-branches,too-many-statements
repos, image):
"""Generate a K8s job to run a notebook using papermill
Args:
name: Name for the K8s job
namespace: The namespace where the job should run.
repos: (Optional) Which repos to checkout; if not specified tries
to infer based on PROW environment variables
image:
"""
util.maybe_activate_service_account()
with open("job.yaml") as hf:
job = yaml.load(hf)
# We need to checkout the correct version of the code
# in presubmits and postsubmits. We should check the environment variables
# for the prow environment variables to get the appropriate values.
# We should probably also only do that if the
# See
# https://github.com/kubernetes/test-infra/blob/45246b09ed105698aa8fb928b7736d14480def29/prow/jobs.md#job-environment-variables
if not repos:
repos = argo_build_util.get_repo_from_prow_env()
logging.info("Repos set to %s", repos)
job["spec"]["template"]["spec"]["initContainers"][0]["command"] = [
"/usr/local/bin/checkout_repos.sh",
"--repos=" + repos,
"--src_dir=/src",
"--depth=all",
]
job["spec"]["template"]["spec"]["containers"][0]["image"] = image
util.load_kube_config(persist_config=False)
if name:
job["metadata"]["name"] = name
else:
job["metadata"]["name"] = ("xgboost-test-" +
datetime.datetime.now().strftime("%H%M%S")
+ "-" + uuid.uuid4().hex[0:3])
name = job["metadata"]["name"]
job["metadata"]["namespace"] = namespace
# Create an API client object to talk to the K8s master.
api_client = k8s_client.ApiClient()
batch_api = k8s_client.BatchV1Api(api_client)
logging.info("Creating job:\n%s", yaml.dump(job))
actual_job = batch_api.create_namespaced_job(job["metadata"]["namespace"],
job)
logging.info("Created job %s.%s:\n%s", namespace, name,
yaml.safe_dump(actual_job.to_dict()))
final_job = util.wait_for_job(api_client, namespace, name,
timeout=datetime.timedelta(minutes=30))
logging.info("Final job:\n%s", yaml.safe_dump(final_job.to_dict()))
if not final_job.status.conditions:
raise RuntimeError("Job {0}.{1}; did not complete".format(namespace, name))
last_condition = final_job.status.conditions[-1]
if last_condition.type not in ["Complete"]:
logging.error("Job didn't complete successfully")
raise RuntimeError("Job {0}.{1} failed".format(namespace, name))

View File

@ -0,0 +1,252 @@
{
// TODO(https://github.com/ksonnet/ksonnet/issues/222): Taking namespace as an argument is a work around for the fact that ksonnet
// doesn't support automatically piping in the namespace from the environment to prototypes.
// convert a list of two items into a map representing an environment variable
// TODO(jlewi): Should we move this into kubeflow/core/util.libsonnet
listToMap:: function(v)
{
name: v[0],
value: v[1],
},
// Function to turn comma separated list of prow environment variables into a dictionary.
parseEnv:: function(v)
local pieces = std.split(v, ",");
if v != "" && std.length(pieces) > 0 then
std.map(
function(i) $.listToMap(std.split(i, "=")),
std.split(v, ",")
)
else [],
parts(namespace, name):: {
// Workflow to run the e2e test.
e2e(prow_env, bucket):
// mountPath is the directory where the volume to store the test data
// should be mounted.
local mountPath = "/mnt/" + "test-data-volume";
// testDir is the root directory for all data for a particular test run.
local testDir = mountPath + "/" + name;
// outputDir is the directory to sync to GCS to contain the output for this job.
local outputDir = testDir + "/output";
local artifactsDir = outputDir + "/artifacts";
local goDir = testDir + "/go";
// Source directory where all repos should be checked out
local srcRootDir = testDir + "/src";
// The directory containing the kubeflow/examples repo
local srcDir = srcRootDir + "/kubeflow/examples";
local image = "gcr.io/kubeflow-ci/test-worker";
// The name of the NFS volume claim to use for test files.
// local nfsVolumeClaim = "kubeflow-testing";
local nfsVolumeClaim = "nfs-external";
// The name to use for the volume to use to contain test data.
local dataVolume = "kubeflow-test-volume";
local versionTag = name;
// The directory within the kubeflow_testing submodule containing
// py scripts to use.
local kubeflowExamplesPy = srcDir;
local kubeflowTestingPy = srcRootDir + "/kubeflow/testing/py";
local project = "kubeflow-ci";
// GKE cluster to use
// We need to truncate the cluster to no more than 40 characters because
// cluster names can be a max of 40 characters.
// We expect the suffix of the cluster name to be unique salt.
// We prepend a z because cluster name must start with an alphanumeric character
// and if we cut the prefix we might end up starting with "-" or other invalid
// character for first character.
local cluster =
if std.length(name) > 40 then
"z" + std.substr(name, std.length(name) - 39, 39)
else
name;
local zone = "us-east1-d";
local chart = srcDir + "/bin/examples-chart-0.2.1-" + versionTag + ".tgz";
{
// Build an Argo template to execute a particular command.
// step_name: Name for the template
// command: List to pass as the container command.
buildTemplate(step_name, command):: {
name: step_name,
container: {
command: command,
image: image,
workingDir: srcDir,
env: [
{
// Add the source directories to the python path.
name: "PYTHONPATH",
value: kubeflowExamplesPy + ":" + kubeflowTestingPy,
},
{
// Set the GOPATH
name: "GOPATH",
value: goDir,
},
{
name: "GOOGLE_APPLICATION_CREDENTIALS",
value: "/secret/gcp-credentials/key.json",
},
{
name: "GIT_TOKEN",
valueFrom: {
secretKeyRef: {
name: "github-token",
key: "github_token",
},
},
},
{
name: "EXTRA_REPOS",
value: "kubeflow/testing@HEAD",
},
] + prow_env,
volumeMounts: [
{
name: dataVolume,
mountPath: mountPath,
},
{
name: "github-token",
mountPath: "/secret/github-token",
},
{
name: "gcp-credentials",
mountPath: "/secret/gcp-credentials",
},
],
},
}, // buildTemplate
apiVersion: "argoproj.io/v1alpha1",
kind: "Workflow",
metadata: {
name: name,
namespace: namespace,
},
// TODO(jlewi): Use OnExit to run cleanup steps.
spec: {
entrypoint: "e2e",
volumes: [
{
name: "github-token",
secret: {
secretName: "github-token",
},
},
{
name: "gcp-credentials",
secret: {
secretName: "kubeflow-testing-credentials",
},
},
{
name: dataVolume,
persistentVolumeClaim: {
claimName: nfsVolumeClaim,
},
},
], // volumes
// onExit specifies the template that should always run when the workflow completes.
onExit: "exit-handler",
templates: [
{
name: "e2e",
steps: [
[{
name: "checkout",
template: "checkout",
}],
[
{
name: "create-pr-symlink",
template: "create-pr-symlink",
},
// test_py_checks runs all py files matching "_test.py"
// This is currently commented out because the only matching tests
// are manual tests for some of the examples and/or they require
// dependencies (e.g. tensorflow) not in the generic test worker image.
//
//
// test_py_checks doesn't have options to exclude specific directories.
// Since there are no other tests we just comment it out.
//
// TODO(https://github.com/kubeflow/testing/issues/240): Modify py_test
// so we can exclude specific files.
//
// {
// name: "py-test",
// template: "py-test",
// },
// {
// name: "py-lint",
// template: "py-lint",
//},
],
],
},
{
name: "exit-handler",
steps: [
[{
name: "copy-artifacts",
template: "copy-artifacts",
}],
],
},
{
name: "checkout",
container: {
command: [
"/usr/local/bin/checkout.sh",
srcRootDir,
],
env: prow_env + [{
name: "EXTRA_REPOS",
value: "kubeflow/testing@HEAD",
}],
image: image,
volumeMounts: [
{
name: dataVolume,
mountPath: mountPath,
},
],
},
}, // checkout
$.parts(namespace, name).e2e(prow_env, bucket).buildTemplate("py-test", [
"python",
"-m",
"kubeflow.testing.test_py_checks",
"--artifacts_dir=" + artifactsDir,
"--src_dir=" + srcDir,
]), // py test
$.parts(namespace, name).e2e(prow_env, bucket).buildTemplate("py-lint", [
"python",
"-m",
"kubeflow.testing.test_py_lint",
"--artifacts_dir=" + artifactsDir,
"--src_dir=" + srcDir,
]), // py lint
$.parts(namespace, name).e2e(prow_env, bucket).buildTemplate("create-pr-symlink", [
"python",
"-m",
"kubeflow.testing.prow_artifacts",
"--artifacts_dir=" + outputDir,
"create_pr_symlink",
"--bucket=" + bucket,
]), // create-pr-symlink
$.parts(namespace, name).e2e(prow_env, bucket).buildTemplate("copy-artifacts", [
"python",
"-m",
"kubeflow.testing.prow_artifacts",
"--artifacts_dir=" + outputDir,
"copy_artifacts",
"--bucket=" + bucket,
]), // copy-artifacts
], // templates
},
}, // e2e
}, // parts
}

View File

@ -1257,7 +1257,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.5rc1" "version": "3.6.9"
} }
}, },
"nbformat": 4, "nbformat": 4,

View File

@ -0,0 +1 @@
TODO: We should reuse/share logic in py/kubeflow/examples/notebook_tests