diff --git a/.pylintrc b/.pylintrc index e133caa2..f19c2384 100644 --- a/.pylintrc +++ b/.pylintrc @@ -56,7 +56,10 @@ confidence= # --enable=similarities". If you want to run only the classes checker, but have # no Warning level messages displayed, use"--disable=all --enable=classes # --disable=W" -disable=import-star-module-level,old-octal-literal,oct-method,print-statement,unpacking-in-except,parameter-unpacking,backtick,old-raise-syntax,old-ne-operator,long-suffix,dict-view-method,dict-iter-method,metaclass-assignment,next-method-called,raising-string,indexing-exception,raw_input-builtin,long-builtin,file-builtin,execfile-builtin,coerce-builtin,cmp-builtin,buffer-builtin,basestring-builtin,apply-builtin,filter-builtin-not-iterating,using-cmp-argument,useless-suppression,range-builtin-not-iterating,suppressed-message,missing-docstring,no-absolute-import,old-division,cmp-method,reload-builtin,zip-builtin-not-iterating,intern-builtin,unichr-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,input-builtin,round-builtin,hex-method,nonzero-method,map-builtin-not-iterating,relative-import,invalid-name,bad-continuation,no-member,locally-disabled,fixme,import-error,too-many-locals,no-name-in-module,too-many-instance-attributes,no-self-use +# +# Kubeflow disables string-interpolation because we are starting to use f +# style strings +disable=import-star-module-level,old-octal-literal,oct-method,print-statement,unpacking-in-except,parameter-unpacking,backtick,old-raise-syntax,old-ne-operator,long-suffix,dict-view-method,dict-iter-method,metaclass-assignment,next-method-called,raising-string,indexing-exception,raw_input-builtin,long-builtin,file-builtin,execfile-builtin,coerce-builtin,cmp-builtin,buffer-builtin,basestring-builtin,apply-builtin,filter-builtin-not-iterating,using-cmp-argument,useless-suppression,range-builtin-not-iterating,suppressed-message,missing-docstring,no-absolute-import,old-division,cmp-method,reload-builtin,zip-builtin-not-iterating,intern-builtin,unichr-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,input-builtin,round-builtin,hex-method,nonzero-method,map-builtin-not-iterating,relative-import,invalid-name,bad-continuation,no-member,locally-disabled,fixme,import-error,too-many-locals,no-name-in-module,too-many-instance-attributes,no-self-use,logging-fstring-interpolation [REPORTS] diff --git a/mnist/Dockerfile.model b/mnist/Dockerfile.model index 9570a2bf..e3a5de54 100644 --- a/mnist/Dockerfile.model +++ b/mnist/Dockerfile.model @@ -1,5 +1,6 @@ #This container contains your model and any helper scripts specific to your model. -FROM tensorflow/tensorflow:1.7.0 +# When building the image inside mnist.ipynb the base docker image will be overwritten +FROM tensorflow/tensorflow:1.15.2-py3 ADD model.py /opt/model.py RUN chmod +x /opt/model.py diff --git a/mnist/Makefile b/mnist/Makefile index a2ef624f..4ba8decd 100755 --- a/mnist/Makefile +++ b/mnist/Makefile @@ -19,6 +19,8 @@ # To override variables do # make ${TARGET} ${VAR}=${VALUE} # +# +# TODO(jlewi): We should probably switch to Skaffold and Tekton # IMG is the base path for images.. # Individual images will be diff --git a/mnist/README.md b/mnist/README.md index 920b2a75..37b469af 100644 --- a/mnist/README.md +++ b/mnist/README.md @@ -3,6 +3,8 @@ **Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* - [MNIST on Kubeflow](#mnist-on-kubeflow) +- [MNIST on Kubeflow on GCP](#mnist-on-kubeflow-on-gcp) +- [MNIST on other platforms](#mnist-on-other-platforms) - [Prerequisites](#prerequisites) - [Deploy Kubeflow](#deploy-kubeflow) - [Local Setup](#local-setup) @@ -13,21 +15,17 @@ - [Preparing your Kubernetes Cluster](#preparing-your-kubernetes-cluster) - [Training your model](#training-your-model) - [Local storage](#local-storage) - - [Using GCS](#using-gcs) - [Using S3](#using-s3) - [Monitoring](#monitoring) - [Tensorboard](#tensorboard) - [Local storage](#local-storage-1) - - [Using GCS](#using-gcs-1) - [Using S3](#using-s3-1) - [Deploying TensorBoard](#deploying-tensorboard) - [Serving the model](#serving-the-model) - - [GCS](#gcs) - [S3](#s3) - [Local storage](#local-storage-2) - [Web Front End](#web-front-end) - [Connecting via port forwarding](#connecting-via-port-forwarding) - - [Using IAP on GCP](#using-iap-on-gcp) - [Conclusion and Next Steps](#conclusion-and-next-steps) @@ -37,6 +35,45 @@ This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model. +Follow the version of the guide that is specific to how you have deployed Kubeflow + +1. [MNIST on Kubeflow on GCP](#gcp) +1. [MNIST on other platforms](#other) + + +# MNIST on Kubeflow on GCP + +Follow these instructions to run the MNIST tutorial on GCP + +1. Follow the [GCP instructions](https://www.kubeflow.org/docs/gke/deploy/) to deploy Kubeflow with IAP + +1. Launch a Jupyter notebook + + * The tutorial has been tested using the Jupyter Tensorflow 1.15 image + +1. Launch a terminal in Jupyter and clone the kubeflow examples repo + + ``` + git clone https://github.com/kubeflow/examples.git git_kubeflow-examples + ``` + + * **Tip** When you start a terminal in Jupyter, run the command `bash` to start + a bash terminal which is much more friendly then the default shell + + * **Tip** You can change the URL from '/tree' to '/lab' to switch to using Jupyterlab + +1. Open the notebook `mnist/mnist_gcp.ipynb` + +1. Follow the notebook to train and deploy MNIST on Kubeflow + + +# MNIST on other platforms + +The tutorial is currently not up to date for Kubeflow 1.0. Please check the issues + +* [kubeflow/examples#724](https://github.com/kubeflow/examples/issues/724) for AWS +* [kubeflow/examples#725](https://github.com/kubeflow/examples/issues/725) for other platforms + ## Prerequisites Before we get started there are a few requirements. @@ -166,100 +203,6 @@ And to check the logs kubectl logs mnist-train-local-chief-0 ``` - -#### Using GCS - -In this section we describe how to save the model to Google Cloud Storage (GCS). - -Storing the model in GCS has the advantages: - -* The model is readily available after the job finishes -* We can run distributed training - - * Distributed training requires a storage system accessible to all the machines - -Enter the `training/GCS` from the `mnist` application directory. - -``` -cd training/GCS -``` - -Set an environment variable that points to your GCP project Id -``` -PROJECT= -``` - -Create a bucket on GCS to store our model. The name must be unique across all GCS buckets -``` -BUCKET=distributed-$(date +%s) -gsutil mb gs://$BUCKET/ -``` - -Give the job a different name (to distinguish it from your job which didn't use GCS) - -``` -kustomize edit add configmap mnist-map-training --from-literal=name=mnist-train-dist -``` - -Optionally, if you want to use your custom training image, configurate that as below. - -``` -kustomize edit set image training-image=$DOCKER_URL -``` - -Next we configure it to run distributed by setting the number of parameter servers and workers to use. The `numPs` means the number of Ps and the `numWorkers` means the number of Worker. - -``` -../base/definition.sh --numPs 1 --numWorkers 2 -``` - -Set the training parameters, such as training steps, batch size and learning rate. - -``` -kustomize edit add configmap mnist-map-training --from-literal=trainSteps=200 -kustomize edit add configmap mnist-map-training --from-literal=batchSize=100 -kustomize edit add configmap mnist-map-training --from-literal=learningRate=0.01 -``` - -Now we need to configure parameters and telling the code to save the model to GCS. - -``` -MODEL_PATH=my-model -kustomize edit add configmap mnist-map-training --from-literal=modelDir=gs://${BUCKET}/${MODEL_PATH} -kustomize edit add configmap mnist-map-training --from-literal=exportDir=gs://${BUCKET}/${MODEL_PATH}/export -``` - -Build a yaml file for the `TFJob` specification based on your kustomize config: - -``` -kustomize build . > mnist-training.yaml -``` - -Then, in `mnist-training.yaml`, search for this line: `namespace: kubeflow`. -Edit it to **replace `kubeflow` with the name of your user profile namespace**, -which will probably have the form `kubeflow-`. (If you're not sure what this -namespace is called, you can find it in the top menubar of the Kubeflow Central -Dashboard.) - -After you've updated the namespace, apply the `TFJob` specification to the -Kubeflow cluster: - -``` -kubectl apply -f mnist-training.yaml -``` - -You can then check the job status: - -``` -kubectl get tfjobs -n -o yaml mnist-train-dist -``` - -And to check the logs: - -``` -kubectl logs -n -f mnist-train-dist-chief-0 -``` - #### Using S3 To use S3 we need to configure TensorFlow to use S3 credentials and variables. These credentials will be provided as kubernetes secrets and the variables will be passed in as environment variables. Modify the below values to suit your environment. @@ -426,27 +369,6 @@ kustomize edit add configmap mnist-map-monitoring --from-literal=pvcMountPath=/m kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=/mnt ``` - -#### Using GCS - -Enter the `monitoring/GCS` from the `mnist` application directory. - -``` -cd monitoring/GCS -``` - -Configure TensorBoard to point to your model location - -``` -kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=${LOGDIR} -``` - -Assuming you followed the directions above if you used GCS you can use the following value - -``` -LOGDIR=gs://${BUCKET}/${MODEL_PATH} -``` - #### Using S3 Enter the `monitoring/S3` from the `mnist` application directory. @@ -551,64 +473,6 @@ The model code will export the model in saved model format which is suitable for To serve the model follow the instructions below. The instructins vary slightly based on where you are storing your model (e.g. GCS, S3, PVC). Depending on the storage system we provide different kustomization as a convenience for setting relevant environment variables. -### GCS - -Here we show to serve the model when it is stored on GCS. This assumes that when you trained the model you set `exportDir` to a GCS URI; if not you can always copy it to GCS using `gsutil`. - -Check that a model was exported - -``` -EXPORT_DIR=gs://${BUCKET}/${MODEL_PATH}/export -gsutil ls -r ${EXPORT_DIR} -``` - -The output should look something like - -``` -${EXPORT_DIR}/1547100373/saved_model.pb -${EXPORT_DIR}/1547100373/variables/: -${EXPORT_DIR}/1547100373/variables/ -${EXPORT_DIR}/1547100373/variables/variables.data-00000-of-00001 -${EXPORT_DIR}/1547100373/variables/variables.index -``` - -The number `1547100373` is a version number auto-generated by TensorFlow; it will vary on each run but should be monotonically increasing if you save a model to the same location as a previous location. - -Enter the `serving/GCS` from the `mnist` application directory. -``` -cd serving/GCS -``` - -Set a different name for the tf-serving. - -``` -kustomize edit add configmap mnist-map-serving --from-literal=name=mnist-gcs-dist -``` - -Set your model path - -``` -kustomize edit add configmap mnist-map-serving --from-literal=modelBasePath=${EXPORT_DIR} -``` - -Deploy it, and run a service to make the deployment accessible to other pods in the cluster - -``` -kustomize build . |kubectl apply -f - -``` - -You can check the deployment by running - -``` -kubectl describe deployments mnist-gcs-dist -``` - -The service should make the `mnist-gcs-dist` deployment accessible over port 9000 - -``` -kubectl describe service mnist-gcs-dist -``` - ### S3 We can also serve the model when it is stored on S3. This assumes that when you trained the model you set `exportDir` to a S3 @@ -799,16 +663,7 @@ POD_NAME=$(kubectl get pods --selector=app=web-ui --template '{{range .items}}{{ kubectl port-forward ${POD_NAME} 8080:5000 ``` -You should now be able to open up the web app at your localhost. [Local Storage](http://localhost:8080) or [GCS](http://localhost:8080/?addr=mnist-gcs-dist) or [S3](http://localhost:8080/?addr=mnist-s3-serving). - - -### Using IAP on GCP - -If you are using GCP and have set up IAP then you can access the web UI at - -``` -https://${DEPLOYMENT}.endpoints.${PROJECT}.cloud.goog/${NAMESPACE}/mnist/ -``` +You should now be able to open up the web app at your localhost. [Local Storage](http://localhost:8080) or [S3](http://localhost:8080/?addr=mnist-s3-serving). ## Conclusion and Next Steps diff --git a/mnist/k8s_util.py b/mnist/k8s_util.py new file mode 100644 index 00000000..af85fe17 --- /dev/null +++ b/mnist/k8s_util.py @@ -0,0 +1,147 @@ +"""Some utilities for working with Kubernetes. + +TODO: These should probably be replaced by functions in fairing. +""" +import logging +import re +import yaml + +from kubernetes import client as k8s_client +from kubernetes.client import rest as k8s_rest + +def camel_to_snake(name): + name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name) + return re.sub('([a-z0-9])([A-Z])', r'\1_\2', name).lower() + +K8S_CREATE = "K8S_CREATE" +K8S_REPLACE = "K8S_REPLACE" +K8S_CREATE_OR_REPLACE = "K8S_CREATE_OR_REPLACE" + +def _get_result_name(result): + # For custom objects the result is a dict but for other objects + # its a python class + if isinstance(result, dict): + result_name = result["metadata"]["name"] + result_namespace = result["metadata"]["name"] + else: + result_name = result.metadata.name + result_namespace = result.metadata.namespace + + return result_namespace, result_name + +def apply_k8s_specs(specs, mode=K8S_CREATE): # pylint: disable=too-many-branches,too-many-statements + """Run apply on the provided Kubernetes specs. + + Args: + specs: A list of strings or dicts providing the YAML specs to + apply. + + mode: (Optional): Mode indicates how the resources should be created. + K8S_CREATE - Use the create verb. Works with generateName + K8S_REPLACE - Issue a delete of existing resources before doing a create + K8s_CREATE_OR_REPLACE - Try to create an object; if it already exists + replace it + """ + # TODO(jlewi): How should we handle patching existing updates? + + results = [] + + if mode not in [K8S_CREATE, K8S_CREATE_OR_REPLACE, K8S_REPLACE]: + raise ValueError(f"Unknown mode {mode}") + + for s in specs: + spec = s + if not isinstance(spec, dict): + spec = yaml.load(spec) + + name = spec["metadata"]["name"] + namespace = spec["metadata"]["namespace"] + kind = spec["kind"] + kind_snake = camel_to_snake(kind) + + plural = spec["kind"].lower() + "s" + + result = None + if not "/" in spec["apiVersion"]: + group = None + + else: + group, version = spec["apiVersion"].split("/", 1) + + if group is None or group.lower() == "apps": + if group is None: + api = k8s_client.CoreV1Api() + else: + api = k8s_client.AppsV1Api() + + create_method_name = f"create_namespaced_{kind_snake}" + create_method_args = [namespace, spec] + + replace_method_name = f"delete_namespaced_{kind_snake}" + replace_method_args = [name, namespace] + + else: + api = k8s_client.CustomObjectsApi() + + create_method_name = f"create_namespaced_custom_object" + create_method = getattr(api, create_method_name) + create_method_args = [group, version, namespace, plural, spec] + + delete_options = k8s_client.V1DeleteOptions() + replace_method_name = f"delete_namespaced_custom_object" + replace_method_args = [group, version, namespace, plural, name, delete_options] + + create_method = getattr(api, create_method_name) + replace_method = getattr(api, replace_method_name) + + if mode in [K8S_CREATE, K8S_CREATE_OR_REPLACE]: + try: + result = create_method(*create_method_args) + result_namespace, result_name = _get_result_name(result) + logging.info(f"Created {kind} {result_namespace}.{result_name}") + results.append(result) + continue + except k8s_rest.ApiException as e: + # 409 is conflict indicates resource already exists + if e.status == 409 and mode == K8S_CREATE_OR_REPLACE: + pass + else: + raise + + # Using replace didn't work for virtualservices so we explicitly delete + # and then issue a create + result = replace_method(*replace_method_args) + logging.info(f"Deleted {kind} {namespace}.{name}") + + result = create_method(*create_method_args) + result_namespace, result_name = _get_result_name(result) + logging.info(f"Created {kind} {result_namespace}.{result_name}") + # Now recreate it + results.append(result) + + return results + +def get_iap_endpoint(): + """Return the URL of the IAP endpoint""" + extensions = k8s_client.ExtensionsV1beta1Api() + kf_ingress = None + + try: + kf_ingress = extensions.read_namespaced_ingress("envoy-ingress", "istio-system") + except k8s_rest.ApiException as e: + if e.status == 403: + logging.warning(f"The service account doesn't have sufficient privileges " + f"to get the istio-system ingress. " + f"You will have to manually enter the Kubeflow endpoint. " + f"To make this function work ask someone with cluster " + f"priveleges to create an appropriate " + f"clusterrolebinding by running a command.\n" + f"kubectl create --namespace=istio-system rolebinding " + "--clusterrole=kubeflow-view " + "--serviceaccount=$NAMESPACE}:default-editor " + "${NAMESPACE}-istio-view") + return "" + + raise + + return f"https://{kf_ingress.spec.rules[0].host}" diff --git a/mnist/mnist_gcp.ipynb b/mnist/mnist_gcp.ipynb new file mode 100644 index 00000000..37316f98 --- /dev/null +++ b/mnist/mnist_gcp.ipynb @@ -0,0 +1,1791 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# MNIST E2E on Kubeflow on GKE\n", + "\n", + "This example guides you through:\n", + " \n", + " 1. Taking an example TensorFlow model and modifying it to support distributed training\n", + " 1. Serving the resulting model using TFServing\n", + " 1. Deploying and using a web-app that uses the model\n", + " \n", + "## Requirements\n", + "\n", + " * You must be running Kubeflow 1.0 on GKE with IAP\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prepare model\n", + "\n", + "There is a delta between existing distributed mnist examples and what's needed to run well as a TFJob.\n", + "\n", + "Basically, we must:\n", + "\n", + "1. Add options in order to make the model configurable.\n", + "1. Use `tf.estimator.train_and_evaluate` to enable model exporting and serving.\n", + "1. Define serving signatures for model serving.\n", + "\n", + "The resulting model is [model.py](model.py)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Verify we have a GCP account\n", + "\n", + "* The cell below checks that this notebook was spawned with credentials to access GCP\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "import os\n", + "import uuid\n", + "from importlib import reload\n", + "from oauth2client.client import GoogleCredentials\n", + "credentials = GoogleCredentials.get_application_default()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install Required Libraries\n", + "\n", + "Import the libraries required to train this model." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "pip installing requirements.txt\n", + "Checkout kubeflow/tf-operator @9238906\n", + "Configure docker credentials\n" + ] + } + ], + "source": [ + "import notebook_setup\n", + "reload(notebook_setup)\n", + "notebook_setup.notebook_setup()" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "import k8s_util\n", + "# Force a reload of kubeflow; since kubeflow is a multi namespace module\n", + "# it looks like doing this in notebook_setup may not be sufficient\n", + "import kubeflow\n", + "reload(kubeflow)\n", + "from kubernetes import client as k8s_client\n", + "from kubernetes import config as k8s_config\n", + "from kubeflow.tfjob.api import tf_job_client as tf_job_client_module\n", + "from IPython.core.display import display, HTML\n", + "import yaml" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Configure The Docker Registry For Kubeflow Fairing\n", + "\n", + "* In order to build docker images from your notebook we need a docker registry where the images will be stored\n", + "* Below you set some variables specifying a [GCR container registry](https://cloud.google.com/container-registry/docs/)\n", + "* Kubeflow Fairing provides a utility function to guess the name of your GCP project" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Running in project jlewi-dev\n", + "Running in namespace kubeflow-jlewi\n", + "Using docker registry gcr.io/jlewi-dev/fairing-job\n" + ] + } + ], + "source": [ + "from kubernetes import client as k8s_client\n", + "from kubernetes.client import rest as k8s_rest\n", + "from kubeflow import fairing \n", + "from kubeflow.fairing import utils as fairing_utils\n", + "from kubeflow.fairing.builders import append\n", + "from kubeflow.fairing.deployers import job\n", + "from kubeflow.fairing.preprocessors import base as base_preprocessor\n", + "\n", + "# Setting up google container repositories (GCR) for storing output containers\n", + "# You can use any docker container registry istead of GCR\n", + "GCP_PROJECT = fairing.cloud.gcp.guess_project_name()\n", + "DOCKER_REGISTRY = 'gcr.io/{}/fairing-job'.format(GCP_PROJECT)\n", + "namespace = fairing_utils.get_current_k8s_namespace()\n", + "\n", + "logging.info(f\"Running in project {GCP_PROJECT}\")\n", + "logging.info(f\"Running in namespace {namespace}\")\n", + "logging.info(f\"Using docker registry {DOCKER_REGISTRY}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Use Kubeflow fairing to build the docker image\n", + "\n", + "* You will use kubeflow fairing's kaniko builder to build a docker image that includes all your dependencies\n", + " * You use kaniko because you want to be able to run `pip` to install dependencies\n", + " * Kaniko gives you the flexibility to build images from Dockerfiles" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO(https://github.com/kubeflow/fairing/issues/426): We should get rid of this once the default \n", + "# Kaniko image is updated to a newer image than 0.7.0.\n", + "from kubeflow.fairing import constants\n", + "constants.constants.KANIKO_IMAGE = \"gcr.io/kaniko-project/executor:v0.14.0\"" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "set()" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from kubeflow.fairing.builders import cluster\n", + "\n", + "# output_map is a map of extra files to add to the notebook.\n", + "# It is a map from source location to the location inside the context.\n", + "output_map = {\n", + " \"Dockerfile.model\": \"Dockerfile\",\n", + " \"model.py\": \"model.py\"\n", + "}\n", + "\n", + "\n", + "preprocessor = base_preprocessor.BasePreProcessor(\n", + " command=[\"python\"], # The base class will set this.\n", + " input_files=[],\n", + " path_prefix=\"/app\", # irrelevant since we aren't preprocessing any files\n", + " output_map=output_map)\n", + "\n", + "preprocessor.preprocess()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Building image using cluster builder.\n", + "Creating docker context: /tmp/fairing_context_n8ikop1c\n", + "Dockerfile already exists in Fairing context, skipping...\n", + "Waiting for fairing-builder-nv9dh-2kwz9 to start...\n", + "Waiting for fairing-builder-nv9dh-2kwz9 to start...\n", + "Waiting for fairing-builder-nv9dh-2kwz9 to start...\n", + "Pod started running True\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ERROR: logging before flag.Parse: E0212 21:28:24.488770 1 metadata.go:241] Failed to unmarshal scopes: invalid character 'h' looking for beginning of value\n", + "\u001b[36mINFO\u001b[0m[0002] Resolved base name tensorflow/tensorflow:1.15.2-py3 to tensorflow/tensorflow:1.15.2-py3\n", + "\u001b[36mINFO\u001b[0m[0002] Resolved base name tensorflow/tensorflow:1.15.2-py3 to tensorflow/tensorflow:1.15.2-py3\n", + "\u001b[36mINFO\u001b[0m[0002] Downloading base image tensorflow/tensorflow:1.15.2-py3\n", + "ERROR: logging before flag.Parse: E0212 21:28:24.983416 1 metadata.go:142] while reading 'google-dockercfg' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg\n", + "ERROR: logging before flag.Parse: E0212 21:28:24.989996 1 metadata.go:159] while reading 'google-dockercfg-url' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg-url\n", + "\u001b[36mINFO\u001b[0m[0002] Error while retrieving image from cache: getting file info: stat /cache/sha256:28b5f547969d70f825909c8fe06675ffc2959afe6079aeae754afa312f6417b9: no such file or directory\n", + "\u001b[36mINFO\u001b[0m[0002] Downloading base image tensorflow/tensorflow:1.15.2-py3\n", + "\u001b[36mINFO\u001b[0m[0003] Built cross stage deps: map[]\n", + "\u001b[36mINFO\u001b[0m[0003] Downloading base image tensorflow/tensorflow:1.15.2-py3\n", + "\u001b[36mINFO\u001b[0m[0003] Error while retrieving image from cache: getting file info: stat /cache/sha256:28b5f547969d70f825909c8fe06675ffc2959afe6079aeae754afa312f6417b9: no such file or directory\n", + "\u001b[36mINFO\u001b[0m[0003] Downloading base image tensorflow/tensorflow:1.15.2-py3\n", + "\u001b[36mINFO\u001b[0m[0003] Using files from context: [/kaniko/buildcontext/model.py]\n", + "\u001b[36mINFO\u001b[0m[0003] Checking for cached layer gcr.io/jlewi-dev/fairing-job/mnist/cache:6802122184979734f01a549e1224c5f46a277db894d4b3e749e41ad1ca522bdf...\n", + "\u001b[36mINFO\u001b[0m[0004] Using caching version of cmd: RUN chmod +x /opt/model.py\n", + "\u001b[36mINFO\u001b[0m[0004] Skipping unpacking as no commands require it.\n", + "\u001b[36mINFO\u001b[0m[0004] Taking snapshot of full filesystem...\n", + "\u001b[36mINFO\u001b[0m[0004] Using files from context: [/kaniko/buildcontext/model.py]\n", + "\u001b[36mINFO\u001b[0m[0004] ADD model.py /opt/model.py\n", + "\u001b[36mINFO\u001b[0m[0004] Taking snapshot of files...\n", + "\u001b[36mINFO\u001b[0m[0004] RUN chmod +x /opt/model.py\n", + "\u001b[36mINFO\u001b[0m[0004] Found cached layer, extracting to filesystem\n", + "\u001b[36mINFO\u001b[0m[0004] Taking snapshot of files...\n", + "\u001b[36mINFO\u001b[0m[0004] ENTRYPOINT [\"/usr/bin/python\"]\n", + "\u001b[36mINFO\u001b[0m[0004] No files changed in this command, skipping snapshotting.\n", + "\u001b[36mINFO\u001b[0m[0004] CMD [\"/opt/model.py\"]\n", + "\u001b[36mINFO\u001b[0m[0004] No files changed in this command, skipping snapshotting.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Built image gcr.io/jlewi-dev/fairing-job/mnist:24327351\n" + ] + } + ], + "source": [ + "# Use a Tensorflow image as the base image\n", + "# We use a custom Dockerfile \n", + "cluster_builder = cluster.cluster.ClusterBuilder(registry=DOCKER_REGISTRY,\n", + " base_image=\"\", # base_image is set in the Dockerfile\n", + " preprocessor=preprocessor,\n", + " image_name=\"mnist\",\n", + " dockerfile_path=\"Dockerfile\",\n", + " pod_spec_mutators=[fairing.cloud.gcp.add_gcp_credentials_if_exists],\n", + " context_source=cluster.gcs_context.GCSContextSource())\n", + "cluster_builder.build()\n", + "logging.info(f\"Built image {cluster_builder.image_tag}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create a GCS Bucket\n", + "\n", + "* Create a GCS bucket to store our models and other results.\n", + "* Since we are running in python we use the python client libraries but you could also use the `gsutil` command line" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Bucket jlewi-dev-mnist already exists\n" + ] + } + ], + "source": [ + "from google.cloud import storage\n", + "bucket = f\"{GCP_PROJECT}-mnist\"\n", + "\n", + "client = storage.Client()\n", + "b = storage.Bucket(client=client, name=bucket)\n", + "\n", + "if not b.exists():\n", + " logging.info(f\"Creating bucket {bucket}\")\n", + " b.create()\n", + "else:\n", + " logging.info(f\"Bucket {bucket} already exists\") " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Distributed training\n", + "\n", + "* We will train the model by using TFJob to run a distributed training job" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "train_name = f\"mnist-train-{uuid.uuid4().hex[:4]}\"\n", + "num_ps = 1\n", + "num_workers = 2\n", + "model_dir = f\"gs://{bucket}/mnist\"\n", + "export_path = f\"gs://{bucket}/mnist/export\" \n", + "train_steps = 200\n", + "batch_size = 100\n", + "learning_rate = .01\n", + "image = cluster_builder.image_tag\n", + "\n", + "train_spec = f\"\"\"apiVersion: kubeflow.org/v1\n", + "kind: TFJob\n", + "metadata:\n", + " name: {train_name} \n", + "spec:\n", + " tfReplicaSpecs:\n", + " Ps:\n", + " replicas: {num_ps}\n", + " template:\n", + " metadata:\n", + " annotations:\n", + " sidecar.istio.io/inject: \"false\"\n", + " spec:\n", + " serviceAccount: default-editor\n", + " containers:\n", + " - name: tensorflow\n", + " command:\n", + " - python\n", + " - /opt/model.py\n", + " - --tf-model-dir={model_dir}\n", + " - --tf-export-dir={export_path}\n", + " - --tf-train-steps={train_steps}\n", + " - --tf-batch-size={batch_size}\n", + " - --tf-learning-rate={learning_rate}\n", + " image: {image}\n", + " workingDir: /opt\n", + " restartPolicy: OnFailure\n", + " Chief:\n", + " replicas: 1\n", + " template:\n", + " metadata:\n", + " annotations:\n", + " sidecar.istio.io/inject: \"false\"\n", + " spec:\n", + " serviceAccount: default-editor\n", + " containers:\n", + " - name: tensorflow\n", + " command:\n", + " - python\n", + " - /opt/model.py\n", + " - --tf-model-dir={model_dir}\n", + " - --tf-export-dir={export_path}\n", + " - --tf-train-steps={train_steps}\n", + " - --tf-batch-size={batch_size}\n", + " - --tf-learning-rate={learning_rate}\n", + " image: {image}\n", + " workingDir: /opt\n", + " restartPolicy: OnFailure\n", + " Worker:\n", + " replicas: 1\n", + " template:\n", + " metadata:\n", + " annotations:\n", + " sidecar.istio.io/inject: \"false\"\n", + " spec:\n", + " serviceAccount: default-editor\n", + " containers:\n", + " - name: tensorflow\n", + " command:\n", + " - python\n", + " - /opt/model.py\n", + " - --tf-model-dir={model_dir}\n", + " - --tf-export-dir={export_path}\n", + " - --tf-train-steps={train_steps}\n", + " - --tf-batch-size={batch_size}\n", + " - --tf-learning-rate={learning_rate}\n", + " image: {image}\n", + " workingDir: /opt\n", + " restartPolicy: OnFailure\n", + "\"\"\" " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create the training job\n", + "\n", + "* You could write the spec to a YAML file and then do `kubectl apply -f {FILE}`\n", + "* Since you are running in jupyter you will use the TFJob client\n", + "* You will run the TFJob in a namespace created by a Kubeflow profile\n", + " * The namespace will be the same namespace you are running the notebook in\n", + " * Creating a profile ensures the namespace is provisioned with service accounts and other resources needed for Kubeflow" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "tf_job_client = tf_job_client_module.TFJobClient()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "TFJob kubeflow-jlewi.mnist-train-2c73 succeeded\n" + ] + } + ], + "source": [ + "tf_job_body = yaml.safe_load(train_spec)\n", + "tf_job = tf_job_client.create(tf_job_body, namespace=namespace) \n", + "\n", + "logging.info(f\"Created job {namespace}.{train_name}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Check the job\n", + "\n", + "* Above you used the python SDK for TFJob to check the status\n", + "* You can also use kubectl get the status of your job\n", + "* The job conditions will tell you whether the job is running, succeeded or failed" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "apiVersion: kubeflow.org/v1\n", + "kind: TFJob\n", + "metadata:\n", + " creationTimestamp: \"2020-02-12T21:28:31Z\"\n", + " generation: 1\n", + " name: mnist-train-2c73\n", + " namespace: kubeflow-jlewi\n", + " resourceVersion: \"1730369\"\n", + " selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow-jlewi/tfjobs/mnist-train-2c73\n", + " uid: 9e27854c-4dde-11ea-9830-42010a8e016f\n", + "spec:\n", + " tfReplicaSpecs:\n", + " Chief:\n", + " replicas: 1\n", + " template:\n", + " metadata:\n", + " annotations:\n", + " sidecar.istio.io/inject: \"false\"\n", + " spec:\n", + " containers:\n", + " - command:\n", + " - python\n", + " - /opt/model.py\n", + " - --tf-model-dir=gs://jlewi-dev-mnist/mnist\n", + " - --tf-export-dir=gs://jlewi-dev-mnist/mnist/export\n", + " - --tf-train-steps=200\n", + " - --tf-batch-size=100\n", + " - --tf-learning-rate=0.01\n", + " image: gcr.io/jlewi-dev/fairing-job/mnist:24327351\n", + " name: tensorflow\n", + " workingDir: /opt\n", + " restartPolicy: OnFailure\n", + " serviceAccount: default-editor\n", + " Ps:\n", + " replicas: 1\n", + " template:\n", + " metadata:\n", + " annotations:\n", + " sidecar.istio.io/inject: \"false\"\n", + " spec:\n", + " containers:\n", + " - command:\n", + " - python\n", + " - /opt/model.py\n", + " - --tf-model-dir=gs://jlewi-dev-mnist/mnist\n", + " - --tf-export-dir=gs://jlewi-dev-mnist/mnist/export\n", + " - --tf-train-steps=200\n", + " - --tf-batch-size=100\n", + " - --tf-learning-rate=0.01\n", + " image: gcr.io/jlewi-dev/fairing-job/mnist:24327351\n", + " name: tensorflow\n", + " workingDir: /opt\n", + " restartPolicy: OnFailure\n", + " serviceAccount: default-editor\n", + " Worker:\n", + " replicas: 1\n", + " template:\n", + " metadata:\n", + " annotations:\n", + " sidecar.istio.io/inject: \"false\"\n", + " spec:\n", + " containers:\n", + " - command:\n", + " - python\n", + " - /opt/model.py\n", + " - --tf-model-dir=gs://jlewi-dev-mnist/mnist\n", + " - --tf-export-dir=gs://jlewi-dev-mnist/mnist/export\n", + " - --tf-train-steps=200\n", + " - --tf-batch-size=100\n", + " - --tf-learning-rate=0.01\n", + " image: gcr.io/jlewi-dev/fairing-job/mnist:24327351\n", + " name: tensorflow\n", + " workingDir: /opt\n", + " restartPolicy: OnFailure\n", + " serviceAccount: default-editor\n", + "status:\n", + " completionTime: \"2020-02-12T21:28:53Z\"\n", + " conditions:\n", + " - lastTransitionTime: \"2020-02-12T21:28:31Z\"\n", + " lastUpdateTime: \"2020-02-12T21:28:31Z\"\n", + " message: TFJob mnist-train-2c73 is created.\n", + " reason: TFJobCreated\n", + " status: \"True\"\n", + " type: Created\n", + " - lastTransitionTime: \"2020-02-12T21:28:34Z\"\n", + " lastUpdateTime: \"2020-02-12T21:28:34Z\"\n", + " message: TFJob mnist-train-2c73 is running.\n", + " reason: TFJobRunning\n", + " status: \"False\"\n", + " type: Running\n", + " - lastTransitionTime: \"2020-02-12T21:28:53Z\"\n", + " lastUpdateTime: \"2020-02-12T21:28:53Z\"\n", + " message: TFJob mnist-train-2c73 successfully completed.\n", + " reason: TFJobSucceeded\n", + " status: \"True\"\n", + " type: Succeeded\n", + " replicaStatuses:\n", + " Chief:\n", + " succeeded: 1\n", + " PS:\n", + " succeeded: 1\n", + " Worker:\n", + " succeeded: 1\n", + " startTime: \"2020-02-12T21:28:32Z\"\n" + ] + } + ], + "source": [ + "!kubectl get tfjobs -o yaml {train_name}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Get The Logs\n", + "\n", + "* There are two ways to get the logs for the training job\n", + "\n", + " 1. Using kubectl to fetch the pod logs\n", + " * These logs are ephemeral; they will be unavailable when the pod is garbage collected to free up resources\n", + " 1. Using stackdriver\n", + " \n", + " * Kubernetes logs are automatically available in stackdriver\n", + " * You can use labels to locate logs for a specific pod\n", + " * In the cell below you use labels for the training job name and process type to locate the logs for a specific pod\n", + " \n", + "* Run the cell below to get a link to stackdriver for your logs" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "Link to: chief logs" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "Link to: worker logs" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "Link to: ps logs" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from urllib.parse import urlencode\n", + "\n", + "for replica in [\"chief\", \"worker\", \"ps\"]: \n", + " logs_filter = f\"\"\"resource.type=\"k8s_container\" \n", + " labels.\"k8s-pod/tf-job-name\" = \"{train_name}\"\n", + " labels.\"k8s-pod/tf-replica-type\" = \"{replica}\" \n", + " resource.labels.container_name=\"tensorflow\" \"\"\"\n", + "\n", + " new_params = {'project': GCP_PROJECT,\n", + " # Logs for last 7 days\n", + " 'interval': 'P7D',\n", + " 'advancedFilter': logs_filter}\n", + "\n", + " query = urlencode(new_params)\n", + "\n", + " url = \"https://console.cloud.google.com/logs/viewer?\" + query\n", + "\n", + " display(HTML(f\"Link to: {replica} logs\"))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Deploy TensorBoard\n", + "\n", + "* You will create a Kubernetes Deployment to run TensorBoard\n", + "* TensorBoard will be accessible behind the Kubeflow IAP endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "tb_name = \"mnist-tensorboard\"\n", + "tb_deploy = f\"\"\"apiVersion: apps/v1\n", + "kind: Deployment\n", + "metadata:\n", + " labels:\n", + " app: mnist-tensorboard\n", + " name: {tb_name}\n", + " namespace: {namespace}\n", + "spec:\n", + " selector:\n", + " matchLabels:\n", + " app: mnist-tensorboard\n", + " template:\n", + " metadata:\n", + " labels:\n", + " app: mnist-tensorboard\n", + " version: v1\n", + " spec:\n", + " serviceAccount: default-editor\n", + " containers:\n", + " - command:\n", + " - /usr/local/bin/tensorboard\n", + " - --logdir={model_dir}\n", + " - --port=80\n", + " image: tensorflow/tensorflow:1.15.2-py3\n", + " name: tensorboard\n", + " ports:\n", + " - containerPort: 80\n", + "\"\"\"\n", + "tb_service = f\"\"\"apiVersion: v1\n", + "kind: Service\n", + "metadata:\n", + " labels:\n", + " app: mnist-tensorboard\n", + " name: {tb_name}\n", + " namespace: {namespace}\n", + "spec:\n", + " ports:\n", + " - name: http-tb\n", + " port: 80\n", + " targetPort: 80\n", + " selector:\n", + " app: mnist-tensorboard\n", + " type: ClusterIP\n", + "\"\"\"\n", + "\n", + "tb_virtual_service = f\"\"\"apiVersion: networking.istio.io/v1alpha3\n", + "kind: VirtualService\n", + "metadata:\n", + " name: {tb_name}\n", + " namespace: {namespace}\n", + "spec:\n", + " gateways:\n", + " - kubeflow/kubeflow-gateway\n", + " hosts:\n", + " - '*'\n", + " http:\n", + " - match:\n", + " - uri:\n", + " prefix: /mnist/{namespace}/tensorboard/\n", + " rewrite:\n", + " uri: /\n", + " route:\n", + " - destination:\n", + " host: {tb_name}.{namespace}.svc.cluster.local\n", + " port:\n", + " number: 80\n", + " timeout: 300s\n", + "\"\"\"\n", + "\n", + "tb_specs = [tb_deploy, tb_service, tb_virtual_service]" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/jovyan/git_kubeflow-examples/mnist/k8s_util.py:55: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n", + " spec = yaml.load(spec)\n", + "Deleted Deployment kubeflow-jlewi.mnist-tensorboard\n", + "Created Deployment kubeflow-jlewi.mnist-tensorboard\n", + "Deleted Service kubeflow-jlewi.mnist-tensorboard\n", + "Created Service kubeflow-jlewi.mnist-tensorboard\n", + "Deleted VirtualService kubeflow-jlewi.mnist-tensorboard\n", + "Created VirtualService mnist-tensorboard.mnist-tensorboard\n" + ] + }, + { + "data": { + "text/plain": [ + "[{'api_version': 'apps/v1',\n", + " 'kind': 'Deployment',\n", + " 'metadata': {'annotations': None,\n", + " 'cluster_name': None,\n", + " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 38, tzinfo=tzlocal()),\n", + " 'deletion_grace_period_seconds': None,\n", + " 'deletion_timestamp': None,\n", + " 'finalizers': None,\n", + " 'generate_name': None,\n", + " 'generation': 1,\n", + " 'initializers': None,\n", + " 'labels': {'app': 'mnist-tensorboard'},\n", + " 'managed_fields': None,\n", + " 'name': 'mnist-tensorboard',\n", + " 'namespace': 'kubeflow-jlewi',\n", + " 'owner_references': None,\n", + " 'resource_version': '1731593',\n", + " 'self_link': '/apis/apps/v1/namespaces/kubeflow-jlewi/deployments/mnist-tensorboard',\n", + " 'uid': 'e9750d8b-4dde-11ea-9830-42010a8e016f'},\n", + " 'spec': {'min_ready_seconds': None,\n", + " 'paused': None,\n", + " 'progress_deadline_seconds': 600,\n", + " 'replicas': 1,\n", + " 'revision_history_limit': 10,\n", + " 'selector': {'match_expressions': None,\n", + " 'match_labels': {'app': 'mnist-tensorboard'}},\n", + " 'strategy': {'rolling_update': {'max_surge': '25%',\n", + " 'max_unavailable': '25%'},\n", + " 'type': 'RollingUpdate'},\n", + " 'template': {'metadata': {'annotations': None,\n", + " 'cluster_name': None,\n", + " 'creation_timestamp': None,\n", + " 'deletion_grace_period_seconds': None,\n", + " 'deletion_timestamp': None,\n", + " 'finalizers': None,\n", + " 'generate_name': None,\n", + " 'generation': None,\n", + " 'initializers': None,\n", + " 'labels': {'app': 'mnist-tensorboard',\n", + " 'version': 'v1'},\n", + " 'managed_fields': None,\n", + " 'name': None,\n", + " 'namespace': None,\n", + " 'owner_references': None,\n", + " 'resource_version': None,\n", + " 'self_link': None,\n", + " 'uid': None},\n", + " 'spec': {'active_deadline_seconds': None,\n", + " 'affinity': None,\n", + " 'automount_service_account_token': None,\n", + " 'containers': [{'args': None,\n", + " 'command': ['/usr/local/bin/tensorboard',\n", + " '--logdir=gs://jlewi-dev-mnist/mnist',\n", + " '--port=80'],\n", + " 'env': None,\n", + " 'env_from': None,\n", + " 'image': 'tensorflow/tensorflow:1.15.2-py3',\n", + " 'image_pull_policy': 'IfNotPresent',\n", + " 'lifecycle': None,\n", + " 'liveness_probe': None,\n", + " 'name': 'tensorboard',\n", + " 'ports': [{'container_port': 80,\n", + " 'host_ip': None,\n", + " 'host_port': None,\n", + " 'name': None,\n", + " 'protocol': 'TCP'}],\n", + " 'readiness_probe': None,\n", + " 'resources': {'limits': None,\n", + " 'requests': None},\n", + " 'security_context': None,\n", + " 'stdin': None,\n", + " 'stdin_once': None,\n", + " 'termination_message_path': '/dev/termination-log',\n", + " 'termination_message_policy': 'File',\n", + " 'tty': None,\n", + " 'volume_devices': None,\n", + " 'volume_mounts': None,\n", + " 'working_dir': None}],\n", + " 'dns_config': None,\n", + " 'dns_policy': 'ClusterFirst',\n", + " 'enable_service_links': None,\n", + " 'host_aliases': None,\n", + " 'host_ipc': None,\n", + " 'host_network': None,\n", + " 'host_pid': None,\n", + " 'hostname': None,\n", + " 'image_pull_secrets': None,\n", + " 'init_containers': None,\n", + " 'node_name': None,\n", + " 'node_selector': None,\n", + " 'priority': None,\n", + " 'priority_class_name': None,\n", + " 'readiness_gates': None,\n", + " 'restart_policy': 'Always',\n", + " 'runtime_class_name': None,\n", + " 'scheduler_name': 'default-scheduler',\n", + " 'security_context': {'fs_group': None,\n", + " 'run_as_group': None,\n", + " 'run_as_non_root': None,\n", + " 'run_as_user': None,\n", + " 'se_linux_options': None,\n", + " 'supplemental_groups': None,\n", + " 'sysctls': None},\n", + " 'service_account': 'default-editor',\n", + " 'service_account_name': 'default-editor',\n", + " 'share_process_namespace': None,\n", + " 'subdomain': None,\n", + " 'termination_grace_period_seconds': 30,\n", + " 'tolerations': None,\n", + " 'volumes': None}}},\n", + " 'status': {'available_replicas': None,\n", + " 'collision_count': None,\n", + " 'conditions': None,\n", + " 'observed_generation': None,\n", + " 'ready_replicas': None,\n", + " 'replicas': None,\n", + " 'unavailable_replicas': None,\n", + " 'updated_replicas': None}}, {'api_version': 'v1',\n", + " 'kind': 'Service',\n", + " 'metadata': {'annotations': None,\n", + " 'cluster_name': None,\n", + " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 38, tzinfo=tzlocal()),\n", + " 'deletion_grace_period_seconds': None,\n", + " 'deletion_timestamp': None,\n", + " 'finalizers': None,\n", + " 'generate_name': None,\n", + " 'generation': None,\n", + " 'initializers': None,\n", + " 'labels': {'app': 'mnist-tensorboard'},\n", + " 'managed_fields': None,\n", + " 'name': 'mnist-tensorboard',\n", + " 'namespace': 'kubeflow-jlewi',\n", + " 'owner_references': None,\n", + " 'resource_version': '1731608',\n", + " 'self_link': '/api/v1/namespaces/kubeflow-jlewi/services/mnist-tensorboard',\n", + " 'uid': 'e98fa09f-4dde-11ea-9830-42010a8e016f'},\n", + " 'spec': {'cluster_ip': '10.55.245.113',\n", + " 'external_i_ps': None,\n", + " 'external_name': None,\n", + " 'external_traffic_policy': None,\n", + " 'health_check_node_port': None,\n", + " 'load_balancer_ip': None,\n", + " 'load_balancer_source_ranges': None,\n", + " 'ports': [{'name': 'http-tb',\n", + " 'node_port': None,\n", + " 'port': 80,\n", + " 'protocol': 'TCP',\n", + " 'target_port': 80}],\n", + " 'publish_not_ready_addresses': None,\n", + " 'selector': {'app': 'mnist-tensorboard'},\n", + " 'session_affinity': 'None',\n", + " 'session_affinity_config': None,\n", + " 'type': 'ClusterIP'},\n", + " 'status': {'load_balancer': {'ingress': None}}}, {'apiVersion': 'networking.istio.io/v1alpha3',\n", + " 'kind': 'VirtualService',\n", + " 'metadata': {'creationTimestamp': '2020-02-12T21:30:38Z',\n", + " 'generation': 1,\n", + " 'name': 'mnist-tensorboard',\n", + " 'namespace': 'kubeflow-jlewi',\n", + " 'resourceVersion': '1731612',\n", + " 'selfLink': '/apis/networking.istio.io/v1alpha3/namespaces/kubeflow-jlewi/virtualservices/mnist-tensorboard',\n", + " 'uid': 'e99c4909-4dde-11ea-9830-42010a8e016f'},\n", + " 'spec': {'gateways': ['kubeflow/kubeflow-gateway'],\n", + " 'hosts': ['*'],\n", + " 'http': [{'match': [{'uri': {'prefix': '/mnist/kubeflow-jlewi/tensorboard/'}}],\n", + " 'rewrite': {'uri': '/'},\n", + " 'route': [{'destination': {'host': 'mnist-tensorboard.kubeflow-jlewi.svc.cluster.local',\n", + " 'port': {'number': 80}}}],\n", + " 'timeout': '300s'}]}}]" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "k8s_util.apply_k8s_specs(tb_specs, k8s_util.K8S_CREATE_OR_REPLACE)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Access The TensorBoard UI" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "TensorBoard UI is at https://kf-v1-0210.endpoints.jlewi-dev.cloud.goog/mnist/kubeflow-jlewi/tensorboard/" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "endpoint = k8s_util.get_iap_endpoint() \n", + "if endpoint: \n", + " vs = yaml.safe_load(tb_virtual_service)\n", + " path= vs[\"spec\"][\"http\"][0][\"match\"][0][\"uri\"][\"prefix\"]\n", + " tb_endpoint = endpoint + path\n", + " display(HTML(f\"TensorBoard UI is at {tb_endpoint}\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Wait For the Training Job to finish" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* You can use the TFJob client to wait for it to finish." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "tf_job = tf_job_client.wait_for_condition(train_name, expected_condition=[\"Succeeded\", \"Failed\"], namespace=namespace)\n", + "\n", + "if tf_job_client.is_job_succeeded(train_name, namespace):\n", + " logging.info(f\"TFJob {namespace}.{train_name} succeeded\")\n", + "else:\n", + " raise ValueError(f\"TFJob {namespace}.{train_name} failed\") " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Serve the model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* Deploy the model using tensorflow serving\n", + "* We need to create\n", + " 1. A Kubernetes Deployment\n", + " 1. A Kubernetes service\n", + " 1. (Optional) Create a configmap containing the prometheus monitoring config" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "deploy_name = \"mnist-model\"\n", + "model_base_path = export_path\n", + "\n", + "# The web ui defaults to mnist-service so if you change it you will\n", + "# need to change it in the UI as well to send predictions to the mode\n", + "model_service = \"mnist-service\"\n", + "\n", + "deploy_spec = f\"\"\"apiVersion: apps/v1\n", + "kind: Deployment\n", + "metadata:\n", + " labels:\n", + " app: mnist\n", + " name: {deploy_name}\n", + " namespace: {namespace}\n", + "spec:\n", + " selector:\n", + " matchLabels:\n", + " app: mnist-model\n", + " template:\n", + " metadata:\n", + " # TODO(jlewi): Right now we disable the istio side car because otherwise ISTIO rbac will prevent the\n", + " # UI from sending RPCs to the server. We should create an appropriate ISTIO rbac authorization\n", + " # policy to allow traffic from the UI to the model servier.\n", + " # https://istio.io/docs/concepts/security/#target-selectors\n", + " annotations: \n", + " sidecar.istio.io/inject: \"false\"\n", + " labels:\n", + " app: mnist-model\n", + " version: v1\n", + " spec:\n", + " serviceAccount: default-editor\n", + " containers:\n", + " - args:\n", + " - --port=9000\n", + " - --rest_api_port=8500\n", + " - --model_name=mnist\n", + " - --model_base_path={model_base_path}\n", + " - --monitoring_config_file=/var/config/monitoring_config.txt\n", + " command:\n", + " - /usr/bin/tensorflow_model_server\n", + " env:\n", + " - name: modelBasePath\n", + " value: {model_base_path}\n", + " image: tensorflow/serving:1.15.0\n", + " imagePullPolicy: IfNotPresent\n", + " livenessProbe:\n", + " initialDelaySeconds: 30\n", + " periodSeconds: 30\n", + " tcpSocket:\n", + " port: 9000\n", + " name: mnist\n", + " ports:\n", + " - containerPort: 9000\n", + " - containerPort: 8500\n", + " resources:\n", + " limits:\n", + " cpu: \"4\"\n", + " memory: 4Gi\n", + " requests:\n", + " cpu: \"1\"\n", + " memory: 1Gi\n", + " volumeMounts:\n", + " - mountPath: /var/config/\n", + " name: model-config\n", + " volumes:\n", + " - configMap:\n", + " name: {deploy_name}\n", + " name: model-config\n", + "\"\"\"\n", + "\n", + "service_spec = f\"\"\"apiVersion: v1\n", + "kind: Service\n", + "metadata:\n", + " annotations: \n", + " prometheus.io/path: /monitoring/prometheus/metrics\n", + " prometheus.io/port: \"8500\"\n", + " prometheus.io/scrape: \"true\"\n", + " labels:\n", + " app: mnist-model\n", + " name: {model_service}\n", + " namespace: {namespace}\n", + "spec:\n", + " ports:\n", + " - name: grpc-tf-serving\n", + " port: 9000\n", + " targetPort: 9000\n", + " - name: http-tf-serving\n", + " port: 8500\n", + " targetPort: 8500\n", + " selector:\n", + " app: mnist-model\n", + " type: ClusterIP\n", + "\"\"\"\n", + "\n", + "monitoring_config = f\"\"\"kind: ConfigMap\n", + "apiVersion: v1\n", + "metadata:\n", + " name: {deploy_name}\n", + " namespace: {namespace}\n", + "data:\n", + " monitoring_config.txt: |-\n", + " prometheus_config: {{\n", + " enable: true,\n", + " path: \"/monitoring/prometheus/metrics\"\n", + " }}\n", + "\"\"\"\n", + "\n", + "model_specs = [deploy_spec, service_spec, monitoring_config]" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Deleted Deployment kubeflow-jlewi.mnist-model\n", + "Created Deployment kubeflow-jlewi.mnist-model\n", + "Deleted Service kubeflow-jlewi.mnist-service\n", + "Created Service kubeflow-jlewi.mnist-service\n", + "Deleted ConfigMap kubeflow-jlewi.mnist-model\n", + "Created ConfigMap kubeflow-jlewi.mnist-model\n" + ] + }, + { + "data": { + "text/plain": [ + "[{'api_version': 'apps/v1',\n", + " 'kind': 'Deployment',\n", + " 'metadata': {'annotations': None,\n", + " 'cluster_name': None,\n", + " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 38, tzinfo=tzlocal()),\n", + " 'deletion_grace_period_seconds': None,\n", + " 'deletion_timestamp': None,\n", + " 'finalizers': None,\n", + " 'generate_name': None,\n", + " 'generation': 1,\n", + " 'initializers': None,\n", + " 'labels': {'app': 'mnist'},\n", + " 'managed_fields': None,\n", + " 'name': 'mnist-model',\n", + " 'namespace': 'kubeflow-jlewi',\n", + " 'owner_references': None,\n", + " 'resource_version': '1731617',\n", + " 'self_link': '/apis/apps/v1/namespaces/kubeflow-jlewi/deployments/mnist-model',\n", + " 'uid': 'e9add65c-4dde-11ea-9830-42010a8e016f'},\n", + " 'spec': {'min_ready_seconds': None,\n", + " 'paused': None,\n", + " 'progress_deadline_seconds': 600,\n", + " 'replicas': 1,\n", + " 'revision_history_limit': 10,\n", + " 'selector': {'match_expressions': None,\n", + " 'match_labels': {'app': 'mnist-model'}},\n", + " 'strategy': {'rolling_update': {'max_surge': '25%',\n", + " 'max_unavailable': '25%'},\n", + " 'type': 'RollingUpdate'},\n", + " 'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},\n", + " 'cluster_name': None,\n", + " 'creation_timestamp': None,\n", + " 'deletion_grace_period_seconds': None,\n", + " 'deletion_timestamp': None,\n", + " 'finalizers': None,\n", + " 'generate_name': None,\n", + " 'generation': None,\n", + " 'initializers': None,\n", + " 'labels': {'app': 'mnist-model',\n", + " 'version': 'v1'},\n", + " 'managed_fields': None,\n", + " 'name': None,\n", + " 'namespace': None,\n", + " 'owner_references': None,\n", + " 'resource_version': None,\n", + " 'self_link': None,\n", + " 'uid': None},\n", + " 'spec': {'active_deadline_seconds': None,\n", + " 'affinity': None,\n", + " 'automount_service_account_token': None,\n", + " 'containers': [{'args': ['--port=9000',\n", + " '--rest_api_port=8500',\n", + " '--model_name=mnist',\n", + " '--model_base_path=gs://jlewi-dev-mnist/mnist/export',\n", + " '--monitoring_config_file=/var/config/monitoring_config.txt'],\n", + " 'command': ['/usr/bin/tensorflow_model_server'],\n", + " 'env': [{'name': 'modelBasePath',\n", + " 'value': 'gs://jlewi-dev-mnist/mnist/export',\n", + " 'value_from': None}],\n", + " 'env_from': None,\n", + " 'image': 'tensorflow/serving:1.15.0',\n", + " 'image_pull_policy': 'IfNotPresent',\n", + " 'lifecycle': None,\n", + " 'liveness_probe': {'_exec': None,\n", + " 'failure_threshold': 3,\n", + " 'http_get': None,\n", + " 'initial_delay_seconds': 30,\n", + " 'period_seconds': 30,\n", + " 'success_threshold': 1,\n", + " 'tcp_socket': {'host': None,\n", + " 'port': 9000},\n", + " 'timeout_seconds': 1},\n", + " 'name': 'mnist',\n", + " 'ports': [{'container_port': 9000,\n", + " 'host_ip': None,\n", + " 'host_port': None,\n", + " 'name': None,\n", + " 'protocol': 'TCP'},\n", + " {'container_port': 8500,\n", + " 'host_ip': None,\n", + " 'host_port': None,\n", + " 'name': None,\n", + " 'protocol': 'TCP'}],\n", + " 'readiness_probe': None,\n", + " 'resources': {'limits': {'cpu': '4',\n", + " 'memory': '4Gi'},\n", + " 'requests': {'cpu': '1',\n", + " 'memory': '1Gi'}},\n", + " 'security_context': None,\n", + " 'stdin': None,\n", + " 'stdin_once': None,\n", + " 'termination_message_path': '/dev/termination-log',\n", + " 'termination_message_policy': 'File',\n", + " 'tty': None,\n", + " 'volume_devices': None,\n", + " 'volume_mounts': [{'mount_path': '/var/config/',\n", + " 'mount_propagation': None,\n", + " 'name': 'model-config',\n", + " 'read_only': None,\n", + " 'sub_path': None,\n", + " 'sub_path_expr': None}],\n", + " 'working_dir': None}],\n", + " 'dns_config': None,\n", + " 'dns_policy': 'ClusterFirst',\n", + " 'enable_service_links': None,\n", + " 'host_aliases': None,\n", + " 'host_ipc': None,\n", + " 'host_network': None,\n", + " 'host_pid': None,\n", + " 'hostname': None,\n", + " 'image_pull_secrets': None,\n", + " 'init_containers': None,\n", + " 'node_name': None,\n", + " 'node_selector': None,\n", + " 'priority': None,\n", + " 'priority_class_name': None,\n", + " 'readiness_gates': None,\n", + " 'restart_policy': 'Always',\n", + " 'runtime_class_name': None,\n", + " 'scheduler_name': 'default-scheduler',\n", + " 'security_context': {'fs_group': None,\n", + " 'run_as_group': None,\n", + " 'run_as_non_root': None,\n", + " 'run_as_user': None,\n", + " 'se_linux_options': None,\n", + " 'supplemental_groups': None,\n", + " 'sysctls': None},\n", + " 'service_account': 'default-editor',\n", + " 'service_account_name': 'default-editor',\n", + " 'share_process_namespace': None,\n", + " 'subdomain': None,\n", + " 'termination_grace_period_seconds': 30,\n", + " 'tolerations': None,\n", + " 'volumes': [{'aws_elastic_block_store': None,\n", + " 'azure_disk': None,\n", + " 'azure_file': None,\n", + " 'cephfs': None,\n", + " 'cinder': None,\n", + " 'config_map': {'default_mode': 420,\n", + " 'items': None,\n", + " 'name': 'mnist-model',\n", + " 'optional': None},\n", + " 'csi': None,\n", + " 'downward_api': None,\n", + " 'empty_dir': None,\n", + " 'fc': None,\n", + " 'flex_volume': None,\n", + " 'flocker': None,\n", + " 'gce_persistent_disk': None,\n", + " 'git_repo': None,\n", + " 'glusterfs': None,\n", + " 'host_path': None,\n", + " 'iscsi': None,\n", + " 'name': 'model-config',\n", + " 'nfs': None,\n", + " 'persistent_volume_claim': None,\n", + " 'photon_persistent_disk': None,\n", + " 'portworx_volume': None,\n", + " 'projected': None,\n", + " 'quobyte': None,\n", + " 'rbd': None,\n", + " 'scale_io': None,\n", + " 'secret': None,\n", + " 'storageos': None,\n", + " 'vsphere_volume': None}]}}},\n", + " 'status': {'available_replicas': None,\n", + " 'collision_count': None,\n", + " 'conditions': None,\n", + " 'observed_generation': None,\n", + " 'ready_replicas': None,\n", + " 'replicas': None,\n", + " 'unavailable_replicas': None,\n", + " 'updated_replicas': None}}, {'api_version': 'v1',\n", + " 'kind': 'Service',\n", + " 'metadata': {'annotations': {'prometheus.io/path': '/monitoring/prometheus/metrics',\n", + " 'prometheus.io/port': '8500',\n", + " 'prometheus.io/scrape': 'true'},\n", + " 'cluster_name': None,\n", + " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 38, tzinfo=tzlocal()),\n", + " 'deletion_grace_period_seconds': None,\n", + " 'deletion_timestamp': None,\n", + " 'finalizers': None,\n", + " 'generate_name': None,\n", + " 'generation': None,\n", + " 'initializers': None,\n", + " 'labels': {'app': 'mnist-model'},\n", + " 'managed_fields': None,\n", + " 'name': 'mnist-service',\n", + " 'namespace': 'kubeflow-jlewi',\n", + " 'owner_references': None,\n", + " 'resource_version': '1731639',\n", + " 'self_link': '/api/v1/namespaces/kubeflow-jlewi/services/mnist-service',\n", + " 'uid': 'e9dcfd8c-4dde-11ea-9830-42010a8e016f'},\n", + " 'spec': {'cluster_ip': '10.55.250.62',\n", + " 'external_i_ps': None,\n", + " 'external_name': None,\n", + " 'external_traffic_policy': None,\n", + " 'health_check_node_port': None,\n", + " 'load_balancer_ip': None,\n", + " 'load_balancer_source_ranges': None,\n", + " 'ports': [{'name': 'grpc-tf-serving',\n", + " 'node_port': None,\n", + " 'port': 9000,\n", + " 'protocol': 'TCP',\n", + " 'target_port': 9000},\n", + " {'name': 'http-tf-serving',\n", + " 'node_port': None,\n", + " 'port': 8500,\n", + " 'protocol': 'TCP',\n", + " 'target_port': 8500}],\n", + " 'publish_not_ready_addresses': None,\n", + " 'selector': {'app': 'mnist-model'},\n", + " 'session_affinity': 'None',\n", + " 'session_affinity_config': None,\n", + " 'type': 'ClusterIP'},\n", + " 'status': {'load_balancer': {'ingress': None}}}, {'api_version': 'v1',\n", + " 'binary_data': None,\n", + " 'data': {'monitoring_config.txt': 'prometheus_config: {\\n'\n", + " ' enable: true,\\n'\n", + " ' path: \"/monitoring/prometheus/metrics\"\\n'\n", + " '}'},\n", + " 'kind': 'ConfigMap',\n", + " 'metadata': {'annotations': None,\n", + " 'cluster_name': None,\n", + " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 39, tzinfo=tzlocal()),\n", + " 'deletion_grace_period_seconds': None,\n", + " 'deletion_timestamp': None,\n", + " 'finalizers': None,\n", + " 'generate_name': None,\n", + " 'generation': None,\n", + " 'initializers': None,\n", + " 'labels': None,\n", + " 'managed_fields': None,\n", + " 'name': 'mnist-model',\n", + " 'namespace': 'kubeflow-jlewi',\n", + " 'owner_references': None,\n", + " 'resource_version': '1731646',\n", + " 'self_link': '/api/v1/namespaces/kubeflow-jlewi/configmaps/mnist-model',\n", + " 'uid': 'e9eeb2f4-4dde-11ea-9830-42010a8e016f'}}]" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "k8s_util.apply_k8s_specs(model_specs, k8s_util.K8S_CREATE_OR_REPLACE) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Deploy the mnist UI\n", + "\n", + "* We will now deploy the UI to visual the mnist results\n", + "* Note: This is using a prebuilt and public docker image for the UI" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "ui_name = \"mnist-ui\"\n", + "ui_deploy = f\"\"\"apiVersion: apps/v1\n", + "kind: Deployment\n", + "metadata:\n", + " name: {ui_name}\n", + " namespace: {namespace}\n", + "spec:\n", + " replicas: 1\n", + " selector:\n", + " matchLabels:\n", + " app: mnist-web-ui\n", + " template:\n", + " metadata:\n", + " labels:\n", + " app: mnist-web-ui\n", + " spec:\n", + " containers:\n", + " - image: gcr.io/kubeflow-examples/mnist/web-ui:v20190112-v0.2-142-g3b38225\n", + " name: web-ui\n", + " ports:\n", + " - containerPort: 5000 \n", + " serviceAccount: default-editor\n", + "\"\"\"\n", + "\n", + "ui_service = f\"\"\"apiVersion: v1\n", + "kind: Service\n", + "metadata:\n", + " annotations:\n", + " name: {ui_name}\n", + " namespace: {namespace}\n", + "spec:\n", + " ports:\n", + " - name: http-mnist-ui\n", + " port: 80\n", + " targetPort: 5000\n", + " selector:\n", + " app: mnist-web-ui\n", + " type: ClusterIP\n", + "\"\"\"\n", + "\n", + "ui_virtual_service = f\"\"\"apiVersion: networking.istio.io/v1alpha3\n", + "kind: VirtualService\n", + "metadata:\n", + " name: {ui_name}\n", + " namespace: {namespace}\n", + "spec:\n", + " gateways:\n", + " - kubeflow/kubeflow-gateway\n", + " hosts:\n", + " - '*'\n", + " http:\n", + " - match:\n", + " - uri:\n", + " prefix: /mnist/{namespace}/ui/\n", + " rewrite:\n", + " uri: /\n", + " route:\n", + " - destination:\n", + " host: {ui_name}.{namespace}.svc.cluster.local\n", + " port:\n", + " number: 80\n", + " timeout: 300s\n", + "\"\"\"\n", + "\n", + "ui_specs = [ui_deploy, ui_service, ui_virtual_service]" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Deleted Deployment kubeflow-jlewi.mnist-ui\n", + "Created Deployment kubeflow-jlewi.mnist-ui\n", + "Deleted Service kubeflow-jlewi.mnist-ui\n", + "Created Service kubeflow-jlewi.mnist-ui\n", + "Deleted VirtualService kubeflow-jlewi.mnist-ui\n", + "Created VirtualService mnist-ui.mnist-ui\n" + ] + }, + { + "data": { + "text/plain": [ + "[{'api_version': 'apps/v1',\n", + " 'kind': 'Deployment',\n", + " 'metadata': {'annotations': None,\n", + " 'cluster_name': None,\n", + " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 39, tzinfo=tzlocal()),\n", + " 'deletion_grace_period_seconds': None,\n", + " 'deletion_timestamp': None,\n", + " 'finalizers': None,\n", + " 'generate_name': None,\n", + " 'generation': 1,\n", + " 'initializers': None,\n", + " 'labels': None,\n", + " 'managed_fields': None,\n", + " 'name': 'mnist-ui',\n", + " 'namespace': 'kubeflow-jlewi',\n", + " 'owner_references': None,\n", + " 'resource_version': '1731648',\n", + " 'self_link': '/apis/apps/v1/namespaces/kubeflow-jlewi/deployments/mnist-ui',\n", + " 'uid': 'e9f77ba8-4dde-11ea-9830-42010a8e016f'},\n", + " 'spec': {'min_ready_seconds': None,\n", + " 'paused': None,\n", + " 'progress_deadline_seconds': 600,\n", + " 'replicas': 1,\n", + " 'revision_history_limit': 10,\n", + " 'selector': {'match_expressions': None,\n", + " 'match_labels': {'app': 'mnist-web-ui'}},\n", + " 'strategy': {'rolling_update': {'max_surge': '25%',\n", + " 'max_unavailable': '25%'},\n", + " 'type': 'RollingUpdate'},\n", + " 'template': {'metadata': {'annotations': None,\n", + " 'cluster_name': None,\n", + " 'creation_timestamp': None,\n", + " 'deletion_grace_period_seconds': None,\n", + " 'deletion_timestamp': None,\n", + " 'finalizers': None,\n", + " 'generate_name': None,\n", + " 'generation': None,\n", + " 'initializers': None,\n", + " 'labels': {'app': 'mnist-web-ui'},\n", + " 'managed_fields': None,\n", + " 'name': None,\n", + " 'namespace': None,\n", + " 'owner_references': None,\n", + " 'resource_version': None,\n", + " 'self_link': None,\n", + " 'uid': None},\n", + " 'spec': {'active_deadline_seconds': None,\n", + " 'affinity': None,\n", + " 'automount_service_account_token': None,\n", + " 'containers': [{'args': None,\n", + " 'command': None,\n", + " 'env': None,\n", + " 'env_from': None,\n", + " 'image': 'gcr.io/kubeflow-examples/mnist/web-ui:v20190112-v0.2-142-g3b38225',\n", + " 'image_pull_policy': 'IfNotPresent',\n", + " 'lifecycle': None,\n", + " 'liveness_probe': None,\n", + " 'name': 'web-ui',\n", + " 'ports': [{'container_port': 5000,\n", + " 'host_ip': None,\n", + " 'host_port': None,\n", + " 'name': None,\n", + " 'protocol': 'TCP'}],\n", + " 'readiness_probe': None,\n", + " 'resources': {'limits': None,\n", + " 'requests': None},\n", + " 'security_context': None,\n", + " 'stdin': None,\n", + " 'stdin_once': None,\n", + " 'termination_message_path': '/dev/termination-log',\n", + " 'termination_message_policy': 'File',\n", + " 'tty': None,\n", + " 'volume_devices': None,\n", + " 'volume_mounts': None,\n", + " 'working_dir': None}],\n", + " 'dns_config': None,\n", + " 'dns_policy': 'ClusterFirst',\n", + " 'enable_service_links': None,\n", + " 'host_aliases': None,\n", + " 'host_ipc': None,\n", + " 'host_network': None,\n", + " 'host_pid': None,\n", + " 'hostname': None,\n", + " 'image_pull_secrets': None,\n", + " 'init_containers': None,\n", + " 'node_name': None,\n", + " 'node_selector': None,\n", + " 'priority': None,\n", + " 'priority_class_name': None,\n", + " 'readiness_gates': None,\n", + " 'restart_policy': 'Always',\n", + " 'runtime_class_name': None,\n", + " 'scheduler_name': 'default-scheduler',\n", + " 'security_context': {'fs_group': None,\n", + " 'run_as_group': None,\n", + " 'run_as_non_root': None,\n", + " 'run_as_user': None,\n", + " 'se_linux_options': None,\n", + " 'supplemental_groups': None,\n", + " 'sysctls': None},\n", + " 'service_account': 'default-editor',\n", + " 'service_account_name': 'default-editor',\n", + " 'share_process_namespace': None,\n", + " 'subdomain': None,\n", + " 'termination_grace_period_seconds': 30,\n", + " 'tolerations': None,\n", + " 'volumes': None}}},\n", + " 'status': {'available_replicas': None,\n", + " 'collision_count': None,\n", + " 'conditions': None,\n", + " 'observed_generation': None,\n", + " 'ready_replicas': None,\n", + " 'replicas': None,\n", + " 'unavailable_replicas': None,\n", + " 'updated_replicas': None}}, {'api_version': 'v1',\n", + " 'kind': 'Service',\n", + " 'metadata': {'annotations': None,\n", + " 'cluster_name': None,\n", + " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 39, tzinfo=tzlocal()),\n", + " 'deletion_grace_period_seconds': None,\n", + " 'deletion_timestamp': None,\n", + " 'finalizers': None,\n", + " 'generate_name': None,\n", + " 'generation': None,\n", + " 'initializers': None,\n", + " 'labels': None,\n", + " 'managed_fields': None,\n", + " 'name': 'mnist-ui',\n", + " 'namespace': 'kubeflow-jlewi',\n", + " 'owner_references': None,\n", + " 'resource_version': '1731664',\n", + " 'self_link': '/api/v1/namespaces/kubeflow-jlewi/services/mnist-ui',\n", + " 'uid': 'ea12ef25-4dde-11ea-9830-42010a8e016f'},\n", + " 'spec': {'cluster_ip': '10.55.250.134',\n", + " 'external_i_ps': None,\n", + " 'external_name': None,\n", + " 'external_traffic_policy': None,\n", + " 'health_check_node_port': None,\n", + " 'load_balancer_ip': None,\n", + " 'load_balancer_source_ranges': None,\n", + " 'ports': [{'name': 'http-mnist-ui',\n", + " 'node_port': None,\n", + " 'port': 80,\n", + " 'protocol': 'TCP',\n", + " 'target_port': 5000}],\n", + " 'publish_not_ready_addresses': None,\n", + " 'selector': {'app': 'mnist-web-ui'},\n", + " 'session_affinity': 'None',\n", + " 'session_affinity_config': None,\n", + " 'type': 'ClusterIP'},\n", + " 'status': {'load_balancer': {'ingress': None}}}, {'apiVersion': 'networking.istio.io/v1alpha3',\n", + " 'kind': 'VirtualService',\n", + " 'metadata': {'creationTimestamp': '2020-02-12T21:30:39Z',\n", + " 'generation': 1,\n", + " 'name': 'mnist-ui',\n", + " 'namespace': 'kubeflow-jlewi',\n", + " 'resourceVersion': '1731676',\n", + " 'selfLink': '/apis/networking.istio.io/v1alpha3/namespaces/kubeflow-jlewi/virtualservices/mnist-ui',\n", + " 'uid': 'ea2ac046-4dde-11ea-9830-42010a8e016f'},\n", + " 'spec': {'gateways': ['kubeflow/kubeflow-gateway'],\n", + " 'hosts': ['*'],\n", + " 'http': [{'match': [{'uri': {'prefix': '/mnist/kubeflow-jlewi/ui/'}}],\n", + " 'rewrite': {'uri': '/'},\n", + " 'route': [{'destination': {'host': 'mnist-ui.kubeflow-jlewi.svc.cluster.local',\n", + " 'port': {'number': 80}}}],\n", + " 'timeout': '300s'}]}}]" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "k8s_util.apply_k8s_specs(ui_specs, k8s_util.K8S_CREATE_OR_REPLACE) \n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Access the web UI\n", + "\n", + "* A reverse proxy route is automatically added to the Kubeflow IAP endpoint\n", + "* The endpoint will be\n", + "\n", + " ```\n", + " http:/${KUBEflOW_ENDPOINT}/mnist/${NAMESPACE}/ui/ \n", + " ```kubeflow-jlewi\n", + "* You can get the KUBEFLOW_ENDPOINT\n", + "\n", + " ```\n", + " KUBEfLOW_ENDPOINT=`kubectl -n istio-system get ingress envoy-ingress -o jsonpath=\"{.spec.rules[0].host}\"`\n", + " ```\n", + " \n", + " * You must run this command with sufficient RBAC permissions to get the ingress.\n", + " \n", + "* If you have sufficient privileges you can run the cell below to get the endpoint if you don't have sufficient priveleges you can \n", + " grant appropriate permissions by running the command\n", + " \n", + " ```\n", + " kubectl create --namespace=istio-system rolebinding --clusterrole=kubeflow-view --serviceaccount=${NAMESPACE}:default-editor ${NAMESPACE}-istio-view\n", + " ```" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "mnist UI is at https://kf-v1-0210.endpoints.jlewi-dev.cloud.goog/mnist/kubeflow-jlewi/ui/" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "endpoint = k8s_util.get_iap_endpoint() \n", + "if endpoint: \n", + " vs = yaml.safe_load(ui_virtual_service)\n", + " path= vs[\"spec\"][\"http\"][0][\"match\"][0][\"uri\"][\"prefix\"]\n", + " ui_endpoint = endpoint + path\n", + " display(HTML(f\"mnist UI is at {ui_endpoint}\"))\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/mnist/monitoring/GCS/kustomization.yaml b/mnist/monitoring/GCS/kustomization.yaml deleted file mode 100644 index 96de4bd1..00000000 --- a/mnist/monitoring/GCS/kustomization.yaml +++ /dev/null @@ -1,8 +0,0 @@ -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization - -bases: -- ../base - -configurations: -- params.yaml diff --git a/mnist/notebook_setup.py b/mnist/notebook_setup.py new file mode 100644 index 00000000..acd8950e --- /dev/null +++ b/mnist/notebook_setup.py @@ -0,0 +1,57 @@ +"""Some routines to setup the notebook. + +This is separated out from util.py because this module installs some of the pip packages +that util depends on. +""" + +import sys +import logging +import os +import subprocess +from importlib import reload + +from pathlib import Path + +TF_OPERATOR_COMMIT = "9238906" + +def notebook_setup(): + # Install the SDK + logging.basicConfig(format='%(message)s') + logging.getLogger().setLevel(logging.INFO) + + home = str(Path.home()) + + logging.info("pip installing requirements.txt") + subprocess.check_call(["pip3", "install", "--user", "-r", "requirements.txt"]) + + clone_dir = os.path.join(home, "git_tf-operator") + if not os.path.exists(clone_dir): + logging.info("Cloning the tf-operator repo") + subprocess.check_call(["git", "clone", "https://github.com/kubeflow/tf-operator.git", + clone_dir]) + logging.info(f"Checkout kubeflow/tf-operator @{TF_OPERATOR_COMMIT}") + subprocess.check_call(["git", "checkout", TF_OPERATOR_COMMIT], cwd=clone_dir) + + logging.info("Configure docker credentials") + subprocess.check_call(["gcloud", "auth", "configure-docker", "--quiet"]) + if os.getenv("GOOGLE_APPLICATION_CREDENTIALS"): + logging.info("Activating service account") + subprocess.check_call(["gcloud", "auth", "activate-service-account", + "--key-file=" + + os.getenv("GOOGLE_APPLICATION_CREDENTIALS"), + "--quiet"]) + # Installing the python packages locally doesn't appear to have them automatically + # added the path so we need to manually add the directory + local_py_path = os.path.join(home, ".local/lib/python3.6/site-packages") + tf_operator_py_path = os.path.join(clone_dir, "sdk/python") + + for p in [local_py_path, tf_operator_py_path]: + if p not in sys.path: + logging.info("Adding %s to python path", p) + # Insert at front because we want to override any installed packages + sys.path.insert(0, p) + + # Force a reload of kubeflow; since kubeflow is a multi namespace module + # if we've loaded up some new kubeflow subpackages we need to force a reload to see them. + import kubeflow + reload(kubeflow) diff --git a/mnist/requirements.txt b/mnist/requirements.txt new file mode 100644 index 00000000..d02d7288 --- /dev/null +++ b/mnist/requirements.txt @@ -0,0 +1,2 @@ +git+git://github.com/kubeflow/fairing.git@9b0d4ed4796ba349ac6067bbd802ff1d6454d015 +retrying==1.3.3 \ No newline at end of file diff --git a/mnist/serving/GCS/kustomization.yaml b/mnist/serving/GCS/kustomization.yaml deleted file mode 100644 index ccb97fdb..00000000 --- a/mnist/serving/GCS/kustomization.yaml +++ /dev/null @@ -1,5 +0,0 @@ -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization - -bases: -- ../base diff --git a/mnist/serving/base/mnist-deploy-config.yaml b/mnist/serving/base/mnist-deploy-config.yaml index d485d1da..844d55b5 100644 --- a/mnist/serving/base/mnist-deploy-config.yaml +++ b/mnist/serving/base/mnist-deploy-config.yaml @@ -1,11 +1,11 @@ +kind: ConfigMap apiVersion: v1 +metadata: + name: mnist-deploy-config + namespace: kubeflow data: monitoring_config.txt: |- prometheus_config: { enable: true, path: "/monitoring/prometheus/metrics" - } -kind: ConfigMap -metadata: - name: mnist-deploy-config - namespace: kubeflow + } \ No newline at end of file diff --git a/mnist/testing/conftest.py b/mnist/testing/conftest.py deleted file mode 100644 index d0e2cf15..00000000 --- a/mnist/testing/conftest.py +++ /dev/null @@ -1,116 +0,0 @@ -import os -import pytest - -def pytest_addoption(parser): - - parser.addoption( - "--tfjob_name", help="Name for the TFjob.", - type=str, default="mnist-test-" + os.getenv('BUILD_ID')) - - parser.addoption( - "--namespace", help=("The namespace to run in. This should correspond to" - "a namespace associated with a Kubeflow namespace."), - type=str, default="kubeflow-kf-ci-v1-user") - - parser.addoption( - "--repos", help="The repos to checkout; leave blank to use defaults", - type=str, default="") - - parser.addoption( - "--trainer_image", help="TFJob training image", - type=str, default="gcr.io/kubeflow-examples/mnist/model:build-" + os.getenv('BUILD_ID')) - - parser.addoption( - "--train_steps", help="train steps for mnist testing", - type=str, default="200") - - parser.addoption( - "--batch_size", help="batch size for mnist trainning", - type=str, default="100") - - parser.addoption( - "--learning_rate", help="mnist learnning rate", - type=str, default="0.01") - - parser.addoption( - "--num_ps", help="The number of PS", - type=str, default="1") - - parser.addoption( - "--num_workers", help="The number of Worker", - type=str, default="2") - - parser.addoption( - "--model_dir", help="Path for model saving", - type=str, default="gs://kubeflow-ci-deployment_ci-temp/mnist/models/" + os.getenv('BUILD_ID')) - - parser.addoption( - "--export_dir", help="Path for model exporting", - type=str, default="gs://kubeflow-ci-deployment_ci-temp/mnist/models/" + os.getenv('BUILD_ID')) - - parser.addoption( - "--deploy_name", help="Name for the service deployment", - type=str, default="mnist-test-" + os.getenv('BUILD_ID')) - - parser.addoption( - "--master", action="store", default="", help="IP address of GKE master") - - parser.addoption( - "--service", action="store", default="mnist-test-" + os.getenv('BUILD_ID'), - help="The name of the mnist K8s service") - -@pytest.fixture -def master(request): - return request.config.getoption("--master") - -@pytest.fixture -def namespace(request): - return request.config.getoption("--namespace") - -@pytest.fixture -def service(request): - return request.config.getoption("--service") - -@pytest.fixture -def tfjob_name(request): - return request.config.getoption("--tfjob_name") - -@pytest.fixture -def repos(request): - return request.config.getoption("--repos") - -@pytest.fixture -def trainer_image(request): - return request.config.getoption("--trainer_image") - -@pytest.fixture -def train_steps(request): - return request.config.getoption("--train_steps") - -@pytest.fixture -def batch_size(request): - return request.config.getoption("--batch_size") - -@pytest.fixture -def learning_rate(request): - return request.config.getoption("--learning_rate") - -@pytest.fixture -def num_ps(request): - return request.config.getoption("--num_ps") - -@pytest.fixture -def num_workers(request): - return request.config.getoption("--num_workers") - -@pytest.fixture -def model_dir(request): - return request.config.getoption("--model_dir") - -@pytest.fixture -def export_dir(request): - return request.config.getoption("--export_dir") - -@pytest.fixture -def deploy_name(request): - return request.config.getoption("--deploy_name") diff --git a/mnist/testing/deploy_test.py b/mnist/testing/deploy_test.py deleted file mode 100644 index 0cc4b063..00000000 --- a/mnist/testing/deploy_test.py +++ /dev/null @@ -1,84 +0,0 @@ -"""Test deploying the mnist model. - -This file tests that we can deploy the model. - -It is an integration test as it depends on having access to -a Kubeflow deployment to deploy on. It also depends on having a model. - -Python Path Requirements: - kubeflow/testing/py - https://github.com/kubeflow/testing/tree/master/py - * Provides utilities for testing - -Manually running the test - pytest deploy_test.py \ - name=mnist-deploy-test-${BUILD_ID} \ - namespace=${namespace} \ - modelBasePath=${modelDir} \ - exportDir=${modelDir} \ - -""" - -import logging -import os -import pytest - -from kubernetes.config import kube_config -from kubernetes import client as k8s_client - -from kubeflow.testing import util - - -def test_deploy(record_xml_attribute, deploy_name, namespace, model_dir, export_dir): - - util.set_pytest_junit(record_xml_attribute, "test_deploy") - - util.maybe_activate_service_account() - - app_dir = os.path.join(os.path.dirname(__file__), "../serving/GCS") - app_dir = os.path.abspath(app_dir) - logging.info("--app_dir not set defaulting to: %s", app_dir) - - # TODO (@jinchihe) Using kustomize 2.0.3 to work around below issue: - # https://github.com/kubernetes-sigs/kustomize/issues/1295 - kusUrl = 'https://github.com/kubernetes-sigs/kustomize/' \ - 'releases/download/v2.0.3/kustomize_2.0.3_linux_amd64' - util.run(['wget', '-q', '-O', '/usr/local/bin/kustomize', kusUrl], cwd=app_dir) - util.run(['chmod', 'a+x', '/usr/local/bin/kustomize'], cwd=app_dir) - - # TODO (@jinchihe): The kubectl need to be upgraded to 1.14.0 due to below issue. - # Invalid object doesn't have additional properties ... - kusUrl = 'https://storage.googleapis.com/kubernetes-release/' \ - 'release/v1.14.0/bin/linux/amd64/kubectl' - util.run(['wget', '-q', '-O', '/usr/local/bin/kubectl', kusUrl], cwd=app_dir) - util.run(['chmod', 'a+x', '/usr/local/bin/kubectl'], cwd=app_dir) - - # Configure custom parameters using kustomize - configmap = 'mnist-map-serving' - util.run(['kustomize', 'edit', 'set', 'namespace', namespace], cwd=app_dir) - util.run(['kustomize', 'edit', 'add', 'configmap', configmap, - '--from-literal=name' + '=' + deploy_name], cwd=app_dir) - - util.run(['kustomize', 'edit', 'add', 'configmap', configmap, - '--from-literal=modelBasePath=' + model_dir], cwd=app_dir) - util.run(['kustomize', 'edit', 'add', 'configmap', configmap, - '--from-literal=exportDir=' + export_dir], cwd=app_dir) - - # Apply the components - util.run(['kustomize', 'build', app_dir, '-o', 'generated.yaml'], cwd=app_dir) - util.run(['kubectl', 'apply', '-f', 'generated.yaml'], cwd=app_dir) - - kube_config.load_kube_config() - api_client = k8s_client.ApiClient() - util.wait_for_deployment(api_client, namespace, deploy_name, timeout_minutes=4) - - # We don't delete the resources. We depend on the namespace being - # garbage collected. - -if __name__ == "__main__": - logging.basicConfig(level=logging.INFO, - format=('%(levelname)s|%(asctime)s' - '|%(pathname)s|%(lineno)d| %(message)s'), - datefmt='%Y-%m-%dT%H:%M:%S', - ) - logging.getLogger().setLevel(logging.INFO) - pytest.main() diff --git a/mnist/testing/predict_test.py b/mnist/testing/predict_test.py deleted file mode 100644 index 78b9dda5..00000000 --- a/mnist/testing/predict_test.py +++ /dev/null @@ -1,123 +0,0 @@ -"""Test mnist_client. - -This file tests that we can send predictions to the model -using REST. - -It is an integration test as it depends on having access to -a deployed model. - -We use the pytest framework because - 1. It can output results in junit format for prow/gubernator - 2. It has good support for configuring tests using command line arguments - (https://docs.pytest.org/en/latest/example/simple.html) - -Python Path Requirements: - kubeflow/testing/py - https://github.com/kubeflow/testing/tree/master/py - * Provides utilities for testing - -Manually running the test - 1. Configure your KUBECONFIG file to point to the desired cluster -""" - -import json -import logging -import os -import subprocess -import requests -from retrying import retry -import six - -from kubernetes.config import kube_config -from kubernetes import client as k8s_client - -import pytest - -from kubeflow.testing import util - -def is_retryable_result(r): - if r.status_code == requests.codes.NOT_FOUND: - message = "Request to {0} returned 404".format(r.url) - logging.error(message) - return True - - return False - -@retry(wait_exponential_multiplier=1000, wait_exponential_max=10000, - stop_max_delay=5*60*1000, - retry_on_result=is_retryable_result) -def send_request(*args, **kwargs): - # We don't use util.run because that ends up including the access token - # in the logs - token = subprocess.check_output(["gcloud", "auth", "print-access-token"]) - if six.PY3 and hasattr(token, "decode"): - token = token.decode() - token = token.strip() - - headers = { - "Authorization": "Bearer " + token, - } - - if "headers" not in kwargs: - kwargs["headers"] = {} - - kwargs["headers"].update(headers) - - r = requests.post(*args, **kwargs) - - return r - -@pytest.mark.xfail -def test_predict(master, namespace, service): - app_credentials = os.getenv("GOOGLE_APPLICATION_CREDENTIALS") - if app_credentials: - print("Activate service account") - util.run(["gcloud", "auth", "activate-service-account", - "--key-file=" + app_credentials]) - - if not master: - print("--master set; using kubeconfig") - # util.load_kube_config appears to hang on python3 - kube_config.load_kube_config() - api_client = k8s_client.ApiClient() - host = api_client.configuration.host - print("host={0}".format(host)) - master = host.rsplit("/", 1)[-1] - - this_dir = os.path.dirname(__file__) - test_data = os.path.join(this_dir, "test_data", "instances.json") - with open(test_data) as hf: - instances = json.load(hf) - - # We proxy the request through the APIServer so that we can connect - # from outside the cluster. - url = ("https://{master}/api/v1/namespaces/{namespace}/services/{service}:8500" - "/proxy/v1/models/mnist:predict").format( - master=master, namespace=namespace, service=service) - logging.info("Request: %s", url) - r = send_request(url, json=instances, verify=False) - - if r.status_code != requests.codes.OK: - msg = "Request to {0} exited with status code: {1} and content: {2}".format( - url, r.status_code, r.content) - logging.error(msg) - raise RuntimeError(msg) - - content = r.content - if six.PY3 and hasattr(content, "decode"): - content = content.decode() - result = json.loads(content) - assert len(result["predictions"]) == 1 - predictions = result["predictions"][0] - assert "classes" in predictions - assert "predictions" in predictions - assert len(predictions["predictions"]) == 10 - logging.info("URL %s returned; %s", url, content) - -if __name__ == "__main__": - logging.basicConfig(level=logging.INFO, - format=('%(levelname)s|%(asctime)s' - '|%(pathname)s|%(lineno)d| %(message)s'), - datefmt='%Y-%m-%dT%H:%M:%S', - ) - logging.getLogger().setLevel(logging.INFO) - pytest.main() diff --git a/mnist/testing/test_data/instances.json b/mnist/testing/test_data/instances.json deleted file mode 100644 index 80eb3e62..00000000 --- a/mnist/testing/test_data/instances.json +++ /dev/null @@ -1,792 +0,0 @@ -{ - "instances": [ - { - "x": [ - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.011764707043766975, - 0.07058823853731155, - 0.07058823853731155, - 0.07058823853731155, - 0.4941176772117615, - 0.5333333611488342, - 0.686274528503418, - 0.10196079313755035, - 0.6509804129600525, - 1.0, - 0.9686275124549866, - 0.49803924560546875, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.11764706671237946, - 0.1411764770746231, - 0.3686274588108063, - 0.6039215922355652, - 0.6666666865348816, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.8823530077934265, - 0.6745098233222961, - 0.9921569228172302, - 0.9490196704864502, - 0.7647059559822083, - 0.250980406999588, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.19215688109397888, - 0.9333333969116211, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9843137860298157, - 0.364705890417099, - 0.32156863808631897, - 0.32156863808631897, - 0.2196078598499298, - 0.15294118225574493, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.07058823853731155, - 0.8588235974311829, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.7764706611633301, - 0.7137255072593689, - 0.9686275124549866, - 0.9450981020927429, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.3137255012989044, - 0.6117647290229797, - 0.41960787773132324, - 0.9921569228172302, - 0.9921569228172302, - 0.803921639919281, - 0.04313725605607033, - 0.0, - 0.16862745583057404, - 0.6039215922355652, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.05490196496248245, - 0.003921568859368563, - 0.6039215922355652, - 0.9921569228172302, - 0.3529411852359772, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.545098066329956, - 0.9921569228172302, - 0.7450980544090271, - 0.007843137718737125, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.04313725605607033, - 0.7450980544090271, - 0.9921569228172302, - 0.27450981736183167, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.13725490868091583, - 0.9450981020927429, - 0.8823530077934265, - 0.6274510025978088, - 0.4235294461250305, - 0.003921568859368563, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.3176470696926117, - 0.9411765336990356, - 0.9921569228172302, - 0.9921569228172302, - 0.46666669845581055, - 0.09803922474384308, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.1764705926179886, - 0.729411780834198, - 0.9921569228172302, - 0.9921569228172302, - 0.5882353186607361, - 0.10588236153125763, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.062745101749897, - 0.364705890417099, - 0.988235354423523, - 0.9921569228172302, - 0.7333333492279053, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.9764706492424011, - 0.9921569228172302, - 0.9764706492424011, - 0.250980406999588, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.18039216101169586, - 0.5098039507865906, - 0.7176470756530762, - 0.9921569228172302, - 0.9921569228172302, - 0.8117647767066956, - 0.007843137718737125, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.15294118225574493, - 0.5803921818733215, - 0.8980392813682556, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9803922176361084, - 0.7137255072593689, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0941176563501358, - 0.44705885648727417, - 0.8666667342185974, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.7882353663444519, - 0.30588236451148987, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.09019608050584793, - 0.25882354378700256, - 0.8352941870689392, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.7764706611633301, - 0.3176470696926117, - 0.007843137718737125, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.07058823853731155, - 0.6705882549285889, - 0.8588235974311829, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.7647059559822083, - 0.3137255012989044, - 0.03529411926865578, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.21568629145622253, - 0.6745098233222961, - 0.8862745761871338, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.9568628072738647, - 0.5215686559677124, - 0.04313725605607033, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.5333333611488342, - 0.9921569228172302, - 0.9921569228172302, - 0.9921569228172302, - 0.8313726186752319, - 0.529411792755127, - 0.5176470875740051, - 0.062745101749897, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0, - 0.0 - ] - } - ] -} \ No newline at end of file diff --git a/mnist/testing/tfjob_test.py b/mnist/testing/tfjob_test.py deleted file mode 100644 index a61bc0e3..00000000 --- a/mnist/testing/tfjob_test.py +++ /dev/null @@ -1,142 +0,0 @@ -"""Test training using TFJob. - -This file tests that we can submit the job -and that the job runs to completion. - -It is an integration test as it depends on having access to -a Kubeflow deployment to submit the TFJob to. - -Python Path Requirements: - kubeflow/tf-operator/py - https://github.com/kubeflow/tf-operator - * Provides utilities for testing TFJobs - kubeflow/testing/py - https://github.com/kubeflow/testing/tree/master/py - * Provides utilities for testing - -Manually running the test - pytest tfjobs_test.py \ - tfjob_name=tfjobs-test-${BUILD_ID} \ - namespace=${test_namespace} \ - trainer_image=${trainning_image} \ - train_steps=10 \ - batch_size=10 \ - learning_rate=0.01 \ - num_ps=1 \ - num_workers=2 \ - model_dir=${model_dir} \ - export_dir=${model_dir} \ -""" - -import json -import logging -import os -import pytest - -from kubernetes.config import kube_config -from kubernetes import client as k8s_client -from kubeflow.tf_operator import tf_job_client #pylint: disable=no-name-in-module - -from kubeflow.testing import util - -def test_training(record_xml_attribute, tfjob_name, namespace, trainer_image, num_ps, #pylint: disable=too-many-arguments - num_workers, train_steps, batch_size, learning_rate, model_dir, export_dir): - - util.set_pytest_junit(record_xml_attribute, "test_mnist") - - util.maybe_activate_service_account() - - app_dir = os.path.join(os.path.dirname(__file__), "../training/GCS") - app_dir = os.path.abspath(app_dir) - logging.info("--app_dir not set defaulting to: %s", app_dir) - - # TODO (@jinchihe) Using kustomize 2.0.3 to work around below issue: - # https://github.com/kubernetes-sigs/kustomize/issues/1295 - kusUrl = 'https://github.com/kubernetes-sigs/kustomize/' \ - 'releases/download/v2.0.3/kustomize_2.0.3_linux_amd64' - util.run(['wget', '-q', '-O', '/usr/local/bin/kustomize', kusUrl], cwd=app_dir) - util.run(['chmod', 'a+x', '/usr/local/bin/kustomize'], cwd=app_dir) - - # TODO (@jinchihe): The kubectl need to be upgraded to 1.14.0 due to below issue. - # Invalid object doesn't have additional properties ... - kusUrl = 'https://storage.googleapis.com/kubernetes-release/' \ - 'release/v1.14.0/bin/linux/amd64/kubectl' - util.run(['wget', '-q', '-O', '/usr/local/bin/kubectl', kusUrl], cwd=app_dir) - util.run(['chmod', 'a+x', '/usr/local/bin/kubectl'], cwd=app_dir) - - # Configurate custom parameters using kustomize - util.run(['kustomize', 'edit', 'set', 'namespace', namespace], cwd=app_dir) - util.run(['kustomize', 'edit', 'set', 'image', 'training-image=' + trainer_image], cwd=app_dir) - - util.run(['../base/definition.sh', '--numPs', num_ps], cwd=app_dir) - util.run(['../base/definition.sh', '--numWorkers', num_workers], cwd=app_dir) - - trainning_config = { - "name": tfjob_name, - "trainSteps": train_steps, - "batchSize": batch_size, - "learningRate": learning_rate, - "modelDir": model_dir, - "exportDir": export_dir, - } - - configmap = 'mnist-map-training' - for key, value in trainning_config.items(): - util.run(['kustomize', 'edit', 'add', 'configmap', configmap, - '--from-literal=' + key + '=' + value], cwd=app_dir) - - # Created the TFJobs. - util.run(['kustomize', 'build', app_dir, '-o', 'generated.yaml'], cwd=app_dir) - util.run(['kubectl', 'apply', '-f', 'generated.yaml'], cwd=app_dir) - logging.info("Created job %s in namespaces %s", tfjob_name, namespace) - - kube_config.load_kube_config() - api_client = k8s_client.ApiClient() - - # Wait for the job to complete. - logging.info("Waiting for job to finish.") - results = tf_job_client.wait_for_job( - api_client, - namespace, - tfjob_name, - status_callback=tf_job_client.log_status) - logging.info("Final TFJob:\n %s", json.dumps(results, indent=2)) - - # Check for errors creating pods and services. Can potentially - # help debug failed test runs. - creation_failures = tf_job_client.get_creation_failures_from_tfjob( - api_client, namespace, results) - if creation_failures: - logging.warning(creation_failures) - - if not tf_job_client.job_succeeded(results): - failure = "Job {0} in namespace {1} in status {2}".format( # pylint: disable=attribute-defined-outside-init - tfjob_name, namespace, results.get("status", {})) - logging.error(failure) - - # if the TFJob failed, print out the pod logs for debugging. - pod_names = tf_job_client.get_pod_names( - api_client, namespace, tfjob_name) - logging.info("The Pods name:\n %s", pod_names) - - core_api = k8s_client.CoreV1Api(api_client) - - for pod in pod_names: - logging.info("Getting logs of Pod %s.", pod) - try: - pod_logs = core_api.read_namespaced_pod_log(pod, namespace) - logging.info("The logs of Pod %s log:\n %s", pod, pod_logs) - except k8s_client.rest.ApiException as e: - logging.info("Exception when calling CoreV1Api->read_namespaced_pod_log: %s\n", e) - return - - # We don't delete the jobs. We rely on TTLSecondsAfterFinished - # to delete old jobs. Leaving jobs around should make it - # easier to debug. - -if __name__ == "__main__": - logging.basicConfig(level=logging.INFO, - format=('%(levelname)s|%(asctime)s' - '|%(pathname)s|%(lineno)d| %(message)s'), - datefmt='%Y-%m-%dT%H:%M:%S', - ) - logging.getLogger().setLevel(logging.INFO) - pytest.main() diff --git a/mnist/training/GCS/kustomization.yaml b/mnist/training/GCS/kustomization.yaml deleted file mode 100644 index 60cf0ebd..00000000 --- a/mnist/training/GCS/kustomization.yaml +++ /dev/null @@ -1,11 +0,0 @@ -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization - -bases: -- ../base - -images: -- name: training-image - newName: gcr.io/kubeflow-examples/mnist/model - newTag: build-1202842504546750464 - diff --git a/mnist/web-ui/mnist_client.py b/mnist/web-ui/mnist_client.py index 723d27c5..cea45763 100644 --- a/mnist/web-ui/mnist_client.py +++ b/mnist/web-ui/mnist_client.py @@ -27,7 +27,7 @@ from tensorflow.examples.tutorials.mnist import input_data from tensorflow_serving.apis import predict_pb2 from tensorflow_serving.apis import prediction_service_pb2 -from PIL import Image +from PIL import Image # pylint: disable=wrong-import-order def get_prediction(image, server_host='127.0.0.1', server_port=9000, diff --git a/prow_config.yaml b/prow_config.yaml index 7c6e8a57..92a9ffc2 100644 --- a/prow_config.yaml +++ b/prow_config.yaml @@ -55,6 +55,7 @@ workflows: - postsubmit include_dirs: - xgboost_synthetic/* + - mnist/* - py/kubeflow/examples/create_e2e_workflow.py # E2E test for various notebooks @@ -67,17 +68,7 @@ workflows: - postsubmit include_dirs: - xgboost_synthetic/* + - mnist/* - py/kubeflow/examples/create_e2e_workflow.py kwargs: cluster_pattern: kf-v1-(?!n\d\d) - - # E2E test for mnist example - - py_func: kubeflow.examples.create_e2e_workflow.create_workflow - name: mnist - job_types: - - periodic - - presubmit - - postsubmit - include_dirs: - - mnist/* - - py/kubeflow/examples/create_e2e_workflow.py diff --git a/py/kubeflow/examples/create_e2e_workflow.py b/py/kubeflow/examples/create_e2e_workflow.py index 1b13bbd5..0bc31672 100644 --- a/py/kubeflow/examples/create_e2e_workflow.py +++ b/py/kubeflow/examples/create_e2e_workflow.py @@ -261,82 +261,23 @@ class Builder: "xgboost_synthetic", "testing") - def _build_tests_dag_mnist(self): - """Build the dag for the set of tests to run mnist TFJob tests.""" - - task_template = self._build_task_template() - # *************************************************************************** - # Build mnist image - step_name = "build-image" - train_image_base = "gcr.io/kubeflow-examples/mnist" - train_image_tag = "build-" + PROW_DICT['BUILD_ID'] - command = ["/bin/bash", - "-c", - "gcloud auth activate-service-account --key-file=$(GOOGLE_APPLICATION_CREDENTIALS) \ - && make build-gcb IMG=" + train_image_base + " TAG=" + train_image_tag, - ] + # Test mnist + step_name = "mnist" + command = ["pytest", "mnist_gcp_test.py", + # Increase the log level so that info level log statements show up. + "--log-cli-level=info", + "--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'", + # Test timeout in seconds. + "--timeout=1800", + "--junitxml=" + self.artifacts_dir + "/junit_mnist-gcp-test.xml", + ] + dependencies = [] - build_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template, - command, dependencies) - build_step["container"]["workingDir"] = os.path.join(self.src_dir, "mnist") - - # *************************************************************************** - # Test mnist TFJob - step_name = "tfjob-test" - # Using python2 to run the test to avoid dependency error. - command = ["python2", "-m", "pytest", "tfjob_test.py", - # Increase the log level so that info level log statements show up. - "--log-cli-level=info", - "--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'", - # Test timeout in seconds. - "--timeout=1800", - "--junitxml=" + self.artifacts_dir + "/junit_tfjob-test.xml", - ] - - dependencies = [build_step['name']] - tfjob_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template, - command, dependencies) - tfjob_step["container"]["workingDir"] = os.path.join(self.src_dir, - "mnist", - "testing") - - # *************************************************************************** - # Test mnist deploy - step_name = "deploy-test" - command = ["python2", "-m", "pytest", "deploy_test.py", - # Increase the log level so that info level log statements show up. - "--log-cli-level=info", - "--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'", - # Test timeout in seconds. - "--timeout=1800", - "--junitxml=" + self.artifacts_dir + "/junit_deploy-test.xml", - ] - - dependencies = [tfjob_step["name"]] - deploy_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template, - command, dependencies) - deploy_step["container"]["workingDir"] = os.path.join(self.src_dir, - "mnist", - "testing") - # *************************************************************************** - # Test mnist predict - step_name = "predict-test" - command = ["pytest", "predict_test.py", - # Increase the log level so that info level log statements show up. - "--log-cli-level=info", - "--log-cli-format='%(levelname)s|%(asctime)s|%(pathname)s|%(lineno)d| %(message)s'", - # Test timeout in seconds. - "--timeout=1800", - "--junitxml=" + self.artifacts_dir + "/junit_predict-test.xml", - ] - - dependencies = [deploy_step["name"]] - predict_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template, - command, dependencies) - predict_step["container"]["workingDir"] = os.path.join(self.src_dir, - "mnist", - "testing") + mnist_step = self._build_step(step_name, self.workflow, TESTS_DAG_NAME, task_template, + command, dependencies) + mnist_step["container"]["workingDir"] = os.path.join( + self.src_dir, "py/kubeflow/examples/notebook_tests") def _build_exit_dag(self): """Build the exit handler dag""" @@ -432,8 +373,6 @@ class Builder: # Run a dag of tests if self.test_target_name.startswith("notebooks"): self._build_tests_dag_notebooks() - elif self.test_target_name == "mnist": - self._build_tests_dag_mnist() else: raise RuntimeError('Invalid test_target_name ' + self.test_target_name) diff --git a/py/kubeflow/examples/notebook_tests/conftest.py b/py/kubeflow/examples/notebook_tests/conftest.py new file mode 100644 index 00000000..17bb294c --- /dev/null +++ b/py/kubeflow/examples/notebook_tests/conftest.py @@ -0,0 +1,34 @@ +import pytest + +def pytest_addoption(parser): + parser.addoption( + "--name", help="Name for the job. If not specified one was created " + "automatically", type=str, default="") + parser.addoption( + "--namespace", help=("The namespace to run in. This should correspond to" + "a namespace associated with a Kubeflow namespace."), + type=str, + default="kubeflow-kf-ci-v1-user") + parser.addoption( + "--image", help="Notebook image to use", type=str, + default="gcr.io/kubeflow-images-public/" + "tensorflow-1.15.2-notebook-cpu:1.0.0") + parser.addoption( + "--repos", help="The repos to checkout; leave blank to use defaults", + type=str, default="") + +@pytest.fixture +def name(request): + return request.config.getoption("--name") + +@pytest.fixture +def namespace(request): + return request.config.getoption("--namespace") + +@pytest.fixture +def image(request): + return request.config.getoption("--image") + +@pytest.fixture +def repos(request): + return request.config.getoption("--repos") diff --git a/py/kubeflow/examples/notebook_tests/execute_notebook.py b/py/kubeflow/examples/notebook_tests/execute_notebook.py new file mode 100644 index 00000000..53ee772c --- /dev/null +++ b/py/kubeflow/examples/notebook_tests/execute_notebook.py @@ -0,0 +1,58 @@ +import fire +import tempfile +import logging +import os +import subprocess + +logger = logging.getLogger(__name__) + +def prepare_env(): + subprocess.check_call(["pip3", "install", "-U", "papermill"]) + subprocess.check_call(["pip3", "install", "-r", "../requirements.txt"]) + + +def execute_notebook(notebook_path, parameters=None): + import papermill #pylint: disable=import-error + temp_dir = tempfile.mkdtemp() + notebook_output_path = os.path.join(temp_dir, "out.ipynb") + papermill.execute_notebook(notebook_path, notebook_output_path, + cwd=os.path.dirname(notebook_path), + parameters=parameters, + log_output=True) + return notebook_output_path + +def run_notebook_test(notebook_path, expected_messages, parameters=None): + output_path = execute_notebook(notebook_path, parameters=parameters) + actual_output = open(output_path, 'r').read() + for expected_message in expected_messages: + if not expected_message in actual_output: + logger.error(actual_output) + assert False, "Unable to find from output: " + expected_message + +class NotebookExecutor: + @staticmethod + def test(notebook_path): + """Test a notebook. + + Args: + notebook_path: Absolute path of the notebook. + """ + prepare_env() + FILE_DIR = os.path.dirname(__file__) + + EXPECTED_MGS = [ + "Finished upload of", + "Model export success: mockup-model.dat", + "Pod started running True", + "Cluster endpoint: http:", + ] + run_notebook_test(notebook_path, EXPECTED_MGS) + +if __name__ == "__main__": + logging.basicConfig(level=logging.INFO, + format=('%(levelname)s|%(asctime)s' + '|%(message)s|%(pathname)s|%(lineno)d|'), + datefmt='%Y-%m-%dT%H:%M:%S', + ) + + fire.Fire(NotebookExecutor) diff --git a/py/kubeflow/examples/notebook_tests/job.yaml b/py/kubeflow/examples/notebook_tests/job.yaml new file mode 100644 index 00000000..4b1b284f --- /dev/null +++ b/py/kubeflow/examples/notebook_tests/job.yaml @@ -0,0 +1,51 @@ +# A batch job to run a notebook using papermill. +# TODO(jlewi): We should switch to using Tekton +apiVersion: batch/v1 +kind: Job +metadata: + name: nb-test + labels: + app: nb-test +spec: + backoffLimit: 1 + template: + metadata: + annotations: + # TODO(jlewi): Do we really want to disable sidecar injection + # in the test? Would it be better to use istio to mimic what happens + # in notebooks? + sidecar.istio.io/inject: "false" + labels: + app: nb-test + spec: + restartPolicy: Never + securityContext: + runAsUser: 0 + initContainers: + # This init container checks out the source code. + - command: + - /usr/local/bin/checkout_repos.sh + - --repos=kubeflow/examples@$(CHECK_TAG) + - --src_dir=/src + name: checkout + image: gcr.io/kubeflow-ci/test-worker:v20190802-c6f9140-e3b0c4 + volumeMounts: + - mountPath: /src + name: src + containers: + - env: + - name: PYTHONPATH + value: /src/kubeflow/examples/py/ + - name: executing-notebooks + image: execute-image + command: ["python3", "-m", + "kubeflow.examples.notebook_tests.execute_notebook", + "test", "/src/kubeflow/examples/mnist/mnist_gcp.ipynb"] + workingDir: /src/kubeflow/examples/py/kubeflow/examples/notebook_tests + volumeMounts: + - mountPath: /src + name: src + serviceAccount: default-editor + volumes: + - name: src + emptyDir: {} diff --git a/py/kubeflow/examples/notebook_tests/mnist_gcp_test.py b/py/kubeflow/examples/notebook_tests/mnist_gcp_test.py new file mode 100644 index 00000000..65f86fe2 --- /dev/null +++ b/py/kubeflow/examples/notebook_tests/mnist_gcp_test.py @@ -0,0 +1,29 @@ +import logging +import os + +import pytest + +from kubeflow.examples.notebook_tests import nb_test_util +from kubeflow.testing import util + +# TODO(jlewi): This test is new; there's some work to be done to make it +# reliable. So for now we mark it as expected to fail in presubmits +# We only mark it as expected to fail +# on presubmits because if expected failures don't show up in test grid +# and we want signal in postsubmits and periodics +@pytest.mark.xfail(os.getenv("JOB_TYPE") == "presubmit", reason="Flaky") +def test_mnist_gcp(record_xml_attribute, name, namespace, # pylint: disable=too-many-branches,too-many-statements + repos, image): + '''Generate Job and summit.''' + util.set_pytest_junit(record_xml_attribute, "test_mnist_gcp") + nb_test_util.run_papermill_job(name, namespace, repos, image) + + +if __name__ == "__main__": + logging.basicConfig(level=logging.INFO, + format=('%(levelname)s|%(asctime)s' + '|%(pathname)s|%(lineno)d| %(message)s'), + datefmt='%Y-%m-%dT%H:%M:%S', + ) + logging.getLogger().setLevel(logging.INFO) + pytest.main() diff --git a/py/kubeflow/examples/notebook_tests/nb_test_util.py b/py/kubeflow/examples/notebook_tests/nb_test_util.py new file mode 100644 index 00000000..e61f4755 --- /dev/null +++ b/py/kubeflow/examples/notebook_tests/nb_test_util.py @@ -0,0 +1,80 @@ +"""Some utitilies for running notebook tests.""" + +import datetime +import logging +import uuid +import yaml + +from kubernetes import client as k8s_client +from kubeflow.testing import argo_build_util +from kubeflow.testing import util + +def run_papermill_job(name, namespace, # pylint: disable=too-many-branches,too-many-statements + repos, image): + """Generate a K8s job to run a notebook using papermill + + Args: + name: Name for the K8s job + namespace: The namespace where the job should run. + repos: (Optional) Which repos to checkout; if not specified tries + to infer based on PROW environment variables + image: + """ + + util.maybe_activate_service_account() + + with open("job.yaml") as hf: + job = yaml.load(hf) + + # We need to checkout the correct version of the code + # in presubmits and postsubmits. We should check the environment variables + # for the prow environment variables to get the appropriate values. + # We should probably also only do that if the + # See + # https://github.com/kubernetes/test-infra/blob/45246b09ed105698aa8fb928b7736d14480def29/prow/jobs.md#job-environment-variables + if not repos: + repos = argo_build_util.get_repo_from_prow_env() + + logging.info("Repos set to %s", repos) + job["spec"]["template"]["spec"]["initContainers"][0]["command"] = [ + "/usr/local/bin/checkout_repos.sh", + "--repos=" + repos, + "--src_dir=/src", + "--depth=all", + ] + job["spec"]["template"]["spec"]["containers"][0]["image"] = image + util.load_kube_config(persist_config=False) + + if name: + job["metadata"]["name"] = name + else: + job["metadata"]["name"] = ("xgboost-test-" + + datetime.datetime.now().strftime("%H%M%S") + + "-" + uuid.uuid4().hex[0:3]) + name = job["metadata"]["name"] + + job["metadata"]["namespace"] = namespace + + # Create an API client object to talk to the K8s master. + api_client = k8s_client.ApiClient() + batch_api = k8s_client.BatchV1Api(api_client) + + logging.info("Creating job:\n%s", yaml.dump(job)) + actual_job = batch_api.create_namespaced_job(job["metadata"]["namespace"], + job) + logging.info("Created job %s.%s:\n%s", namespace, name, + yaml.safe_dump(actual_job.to_dict())) + + final_job = util.wait_for_job(api_client, namespace, name, + timeout=datetime.timedelta(minutes=30)) + + logging.info("Final job:\n%s", yaml.safe_dump(final_job.to_dict())) + + if not final_job.status.conditions: + raise RuntimeError("Job {0}.{1}; did not complete".format(namespace, name)) + + last_condition = final_job.status.conditions[-1] + + if last_condition.type not in ["Complete"]: + logging.error("Job didn't complete successfully") + raise RuntimeError("Job {0}.{1} failed".format(namespace, name)) diff --git a/test/workflows/components/workflows.libsonnet b/test/workflows/components/workflows.libsonnet new file mode 100644 index 00000000..39cc5e1e --- /dev/null +++ b/test/workflows/components/workflows.libsonnet @@ -0,0 +1,252 @@ +{ + // TODO(https://github.com/ksonnet/ksonnet/issues/222): Taking namespace as an argument is a work around for the fact that ksonnet + // doesn't support automatically piping in the namespace from the environment to prototypes. + + // convert a list of two items into a map representing an environment variable + // TODO(jlewi): Should we move this into kubeflow/core/util.libsonnet + listToMap:: function(v) + { + name: v[0], + value: v[1], + }, + + // Function to turn comma separated list of prow environment variables into a dictionary. + parseEnv:: function(v) + local pieces = std.split(v, ","); + if v != "" && std.length(pieces) > 0 then + std.map( + function(i) $.listToMap(std.split(i, "=")), + std.split(v, ",") + ) + else [], + + parts(namespace, name):: { + // Workflow to run the e2e test. + e2e(prow_env, bucket): + // mountPath is the directory where the volume to store the test data + // should be mounted. + local mountPath = "/mnt/" + "test-data-volume"; + // testDir is the root directory for all data for a particular test run. + local testDir = mountPath + "/" + name; + // outputDir is the directory to sync to GCS to contain the output for this job. + local outputDir = testDir + "/output"; + local artifactsDir = outputDir + "/artifacts"; + local goDir = testDir + "/go"; + // Source directory where all repos should be checked out + local srcRootDir = testDir + "/src"; + // The directory containing the kubeflow/examples repo + local srcDir = srcRootDir + "/kubeflow/examples"; + local image = "gcr.io/kubeflow-ci/test-worker"; + // The name of the NFS volume claim to use for test files. + // local nfsVolumeClaim = "kubeflow-testing"; + local nfsVolumeClaim = "nfs-external"; + // The name to use for the volume to use to contain test data. + local dataVolume = "kubeflow-test-volume"; + local versionTag = name; + // The directory within the kubeflow_testing submodule containing + // py scripts to use. + local kubeflowExamplesPy = srcDir; + local kubeflowTestingPy = srcRootDir + "/kubeflow/testing/py"; + + local project = "kubeflow-ci"; + // GKE cluster to use + // We need to truncate the cluster to no more than 40 characters because + // cluster names can be a max of 40 characters. + // We expect the suffix of the cluster name to be unique salt. + // We prepend a z because cluster name must start with an alphanumeric character + // and if we cut the prefix we might end up starting with "-" or other invalid + // character for first character. + local cluster = + if std.length(name) > 40 then + "z" + std.substr(name, std.length(name) - 39, 39) + else + name; + local zone = "us-east1-d"; + local chart = srcDir + "/bin/examples-chart-0.2.1-" + versionTag + ".tgz"; + { + // Build an Argo template to execute a particular command. + // step_name: Name for the template + // command: List to pass as the container command. + buildTemplate(step_name, command):: { + name: step_name, + container: { + command: command, + image: image, + workingDir: srcDir, + env: [ + { + // Add the source directories to the python path. + name: "PYTHONPATH", + value: kubeflowExamplesPy + ":" + kubeflowTestingPy, + }, + { + // Set the GOPATH + name: "GOPATH", + value: goDir, + }, + { + name: "GOOGLE_APPLICATION_CREDENTIALS", + value: "/secret/gcp-credentials/key.json", + }, + { + name: "GIT_TOKEN", + valueFrom: { + secretKeyRef: { + name: "github-token", + key: "github_token", + }, + }, + }, + { + name: "EXTRA_REPOS", + value: "kubeflow/testing@HEAD", + }, + ] + prow_env, + volumeMounts: [ + { + name: dataVolume, + mountPath: mountPath, + }, + { + name: "github-token", + mountPath: "/secret/github-token", + }, + { + name: "gcp-credentials", + mountPath: "/secret/gcp-credentials", + }, + ], + }, + }, // buildTemplate + + apiVersion: "argoproj.io/v1alpha1", + kind: "Workflow", + metadata: { + name: name, + namespace: namespace, + }, + // TODO(jlewi): Use OnExit to run cleanup steps. + spec: { + entrypoint: "e2e", + volumes: [ + { + name: "github-token", + secret: { + secretName: "github-token", + }, + }, + { + name: "gcp-credentials", + secret: { + secretName: "kubeflow-testing-credentials", + }, + }, + { + name: dataVolume, + persistentVolumeClaim: { + claimName: nfsVolumeClaim, + }, + }, + ], // volumes + // onExit specifies the template that should always run when the workflow completes. + onExit: "exit-handler", + templates: [ + { + name: "e2e", + steps: [ + [{ + name: "checkout", + template: "checkout", + }], + [ + { + name: "create-pr-symlink", + template: "create-pr-symlink", + }, + // test_py_checks runs all py files matching "_test.py" + // This is currently commented out because the only matching tests + // are manual tests for some of the examples and/or they require + // dependencies (e.g. tensorflow) not in the generic test worker image. + // + // + // test_py_checks doesn't have options to exclude specific directories. + // Since there are no other tests we just comment it out. + // + // TODO(https://github.com/kubeflow/testing/issues/240): Modify py_test + // so we can exclude specific files. + // + // { + // name: "py-test", + // template: "py-test", + // }, + // { + // name: "py-lint", + // template: "py-lint", + //}, + ], + ], + }, + { + name: "exit-handler", + steps: [ + [{ + name: "copy-artifacts", + template: "copy-artifacts", + }], + ], + }, + { + name: "checkout", + container: { + command: [ + "/usr/local/bin/checkout.sh", + srcRootDir, + ], + env: prow_env + [{ + name: "EXTRA_REPOS", + value: "kubeflow/testing@HEAD", + }], + image: image, + volumeMounts: [ + { + name: dataVolume, + mountPath: mountPath, + }, + ], + }, + }, // checkout + $.parts(namespace, name).e2e(prow_env, bucket).buildTemplate("py-test", [ + "python", + "-m", + "kubeflow.testing.test_py_checks", + "--artifacts_dir=" + artifactsDir, + "--src_dir=" + srcDir, + ]), // py test + $.parts(namespace, name).e2e(prow_env, bucket).buildTemplate("py-lint", [ + "python", + "-m", + "kubeflow.testing.test_py_lint", + "--artifacts_dir=" + artifactsDir, + "--src_dir=" + srcDir, + ]), // py lint + $.parts(namespace, name).e2e(prow_env, bucket).buildTemplate("create-pr-symlink", [ + "python", + "-m", + "kubeflow.testing.prow_artifacts", + "--artifacts_dir=" + outputDir, + "create_pr_symlink", + "--bucket=" + bucket, + ]), // create-pr-symlink + $.parts(namespace, name).e2e(prow_env, bucket).buildTemplate("copy-artifacts", [ + "python", + "-m", + "kubeflow.testing.prow_artifacts", + "--artifacts_dir=" + outputDir, + "copy_artifacts", + "--bucket=" + bucket, + ]), // copy-artifacts + ], // templates + }, + }, // e2e + }, // parts +} diff --git a/xgboost_synthetic/build-train-deploy.ipynb b/xgboost_synthetic/build-train-deploy.ipynb index 7560a7a1..608d9c52 100644 --- a/xgboost_synthetic/build-train-deploy.ipynb +++ b/xgboost_synthetic/build-train-deploy.ipynb @@ -1257,7 +1257,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.5rc1" + "version": "3.6.9" } }, "nbformat": 4, diff --git a/xgboost_synthetic/testing/README.md b/xgboost_synthetic/testing/README.md new file mode 100644 index 00000000..cc624718 --- /dev/null +++ b/xgboost_synthetic/testing/README.md @@ -0,0 +1 @@ +TODO: We should reuse/share logic in py/kubeflow/examples/notebook_tests \ No newline at end of file