mirror of https://github.com/kubeflow/examples.git
				
				
				
			
		
			
				
	
	
		
			955 lines
		
	
	
		
			29 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
			
		
		
	
	
			955 lines
		
	
	
		
			29 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
| {
 | |
|  "cells": [
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "# MNIST E2E on Kubeflow on IBM Cloud Kubernetes Service.\n",
 | |
|     "\n",
 | |
|     "This example guides you through:\n",
 | |
|     "  \n",
 | |
|     "  1. Taking an example TensorFlow model and modifying it to support distributed training\n",
 | |
|     "  1. Serving the resulting model using TFServing\n",
 | |
|     "  1. Deploying and using a web-app that uses the model\n",
 | |
|     "  \n",
 | |
|     "## Requirements\n",
 | |
|     "\n",
 | |
|     "  * You must be [running Kubeflow 1.0 on IBM Cloud Kubernetes Service](https://www.kubeflow.org/docs/ibm/install-kubeflow/).\n",
 | |
|     " "
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Required Services and Credentials\n",
 | |
|     "\n",
 | |
|     "Before proceeding to the next steps, we first need to provision the necessary IBM Services and input the credentials below.\n",
 | |
|     "\n",
 | |
|     "IBM Cloud Object Storage(COS): https://cloud.ibm.com/catalog/services/cloud-object-storage"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "**Tip**: follow the steps below to access your COS instance dashboard. From the [IBM Cloud dashboard](https://cloud.ibm.com/resources):\n",
 | |
|     "\n",
 | |
|     "- Click the **Storage** tab\n",
 | |
|     "- Select and click your target object storage (COS)\n",
 | |
|     "\n",
 | |
|     "\n",
 | |
|     "**Create new credentials with HMAC**:\n",
 | |
|     "\n",
 | |
|     "  - Go to your COS dashboard (see the above **Tip**).\n",
 | |
|     "  - In the **Service credentials** tab, click **New Credential+**.\n",
 | |
|     "  - In the **Add Inline Configuration Parameters(Optional)**: box, add {\"HMAC\":true}\n",
 | |
|     "  - Click **Add**. (For more information, see HMAC.)\n",
 | |
|     "  \n",
 | |
|     "**Replace** the information in the following cell with your COS credentials.\n",
 | |
|     "\n",
 | |
|     "You can find these credentials in your COS instance dashboard under the **Service credentials** tab."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "cos_credentials = {\n",
 | |
|     "  \"apikey\": \"-------\",\n",
 | |
|     "  \"cos_hmac_keys\": {\n",
 | |
|     "    \"access_key_id\": \"------\",\n",
 | |
|     "    \"secret_access_key\": \"------\"\n",
 | |
|     "  },\n",
 | |
|     "  \"endpoints\": \"https://cos-service.bluemix.net/endpoints\",\n",
 | |
|     "  \"iam_apikey_description\": \"------\",\n",
 | |
|     "  \"iam_apikey_name\": \"------\",\n",
 | |
|     "  \"iam_role_crn\": \"------\",\n",
 | |
|     "  \"iam_serviceid_crn\": \"------\",\n",
 | |
|     "  \"resource_instance_id\": \"-------\"\n",
 | |
|     "}"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "Define the endpoint.\n",
 | |
|     "\n",
 | |
|     "To do this, go to the **Endpoint** tab in the COS instance's dashboard to get the endpoint information, then enter it in the `service_endpoint` cell below."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "service_endpoint = 's3.us.cloud-object-storage.appdomain.cloud'\n",
 | |
|     "service_endpoint_with_https=\"https://\" + service_endpoint"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Prepare model\n",
 | |
|     "\n",
 | |
|     "There is a delta between existing distributed mnist examples and what's needed to run well as a TFJob.\n",
 | |
|     "\n",
 | |
|     "Basically, we must:\n",
 | |
|     "\n",
 | |
|     "1. Add options in order to make the model configurable.\n",
 | |
|     "1. Use `tf.estimator.train_and_evaluate` to enable model exporting and serving.\n",
 | |
|     "1. Define serving signatures for model serving.\n",
 | |
|     "\n",
 | |
|     "The resulting model is [model.py](model.py)."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Install Required Libraries\n",
 | |
|     "\n",
 | |
|     "Import the libraries required to train this model."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "import logging\n",
 | |
|     "import os\n",
 | |
|     "import uuid\n",
 | |
|     "from importlib import reload\n",
 | |
|     "import notebook_setup\n",
 | |
|     "reload(notebook_setup)\n",
 | |
|     "notebook_setup.notebook_setup(platform=None)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "import k8s_util\n",
 | |
|     "# Force a reload of kubeflow; since kubeflow is a multi namespace module\n",
 | |
|     "# it looks like doing this in notebook_setup may not be sufficient\n",
 | |
|     "import kubeflow\n",
 | |
|     "reload(kubeflow)\n",
 | |
|     "from kubernetes import client as k8s_client\n",
 | |
|     "from kubernetes import config as k8s_config\n",
 | |
|     "from kubeflow.tfjob.api import tf_job_client as tf_job_client_module\n",
 | |
|     "from IPython.core.display import display, HTML\n",
 | |
|     "import yaml"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Configure docker credentials\n",
 | |
|     "\n",
 | |
|     "Get your docker registry user and password encoded in base64 <br>\n",
 | |
|     "\n",
 | |
|     "`echo -n USER:PASSWORD | base64` <br>\n",
 | |
|     "\n",
 | |
|     "Update the config auth section below with your Docker registry url and the previous generated base64 string <br>\n",
 | |
|     "\n",
 | |
|     "<br>"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "import json\n",
 | |
|     "config={\n",
 | |
|     "    \"auths\": {\n",
 | |
|     "        \"https://index.docker.io/v1/\": {\n",
 | |
|     "            \"auth\": \"xxxxxxxxxxxxxxx\"\n",
 | |
|     "        }\n",
 | |
|     "    }\n",
 | |
|     "}\n",
 | |
|     "with open('config.json', 'w') as outfile:\n",
 | |
|     "    json.dump(config, outfile)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Create a config-map in the namespace you're using with the docker config\n"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# !kubectl delete configmap docker-config\n",
 | |
|     "!kubectl create configmap docker-config --from-file=config.json\n",
 | |
|     "!rm config.json"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Update the `DOCKER_REGISTRY` and build the training image using Kaniko"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "from kubernetes import client as k8s_client\n",
 | |
|     "from kubernetes.client import rest as k8s_rest\n",
 | |
|     "from kubeflow import fairing   \n",
 | |
|     "from kubeflow.fairing import utils as fairing_utils\n",
 | |
|     "from kubeflow.fairing.builders import append\n",
 | |
|     "from kubeflow.fairing.deployers import job\n",
 | |
|     "from kubeflow.fairing.preprocessors import base as base_preprocessor\n",
 | |
|     "\n",
 | |
|     "# Update the DOCKER_REGISTRY to your docker registry!!\n",
 | |
|     "DOCKER_REGISTRY = \"dockerregistry\"\n",
 | |
|     "namespace = fairing_utils.get_current_k8s_namespace()\n",
 | |
|     "\n",
 | |
|     "cos_username = cos_credentials['cos_hmac_keys']['access_key_id']\n",
 | |
|     "cos_key = cos_credentials['cos_hmac_keys']['secret_access_key']\n",
 | |
|     "cos_region = \"us-east-1\"\n",
 | |
|     "\n",
 | |
|     "logging.info(f\"Running in namespace {namespace}\")\n",
 | |
|     "logging.info(f\"Using docker registry {DOCKER_REGISTRY}\")"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# TODO(https://github.com/kubeflow/fairing/issues/426): We should get rid of this once the default \n",
 | |
|     "# Kaniko image is updated to a newer image than 0.7.0.\n",
 | |
|     "from kubeflow.fairing import constants\n",
 | |
|     "constants.constants.KANIKO_IMAGE = \"gcr.io/kaniko-project/executor:v0.14.0\""
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "from kubeflow.fairing.builders import cluster\n",
 | |
|     "\n",
 | |
|     "# output_map is a map of extra files to add to the notebook.\n",
 | |
|     "# It is a map from source location to the location inside the context.\n",
 | |
|     "output_map =  {\n",
 | |
|     "    \"Dockerfile.model\": \"Dockerfile\",\n",
 | |
|     "    \"model.py\": \"model.py\"\n",
 | |
|     "}\n",
 | |
|     "\n",
 | |
|     "\n",
 | |
|     "preprocessor = base_preprocessor.BasePreProcessor(\n",
 | |
|     "    command=[\"python\"], # The base class will set this.\n",
 | |
|     "    input_files=[],\n",
 | |
|     "    path_prefix=\"/app\", # irrelevant since we aren't preprocessing any files\n",
 | |
|     "    output_map=output_map)\n",
 | |
|     "\n",
 | |
|     "preprocessor.preprocess()"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# Use a Tensorflow image as the base image\n",
 | |
|     "# We use a custom Dockerfile \n",
 | |
|     "from kubeflow.fairing.cloud.k8s import MinioUploader\n",
 | |
|     "from kubeflow.fairing.builders.cluster.minio_context import MinioContextSource\n",
 | |
|     "minio_uploader = MinioUploader(endpoint_url=service_endpoint_with_https, minio_secret=cos_username, minio_secret_key=cos_key, region_name=cos_region)\n",
 | |
|     "minio_context_source = MinioContextSource(endpoint_url=service_endpoint_with_https, minio_secret=cos_username, minio_secret_key=cos_key, region_name=cos_region)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# TODO: Add IBM Container registry as part of the fairing SDK.\n",
 | |
|     "cluster_builder = cluster.cluster.ClusterBuilder(registry=DOCKER_REGISTRY,\n",
 | |
|     "                                                 base_image=\"\", # base_image is set in the Dockerfile\n",
 | |
|     "                                                 preprocessor=preprocessor,\n",
 | |
|     "                                                 image_name=\"mnist\",\n",
 | |
|     "                                                 dockerfile_path=\"Dockerfile\",\n",
 | |
|     "                                                 context_source=minio_context_source)\n",
 | |
|     "cluster_builder.build()\n",
 | |
|     "logging.info(f\"Built image {cluster_builder.image_tag}\")"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Create a Object Storage Bucket\n",
 | |
|     "\n",
 | |
|     "* Create a object storage bucket to store our models and other results."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "mnist_bucket = f\"{DOCKER_REGISTRY}-mnist\"\n",
 | |
|     "minio_uploader.create_bucket(mnist_bucket)\n",
 | |
|     "logging.info(f\"Bucket {mnist_bucket} created or already exists\")"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Distributed training\n",
 | |
|     "\n",
 | |
|     "* We will train the model by using TFJob to run a distributed training job"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Training job parameters"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "train_name = f\"mnist-train-{uuid.uuid4().hex[:4]}\"\n",
 | |
|     "num_ps = 1\n",
 | |
|     "num_workers = 2\n",
 | |
|     "model_dir = f\"s3://{mnist_bucket}/mnist\"\n",
 | |
|     "export_path = f\"s3://{mnist_bucket}/mnist/export\" \n",
 | |
|     "train_steps = 200\n",
 | |
|     "batch_size = 100\n",
 | |
|     "learning_rate = .01\n",
 | |
|     "image = cluster_builder.image_tag"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "train_spec = f\"\"\"apiVersion: kubeflow.org/v1\n",
 | |
|     "kind: TFJob\n",
 | |
|     "metadata:\n",
 | |
|     "  name: {train_name}  \n",
 | |
|     "spec:\n",
 | |
|     "  tfReplicaSpecs:\n",
 | |
|     "    Ps:\n",
 | |
|     "      replicas: {num_ps}\n",
 | |
|     "      template:\n",
 | |
|     "        metadata:\n",
 | |
|     "          annotations:\n",
 | |
|     "            sidecar.istio.io/inject: \"false\"\n",
 | |
|     "        spec:\n",
 | |
|     "          serviceAccount: default-editor\n",
 | |
|     "          containers:\n",
 | |
|     "          - name: tensorflow\n",
 | |
|     "            command:\n",
 | |
|     "            - python\n",
 | |
|     "            - /opt/model.py\n",
 | |
|     "            - --tf-model-dir={model_dir}\n",
 | |
|     "            - --tf-export-dir={export_path}\n",
 | |
|     "            - --tf-train-steps={train_steps}\n",
 | |
|     "            - --tf-batch-size={batch_size}\n",
 | |
|     "            - --tf-learning-rate={learning_rate}\n",
 | |
|     "            env:\n",
 | |
|     "            - name: S3_ENDPOINT\n",
 | |
|     "              value: {service_endpoint}\n",
 | |
|     "            - name: AWS_REGION\n",
 | |
|     "              value: {cos_region}\n",
 | |
|     "            - name: BUCKET_NAME\n",
 | |
|     "              value: {mnist_bucket}\n",
 | |
|     "            - name: S3_USE_HTTPS\n",
 | |
|     "              value: \"1\"\n",
 | |
|     "            - name: S3_VERIFY_SSL\n",
 | |
|     "              value: \"1\"\n",
 | |
|     "            - name: AWS_ACCESS_KEY_ID\n",
 | |
|     "              value: {cos_username}\n",
 | |
|     "            - name: AWS_SECRET_ACCESS_KEY\n",
 | |
|     "              value: {cos_key}\n",
 | |
|     "            image: {image}\n",
 | |
|     "            workingDir: /opt\n",
 | |
|     "          restartPolicy: OnFailure\n",
 | |
|     "    Chief:\n",
 | |
|     "      replicas: 1\n",
 | |
|     "      template:\n",
 | |
|     "        metadata:\n",
 | |
|     "          annotations:\n",
 | |
|     "            sidecar.istio.io/inject: \"false\"\n",
 | |
|     "        spec:\n",
 | |
|     "          serviceAccount: default-editor\n",
 | |
|     "          containers:\n",
 | |
|     "          - name: tensorflow\n",
 | |
|     "            command:\n",
 | |
|     "            - python\n",
 | |
|     "            - /opt/model.py\n",
 | |
|     "            - --tf-model-dir={model_dir}\n",
 | |
|     "            - --tf-export-dir={export_path}\n",
 | |
|     "            - --tf-train-steps={train_steps}\n",
 | |
|     "            - --tf-batch-size={batch_size}\n",
 | |
|     "            - --tf-learning-rate={learning_rate}\n",
 | |
|     "            env:\n",
 | |
|     "            - name: S3_ENDPOINT\n",
 | |
|     "              value: {service_endpoint}\n",
 | |
|     "            - name: AWS_REGION\n",
 | |
|     "              value: {cos_region}\n",
 | |
|     "            - name: BUCKET_NAME\n",
 | |
|     "              value: {mnist_bucket}\n",
 | |
|     "            - name: S3_USE_HTTPS\n",
 | |
|     "              value: \"1\"\n",
 | |
|     "            - name: S3_VERIFY_SSL\n",
 | |
|     "              value: \"1\"\n",
 | |
|     "            - name: AWS_ACCESS_KEY_ID\n",
 | |
|     "              value: {cos_username}\n",
 | |
|     "            - name: AWS_SECRET_ACCESS_KEY\n",
 | |
|     "              value: {cos_key}\n",
 | |
|     "            image: {image}\n",
 | |
|     "            workingDir: /opt\n",
 | |
|     "          restartPolicy: OnFailure\n",
 | |
|     "    Worker:\n",
 | |
|     "      replicas: 1\n",
 | |
|     "      template:\n",
 | |
|     "        metadata:\n",
 | |
|     "          annotations:\n",
 | |
|     "            sidecar.istio.io/inject: \"false\"\n",
 | |
|     "        spec:\n",
 | |
|     "          serviceAccount: default-editor\n",
 | |
|     "          containers:\n",
 | |
|     "          - name: tensorflow\n",
 | |
|     "            command:\n",
 | |
|     "            - python\n",
 | |
|     "            - /opt/model.py\n",
 | |
|     "            - --tf-model-dir={model_dir}\n",
 | |
|     "            - --tf-export-dir={export_path}\n",
 | |
|     "            - --tf-train-steps={train_steps}\n",
 | |
|     "            - --tf-batch-size={batch_size}\n",
 | |
|     "            - --tf-learning-rate={learning_rate}\n",
 | |
|     "            env:\n",
 | |
|     "            - name: S3_ENDPOINT\n",
 | |
|     "              value: {service_endpoint}\n",
 | |
|     "            - name: AWS_REGION\n",
 | |
|     "              value: {cos_region}\n",
 | |
|     "            - name: BUCKET_NAME\n",
 | |
|     "              value: {mnist_bucket}\n",
 | |
|     "            - name: S3_USE_HTTPS\n",
 | |
|     "              value: \"1\"\n",
 | |
|     "            - name: S3_VERIFY_SSL\n",
 | |
|     "              value: \"1\"\n",
 | |
|     "            - name: AWS_ACCESS_KEY_ID\n",
 | |
|     "              value: {cos_username}\n",
 | |
|     "            - name: AWS_SECRET_ACCESS_KEY\n",
 | |
|     "              value: {cos_key}\n",
 | |
|     "            image: {image}\n",
 | |
|     "            workingDir: /opt\n",
 | |
|     "          restartPolicy: OnFailure\n",
 | |
|     "\"\"\" "
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Create the training job\n",
 | |
|     "\n",
 | |
|     "* You could write the spec to a YAML file and then do `kubectl apply -f {FILE}`\n",
 | |
|     "* Since you are running in jupyter you will use the TFJob client\n",
 | |
|     "* You will run the TFJob in a namespace created by a Kubeflow profile\n",
 | |
|     "  * The namespace will be the same namespace you are running the notebook in\n",
 | |
|     "  * Creating a profile ensures the namespace is provisioned with service accounts and other resources needed for Kubeflow"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "tf_job_client = tf_job_client_module.TFJobClient()"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "tf_job_body = yaml.safe_load(train_spec)\n",
 | |
|     "tf_job = tf_job_client.create(tf_job_body, namespace=namespace)  \n",
 | |
|     "\n",
 | |
|     "logging.info(f\"Created job {namespace}.{train_name}\")"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "from kubeflow.tfjob import TFJobClient\n",
 | |
|     "tfjob_client = TFJobClient()\n",
 | |
|     "tfjob_client.wait_for_job(train_name, namespace=namespace, watch=True)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Get TF Job logs"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "tfjob_client.get_logs(train_name, namespace=namespace)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Deploy Tensorboard"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "tb_name = \"mnist-tensorboard\"\n",
 | |
|     "tb_deploy = f\"\"\"apiVersion: apps/v1\n",
 | |
|     "kind: Deployment\n",
 | |
|     "metadata:\n",
 | |
|     "  labels:\n",
 | |
|     "    app: mnist-tensorboard\n",
 | |
|     "  name: {tb_name}\n",
 | |
|     "  namespace: {namespace}\n",
 | |
|     "spec:\n",
 | |
|     "  selector:\n",
 | |
|     "    matchLabels:\n",
 | |
|     "      app: mnist-tensorboard\n",
 | |
|     "  template:\n",
 | |
|     "    metadata:\n",
 | |
|     "      labels:\n",
 | |
|     "        app: mnist-tensorboard\n",
 | |
|     "        version: v1\n",
 | |
|     "    spec:\n",
 | |
|     "      serviceAccount: default-editor\n",
 | |
|     "      containers:\n",
 | |
|     "      - command:\n",
 | |
|     "        - /usr/local/bin/tensorboard\n",
 | |
|     "        - --logdir={model_dir}\n",
 | |
|     "        - --port=80\n",
 | |
|     "        image: tensorflow/tensorflow:1.15.2-py3\n",
 | |
|     "        env:\n",
 | |
|     "        - name: S3_ENDPOINT\n",
 | |
|     "          value: {service_endpoint}\n",
 | |
|     "        - name: AWS_REGION\n",
 | |
|     "          value: {cos_region}\n",
 | |
|     "        - name: BUCKET_NAME\n",
 | |
|     "          value: {mnist_bucket}\n",
 | |
|     "        - name: S3_USE_HTTPS\n",
 | |
|     "          value: \"1\"\n",
 | |
|     "        - name: S3_VERIFY_SSL\n",
 | |
|     "          value: \"1\"\n",
 | |
|     "        - name: AWS_ACCESS_KEY_ID\n",
 | |
|     "          value: {cos_username}\n",
 | |
|     "        - name: AWS_SECRET_ACCESS_KEY\n",
 | |
|     "          value: {cos_key}  \n",
 | |
|     "        name: tensorboard\n",
 | |
|     "        ports:\n",
 | |
|     "        - containerPort: 80\n",
 | |
|     "\"\"\"\n",
 | |
|     "tb_service = f\"\"\"apiVersion: v1\n",
 | |
|     "kind: Service\n",
 | |
|     "metadata:\n",
 | |
|     "  labels:\n",
 | |
|     "    app: mnist-tensorboard\n",
 | |
|     "  name: {tb_name}\n",
 | |
|     "  namespace: {namespace}\n",
 | |
|     "spec:\n",
 | |
|     "  ports:\n",
 | |
|     "  - name: http-tb\n",
 | |
|     "    port: 80\n",
 | |
|     "    targetPort: 80\n",
 | |
|     "  selector:\n",
 | |
|     "    app: mnist-tensorboard\n",
 | |
|     "  type: ClusterIP\n",
 | |
|     "\"\"\"\n",
 | |
|     "\n",
 | |
|     "tb_virtual_service = f\"\"\"apiVersion: networking.istio.io/v1alpha3\n",
 | |
|     "kind: VirtualService\n",
 | |
|     "metadata:\n",
 | |
|     "  name: {tb_name}\n",
 | |
|     "  namespace: {namespace}\n",
 | |
|     "spec:\n",
 | |
|     "  gateways:\n",
 | |
|     "  - kubeflow/kubeflow-gateway\n",
 | |
|     "  hosts:\n",
 | |
|     "  - '*'\n",
 | |
|     "  http:\n",
 | |
|     "  - match:\n",
 | |
|     "    - uri:\n",
 | |
|     "        prefix: /mnist/{namespace}/tensorboard/\n",
 | |
|     "    rewrite:\n",
 | |
|     "      uri: /\n",
 | |
|     "    route:\n",
 | |
|     "    - destination:\n",
 | |
|     "        host: {tb_name}.{namespace}.svc.cluster.local\n",
 | |
|     "        port:\n",
 | |
|     "          number: 80\n",
 | |
|     "    timeout: 300s\n",
 | |
|     "\"\"\"\n",
 | |
|     "\n",
 | |
|     "tb_specs = [tb_deploy, tb_service, tb_virtual_service]"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "k8s_util.apply_k8s_specs(tb_specs, k8s_util.K8S_CREATE_OR_REPLACE)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Get Tensorboard URL"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "Run this with the appropriate RBAC permissions <br>\n",
 | |
|     "**Note:** You can get the node worker ip from `kubectl get no -o wide` <br>\n",
 | |
|     "```bash\n",
 | |
|     "export INGRESS_HOST=<worker-node-ip>\n",
 | |
|     "export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name==\"http2\")].nodePort}')\n",
 | |
|     "printf \"Tensorboard URL: \\n${INGRESS_HOST}:${INGRESS_PORT}/mnist/anonymous/tensorboard/\\n\"\n",
 | |
|     "```"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Serve the model\n",
 | |
|     "\n",
 | |
|     "* Deploy the model using tensorflow serving\n",
 | |
|     "* We need to create\n",
 | |
|     "  1. A Kubernetes Deployment\n",
 | |
|     "  1. A Kubernetes service\n",
 | |
|     "  1. (Optional) Create a configmap containing the prometheus monitoring config"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "deploy_name = \"mnist-model\"\n",
 | |
|     "model_base_path = export_path\n",
 | |
|     "\n",
 | |
|     "# The web ui defaults to mnist-service so if you change it you will\n",
 | |
|     "# need to change it in the UI as well to send predictions to the mode\n",
 | |
|     "model_service = \"mnist-service\"\n",
 | |
|     "\n",
 | |
|     "deploy_spec = f\"\"\"apiVersion: apps/v1\n",
 | |
|     "kind: Deployment\n",
 | |
|     "metadata:\n",
 | |
|     "  labels:\n",
 | |
|     "    app: mnist\n",
 | |
|     "  name: {deploy_name}\n",
 | |
|     "  namespace: {namespace}\n",
 | |
|     "spec:\n",
 | |
|     "  selector:\n",
 | |
|     "    matchLabels:\n",
 | |
|     "      app: mnist-model\n",
 | |
|     "  template:\n",
 | |
|     "    metadata:\n",
 | |
|     "      # TODO(jlewi): Right now we disable the istio side car because otherwise ISTIO rbac will prevent the\n",
 | |
|     "      # UI from sending RPCs to the server. We should create an appropriate ISTIO rbac authorization\n",
 | |
|     "      # policy to allow traffic from the UI to the model servier.\n",
 | |
|     "      # https://istio.io/docs/concepts/security/#target-selectors\n",
 | |
|     "      annotations:        \n",
 | |
|     "        sidecar.istio.io/inject: \"false\"\n",
 | |
|     "      labels:\n",
 | |
|     "        app: mnist-model\n",
 | |
|     "        version: v1\n",
 | |
|     "    spec:\n",
 | |
|     "      serviceAccount: default-editor\n",
 | |
|     "      containers:\n",
 | |
|     "      - args:\n",
 | |
|     "        - --port=9000\n",
 | |
|     "        - --rest_api_port=8500\n",
 | |
|     "        - --model_name=mnist\n",
 | |
|     "        - --model_base_path={model_base_path}\n",
 | |
|     "        command:\n",
 | |
|     "        - /usr/bin/tensorflow_model_server\n",
 | |
|     "        env:\n",
 | |
|     "        - name: modelBasePath\n",
 | |
|     "          value: {model_base_path}\n",
 | |
|     "        - name: S3_ENDPOINT\n",
 | |
|     "          value: {service_endpoint}\n",
 | |
|     "        - name: AWS_REGION\n",
 | |
|     "          value: {cos_region}\n",
 | |
|     "        - name: BUCKET_NAME\n",
 | |
|     "          value: {mnist_bucket}\n",
 | |
|     "        - name: S3_USE_HTTPS\n",
 | |
|     "          value: \"1\"\n",
 | |
|     "        - name: S3_VERIFY_SSL\n",
 | |
|     "          value: \"1\"\n",
 | |
|     "        - name: AWS_ACCESS_KEY_ID\n",
 | |
|     "          value: {cos_username}\n",
 | |
|     "        - name: AWS_SECRET_ACCESS_KEY\n",
 | |
|     "          value: {cos_key}  \n",
 | |
|     "        image: tensorflow/serving:1.15.0\n",
 | |
|     "        imagePullPolicy: IfNotPresent\n",
 | |
|     "        livenessProbe:\n",
 | |
|     "          initialDelaySeconds: 30\n",
 | |
|     "          periodSeconds: 30\n",
 | |
|     "          tcpSocket:\n",
 | |
|     "            port: 9000\n",
 | |
|     "        name: mnist\n",
 | |
|     "        ports:\n",
 | |
|     "        - containerPort: 9000\n",
 | |
|     "        - containerPort: 8500\n",
 | |
|     "        resources:\n",
 | |
|     "          limits:\n",
 | |
|     "            cpu: \"4\"\n",
 | |
|     "            memory: 4Gi\n",
 | |
|     "          requests:\n",
 | |
|     "            cpu: \"1\"\n",
 | |
|     "            memory: 1Gi\n",
 | |
|     "        volumeMounts:\n",
 | |
|     "        - mountPath: /var/config/\n",
 | |
|     "          name: model-config\n",
 | |
|     "      volumes:\n",
 | |
|     "      - configMap:\n",
 | |
|     "          name: {deploy_name}\n",
 | |
|     "        name: model-config\n",
 | |
|     "\"\"\"\n",
 | |
|     "\n",
 | |
|     "service_spec = f\"\"\"apiVersion: v1\n",
 | |
|     "kind: Service\n",
 | |
|     "metadata:\n",
 | |
|     "  annotations:    \n",
 | |
|     "    prometheus.io/path: /monitoring/prometheus/metrics\n",
 | |
|     "    prometheus.io/port: \"8500\"\n",
 | |
|     "    prometheus.io/scrape: \"true\"\n",
 | |
|     "  labels:\n",
 | |
|     "    app: mnist-model\n",
 | |
|     "  name: {model_service}\n",
 | |
|     "  namespace: {namespace}\n",
 | |
|     "spec:\n",
 | |
|     "  ports:\n",
 | |
|     "  - name: grpc-tf-serving\n",
 | |
|     "    port: 9000\n",
 | |
|     "    targetPort: 9000\n",
 | |
|     "  - name: http-tf-serving\n",
 | |
|     "    port: 8500\n",
 | |
|     "    targetPort: 8500\n",
 | |
|     "  selector:\n",
 | |
|     "    app: mnist-model\n",
 | |
|     "  type: ClusterIP\n",
 | |
|     "\"\"\"\n",
 | |
|     "\n",
 | |
|     "monitoring_config = f\"\"\"kind: ConfigMap\n",
 | |
|     "apiVersion: v1\n",
 | |
|     "metadata:\n",
 | |
|     "  name: {deploy_name}\n",
 | |
|     "  namespace: {namespace}\n",
 | |
|     "data:\n",
 | |
|     "  monitoring_config.txt: |-\n",
 | |
|     "    prometheus_config: {{\n",
 | |
|     "      enable: true,\n",
 | |
|     "      path: \"/monitoring/prometheus/metrics\"\n",
 | |
|     "    }}\n",
 | |
|     "\"\"\"\n",
 | |
|     "\n",
 | |
|     "model_specs = [deploy_spec, service_spec, monitoring_config]"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "k8s_util.apply_k8s_specs(model_specs, k8s_util.K8S_CREATE_OR_REPLACE)     "
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Deploy the mnist UI\n",
 | |
|     "\n",
 | |
|     "* We will now deploy the UI to visualize the mnist results\n",
 | |
|     "* Note: This is using a prebuilt and public docker image for the UI"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "ui_name = \"mnist-ui\"\n",
 | |
|     "ui_deploy = f\"\"\"apiVersion: apps/v1\n",
 | |
|     "kind: Deployment\n",
 | |
|     "metadata:\n",
 | |
|     "  name: {ui_name}\n",
 | |
|     "  namespace: {namespace}\n",
 | |
|     "spec:\n",
 | |
|     "  replicas: 1\n",
 | |
|     "  selector:\n",
 | |
|     "    matchLabels:\n",
 | |
|     "      app: mnist-web-ui\n",
 | |
|     "  template:\n",
 | |
|     "    metadata:\n",
 | |
|     "      labels:\n",
 | |
|     "        app: mnist-web-ui\n",
 | |
|     "    spec:\n",
 | |
|     "      containers:\n",
 | |
|     "      - image: gcr.io/kubeflow-examples/mnist/web-ui:v20190112-v0.2-142-g3b38225\n",
 | |
|     "        name: web-ui\n",
 | |
|     "        ports:\n",
 | |
|     "        - containerPort: 5000        \n",
 | |
|     "      serviceAccount: default-editor\n",
 | |
|     "\"\"\"\n",
 | |
|     "\n",
 | |
|     "ui_service = f\"\"\"apiVersion: v1\n",
 | |
|     "kind: Service\n",
 | |
|     "metadata:\n",
 | |
|     "  annotations:\n",
 | |
|     "  name: {ui_name}\n",
 | |
|     "  namespace: {namespace}\n",
 | |
|     "spec:\n",
 | |
|     "  ports:\n",
 | |
|     "  - name: http-mnist-ui\n",
 | |
|     "    port: 80\n",
 | |
|     "    targetPort: 5000\n",
 | |
|     "  selector:\n",
 | |
|     "    app: mnist-web-ui\n",
 | |
|     "  type: ClusterIP\n",
 | |
|     "\"\"\"\n",
 | |
|     "\n",
 | |
|     "ui_virtual_service = f\"\"\"apiVersion: networking.istio.io/v1alpha3\n",
 | |
|     "kind: VirtualService\n",
 | |
|     "metadata:\n",
 | |
|     "  name: {ui_name}\n",
 | |
|     "  namespace: {namespace}\n",
 | |
|     "spec:\n",
 | |
|     "  gateways:\n",
 | |
|     "  - kubeflow/kubeflow-gateway\n",
 | |
|     "  hosts:\n",
 | |
|     "  - '*'\n",
 | |
|     "  http:\n",
 | |
|     "  - match:\n",
 | |
|     "    - uri:\n",
 | |
|     "        prefix: /mnist/{namespace}/ui/\n",
 | |
|     "    rewrite:\n",
 | |
|     "      uri: /\n",
 | |
|     "    route:\n",
 | |
|     "    - destination:\n",
 | |
|     "        host: {ui_name}.{namespace}.svc.cluster.local\n",
 | |
|     "        port:\n",
 | |
|     "          number: 80\n",
 | |
|     "    timeout: 300s\n",
 | |
|     "\"\"\"\n",
 | |
|     "\n",
 | |
|     "ui_specs = [ui_deploy, ui_service, ui_virtual_service]"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "k8s_util.apply_k8s_specs(ui_specs, k8s_util.K8S_CREATE_OR_REPLACE)     "
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Access the  web UI\n",
 | |
|     "\n",
 | |
|     "* The endpoint will be\n",
 | |
|     "\n",
 | |
|     "Run this with the appropriate RBAC permissions <br>\n",
 | |
|     "**Note:** You can get the node worker ip from `kubectl get no -o wide` <br>\n",
 | |
|     "```bash\n",
 | |
|     "export INGRESS_HOST=<worker-node-ip>\n",
 | |
|     "export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name==\"http2\")].nodePort}')\n",
 | |
|     "printf \"mnist-web-app URL: \\n${INGRESS_HOST}:${INGRESS_PORT}/mnist/anonymous/ui/\\n\"\n",
 | |
|     "```"
 | |
|    ]
 | |
|   }
 | |
|  ],
 | |
|  "metadata": {
 | |
|   "kernelspec": {
 | |
|    "display_name": "Python 3",
 | |
|    "language": "python",
 | |
|    "name": "python3"
 | |
|   },
 | |
|   "language_info": {
 | |
|    "codemirror_mode": {
 | |
|     "name": "ipython",
 | |
|     "version": 3
 | |
|    },
 | |
|    "file_extension": ".py",
 | |
|    "mimetype": "text/x-python",
 | |
|    "name": "python",
 | |
|    "nbconvert_exporter": "python",
 | |
|    "pygments_lexer": "ipython3",
 | |
|    "version": "3.7.4"
 | |
|   }
 | |
|  },
 | |
|  "nbformat": 4,
 | |
|  "nbformat_minor": 4
 | |
| }
 |