examples/mnist/mnist_gcp.ipynb

1857 lines
82 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# MNIST end to end on Kubeflow on GKE\n",
"\n",
"This example guides you through:\n",
" \n",
" 1. Taking an example TensorFlow model and modifying it to support distributed training.\n",
" 1. Serving the resulting model using TFServing.\n",
" 1. Deploying and using a web app that sends prediction requests to the model.\n",
" \n",
"## Requirements\n",
"\n",
" * You must be running Kubeflow 1.0 on Kubernetes Engine (GKE) with Cloud Identity-Aware Proxy (Cloud IAP). See the guide to [deploying Kubeflow on GCP](https://www.kubeflow.org/docs/gke/deploy/).\n",
" * Run this notebook within your Kubeflow cluster. See the guide to [setting up your Kubeflow notebooks](https://www.kubeflow.org/docs/components/notebooks/setup/).\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare model\n",
"\n",
"There is a delta between existing distributed MNIST examples and what's needed to run well as a TFJob.\n",
"\n",
"Basically, you must:\n",
"\n",
"* Add options in order to make the model configurable.\n",
"* Use `tf.estimator.train_and_evaluate` to enable model exporting and serving.\n",
"* Define serving signatures for model serving.\n",
"\n",
"This tutorial provides a Python program that's already prepared for you: [model.py](model.py)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Verify that you have a Google Cloud Platform (GCP) account\n",
"\n",
"The cell below checks that this notebook was spawned with credentials to access GCP.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"import os\n",
"import uuid\n",
"from importlib import reload\n",
"from oauth2client.client import GoogleCredentials\n",
"credentials = GoogleCredentials.get_application_default()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install the required libraries\n",
"\n",
"Run the next cell to import the libraries required to train this model."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"pip installing requirements.txt\n",
"Cloning the tf-operator repo\n",
"Checkout kubeflow/tf-operator @9238906\n",
"Adding /home/jovyan/.local/lib/python3.6/site-packages to python path\n",
"Adding /home/jovyan/git_tf-operator/sdk/python to python path\n",
"Configure docker credentials\n"
]
}
],
"source": [
"import notebook_setup\n",
"reload(notebook_setup)\n",
"notebook_setup.notebook_setup()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wait for the message `Configure docker credentials` before moving on to the next cell."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import k8s_util\n",
"# Force a reload of Kubeflow. Since Kubeflow is a multi namespace module,\n",
"# doing the reload in notebook_setup may not be sufficient.\n",
"import kubeflow\n",
"reload(kubeflow)\n",
"from kubernetes import client as k8s_client\n",
"from kubernetes import config as k8s_config\n",
"from kubeflow.tfjob.api import tf_job_client as tf_job_client_module\n",
"from IPython.core.display import display, HTML\n",
"import yaml"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure a Docker registry for Kubeflow Fairing\n",
"\n",
"* In order to build Docker images from your notebook, you need a Docker registry to store the images.\n",
"* Below you set some variables specifying a [Container Registry](https://cloud.google.com/container-registry/docs/).\n",
"* Kubeflow Fairing provides a utility function to guess the name of your GCP project."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Running in project kubeflow-writers\n",
"Running in namespace kubeflow-sarahmaddox\n",
"Using Docker registry gcr.io/kubeflow-writers/fairing-job\n"
]
}
],
"source": [
"from kubernetes import client as k8s_client\n",
"from kubernetes.client import rest as k8s_rest\n",
"from kubeflow import fairing \n",
"from kubeflow.fairing import utils as fairing_utils\n",
"from kubeflow.fairing.builders import append\n",
"from kubeflow.fairing.deployers import job\n",
"from kubeflow.fairing.preprocessors import base as base_preprocessor\n",
"\n",
"# Setting up Google Container Registry (GCR) for storing output containers.\n",
"# You can use any Docker container registry instead of GCR.\n",
"GCP_PROJECT = fairing.cloud.gcp.guess_project_name()\n",
"DOCKER_REGISTRY = 'gcr.io/{}/fairing-job'.format(GCP_PROJECT)\n",
"namespace = fairing_utils.get_current_k8s_namespace()\n",
"\n",
"logging.info(f\"Running in project {GCP_PROJECT}\")\n",
"logging.info(f\"Running in namespace {namespace}\")\n",
"logging.info(f\"Using Docker registry {DOCKER_REGISTRY}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use Kubeflow Fairing to build the Docker image\n",
"\n",
"This notebook uses Kubeflow Fairing's kaniko builder to build a Docker image that includes all your dependencies.\n",
" * You use kaniko because you want to be able to run `pip` to install dependencies.\n",
" * Kaniko gives you the flexibility to build images from Dockerfiles."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# TODO(https://github.com/kubeflow/fairing/issues/426): We should get rid of this once the default \n",
"# Kaniko image is updated to a newer image than 0.7.0.\n",
"from kubeflow.fairing import constants\n",
"constants.constants.KANIKO_IMAGE = \"gcr.io/kaniko-project/executor:v0.14.0\""
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"set()"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from kubeflow.fairing.builders import cluster\n",
"\n",
"# output_map is a map of extra files to add to the notebook.\n",
"# It is a map from source location to the location inside the context.\n",
"output_map = {\n",
" \"Dockerfile.model\": \"Dockerfile\",\n",
" \"model.py\": \"model.py\"\n",
"}\n",
"\n",
"\n",
"preprocessor = base_preprocessor.BasePreProcessor(\n",
" command=[\"python\"], # The base class will set this.\n",
" input_files=[],\n",
" path_prefix=\"/app\", # irrelevant since we aren't preprocessing any files\n",
" output_map=output_map)\n",
"\n",
"preprocessor.preprocess()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the next cell and wait until you see a message like `Built image gcr.io/<your-project>/fairing-job/mnist:<1234567>`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Building image using cluster builder.\n",
"Creating docker context: /tmp/fairing_context_ohm2nlbv\n",
"Dockerfile already exists in Fairing context, skipping...\n",
"Waiting for fairing-builder-9vw9w-ndbhd to start...\n",
"Waiting for fairing-builder-9vw9w-ndbhd to start...\n",
"Waiting for fairing-builder-9vw9w-ndbhd to start...\n",
"Pod started running True\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"ERROR: logging before flag.Parse: E0226 02:34:42.750776 1 metadata.go:241] Failed to unmarshal scopes: invalid character 'h' looking for beginning of value\n",
"\u001b[36mINFO\u001b[0m[0004] Resolved base name tensorflow/tensorflow:1.15.2-py3 to tensorflow/tensorflow:1.15.2-py3\n",
"\u001b[36mINFO\u001b[0m[0004] Resolved base name tensorflow/tensorflow:1.15.2-py3 to tensorflow/tensorflow:1.15.2-py3\n",
"\u001b[36mINFO\u001b[0m[0004] Downloading base image tensorflow/tensorflow:1.15.2-py3\n",
"ERROR: logging before flag.Parse: E0226 02:34:44.230593 1 metadata.go:142] while reading 'google-dockercfg' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg\n",
"ERROR: logging before flag.Parse: E0226 02:34:44.233477 1 metadata.go:159] while reading 'google-dockercfg-url' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg-url\n",
"\u001b[36mINFO\u001b[0m[0004] Error while retrieving image from cache: getting file info: stat /cache/sha256:28b5f547969d70f825909c8fe06675ffc2959afe6079aeae754afa312f6417b9: no such file or directory\n",
"\u001b[36mINFO\u001b[0m[0004] Downloading base image tensorflow/tensorflow:1.15.2-py3\n",
"\u001b[36mINFO\u001b[0m[0005] Built cross stage deps: map[]\n",
"\u001b[36mINFO\u001b[0m[0005] Downloading base image tensorflow/tensorflow:1.15.2-py3\n",
"\u001b[36mINFO\u001b[0m[0005] Error while retrieving image from cache: getting file info: stat /cache/sha256:28b5f547969d70f825909c8fe06675ffc2959afe6079aeae754afa312f6417b9: no such file or directory\n",
"\u001b[36mINFO\u001b[0m[0005] Downloading base image tensorflow/tensorflow:1.15.2-py3\n",
"\u001b[36mINFO\u001b[0m[0005] Using files from context: [/kaniko/buildcontext/model.py]\n",
"\u001b[36mINFO\u001b[0m[0005] Checking for cached layer gcr.io/kubeflow-writers/fairing-job/mnist/cache:6802122184979734f01a549e1224c5f46a277db894d4b3e749e41ad1ca522bdf...\n",
"\u001b[36mINFO\u001b[0m[0006] No cached layer found for cmd RUN chmod +x /opt/model.py\n",
"\u001b[36mINFO\u001b[0m[0006] Unpacking rootfs as cmd RUN chmod +x /opt/model.py requires it.\n",
"\u001b[36mINFO\u001b[0m[0029] Taking snapshot of full filesystem...\n",
"\u001b[36mINFO\u001b[0m[0042] Using files from context: [/kaniko/buildcontext/model.py]\n",
"\u001b[36mINFO\u001b[0m[0042] ADD model.py /opt/model.py\n",
"\u001b[36mINFO\u001b[0m[0042] Taking snapshot of files...\n",
"\u001b[36mINFO\u001b[0m[0042] RUN chmod +x /opt/model.py\n",
"\u001b[36mINFO\u001b[0m[0042] cmd: /bin/sh\n",
"\u001b[36mINFO\u001b[0m[0042] args: [-c chmod +x /opt/model.py]\n",
"\u001b[36mINFO\u001b[0m[0042] Taking snapshot of full filesystem...\n",
"\u001b[36mINFO\u001b[0m[0045] ENTRYPOINT [\"/usr/bin/python\"]\n",
"\u001b[36mINFO\u001b[0m[0045] Pushing layer gcr.io/kubeflow-writers/fairing-job/mnist/cache:6802122184979734f01a549e1224c5f46a277db894d4b3e749e41ad1ca522bdf to cache now\n",
"\u001b[36mINFO\u001b[0m[0045] No files changed in this command, skipping snapshotting.\n",
"\u001b[36mINFO\u001b[0m[0045] CMD [\"/opt/model.py\"]\n",
"\u001b[36mINFO\u001b[0m[0045] No files changed in this command, skipping snapshotting.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Built image gcr.io/kubeflow-writers/fairing-job/mnist:8310D75B\n"
]
}
],
"source": [
"# Use a Tensorflow image as the base image\n",
"# We use a custom Dockerfile \n",
"cluster_builder = cluster.cluster.ClusterBuilder(registry=DOCKER_REGISTRY,\n",
" base_image=\"\", # base_image is set in the Dockerfile\n",
" preprocessor=preprocessor,\n",
" image_name=\"mnist\",\n",
" dockerfile_path=\"Dockerfile\",\n",
" pod_spec_mutators=[fairing.cloud.gcp.add_gcp_credentials_if_exists],\n",
" context_source=cluster.gcs_context.GCSContextSource())\n",
"cluster_builder.build()\n",
"logging.info(f\"Built image {cluster_builder.image_tag}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a Cloud Storage bucket\n",
"\n",
"Run the next cell to create a Google Cloud Storage (GCS) bucket to store your models and other results.\n",
"\n",
"Since this notebook is running in Python, the cell uses the GCS Python client libraries, but you can use the `gsutil` command line instead."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Creating bucket kubeflow-writers-mnist\n"
]
}
],
"source": [
"from google.cloud import storage\n",
"bucket = f\"{GCP_PROJECT}-mnist\"\n",
"\n",
"client = storage.Client()\n",
"b = storage.Bucket(client=client, name=bucket)\n",
"\n",
"if not b.exists():\n",
" logging.info(f\"Creating bucket {bucket}\")\n",
" b.create()\n",
"else:\n",
" logging.info(f\"Bucket {bucket} already exists\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Distributed training\n",
"\n",
"To train the model, this example uses [TFJob](https://www.kubeflow.org/docs/components/training/tftraining/) to run a distributed training job. Run the next cell to set up the YAML specification for the job:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"train_name = f\"mnist-train-{uuid.uuid4().hex[:4]}\"\n",
"num_ps = 1\n",
"num_workers = 2\n",
"model_dir = f\"gs://{bucket}/mnist\"\n",
"export_path = f\"gs://{bucket}/mnist/export\" \n",
"train_steps = 200\n",
"batch_size = 100\n",
"learning_rate = .01\n",
"image = cluster_builder.image_tag\n",
"\n",
"train_spec = f\"\"\"apiVersion: kubeflow.org/v1\n",
"kind: TFJob\n",
"metadata:\n",
" name: {train_name} \n",
"spec:\n",
" tfReplicaSpecs:\n",
" Ps:\n",
" replicas: {num_ps}\n",
" template:\n",
" metadata:\n",
" annotations:\n",
" sidecar.istio.io/inject: \"false\"\n",
" spec:\n",
" serviceAccount: default-editor\n",
" containers:\n",
" - name: tensorflow\n",
" command:\n",
" - python\n",
" - /opt/model.py\n",
" - --tf-model-dir={model_dir}\n",
" - --tf-export-dir={export_path}\n",
" - --tf-train-steps={train_steps}\n",
" - --tf-batch-size={batch_size}\n",
" - --tf-learning-rate={learning_rate}\n",
" image: {image}\n",
" workingDir: /opt\n",
" restartPolicy: OnFailure\n",
" Chief:\n",
" replicas: 1\n",
" template:\n",
" metadata:\n",
" annotations:\n",
" sidecar.istio.io/inject: \"false\"\n",
" spec:\n",
" serviceAccount: default-editor\n",
" containers:\n",
" - name: tensorflow\n",
" command:\n",
" - python\n",
" - /opt/model.py\n",
" - --tf-model-dir={model_dir}\n",
" - --tf-export-dir={export_path}\n",
" - --tf-train-steps={train_steps}\n",
" - --tf-batch-size={batch_size}\n",
" - --tf-learning-rate={learning_rate}\n",
" image: {image}\n",
" workingDir: /opt\n",
" restartPolicy: OnFailure\n",
" Worker:\n",
" replicas: 1\n",
" template:\n",
" metadata:\n",
" annotations:\n",
" sidecar.istio.io/inject: \"false\"\n",
" spec:\n",
" serviceAccount: default-editor\n",
" containers:\n",
" - name: tensorflow\n",
" command:\n",
" - python\n",
" - /opt/model.py\n",
" - --tf-model-dir={model_dir}\n",
" - --tf-export-dir={export_path}\n",
" - --tf-train-steps={train_steps}\n",
" - --tf-batch-size={batch_size}\n",
" - --tf-learning-rate={learning_rate}\n",
" image: {image}\n",
" workingDir: /opt\n",
" restartPolicy: OnFailure\n",
"\"\"\" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create the training job\n",
"\n",
"To submit the training job, you could write the spec to a YAML file and then do `kubectl apply -f {FILE}`.\n",
"\n",
"However, because you are running in a Jupyter notebook, you use the TFJob client. \n",
"* You run the TFJob in a namespace created by a Kubeflow profile.\n",
"* The namespace is the same as the namespace where you are running the notebook.\n",
"* Creating a profile ensures that the namespace is provisioned with service accounts and other resources needed for Kubeflow."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"tf_job_client = tf_job_client_module.TFJobClient()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Created job kubeflow-sarahmaddox.mnist-train-289e\n"
]
}
],
"source": [
"tf_job_body = yaml.safe_load(train_spec)\n",
"tf_job = tf_job_client.create(tf_job_body, namespace=namespace) \n",
"\n",
"logging.info(f\"Created job {namespace}.{train_name}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check the job using kubectl\n",
"\n",
"Above you used the Python SDK for TFJob to check the status. You can also use kubectl get the status of your job. \n",
"The job conditions will tell you whether the job is running, succeeded or failed."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"apiVersion: kubeflow.org/v1\r\n",
"kind: TFJob\r\n",
"metadata:\r\n",
" creationTimestamp: \"2020-02-26T02:58:32Z\"\r\n",
" generation: 1\r\n",
" name: mnist-train-289e\r\n",
" namespace: kubeflow-sarahmaddox\r\n",
" resourceVersion: \"770252\"\r\n",
" selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow-sarahmaddox/tfjobs/mnist-train-289e\r\n",
" uid: dfa23ecf-5843-11ea-9ddf-42010a80013f\r\n",
"spec:\r\n",
" tfReplicaSpecs:\r\n",
" Chief:\r\n",
" replicas: 1\r\n",
" template:\r\n",
" metadata:\r\n",
" annotations:\r\n",
" sidecar.istio.io/inject: \"false\"\r\n",
" spec:\r\n",
" containers:\r\n",
" - command:\r\n",
" - python\r\n",
" - /opt/model.py\r\n",
" - --tf-model-dir=gs://kubeflow-writers-mnist/mnist\r\n",
" - --tf-export-dir=gs://kubeflow-writers-mnist/mnist/export\r\n",
" - --tf-train-steps=200\r\n",
" - --tf-batch-size=100\r\n",
" - --tf-learning-rate=0.01\r\n",
" image: gcr.io/kubeflow-writers/fairing-job/mnist:8310D75B\r\n",
" name: tensorflow\r\n",
" workingDir: /opt\r\n",
" restartPolicy: OnFailure\r\n",
" serviceAccount: default-editor\r\n",
" Ps:\r\n",
" replicas: 1\r\n",
" template:\r\n",
" metadata:\r\n",
" annotations:\r\n",
" sidecar.istio.io/inject: \"false\"\r\n",
" spec:\r\n",
" containers:\r\n",
" - command:\r\n",
" - python\r\n",
" - /opt/model.py\r\n",
" - --tf-model-dir=gs://kubeflow-writers-mnist/mnist\r\n",
" - --tf-export-dir=gs://kubeflow-writers-mnist/mnist/export\r\n",
" - --tf-train-steps=200\r\n",
" - --tf-batch-size=100\r\n",
" - --tf-learning-rate=0.01\r\n",
" image: gcr.io/kubeflow-writers/fairing-job/mnist:8310D75B\r\n",
" name: tensorflow\r\n",
" workingDir: /opt\r\n",
" restartPolicy: OnFailure\r\n",
" serviceAccount: default-editor\r\n",
" Worker:\r\n",
" replicas: 1\r\n",
" template:\r\n",
" metadata:\r\n",
" annotations:\r\n",
" sidecar.istio.io/inject: \"false\"\r\n",
" spec:\r\n",
" containers:\r\n",
" - command:\r\n",
" - python\r\n",
" - /opt/model.py\r\n",
" - --tf-model-dir=gs://kubeflow-writers-mnist/mnist\r\n",
" - --tf-export-dir=gs://kubeflow-writers-mnist/mnist/export\r\n",
" - --tf-train-steps=200\r\n",
" - --tf-batch-size=100\r\n",
" - --tf-learning-rate=0.01\r\n",
" image: gcr.io/kubeflow-writers/fairing-job/mnist:8310D75B\r\n",
" name: tensorflow\r\n",
" workingDir: /opt\r\n",
" restartPolicy: OnFailure\r\n",
" serviceAccount: default-editor\r\n",
"status:\r\n",
" completionTime: \"2020-02-26T02:59:58Z\"\r\n",
" conditions:\r\n",
" - lastTransitionTime: \"2020-02-26T02:58:32Z\"\r\n",
" lastUpdateTime: \"2020-02-26T02:58:32Z\"\r\n",
" message: TFJob mnist-train-289e is created.\r\n",
" reason: TFJobCreated\r\n",
" status: \"True\"\r\n",
" type: Created\r\n",
" - lastTransitionTime: \"2020-02-26T02:58:35Z\"\r\n",
" lastUpdateTime: \"2020-02-26T02:58:35Z\"\r\n",
" message: TFJob mnist-train-289e is running.\r\n",
" reason: TFJobRunning\r\n",
" status: \"False\"\r\n",
" type: Running\r\n",
" - lastTransitionTime: \"2020-02-26T02:59:58Z\"\r\n",
" lastUpdateTime: \"2020-02-26T02:59:58Z\"\r\n",
" message: TFJob mnist-train-289e successfully completed.\r\n",
" reason: TFJobSucceeded\r\n",
" status: \"True\"\r\n",
" type: Succeeded\r\n",
" replicaStatuses:\r\n",
" Chief:\r\n",
" succeeded: 1\r\n",
" PS:\r\n",
" succeeded: 1\r\n",
" Worker:\r\n",
" succeeded: 1\r\n",
" startTime: \"2020-02-26T02:58:32Z\"\r\n"
]
}
],
"source": [
"!kubectl get tfjobs -o yaml {train_name}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get the training logs\n",
"\n",
"* There are two ways to get the logs for the training job:\n",
"\n",
" * Using kubectl to fetch the pod logs. These logs are ephemeral; they will be unavailable when the pod is garbage collected to free up resources.\n",
" * Using Stackdriver.\n",
" \n",
" * Kubernetes logs are automatically available in Stackdriver.\n",
" * You can use labels to locate the logs for a specific pod.\n",
" * In the cell below, you use labels for the training job name and process type to locate the logs for a specific pod.\n",
" \n",
"* Run the cell below to get a link to Stackdriver for your logs:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Link to: <a href='https://console.cloud.google.com/logs/viewer?project=kubeflow-writers&interval=P7D&advancedFilter=resource.type%3D%22k8s_container%22++++%0A++++labels.%22k8s-pod%2Ftf-job-name%22+%3D+%22mnist-train-289e%22%0A++++labels.%22k8s-pod%2Ftf-replica-type%22+%3D+%22chief%22++++%0A++++resource.labels.container_name%3D%22tensorflow%22+'>chief logs</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Link to: <a href='https://console.cloud.google.com/logs/viewer?project=kubeflow-writers&interval=P7D&advancedFilter=resource.type%3D%22k8s_container%22++++%0A++++labels.%22k8s-pod%2Ftf-job-name%22+%3D+%22mnist-train-289e%22%0A++++labels.%22k8s-pod%2Ftf-replica-type%22+%3D+%22worker%22++++%0A++++resource.labels.container_name%3D%22tensorflow%22+'>worker logs</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Link to: <a href='https://console.cloud.google.com/logs/viewer?project=kubeflow-writers&interval=P7D&advancedFilter=resource.type%3D%22k8s_container%22++++%0A++++labels.%22k8s-pod%2Ftf-job-name%22+%3D+%22mnist-train-289e%22%0A++++labels.%22k8s-pod%2Ftf-replica-type%22+%3D+%22ps%22++++%0A++++resource.labels.container_name%3D%22tensorflow%22+'>ps logs</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from urllib.parse import urlencode\n",
"\n",
"for replica in [\"chief\", \"worker\", \"ps\"]: \n",
" logs_filter = f\"\"\"resource.type=\"k8s_container\" \n",
" labels.\"k8s-pod/tf-job-name\" = \"{train_name}\"\n",
" labels.\"k8s-pod/tf-replica-type\" = \"{replica}\" \n",
" resource.labels.container_name=\"tensorflow\" \"\"\"\n",
"\n",
" new_params = {'project': GCP_PROJECT,\n",
" # Logs for last 7 days\n",
" 'interval': 'P7D',\n",
" 'advancedFilter': logs_filter}\n",
"\n",
" query = urlencode(new_params)\n",
"\n",
" url = \"https://console.cloud.google.com/logs/viewer?\" + query\n",
"\n",
" display(HTML(f\"Link to: <a href='{url}'>{replica} logs</a>\"))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy TensorBoard\n",
"\n",
"The next step is to create a Kubernetes deployment to run TensorBoard.\n",
"\n",
"TensorBoard will be accessible behind the Kubeflow IAP endpoint."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"tb_name = \"mnist-tensorboard\"\n",
"tb_deploy = f\"\"\"apiVersion: apps/v1\n",
"kind: Deployment\n",
"metadata:\n",
" labels:\n",
" app: mnist-tensorboard\n",
" name: {tb_name}\n",
" namespace: {namespace}\n",
"spec:\n",
" selector:\n",
" matchLabels:\n",
" app: mnist-tensorboard\n",
" template:\n",
" metadata:\n",
" labels:\n",
" app: mnist-tensorboard\n",
" version: v1\n",
" spec:\n",
" serviceAccount: default-editor\n",
" containers:\n",
" - command:\n",
" - /usr/local/bin/tensorboard\n",
" - --logdir={model_dir}\n",
" - --port=80\n",
" image: tensorflow/tensorflow:1.15.2-py3\n",
" name: tensorboard\n",
" ports:\n",
" - containerPort: 80\n",
"\"\"\"\n",
"tb_service = f\"\"\"apiVersion: v1\n",
"kind: Service\n",
"metadata:\n",
" labels:\n",
" app: mnist-tensorboard\n",
" name: {tb_name}\n",
" namespace: {namespace}\n",
"spec:\n",
" ports:\n",
" - name: http-tb\n",
" port: 80\n",
" targetPort: 80\n",
" selector:\n",
" app: mnist-tensorboard\n",
" type: ClusterIP\n",
"\"\"\"\n",
"\n",
"tb_virtual_service = f\"\"\"apiVersion: networking.istio.io/v1alpha3\n",
"kind: VirtualService\n",
"metadata:\n",
" name: {tb_name}\n",
" namespace: {namespace}\n",
"spec:\n",
" gateways:\n",
" - kubeflow/kubeflow-gateway\n",
" hosts:\n",
" - '*'\n",
" http:\n",
" - match:\n",
" - uri:\n",
" prefix: /mnist/{namespace}/tensorboard/\n",
" rewrite:\n",
" uri: /\n",
" route:\n",
" - destination:\n",
" host: {tb_name}.{namespace}.svc.cluster.local\n",
" port:\n",
" number: 80\n",
" timeout: 300s\n",
"\"\"\"\n",
"\n",
"tb_specs = [tb_deploy, tb_service, tb_virtual_service]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/jovyan/examples/mnist/k8s_util.py:55: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n",
" spec = yaml.load(spec)\n",
"Created Deployment kubeflow-sarahmaddox.mnist-tensorboard\n",
"Created Service kubeflow-sarahmaddox.mnist-tensorboard\n",
"Created VirtualService mnist-tensorboard.mnist-tensorboard\n"
]
},
{
"data": {
"text/plain": [
"[{'api_version': 'apps/v1',\n",
" 'kind': 'Deployment',\n",
" 'metadata': {'annotations': None,\n",
" 'cluster_name': None,\n",
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 20, 4, tzinfo=tzlocal()),\n",
" 'deletion_grace_period_seconds': None,\n",
" 'deletion_timestamp': None,\n",
" 'finalizers': None,\n",
" 'generate_name': None,\n",
" 'generation': 1,\n",
" 'initializers': None,\n",
" 'labels': {'app': 'mnist-tensorboard'},\n",
" 'managed_fields': None,\n",
" 'name': 'mnist-tensorboard',\n",
" 'namespace': 'kubeflow-sarahmaddox',\n",
" 'owner_references': None,\n",
" 'resource_version': '782392',\n",
" 'self_link': '/apis/apps/v1/namespaces/kubeflow-sarahmaddox/deployments/mnist-tensorboard',\n",
" 'uid': 'e1d50153-5846-11ea-9ddf-42010a80013f'},\n",
" 'spec': {'min_ready_seconds': None,\n",
" 'paused': None,\n",
" 'progress_deadline_seconds': 600,\n",
" 'replicas': 1,\n",
" 'revision_history_limit': 10,\n",
" 'selector': {'match_expressions': None,\n",
" 'match_labels': {'app': 'mnist-tensorboard'}},\n",
" 'strategy': {'rolling_update': {'max_surge': '25%',\n",
" 'max_unavailable': '25%'},\n",
" 'type': 'RollingUpdate'},\n",
" 'template': {'metadata': {'annotations': None,\n",
" 'cluster_name': None,\n",
" 'creation_timestamp': None,\n",
" 'deletion_grace_period_seconds': None,\n",
" 'deletion_timestamp': None,\n",
" 'finalizers': None,\n",
" 'generate_name': None,\n",
" 'generation': None,\n",
" 'initializers': None,\n",
" 'labels': {'app': 'mnist-tensorboard',\n",
" 'version': 'v1'},\n",
" 'managed_fields': None,\n",
" 'name': None,\n",
" 'namespace': None,\n",
" 'owner_references': None,\n",
" 'resource_version': None,\n",
" 'self_link': None,\n",
" 'uid': None},\n",
" 'spec': {'active_deadline_seconds': None,\n",
" 'affinity': None,\n",
" 'automount_service_account_token': None,\n",
" 'containers': [{'args': None,\n",
" 'command': ['/usr/local/bin/tensorboard',\n",
" '--logdir=gs://kubeflow-writers-mnist/mnist',\n",
" '--port=80'],\n",
" 'env': None,\n",
" 'env_from': None,\n",
" 'image': 'tensorflow/tensorflow:1.15.2-py3',\n",
" 'image_pull_policy': 'IfNotPresent',\n",
" 'lifecycle': None,\n",
" 'liveness_probe': None,\n",
" 'name': 'tensorboard',\n",
" 'ports': [{'container_port': 80,\n",
" 'host_ip': None,\n",
" 'host_port': None,\n",
" 'name': None,\n",
" 'protocol': 'TCP'}],\n",
" 'readiness_probe': None,\n",
" 'resources': {'limits': None,\n",
" 'requests': None},\n",
" 'security_context': None,\n",
" 'stdin': None,\n",
" 'stdin_once': None,\n",
" 'termination_message_path': '/dev/termination-log',\n",
" 'termination_message_policy': 'File',\n",
" 'tty': None,\n",
" 'volume_devices': None,\n",
" 'volume_mounts': None,\n",
" 'working_dir': None}],\n",
" 'dns_config': None,\n",
" 'dns_policy': 'ClusterFirst',\n",
" 'enable_service_links': None,\n",
" 'host_aliases': None,\n",
" 'host_ipc': None,\n",
" 'host_network': None,\n",
" 'host_pid': None,\n",
" 'hostname': None,\n",
" 'image_pull_secrets': None,\n",
" 'init_containers': None,\n",
" 'node_name': None,\n",
" 'node_selector': None,\n",
" 'priority': None,\n",
" 'priority_class_name': None,\n",
" 'readiness_gates': None,\n",
" 'restart_policy': 'Always',\n",
" 'runtime_class_name': None,\n",
" 'scheduler_name': 'default-scheduler',\n",
" 'security_context': {'fs_group': None,\n",
" 'run_as_group': None,\n",
" 'run_as_non_root': None,\n",
" 'run_as_user': None,\n",
" 'se_linux_options': None,\n",
" 'supplemental_groups': None,\n",
" 'sysctls': None},\n",
" 'service_account': 'default-editor',\n",
" 'service_account_name': 'default-editor',\n",
" 'share_process_namespace': None,\n",
" 'subdomain': None,\n",
" 'termination_grace_period_seconds': 30,\n",
" 'tolerations': None,\n",
" 'volumes': None}}},\n",
" 'status': {'available_replicas': None,\n",
" 'collision_count': None,\n",
" 'conditions': None,\n",
" 'observed_generation': None,\n",
" 'ready_replicas': None,\n",
" 'replicas': None,\n",
" 'unavailable_replicas': None,\n",
" 'updated_replicas': None}}, {'api_version': 'v1',\n",
" 'kind': 'Service',\n",
" 'metadata': {'annotations': None,\n",
" 'cluster_name': None,\n",
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 20, 4, tzinfo=tzlocal()),\n",
" 'deletion_grace_period_seconds': None,\n",
" 'deletion_timestamp': None,\n",
" 'finalizers': None,\n",
" 'generate_name': None,\n",
" 'generation': None,\n",
" 'initializers': None,\n",
" 'labels': {'app': 'mnist-tensorboard'},\n",
" 'managed_fields': None,\n",
" 'name': 'mnist-tensorboard',\n",
" 'namespace': 'kubeflow-sarahmaddox',\n",
" 'owner_references': None,\n",
" 'resource_version': '782395',\n",
" 'self_link': '/api/v1/namespaces/kubeflow-sarahmaddox/services/mnist-tensorboard',\n",
" 'uid': 'e1d7b041-5846-11ea-9ddf-42010a80013f'},\n",
" 'spec': {'cluster_ip': '10.35.253.170',\n",
" 'external_i_ps': None,\n",
" 'external_name': None,\n",
" 'external_traffic_policy': None,\n",
" 'health_check_node_port': None,\n",
" 'load_balancer_ip': None,\n",
" 'load_balancer_source_ranges': None,\n",
" 'ports': [{'name': 'http-tb',\n",
" 'node_port': None,\n",
" 'port': 80,\n",
" 'protocol': 'TCP',\n",
" 'target_port': 80}],\n",
" 'publish_not_ready_addresses': None,\n",
" 'selector': {'app': 'mnist-tensorboard'},\n",
" 'session_affinity': 'None',\n",
" 'session_affinity_config': None,\n",
" 'type': 'ClusterIP'},\n",
" 'status': {'load_balancer': {'ingress': None}}}, {'apiVersion': 'networking.istio.io/v1alpha3',\n",
" 'kind': 'VirtualService',\n",
" 'metadata': {'creationTimestamp': '2020-02-26T03:20:04Z',\n",
" 'generation': 1,\n",
" 'name': 'mnist-tensorboard',\n",
" 'namespace': 'kubeflow-sarahmaddox',\n",
" 'resourceVersion': '782396',\n",
" 'selfLink': '/apis/networking.istio.io/v1alpha3/namespaces/kubeflow-sarahmaddox/virtualservices/mnist-tensorboard',\n",
" 'uid': 'e1daadfe-5846-11ea-9ddf-42010a80013f'},\n",
" 'spec': {'gateways': ['kubeflow/kubeflow-gateway'],\n",
" 'hosts': ['*'],\n",
" 'http': [{'match': [{'uri': {'prefix': '/mnist/kubeflow-sarahmaddox/tensorboard/'}}],\n",
" 'rewrite': {'uri': '/'},\n",
" 'route': [{'destination': {'host': 'mnist-tensorboard.kubeflow-sarahmaddox.svc.cluster.local',\n",
" 'port': {'number': 80}}}],\n",
" 'timeout': '300s'}]}}]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"k8s_util.apply_k8s_specs(tb_specs, k8s_util.K8S_CREATE_OR_REPLACE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set a variable defining your endpoint\n",
"\n",
"Set `endpoint` to `https://your-domain` (with no slash at the end). Your domain typically has the following pattern: `<your-kubeflow-deployment-name>.endpoints.<your-gcp-project>.cloud.goog`. You can see your domain in the URL that you're using to access this notebook."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"endpoint set to https://sarahmaddox-kfw-v100rc4.endpoints.kubeflow-writers.cloud.goog\n"
]
}
],
"source": [
"endpoint = None\n",
"\n",
"if endpoint:\n",
" logging.info(f\"endpoint set to {endpoint}\")\n",
"else:\n",
" logging.info(\"Warning: You must set {endpoint} in order to print out the URLs where you can access your web apps.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Access the TensorBoard UI\n",
"\n",
"Run the cell below to find the endpoint for the TensorBoard UI."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"TensorBoard UI is at <a href='https://sarahmaddox-kfw-v100rc4.endpoints.kubeflow-writers.cloud.goog/mnist/kubeflow-sarahmaddox/tensorboard/'>https://sarahmaddox-kfw-v100rc4.endpoints.kubeflow-writers.cloud.goog/mnist/kubeflow-sarahmaddox/tensorboard/</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"if endpoint: \n",
" vs = yaml.safe_load(tb_virtual_service)\n",
" path= vs[\"spec\"][\"http\"][0][\"match\"][0][\"uri\"][\"prefix\"]\n",
" tb_endpoint = endpoint + path\n",
" display(HTML(f\"TensorBoard UI is at <a href='{tb_endpoint}'>{tb_endpoint}</a>\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wait for the training job to finish"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use the TFJob client to wait for the job to finish:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"TFJob kubeflow-sarahmaddox.mnist-train-289e succeeded\n"
]
}
],
"source": [
"tf_job = tf_job_client.wait_for_condition(train_name, expected_condition=[\"Succeeded\", \"Failed\"], namespace=namespace)\n",
"\n",
"if tf_job_client.is_job_succeeded(train_name, namespace):\n",
" logging.info(f\"TFJob {namespace}.{train_name} succeeded\")\n",
"else:\n",
" raise ValueError(f\"TFJob {namespace}.{train_name} failed\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Serve the model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you can deploy the model using [TensorFlow Serving](https://www.kubeflow.org/docs/components/serving/tfserving_new/).\n",
"\n",
"You need to create the following:\n",
"* A Kubernetes deployment.\n",
"* A Kubernetes service.\n",
"* (Optional) A configmap containing the Prometheus monitoring configuration."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"deploy_name = \"mnist-model\"\n",
"model_base_path = export_path\n",
"\n",
"# The web UI defaults to mnist-service so if you change the name, you must\n",
"# change it in the UI as well.\n",
"model_service = \"mnist-service\"\n",
"\n",
"deploy_spec = f\"\"\"apiVersion: apps/v1\n",
"kind: Deployment\n",
"metadata:\n",
" labels:\n",
" app: mnist\n",
" name: {deploy_name}\n",
" namespace: {namespace}\n",
"spec:\n",
" selector:\n",
" matchLabels:\n",
" app: mnist-model\n",
" template:\n",
" metadata:\n",
" # TODO(jlewi): Right now we disable the istio side car because otherwise ISTIO rbac will prevent the\n",
" # UI from sending RPCs to the server. We should create an appropriate ISTIO rbac authorization\n",
" # policy to allow traffic from the UI to the model servier.\n",
" # https://istio.io/docs/concepts/security/#target-selectors\n",
" annotations: \n",
" sidecar.istio.io/inject: \"false\"\n",
" labels:\n",
" app: mnist-model\n",
" version: v1\n",
" spec:\n",
" serviceAccount: default-editor\n",
" containers:\n",
" - args:\n",
" - --port=9000\n",
" - --rest_api_port=8500\n",
" - --model_name=mnist\n",
" - --model_base_path={model_base_path}\n",
" - --monitoring_config_file=/var/config/monitoring_config.txt\n",
" command:\n",
" - /usr/bin/tensorflow_model_server\n",
" env:\n",
" - name: modelBasePath\n",
" value: {model_base_path}\n",
" image: tensorflow/serving:1.15.0\n",
" imagePullPolicy: IfNotPresent\n",
" livenessProbe:\n",
" initialDelaySeconds: 30\n",
" periodSeconds: 30\n",
" tcpSocket:\n",
" port: 9000\n",
" name: mnist\n",
" ports:\n",
" - containerPort: 9000\n",
" - containerPort: 8500\n",
" resources:\n",
" limits:\n",
" cpu: \"4\"\n",
" memory: 4Gi\n",
" requests:\n",
" cpu: \"1\"\n",
" memory: 1Gi\n",
" volumeMounts:\n",
" - mountPath: /var/config/\n",
" name: model-config\n",
" volumes:\n",
" - configMap:\n",
" name: {deploy_name}\n",
" name: model-config\n",
"\"\"\"\n",
"\n",
"service_spec = f\"\"\"apiVersion: v1\n",
"kind: Service\n",
"metadata:\n",
" annotations: \n",
" prometheus.io/path: /monitoring/prometheus/metrics\n",
" prometheus.io/port: \"8500\"\n",
" prometheus.io/scrape: \"true\"\n",
" labels:\n",
" app: mnist-model\n",
" name: {model_service}\n",
" namespace: {namespace}\n",
"spec:\n",
" ports:\n",
" - name: grpc-tf-serving\n",
" port: 9000\n",
" targetPort: 9000\n",
" - name: http-tf-serving\n",
" port: 8500\n",
" targetPort: 8500\n",
" selector:\n",
" app: mnist-model\n",
" type: ClusterIP\n",
"\"\"\"\n",
"\n",
"monitoring_config = f\"\"\"kind: ConfigMap\n",
"apiVersion: v1\n",
"metadata:\n",
" name: {deploy_name}\n",
" namespace: {namespace}\n",
"data:\n",
" monitoring_config.txt: |-\n",
" prometheus_config: {{\n",
" enable: true,\n",
" path: \"/monitoring/prometheus/metrics\"\n",
" }}\n",
"\"\"\"\n",
"\n",
"model_specs = [deploy_spec, service_spec, monitoring_config]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Created Deployment kubeflow-sarahmaddox.mnist-model\n",
"Created Service kubeflow-sarahmaddox.mnist-service\n",
"Created ConfigMap kubeflow-sarahmaddox.mnist-model\n"
]
},
{
"data": {
"text/plain": [
"[{'api_version': 'apps/v1',\n",
" 'kind': 'Deployment',\n",
" 'metadata': {'annotations': None,\n",
" 'cluster_name': None,\n",
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 30, 28, tzinfo=tzlocal()),\n",
" 'deletion_grace_period_seconds': None,\n",
" 'deletion_timestamp': None,\n",
" 'finalizers': None,\n",
" 'generate_name': None,\n",
" 'generation': 1,\n",
" 'initializers': None,\n",
" 'labels': {'app': 'mnist'},\n",
" 'managed_fields': None,\n",
" 'name': 'mnist-model',\n",
" 'namespace': 'kubeflow-sarahmaddox',\n",
" 'owner_references': None,\n",
" 'resource_version': '788910',\n",
" 'self_link': '/apis/apps/v1/namespaces/kubeflow-sarahmaddox/deployments/mnist-model',\n",
" 'uid': '5555d458-5848-11ea-9ddf-42010a80013f'},\n",
" 'spec': {'min_ready_seconds': None,\n",
" 'paused': None,\n",
" 'progress_deadline_seconds': 600,\n",
" 'replicas': 1,\n",
" 'revision_history_limit': 10,\n",
" 'selector': {'match_expressions': None,\n",
" 'match_labels': {'app': 'mnist-model'}},\n",
" 'strategy': {'rolling_update': {'max_surge': '25%',\n",
" 'max_unavailable': '25%'},\n",
" 'type': 'RollingUpdate'},\n",
" 'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},\n",
" 'cluster_name': None,\n",
" 'creation_timestamp': None,\n",
" 'deletion_grace_period_seconds': None,\n",
" 'deletion_timestamp': None,\n",
" 'finalizers': None,\n",
" 'generate_name': None,\n",
" 'generation': None,\n",
" 'initializers': None,\n",
" 'labels': {'app': 'mnist-model',\n",
" 'version': 'v1'},\n",
" 'managed_fields': None,\n",
" 'name': None,\n",
" 'namespace': None,\n",
" 'owner_references': None,\n",
" 'resource_version': None,\n",
" 'self_link': None,\n",
" 'uid': None},\n",
" 'spec': {'active_deadline_seconds': None,\n",
" 'affinity': None,\n",
" 'automount_service_account_token': None,\n",
" 'containers': [{'args': ['--port=9000',\n",
" '--rest_api_port=8500',\n",
" '--model_name=mnist',\n",
" '--model_base_path=gs://kubeflow-writers-mnist/mnist/export',\n",
" '--monitoring_config_file=/var/config/monitoring_config.txt'],\n",
" 'command': ['/usr/bin/tensorflow_model_server'],\n",
" 'env': [{'name': 'modelBasePath',\n",
" 'value': 'gs://kubeflow-writers-mnist/mnist/export',\n",
" 'value_from': None}],\n",
" 'env_from': None,\n",
" 'image': 'tensorflow/serving:1.15.0',\n",
" 'image_pull_policy': 'IfNotPresent',\n",
" 'lifecycle': None,\n",
" 'liveness_probe': {'_exec': None,\n",
" 'failure_threshold': 3,\n",
" 'http_get': None,\n",
" 'initial_delay_seconds': 30,\n",
" 'period_seconds': 30,\n",
" 'success_threshold': 1,\n",
" 'tcp_socket': {'host': None,\n",
" 'port': 9000},\n",
" 'timeout_seconds': 1},\n",
" 'name': 'mnist',\n",
" 'ports': [{'container_port': 9000,\n",
" 'host_ip': None,\n",
" 'host_port': None,\n",
" 'name': None,\n",
" 'protocol': 'TCP'},\n",
" {'container_port': 8500,\n",
" 'host_ip': None,\n",
" 'host_port': None,\n",
" 'name': None,\n",
" 'protocol': 'TCP'}],\n",
" 'readiness_probe': None,\n",
" 'resources': {'limits': {'cpu': '4',\n",
" 'memory': '4Gi'},\n",
" 'requests': {'cpu': '1',\n",
" 'memory': '1Gi'}},\n",
" 'security_context': None,\n",
" 'stdin': None,\n",
" 'stdin_once': None,\n",
" 'termination_message_path': '/dev/termination-log',\n",
" 'termination_message_policy': 'File',\n",
" 'tty': None,\n",
" 'volume_devices': None,\n",
" 'volume_mounts': [{'mount_path': '/var/config/',\n",
" 'mount_propagation': None,\n",
" 'name': 'model-config',\n",
" 'read_only': None,\n",
" 'sub_path': None,\n",
" 'sub_path_expr': None}],\n",
" 'working_dir': None}],\n",
" 'dns_config': None,\n",
" 'dns_policy': 'ClusterFirst',\n",
" 'enable_service_links': None,\n",
" 'host_aliases': None,\n",
" 'host_ipc': None,\n",
" 'host_network': None,\n",
" 'host_pid': None,\n",
" 'hostname': None,\n",
" 'image_pull_secrets': None,\n",
" 'init_containers': None,\n",
" 'node_name': None,\n",
" 'node_selector': None,\n",
" 'priority': None,\n",
" 'priority_class_name': None,\n",
" 'readiness_gates': None,\n",
" 'restart_policy': 'Always',\n",
" 'runtime_class_name': None,\n",
" 'scheduler_name': 'default-scheduler',\n",
" 'security_context': {'fs_group': None,\n",
" 'run_as_group': None,\n",
" 'run_as_non_root': None,\n",
" 'run_as_user': None,\n",
" 'se_linux_options': None,\n",
" 'supplemental_groups': None,\n",
" 'sysctls': None},\n",
" 'service_account': 'default-editor',\n",
" 'service_account_name': 'default-editor',\n",
" 'share_process_namespace': None,\n",
" 'subdomain': None,\n",
" 'termination_grace_period_seconds': 30,\n",
" 'tolerations': None,\n",
" 'volumes': [{'aws_elastic_block_store': None,\n",
" 'azure_disk': None,\n",
" 'azure_file': None,\n",
" 'cephfs': None,\n",
" 'cinder': None,\n",
" 'config_map': {'default_mode': 420,\n",
" 'items': None,\n",
" 'name': 'mnist-model',\n",
" 'optional': None},\n",
" 'csi': None,\n",
" 'downward_api': None,\n",
" 'empty_dir': None,\n",
" 'fc': None,\n",
" 'flex_volume': None,\n",
" 'flocker': None,\n",
" 'gce_persistent_disk': None,\n",
" 'git_repo': None,\n",
" 'glusterfs': None,\n",
" 'host_path': None,\n",
" 'iscsi': None,\n",
" 'name': 'model-config',\n",
" 'nfs': None,\n",
" 'persistent_volume_claim': None,\n",
" 'photon_persistent_disk': None,\n",
" 'portworx_volume': None,\n",
" 'projected': None,\n",
" 'quobyte': None,\n",
" 'rbd': None,\n",
" 'scale_io': None,\n",
" 'secret': None,\n",
" 'storageos': None,\n",
" 'vsphere_volume': None}]}}},\n",
" 'status': {'available_replicas': None,\n",
" 'collision_count': None,\n",
" 'conditions': None,\n",
" 'observed_generation': None,\n",
" 'ready_replicas': None,\n",
" 'replicas': None,\n",
" 'unavailable_replicas': None,\n",
" 'updated_replicas': None}}, {'api_version': 'v1',\n",
" 'kind': 'Service',\n",
" 'metadata': {'annotations': {'prometheus.io/path': '/monitoring/prometheus/metrics',\n",
" 'prometheus.io/port': '8500',\n",
" 'prometheus.io/scrape': 'true'},\n",
" 'cluster_name': None,\n",
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 30, 28, tzinfo=tzlocal()),\n",
" 'deletion_grace_period_seconds': None,\n",
" 'deletion_timestamp': None,\n",
" 'finalizers': None,\n",
" 'generate_name': None,\n",
" 'generation': None,\n",
" 'initializers': None,\n",
" 'labels': {'app': 'mnist-model'},\n",
" 'managed_fields': None,\n",
" 'name': 'mnist-service',\n",
" 'namespace': 'kubeflow-sarahmaddox',\n",
" 'owner_references': None,\n",
" 'resource_version': '788913',\n",
" 'self_link': '/api/v1/namespaces/kubeflow-sarahmaddox/services/mnist-service',\n",
" 'uid': '555d8fc0-5848-11ea-9ddf-42010a80013f'},\n",
" 'spec': {'cluster_ip': '10.35.254.103',\n",
" 'external_i_ps': None,\n",
" 'external_name': None,\n",
" 'external_traffic_policy': None,\n",
" 'health_check_node_port': None,\n",
" 'load_balancer_ip': None,\n",
" 'load_balancer_source_ranges': None,\n",
" 'ports': [{'name': 'grpc-tf-serving',\n",
" 'node_port': None,\n",
" 'port': 9000,\n",
" 'protocol': 'TCP',\n",
" 'target_port': 9000},\n",
" {'name': 'http-tf-serving',\n",
" 'node_port': None,\n",
" 'port': 8500,\n",
" 'protocol': 'TCP',\n",
" 'target_port': 8500}],\n",
" 'publish_not_ready_addresses': None,\n",
" 'selector': {'app': 'mnist-model'},\n",
" 'session_affinity': 'None',\n",
" 'session_affinity_config': None,\n",
" 'type': 'ClusterIP'},\n",
" 'status': {'load_balancer': {'ingress': None}}}, {'api_version': 'v1',\n",
" 'binary_data': None,\n",
" 'data': {'monitoring_config.txt': 'prometheus_config: {\\n'\n",
" ' enable: true,\\n'\n",
" ' path: \"/monitoring/prometheus/metrics\"\\n'\n",
" '}'},\n",
" 'kind': 'ConfigMap',\n",
" 'metadata': {'annotations': None,\n",
" 'cluster_name': None,\n",
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 30, 28, tzinfo=tzlocal()),\n",
" 'deletion_grace_period_seconds': None,\n",
" 'deletion_timestamp': None,\n",
" 'finalizers': None,\n",
" 'generate_name': None,\n",
" 'generation': None,\n",
" 'initializers': None,\n",
" 'labels': None,\n",
" 'managed_fields': None,\n",
" 'name': 'mnist-model',\n",
" 'namespace': 'kubeflow-sarahmaddox',\n",
" 'owner_references': None,\n",
" 'resource_version': '788914',\n",
" 'self_link': '/api/v1/namespaces/kubeflow-sarahmaddox/configmaps/mnist-model',\n",
" 'uid': '5560bb37-5848-11ea-9ddf-42010a80013f'}}]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"k8s_util.apply_k8s_specs(model_specs, k8s_util.K8S_CREATE_OR_REPLACE) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploy the UI for the MNIST web app\n",
"\n",
"Deploy the UI to visualize the MNIST prediction results.\n",
"\n",
"This example uses a prebuilt and public Docker image for the UI."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"ui_name = \"mnist-ui\"\n",
"ui_deploy = f\"\"\"apiVersion: apps/v1\n",
"kind: Deployment\n",
"metadata:\n",
" name: {ui_name}\n",
" namespace: {namespace}\n",
"spec:\n",
" replicas: 1\n",
" selector:\n",
" matchLabels:\n",
" app: mnist-web-ui\n",
" template:\n",
" metadata:\n",
" labels:\n",
" app: mnist-web-ui\n",
" spec:\n",
" containers:\n",
" - image: gcr.io/kubeflow-examples/mnist/web-ui:v20190112-v0.2-142-g3b38225\n",
" name: web-ui\n",
" ports:\n",
" - containerPort: 5000 \n",
" serviceAccount: default-editor\n",
"\"\"\"\n",
"\n",
"ui_service = f\"\"\"apiVersion: v1\n",
"kind: Service\n",
"metadata:\n",
" annotations:\n",
" name: {ui_name}\n",
" namespace: {namespace}\n",
"spec:\n",
" ports:\n",
" - name: http-mnist-ui\n",
" port: 80\n",
" targetPort: 5000\n",
" selector:\n",
" app: mnist-web-ui\n",
" type: ClusterIP\n",
"\"\"\"\n",
"\n",
"ui_virtual_service = f\"\"\"apiVersion: networking.istio.io/v1alpha3\n",
"kind: VirtualService\n",
"metadata:\n",
" name: {ui_name}\n",
" namespace: {namespace}\n",
"spec:\n",
" gateways:\n",
" - kubeflow/kubeflow-gateway\n",
" hosts:\n",
" - '*'\n",
" http:\n",
" - match:\n",
" - uri:\n",
" prefix: /mnist/{namespace}/ui/\n",
" rewrite:\n",
" uri: /\n",
" route:\n",
" - destination:\n",
" host: {ui_name}.{namespace}.svc.cluster.local\n",
" port:\n",
" number: 80\n",
" timeout: 300s\n",
"\"\"\"\n",
"\n",
"ui_specs = [ui_deploy, ui_service, ui_virtual_service]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Created Deployment kubeflow-sarahmaddox.mnist-ui\n",
"Created Service kubeflow-sarahmaddox.mnist-ui\n",
"Created VirtualService mnist-ui.mnist-ui\n"
]
},
{
"data": {
"text/plain": [
"[{'api_version': 'apps/v1',\n",
" 'kind': 'Deployment',\n",
" 'metadata': {'annotations': None,\n",
" 'cluster_name': None,\n",
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 32, 29, tzinfo=tzlocal()),\n",
" 'deletion_grace_period_seconds': None,\n",
" 'deletion_timestamp': None,\n",
" 'finalizers': None,\n",
" 'generate_name': None,\n",
" 'generation': 1,\n",
" 'initializers': None,\n",
" 'labels': None,\n",
" 'managed_fields': None,\n",
" 'name': 'mnist-ui',\n",
" 'namespace': 'kubeflow-sarahmaddox',\n",
" 'owner_references': None,\n",
" 'resource_version': '790203',\n",
" 'self_link': '/apis/apps/v1/namespaces/kubeflow-sarahmaddox/deployments/mnist-ui',\n",
" 'uid': '9d846bf6-5848-11ea-9ddf-42010a80013f'},\n",
" 'spec': {'min_ready_seconds': None,\n",
" 'paused': None,\n",
" 'progress_deadline_seconds': 600,\n",
" 'replicas': 1,\n",
" 'revision_history_limit': 10,\n",
" 'selector': {'match_expressions': None,\n",
" 'match_labels': {'app': 'mnist-web-ui'}},\n",
" 'strategy': {'rolling_update': {'max_surge': '25%',\n",
" 'max_unavailable': '25%'},\n",
" 'type': 'RollingUpdate'},\n",
" 'template': {'metadata': {'annotations': None,\n",
" 'cluster_name': None,\n",
" 'creation_timestamp': None,\n",
" 'deletion_grace_period_seconds': None,\n",
" 'deletion_timestamp': None,\n",
" 'finalizers': None,\n",
" 'generate_name': None,\n",
" 'generation': None,\n",
" 'initializers': None,\n",
" 'labels': {'app': 'mnist-web-ui'},\n",
" 'managed_fields': None,\n",
" 'name': None,\n",
" 'namespace': None,\n",
" 'owner_references': None,\n",
" 'resource_version': None,\n",
" 'self_link': None,\n",
" 'uid': None},\n",
" 'spec': {'active_deadline_seconds': None,\n",
" 'affinity': None,\n",
" 'automount_service_account_token': None,\n",
" 'containers': [{'args': None,\n",
" 'command': None,\n",
" 'env': None,\n",
" 'env_from': None,\n",
" 'image': 'gcr.io/kubeflow-examples/mnist/web-ui:v20190112-v0.2-142-g3b38225',\n",
" 'image_pull_policy': 'IfNotPresent',\n",
" 'lifecycle': None,\n",
" 'liveness_probe': None,\n",
" 'name': 'web-ui',\n",
" 'ports': [{'container_port': 5000,\n",
" 'host_ip': None,\n",
" 'host_port': None,\n",
" 'name': None,\n",
" 'protocol': 'TCP'}],\n",
" 'readiness_probe': None,\n",
" 'resources': {'limits': None,\n",
" 'requests': None},\n",
" 'security_context': None,\n",
" 'stdin': None,\n",
" 'stdin_once': None,\n",
" 'termination_message_path': '/dev/termination-log',\n",
" 'termination_message_policy': 'File',\n",
" 'tty': None,\n",
" 'volume_devices': None,\n",
" 'volume_mounts': None,\n",
" 'working_dir': None}],\n",
" 'dns_config': None,\n",
" 'dns_policy': 'ClusterFirst',\n",
" 'enable_service_links': None,\n",
" 'host_aliases': None,\n",
" 'host_ipc': None,\n",
" 'host_network': None,\n",
" 'host_pid': None,\n",
" 'hostname': None,\n",
" 'image_pull_secrets': None,\n",
" 'init_containers': None,\n",
" 'node_name': None,\n",
" 'node_selector': None,\n",
" 'priority': None,\n",
" 'priority_class_name': None,\n",
" 'readiness_gates': None,\n",
" 'restart_policy': 'Always',\n",
" 'runtime_class_name': None,\n",
" 'scheduler_name': 'default-scheduler',\n",
" 'security_context': {'fs_group': None,\n",
" 'run_as_group': None,\n",
" 'run_as_non_root': None,\n",
" 'run_as_user': None,\n",
" 'se_linux_options': None,\n",
" 'supplemental_groups': None,\n",
" 'sysctls': None},\n",
" 'service_account': 'default-editor',\n",
" 'service_account_name': 'default-editor',\n",
" 'share_process_namespace': None,\n",
" 'subdomain': None,\n",
" 'termination_grace_period_seconds': 30,\n",
" 'tolerations': None,\n",
" 'volumes': None}}},\n",
" 'status': {'available_replicas': None,\n",
" 'collision_count': None,\n",
" 'conditions': None,\n",
" 'observed_generation': None,\n",
" 'ready_replicas': None,\n",
" 'replicas': None,\n",
" 'unavailable_replicas': None,\n",
" 'updated_replicas': None}}, {'api_version': 'v1',\n",
" 'kind': 'Service',\n",
" 'metadata': {'annotations': None,\n",
" 'cluster_name': None,\n",
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 32, 29, tzinfo=tzlocal()),\n",
" 'deletion_grace_period_seconds': None,\n",
" 'deletion_timestamp': None,\n",
" 'finalizers': None,\n",
" 'generate_name': None,\n",
" 'generation': None,\n",
" 'initializers': None,\n",
" 'labels': None,\n",
" 'managed_fields': None,\n",
" 'name': 'mnist-ui',\n",
" 'namespace': 'kubeflow-sarahmaddox',\n",
" 'owner_references': None,\n",
" 'resource_version': '790209',\n",
" 'self_link': '/api/v1/namespaces/kubeflow-sarahmaddox/services/mnist-ui',\n",
" 'uid': '9d8a67e4-5848-11ea-9ddf-42010a80013f'},\n",
" 'spec': {'cluster_ip': '10.35.244.4',\n",
" 'external_i_ps': None,\n",
" 'external_name': None,\n",
" 'external_traffic_policy': None,\n",
" 'health_check_node_port': None,\n",
" 'load_balancer_ip': None,\n",
" 'load_balancer_source_ranges': None,\n",
" 'ports': [{'name': 'http-mnist-ui',\n",
" 'node_port': None,\n",
" 'port': 80,\n",
" 'protocol': 'TCP',\n",
" 'target_port': 5000}],\n",
" 'publish_not_ready_addresses': None,\n",
" 'selector': {'app': 'mnist-web-ui'},\n",
" 'session_affinity': 'None',\n",
" 'session_affinity_config': None,\n",
" 'type': 'ClusterIP'},\n",
" 'status': {'load_balancer': {'ingress': None}}}, {'apiVersion': 'networking.istio.io/v1alpha3',\n",
" 'kind': 'VirtualService',\n",
" 'metadata': {'creationTimestamp': '2020-02-26T03:32:29Z',\n",
" 'generation': 1,\n",
" 'name': 'mnist-ui',\n",
" 'namespace': 'kubeflow-sarahmaddox',\n",
" 'resourceVersion': '790211',\n",
" 'selfLink': '/apis/networking.istio.io/v1alpha3/namespaces/kubeflow-sarahmaddox/virtualservices/mnist-ui',\n",
" 'uid': '9d921512-5848-11ea-9ddf-42010a80013f'},\n",
" 'spec': {'gateways': ['kubeflow/kubeflow-gateway'],\n",
" 'hosts': ['*'],\n",
" 'http': [{'match': [{'uri': {'prefix': '/mnist/kubeflow-sarahmaddox/ui/'}}],\n",
" 'rewrite': {'uri': '/'},\n",
" 'route': [{'destination': {'host': 'mnist-ui.kubeflow-sarahmaddox.svc.cluster.local',\n",
" 'port': {'number': 80}}}],\n",
" 'timeout': '300s'}]}}]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"k8s_util.apply_k8s_specs(ui_specs, k8s_util.K8S_CREATE_OR_REPLACE) \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Access the MNIST web UI\n",
"\n",
"A reverse proxy route is automatically added to the Kubeflow IAP endpoint. The MNIST endpoint is:\n",
"\n",
" ```\n",
" https:/${KUBEFlOW_ENDPOINT}/mnist/${NAMESPACE}/ui/ \n",
" ```\n",
" \n",
"where `NAMESPACE` is the namespace where you're running the Jupyter notebook."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"mnist UI is at <a href='https://sarahmaddox-kfw-v100rc4.endpoints.kubeflow-writers.cloud.goog/mnist/kubeflow-sarahmaddox/ui/'>https://sarahmaddox-kfw-v100rc4.endpoints.kubeflow-writers.cloud.goog/mnist/kubeflow-sarahmaddox/ui/</a>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"if endpoint: \n",
" vs = yaml.safe_load(ui_virtual_service)\n",
" path= vs[\"spec\"][\"http\"][0][\"match\"][0][\"uri\"][\"prefix\"]\n",
" ui_endpoint = endpoint + path\n",
" display(HTML(f\"mnist UI is at <a href='{ui_endpoint}'>{ui_endpoint}</a>\"))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Open the MNIST UI in your browser. You should see an image of a hand-written digit from 0 to 9. This is a random image sent to the model for classification. Below the image is a set of bar graphs, one for each classification label from 0 to 9, as output by the model. Each bar represents the probability that the image matches the respective label. \n",
"\n",
"Click the **test random image** button to send the model a new image."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"Visit the [Kubeflow docs](https://www.kubeflow.org/docs/gke/) for more information about running Kubeflow on GCP."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}