mirror of https://github.com/kubeflow/examples.git
1857 lines
82 KiB
Plaintext
1857 lines
82 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# MNIST end to end on Kubeflow on GKE\n",
|
|
"\n",
|
|
"This example guides you through:\n",
|
|
" \n",
|
|
" 1. Taking an example TensorFlow model and modifying it to support distributed training.\n",
|
|
" 1. Serving the resulting model using TFServing.\n",
|
|
" 1. Deploying and using a web app that sends prediction requests to the model.\n",
|
|
" \n",
|
|
"## Requirements\n",
|
|
"\n",
|
|
" * You must be running Kubeflow 1.0 on Kubernetes Engine (GKE) with Cloud Identity-Aware Proxy (Cloud IAP). See the guide to [deploying Kubeflow on GCP](https://www.kubeflow.org/docs/gke/deploy/).\n",
|
|
" * Run this notebook within your Kubeflow cluster. See the guide to [setting up your Kubeflow notebooks](https://www.kubeflow.org/docs/components/notebooks/setup/).\n",
|
|
" "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Prepare model\n",
|
|
"\n",
|
|
"There is a delta between existing distributed MNIST examples and what's needed to run well as a TFJob.\n",
|
|
"\n",
|
|
"Basically, you must:\n",
|
|
"\n",
|
|
"* Add options in order to make the model configurable.\n",
|
|
"* Use `tf.estimator.train_and_evaluate` to enable model exporting and serving.\n",
|
|
"* Define serving signatures for model serving.\n",
|
|
"\n",
|
|
"This tutorial provides a Python program that's already prepared for you: [model.py](model.py)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Verify that you have a Google Cloud Platform (GCP) account\n",
|
|
"\n",
|
|
"The cell below checks that this notebook was spawned with credentials to access GCP.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"import os\n",
|
|
"import uuid\n",
|
|
"from importlib import reload\n",
|
|
"from oauth2client.client import GoogleCredentials\n",
|
|
"credentials = GoogleCredentials.get_application_default()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Install the required libraries\n",
|
|
"\n",
|
|
"Run the next cell to import the libraries required to train this model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"pip installing requirements.txt\n",
|
|
"Cloning the tf-operator repo\n",
|
|
"Checkout kubeflow/tf-operator @9238906\n",
|
|
"Adding /home/jovyan/.local/lib/python3.6/site-packages to python path\n",
|
|
"Adding /home/jovyan/git_tf-operator/sdk/python to python path\n",
|
|
"Configure docker credentials\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import notebook_setup\n",
|
|
"reload(notebook_setup)\n",
|
|
"notebook_setup.notebook_setup()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Wait for the message `Configure docker credentials` before moving on to the next cell."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import k8s_util\n",
|
|
"# Force a reload of Kubeflow. Since Kubeflow is a multi namespace module,\n",
|
|
"# doing the reload in notebook_setup may not be sufficient.\n",
|
|
"import kubeflow\n",
|
|
"reload(kubeflow)\n",
|
|
"from kubernetes import client as k8s_client\n",
|
|
"from kubernetes import config as k8s_config\n",
|
|
"from kubeflow.tfjob.api import tf_job_client as tf_job_client_module\n",
|
|
"from IPython.core.display import display, HTML\n",
|
|
"import yaml"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Configure a Docker registry for Kubeflow Fairing\n",
|
|
"\n",
|
|
"* In order to build Docker images from your notebook, you need a Docker registry to store the images.\n",
|
|
"* Below you set some variables specifying a [Container Registry](https://cloud.google.com/container-registry/docs/).\n",
|
|
"* Kubeflow Fairing provides a utility function to guess the name of your GCP project."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Running in project kubeflow-writers\n",
|
|
"Running in namespace kubeflow-sarahmaddox\n",
|
|
"Using Docker registry gcr.io/kubeflow-writers/fairing-job\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from kubernetes import client as k8s_client\n",
|
|
"from kubernetes.client import rest as k8s_rest\n",
|
|
"from kubeflow import fairing \n",
|
|
"from kubeflow.fairing import utils as fairing_utils\n",
|
|
"from kubeflow.fairing.builders import append\n",
|
|
"from kubeflow.fairing.deployers import job\n",
|
|
"from kubeflow.fairing.preprocessors import base as base_preprocessor\n",
|
|
"\n",
|
|
"# Setting up Google Container Registry (GCR) for storing output containers.\n",
|
|
"# You can use any Docker container registry instead of GCR.\n",
|
|
"GCP_PROJECT = fairing.cloud.gcp.guess_project_name()\n",
|
|
"DOCKER_REGISTRY = 'gcr.io/{}/fairing-job'.format(GCP_PROJECT)\n",
|
|
"namespace = fairing_utils.get_current_k8s_namespace()\n",
|
|
"\n",
|
|
"logging.info(f\"Running in project {GCP_PROJECT}\")\n",
|
|
"logging.info(f\"Running in namespace {namespace}\")\n",
|
|
"logging.info(f\"Using Docker registry {DOCKER_REGISTRY}\")\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Use Kubeflow Fairing to build the Docker image\n",
|
|
"\n",
|
|
"This notebook uses Kubeflow Fairing's kaniko builder to build a Docker image that includes all your dependencies.\n",
|
|
" * You use kaniko because you want to be able to run `pip` to install dependencies.\n",
|
|
" * Kaniko gives you the flexibility to build images from Dockerfiles."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# TODO(https://github.com/kubeflow/fairing/issues/426): We should get rid of this once the default \n",
|
|
"# Kaniko image is updated to a newer image than 0.7.0.\n",
|
|
"from kubeflow.fairing import constants\n",
|
|
"constants.constants.KANIKO_IMAGE = \"gcr.io/kaniko-project/executor:v0.14.0\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"set()"
|
|
]
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from kubeflow.fairing.builders import cluster\n",
|
|
"\n",
|
|
"# output_map is a map of extra files to add to the notebook.\n",
|
|
"# It is a map from source location to the location inside the context.\n",
|
|
"output_map = {\n",
|
|
" \"Dockerfile.model\": \"Dockerfile\",\n",
|
|
" \"model.py\": \"model.py\"\n",
|
|
"}\n",
|
|
"\n",
|
|
"\n",
|
|
"preprocessor = base_preprocessor.BasePreProcessor(\n",
|
|
" command=[\"python\"], # The base class will set this.\n",
|
|
" input_files=[],\n",
|
|
" path_prefix=\"/app\", # irrelevant since we aren't preprocessing any files\n",
|
|
" output_map=output_map)\n",
|
|
"\n",
|
|
"preprocessor.preprocess()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Run the next cell and wait until you see a message like `Built image gcr.io/<your-project>/fairing-job/mnist:<1234567>`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Building image using cluster builder.\n",
|
|
"Creating docker context: /tmp/fairing_context_ohm2nlbv\n",
|
|
"Dockerfile already exists in Fairing context, skipping...\n",
|
|
"Waiting for fairing-builder-9vw9w-ndbhd to start...\n",
|
|
"Waiting for fairing-builder-9vw9w-ndbhd to start...\n",
|
|
"Waiting for fairing-builder-9vw9w-ndbhd to start...\n",
|
|
"Pod started running True\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"ERROR: logging before flag.Parse: E0226 02:34:42.750776 1 metadata.go:241] Failed to unmarshal scopes: invalid character 'h' looking for beginning of value\n",
|
|
"\u001b[36mINFO\u001b[0m[0004] Resolved base name tensorflow/tensorflow:1.15.2-py3 to tensorflow/tensorflow:1.15.2-py3\n",
|
|
"\u001b[36mINFO\u001b[0m[0004] Resolved base name tensorflow/tensorflow:1.15.2-py3 to tensorflow/tensorflow:1.15.2-py3\n",
|
|
"\u001b[36mINFO\u001b[0m[0004] Downloading base image tensorflow/tensorflow:1.15.2-py3\n",
|
|
"ERROR: logging before flag.Parse: E0226 02:34:44.230593 1 metadata.go:142] while reading 'google-dockercfg' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg\n",
|
|
"ERROR: logging before flag.Parse: E0226 02:34:44.233477 1 metadata.go:159] while reading 'google-dockercfg-url' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg-url\n",
|
|
"\u001b[36mINFO\u001b[0m[0004] Error while retrieving image from cache: getting file info: stat /cache/sha256:28b5f547969d70f825909c8fe06675ffc2959afe6079aeae754afa312f6417b9: no such file or directory\n",
|
|
"\u001b[36mINFO\u001b[0m[0004] Downloading base image tensorflow/tensorflow:1.15.2-py3\n",
|
|
"\u001b[36mINFO\u001b[0m[0005] Built cross stage deps: map[]\n",
|
|
"\u001b[36mINFO\u001b[0m[0005] Downloading base image tensorflow/tensorflow:1.15.2-py3\n",
|
|
"\u001b[36mINFO\u001b[0m[0005] Error while retrieving image from cache: getting file info: stat /cache/sha256:28b5f547969d70f825909c8fe06675ffc2959afe6079aeae754afa312f6417b9: no such file or directory\n",
|
|
"\u001b[36mINFO\u001b[0m[0005] Downloading base image tensorflow/tensorflow:1.15.2-py3\n",
|
|
"\u001b[36mINFO\u001b[0m[0005] Using files from context: [/kaniko/buildcontext/model.py]\n",
|
|
"\u001b[36mINFO\u001b[0m[0005] Checking for cached layer gcr.io/kubeflow-writers/fairing-job/mnist/cache:6802122184979734f01a549e1224c5f46a277db894d4b3e749e41ad1ca522bdf...\n",
|
|
"\u001b[36mINFO\u001b[0m[0006] No cached layer found for cmd RUN chmod +x /opt/model.py\n",
|
|
"\u001b[36mINFO\u001b[0m[0006] Unpacking rootfs as cmd RUN chmod +x /opt/model.py requires it.\n",
|
|
"\u001b[36mINFO\u001b[0m[0029] Taking snapshot of full filesystem...\n",
|
|
"\u001b[36mINFO\u001b[0m[0042] Using files from context: [/kaniko/buildcontext/model.py]\n",
|
|
"\u001b[36mINFO\u001b[0m[0042] ADD model.py /opt/model.py\n",
|
|
"\u001b[36mINFO\u001b[0m[0042] Taking snapshot of files...\n",
|
|
"\u001b[36mINFO\u001b[0m[0042] RUN chmod +x /opt/model.py\n",
|
|
"\u001b[36mINFO\u001b[0m[0042] cmd: /bin/sh\n",
|
|
"\u001b[36mINFO\u001b[0m[0042] args: [-c chmod +x /opt/model.py]\n",
|
|
"\u001b[36mINFO\u001b[0m[0042] Taking snapshot of full filesystem...\n",
|
|
"\u001b[36mINFO\u001b[0m[0045] ENTRYPOINT [\"/usr/bin/python\"]\n",
|
|
"\u001b[36mINFO\u001b[0m[0045] Pushing layer gcr.io/kubeflow-writers/fairing-job/mnist/cache:6802122184979734f01a549e1224c5f46a277db894d4b3e749e41ad1ca522bdf to cache now\n",
|
|
"\u001b[36mINFO\u001b[0m[0045] No files changed in this command, skipping snapshotting.\n",
|
|
"\u001b[36mINFO\u001b[0m[0045] CMD [\"/opt/model.py\"]\n",
|
|
"\u001b[36mINFO\u001b[0m[0045] No files changed in this command, skipping snapshotting.\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Built image gcr.io/kubeflow-writers/fairing-job/mnist:8310D75B\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Use a Tensorflow image as the base image\n",
|
|
"# We use a custom Dockerfile \n",
|
|
"cluster_builder = cluster.cluster.ClusterBuilder(registry=DOCKER_REGISTRY,\n",
|
|
" base_image=\"\", # base_image is set in the Dockerfile\n",
|
|
" preprocessor=preprocessor,\n",
|
|
" image_name=\"mnist\",\n",
|
|
" dockerfile_path=\"Dockerfile\",\n",
|
|
" pod_spec_mutators=[fairing.cloud.gcp.add_gcp_credentials_if_exists],\n",
|
|
" context_source=cluster.gcs_context.GCSContextSource())\n",
|
|
"cluster_builder.build()\n",
|
|
"logging.info(f\"Built image {cluster_builder.image_tag}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create a Cloud Storage bucket\n",
|
|
"\n",
|
|
"Run the next cell to create a Google Cloud Storage (GCS) bucket to store your models and other results.\n",
|
|
"\n",
|
|
"Since this notebook is running in Python, the cell uses the GCS Python client libraries, but you can use the `gsutil` command line instead."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Creating bucket kubeflow-writers-mnist\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from google.cloud import storage\n",
|
|
"bucket = f\"{GCP_PROJECT}-mnist\"\n",
|
|
"\n",
|
|
"client = storage.Client()\n",
|
|
"b = storage.Bucket(client=client, name=bucket)\n",
|
|
"\n",
|
|
"if not b.exists():\n",
|
|
" logging.info(f\"Creating bucket {bucket}\")\n",
|
|
" b.create()\n",
|
|
"else:\n",
|
|
" logging.info(f\"Bucket {bucket} already exists\") "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Distributed training\n",
|
|
"\n",
|
|
"To train the model, this example uses [TFJob](https://www.kubeflow.org/docs/components/training/tftraining/) to run a distributed training job. Run the next cell to set up the YAML specification for the job:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"train_name = f\"mnist-train-{uuid.uuid4().hex[:4]}\"\n",
|
|
"num_ps = 1\n",
|
|
"num_workers = 2\n",
|
|
"model_dir = f\"gs://{bucket}/mnist\"\n",
|
|
"export_path = f\"gs://{bucket}/mnist/export\" \n",
|
|
"train_steps = 200\n",
|
|
"batch_size = 100\n",
|
|
"learning_rate = .01\n",
|
|
"image = cluster_builder.image_tag\n",
|
|
"\n",
|
|
"train_spec = f\"\"\"apiVersion: kubeflow.org/v1\n",
|
|
"kind: TFJob\n",
|
|
"metadata:\n",
|
|
" name: {train_name} \n",
|
|
"spec:\n",
|
|
" tfReplicaSpecs:\n",
|
|
" Ps:\n",
|
|
" replicas: {num_ps}\n",
|
|
" template:\n",
|
|
" metadata:\n",
|
|
" annotations:\n",
|
|
" sidecar.istio.io/inject: \"false\"\n",
|
|
" spec:\n",
|
|
" serviceAccount: default-editor\n",
|
|
" containers:\n",
|
|
" - name: tensorflow\n",
|
|
" command:\n",
|
|
" - python\n",
|
|
" - /opt/model.py\n",
|
|
" - --tf-model-dir={model_dir}\n",
|
|
" - --tf-export-dir={export_path}\n",
|
|
" - --tf-train-steps={train_steps}\n",
|
|
" - --tf-batch-size={batch_size}\n",
|
|
" - --tf-learning-rate={learning_rate}\n",
|
|
" image: {image}\n",
|
|
" workingDir: /opt\n",
|
|
" restartPolicy: OnFailure\n",
|
|
" Chief:\n",
|
|
" replicas: 1\n",
|
|
" template:\n",
|
|
" metadata:\n",
|
|
" annotations:\n",
|
|
" sidecar.istio.io/inject: \"false\"\n",
|
|
" spec:\n",
|
|
" serviceAccount: default-editor\n",
|
|
" containers:\n",
|
|
" - name: tensorflow\n",
|
|
" command:\n",
|
|
" - python\n",
|
|
" - /opt/model.py\n",
|
|
" - --tf-model-dir={model_dir}\n",
|
|
" - --tf-export-dir={export_path}\n",
|
|
" - --tf-train-steps={train_steps}\n",
|
|
" - --tf-batch-size={batch_size}\n",
|
|
" - --tf-learning-rate={learning_rate}\n",
|
|
" image: {image}\n",
|
|
" workingDir: /opt\n",
|
|
" restartPolicy: OnFailure\n",
|
|
" Worker:\n",
|
|
" replicas: 1\n",
|
|
" template:\n",
|
|
" metadata:\n",
|
|
" annotations:\n",
|
|
" sidecar.istio.io/inject: \"false\"\n",
|
|
" spec:\n",
|
|
" serviceAccount: default-editor\n",
|
|
" containers:\n",
|
|
" - name: tensorflow\n",
|
|
" command:\n",
|
|
" - python\n",
|
|
" - /opt/model.py\n",
|
|
" - --tf-model-dir={model_dir}\n",
|
|
" - --tf-export-dir={export_path}\n",
|
|
" - --tf-train-steps={train_steps}\n",
|
|
" - --tf-batch-size={batch_size}\n",
|
|
" - --tf-learning-rate={learning_rate}\n",
|
|
" image: {image}\n",
|
|
" workingDir: /opt\n",
|
|
" restartPolicy: OnFailure\n",
|
|
"\"\"\" "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create the training job\n",
|
|
"\n",
|
|
"To submit the training job, you could write the spec to a YAML file and then do `kubectl apply -f {FILE}`.\n",
|
|
"\n",
|
|
"However, because you are running in a Jupyter notebook, you use the TFJob client. \n",
|
|
"* You run the TFJob in a namespace created by a Kubeflow profile.\n",
|
|
"* The namespace is the same as the namespace where you are running the notebook.\n",
|
|
"* Creating a profile ensures that the namespace is provisioned with service accounts and other resources needed for Kubeflow."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tf_job_client = tf_job_client_module.TFJobClient()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Created job kubeflow-sarahmaddox.mnist-train-289e\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"tf_job_body = yaml.safe_load(train_spec)\n",
|
|
"tf_job = tf_job_client.create(tf_job_body, namespace=namespace) \n",
|
|
"\n",
|
|
"logging.info(f\"Created job {namespace}.{train_name}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Check the job using kubectl\n",
|
|
"\n",
|
|
"Above you used the Python SDK for TFJob to check the status. You can also use kubectl get the status of your job. \n",
|
|
"The job conditions will tell you whether the job is running, succeeded or failed."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"apiVersion: kubeflow.org/v1\r\n",
|
|
"kind: TFJob\r\n",
|
|
"metadata:\r\n",
|
|
" creationTimestamp: \"2020-02-26T02:58:32Z\"\r\n",
|
|
" generation: 1\r\n",
|
|
" name: mnist-train-289e\r\n",
|
|
" namespace: kubeflow-sarahmaddox\r\n",
|
|
" resourceVersion: \"770252\"\r\n",
|
|
" selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow-sarahmaddox/tfjobs/mnist-train-289e\r\n",
|
|
" uid: dfa23ecf-5843-11ea-9ddf-42010a80013f\r\n",
|
|
"spec:\r\n",
|
|
" tfReplicaSpecs:\r\n",
|
|
" Chief:\r\n",
|
|
" replicas: 1\r\n",
|
|
" template:\r\n",
|
|
" metadata:\r\n",
|
|
" annotations:\r\n",
|
|
" sidecar.istio.io/inject: \"false\"\r\n",
|
|
" spec:\r\n",
|
|
" containers:\r\n",
|
|
" - command:\r\n",
|
|
" - python\r\n",
|
|
" - /opt/model.py\r\n",
|
|
" - --tf-model-dir=gs://kubeflow-writers-mnist/mnist\r\n",
|
|
" - --tf-export-dir=gs://kubeflow-writers-mnist/mnist/export\r\n",
|
|
" - --tf-train-steps=200\r\n",
|
|
" - --tf-batch-size=100\r\n",
|
|
" - --tf-learning-rate=0.01\r\n",
|
|
" image: gcr.io/kubeflow-writers/fairing-job/mnist:8310D75B\r\n",
|
|
" name: tensorflow\r\n",
|
|
" workingDir: /opt\r\n",
|
|
" restartPolicy: OnFailure\r\n",
|
|
" serviceAccount: default-editor\r\n",
|
|
" Ps:\r\n",
|
|
" replicas: 1\r\n",
|
|
" template:\r\n",
|
|
" metadata:\r\n",
|
|
" annotations:\r\n",
|
|
" sidecar.istio.io/inject: \"false\"\r\n",
|
|
" spec:\r\n",
|
|
" containers:\r\n",
|
|
" - command:\r\n",
|
|
" - python\r\n",
|
|
" - /opt/model.py\r\n",
|
|
" - --tf-model-dir=gs://kubeflow-writers-mnist/mnist\r\n",
|
|
" - --tf-export-dir=gs://kubeflow-writers-mnist/mnist/export\r\n",
|
|
" - --tf-train-steps=200\r\n",
|
|
" - --tf-batch-size=100\r\n",
|
|
" - --tf-learning-rate=0.01\r\n",
|
|
" image: gcr.io/kubeflow-writers/fairing-job/mnist:8310D75B\r\n",
|
|
" name: tensorflow\r\n",
|
|
" workingDir: /opt\r\n",
|
|
" restartPolicy: OnFailure\r\n",
|
|
" serviceAccount: default-editor\r\n",
|
|
" Worker:\r\n",
|
|
" replicas: 1\r\n",
|
|
" template:\r\n",
|
|
" metadata:\r\n",
|
|
" annotations:\r\n",
|
|
" sidecar.istio.io/inject: \"false\"\r\n",
|
|
" spec:\r\n",
|
|
" containers:\r\n",
|
|
" - command:\r\n",
|
|
" - python\r\n",
|
|
" - /opt/model.py\r\n",
|
|
" - --tf-model-dir=gs://kubeflow-writers-mnist/mnist\r\n",
|
|
" - --tf-export-dir=gs://kubeflow-writers-mnist/mnist/export\r\n",
|
|
" - --tf-train-steps=200\r\n",
|
|
" - --tf-batch-size=100\r\n",
|
|
" - --tf-learning-rate=0.01\r\n",
|
|
" image: gcr.io/kubeflow-writers/fairing-job/mnist:8310D75B\r\n",
|
|
" name: tensorflow\r\n",
|
|
" workingDir: /opt\r\n",
|
|
" restartPolicy: OnFailure\r\n",
|
|
" serviceAccount: default-editor\r\n",
|
|
"status:\r\n",
|
|
" completionTime: \"2020-02-26T02:59:58Z\"\r\n",
|
|
" conditions:\r\n",
|
|
" - lastTransitionTime: \"2020-02-26T02:58:32Z\"\r\n",
|
|
" lastUpdateTime: \"2020-02-26T02:58:32Z\"\r\n",
|
|
" message: TFJob mnist-train-289e is created.\r\n",
|
|
" reason: TFJobCreated\r\n",
|
|
" status: \"True\"\r\n",
|
|
" type: Created\r\n",
|
|
" - lastTransitionTime: \"2020-02-26T02:58:35Z\"\r\n",
|
|
" lastUpdateTime: \"2020-02-26T02:58:35Z\"\r\n",
|
|
" message: TFJob mnist-train-289e is running.\r\n",
|
|
" reason: TFJobRunning\r\n",
|
|
" status: \"False\"\r\n",
|
|
" type: Running\r\n",
|
|
" - lastTransitionTime: \"2020-02-26T02:59:58Z\"\r\n",
|
|
" lastUpdateTime: \"2020-02-26T02:59:58Z\"\r\n",
|
|
" message: TFJob mnist-train-289e successfully completed.\r\n",
|
|
" reason: TFJobSucceeded\r\n",
|
|
" status: \"True\"\r\n",
|
|
" type: Succeeded\r\n",
|
|
" replicaStatuses:\r\n",
|
|
" Chief:\r\n",
|
|
" succeeded: 1\r\n",
|
|
" PS:\r\n",
|
|
" succeeded: 1\r\n",
|
|
" Worker:\r\n",
|
|
" succeeded: 1\r\n",
|
|
" startTime: \"2020-02-26T02:58:32Z\"\r\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"!kubectl get tfjobs -o yaml {train_name}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Get the training logs\n",
|
|
"\n",
|
|
"* There are two ways to get the logs for the training job:\n",
|
|
"\n",
|
|
" * Using kubectl to fetch the pod logs. These logs are ephemeral; they will be unavailable when the pod is garbage collected to free up resources.\n",
|
|
" * Using Stackdriver.\n",
|
|
" \n",
|
|
" * Kubernetes logs are automatically available in Stackdriver.\n",
|
|
" * You can use labels to locate the logs for a specific pod.\n",
|
|
" * In the cell below, you use labels for the training job name and process type to locate the logs for a specific pod.\n",
|
|
" \n",
|
|
"* Run the cell below to get a link to Stackdriver for your logs:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"Link to: <a href='https://console.cloud.google.com/logs/viewer?project=kubeflow-writers&interval=P7D&advancedFilter=resource.type%3D%22k8s_container%22++++%0A++++labels.%22k8s-pod%2Ftf-job-name%22+%3D+%22mnist-train-289e%22%0A++++labels.%22k8s-pod%2Ftf-replica-type%22+%3D+%22chief%22++++%0A++++resource.labels.container_name%3D%22tensorflow%22+'>chief logs</a>"
|
|
],
|
|
"text/plain": [
|
|
"<IPython.core.display.HTML object>"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"Link to: <a href='https://console.cloud.google.com/logs/viewer?project=kubeflow-writers&interval=P7D&advancedFilter=resource.type%3D%22k8s_container%22++++%0A++++labels.%22k8s-pod%2Ftf-job-name%22+%3D+%22mnist-train-289e%22%0A++++labels.%22k8s-pod%2Ftf-replica-type%22+%3D+%22worker%22++++%0A++++resource.labels.container_name%3D%22tensorflow%22+'>worker logs</a>"
|
|
],
|
|
"text/plain": [
|
|
"<IPython.core.display.HTML object>"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"Link to: <a href='https://console.cloud.google.com/logs/viewer?project=kubeflow-writers&interval=P7D&advancedFilter=resource.type%3D%22k8s_container%22++++%0A++++labels.%22k8s-pod%2Ftf-job-name%22+%3D+%22mnist-train-289e%22%0A++++labels.%22k8s-pod%2Ftf-replica-type%22+%3D+%22ps%22++++%0A++++resource.labels.container_name%3D%22tensorflow%22+'>ps logs</a>"
|
|
],
|
|
"text/plain": [
|
|
"<IPython.core.display.HTML object>"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"from urllib.parse import urlencode\n",
|
|
"\n",
|
|
"for replica in [\"chief\", \"worker\", \"ps\"]: \n",
|
|
" logs_filter = f\"\"\"resource.type=\"k8s_container\" \n",
|
|
" labels.\"k8s-pod/tf-job-name\" = \"{train_name}\"\n",
|
|
" labels.\"k8s-pod/tf-replica-type\" = \"{replica}\" \n",
|
|
" resource.labels.container_name=\"tensorflow\" \"\"\"\n",
|
|
"\n",
|
|
" new_params = {'project': GCP_PROJECT,\n",
|
|
" # Logs for last 7 days\n",
|
|
" 'interval': 'P7D',\n",
|
|
" 'advancedFilter': logs_filter}\n",
|
|
"\n",
|
|
" query = urlencode(new_params)\n",
|
|
"\n",
|
|
" url = \"https://console.cloud.google.com/logs/viewer?\" + query\n",
|
|
"\n",
|
|
" display(HTML(f\"Link to: <a href='{url}'>{replica} logs</a>\"))\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Deploy TensorBoard\n",
|
|
"\n",
|
|
"The next step is to create a Kubernetes deployment to run TensorBoard.\n",
|
|
"\n",
|
|
"TensorBoard will be accessible behind the Kubeflow IAP endpoint."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tb_name = \"mnist-tensorboard\"\n",
|
|
"tb_deploy = f\"\"\"apiVersion: apps/v1\n",
|
|
"kind: Deployment\n",
|
|
"metadata:\n",
|
|
" labels:\n",
|
|
" app: mnist-tensorboard\n",
|
|
" name: {tb_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" selector:\n",
|
|
" matchLabels:\n",
|
|
" app: mnist-tensorboard\n",
|
|
" template:\n",
|
|
" metadata:\n",
|
|
" labels:\n",
|
|
" app: mnist-tensorboard\n",
|
|
" version: v1\n",
|
|
" spec:\n",
|
|
" serviceAccount: default-editor\n",
|
|
" containers:\n",
|
|
" - command:\n",
|
|
" - /usr/local/bin/tensorboard\n",
|
|
" - --logdir={model_dir}\n",
|
|
" - --port=80\n",
|
|
" image: tensorflow/tensorflow:1.15.2-py3\n",
|
|
" name: tensorboard\n",
|
|
" ports:\n",
|
|
" - containerPort: 80\n",
|
|
"\"\"\"\n",
|
|
"tb_service = f\"\"\"apiVersion: v1\n",
|
|
"kind: Service\n",
|
|
"metadata:\n",
|
|
" labels:\n",
|
|
" app: mnist-tensorboard\n",
|
|
" name: {tb_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" ports:\n",
|
|
" - name: http-tb\n",
|
|
" port: 80\n",
|
|
" targetPort: 80\n",
|
|
" selector:\n",
|
|
" app: mnist-tensorboard\n",
|
|
" type: ClusterIP\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"tb_virtual_service = f\"\"\"apiVersion: networking.istio.io/v1alpha3\n",
|
|
"kind: VirtualService\n",
|
|
"metadata:\n",
|
|
" name: {tb_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" gateways:\n",
|
|
" - kubeflow/kubeflow-gateway\n",
|
|
" hosts:\n",
|
|
" - '*'\n",
|
|
" http:\n",
|
|
" - match:\n",
|
|
" - uri:\n",
|
|
" prefix: /mnist/{namespace}/tensorboard/\n",
|
|
" rewrite:\n",
|
|
" uri: /\n",
|
|
" route:\n",
|
|
" - destination:\n",
|
|
" host: {tb_name}.{namespace}.svc.cluster.local\n",
|
|
" port:\n",
|
|
" number: 80\n",
|
|
" timeout: 300s\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"tb_specs = [tb_deploy, tb_service, tb_virtual_service]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/home/jovyan/examples/mnist/k8s_util.py:55: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n",
|
|
" spec = yaml.load(spec)\n",
|
|
"Created Deployment kubeflow-sarahmaddox.mnist-tensorboard\n",
|
|
"Created Service kubeflow-sarahmaddox.mnist-tensorboard\n",
|
|
"Created VirtualService mnist-tensorboard.mnist-tensorboard\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'api_version': 'apps/v1',\n",
|
|
" 'kind': 'Deployment',\n",
|
|
" 'metadata': {'annotations': None,\n",
|
|
" 'cluster_name': None,\n",
|
|
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 20, 4, tzinfo=tzlocal()),\n",
|
|
" 'deletion_grace_period_seconds': None,\n",
|
|
" 'deletion_timestamp': None,\n",
|
|
" 'finalizers': None,\n",
|
|
" 'generate_name': None,\n",
|
|
" 'generation': 1,\n",
|
|
" 'initializers': None,\n",
|
|
" 'labels': {'app': 'mnist-tensorboard'},\n",
|
|
" 'managed_fields': None,\n",
|
|
" 'name': 'mnist-tensorboard',\n",
|
|
" 'namespace': 'kubeflow-sarahmaddox',\n",
|
|
" 'owner_references': None,\n",
|
|
" 'resource_version': '782392',\n",
|
|
" 'self_link': '/apis/apps/v1/namespaces/kubeflow-sarahmaddox/deployments/mnist-tensorboard',\n",
|
|
" 'uid': 'e1d50153-5846-11ea-9ddf-42010a80013f'},\n",
|
|
" 'spec': {'min_ready_seconds': None,\n",
|
|
" 'paused': None,\n",
|
|
" 'progress_deadline_seconds': 600,\n",
|
|
" 'replicas': 1,\n",
|
|
" 'revision_history_limit': 10,\n",
|
|
" 'selector': {'match_expressions': None,\n",
|
|
" 'match_labels': {'app': 'mnist-tensorboard'}},\n",
|
|
" 'strategy': {'rolling_update': {'max_surge': '25%',\n",
|
|
" 'max_unavailable': '25%'},\n",
|
|
" 'type': 'RollingUpdate'},\n",
|
|
" 'template': {'metadata': {'annotations': None,\n",
|
|
" 'cluster_name': None,\n",
|
|
" 'creation_timestamp': None,\n",
|
|
" 'deletion_grace_period_seconds': None,\n",
|
|
" 'deletion_timestamp': None,\n",
|
|
" 'finalizers': None,\n",
|
|
" 'generate_name': None,\n",
|
|
" 'generation': None,\n",
|
|
" 'initializers': None,\n",
|
|
" 'labels': {'app': 'mnist-tensorboard',\n",
|
|
" 'version': 'v1'},\n",
|
|
" 'managed_fields': None,\n",
|
|
" 'name': None,\n",
|
|
" 'namespace': None,\n",
|
|
" 'owner_references': None,\n",
|
|
" 'resource_version': None,\n",
|
|
" 'self_link': None,\n",
|
|
" 'uid': None},\n",
|
|
" 'spec': {'active_deadline_seconds': None,\n",
|
|
" 'affinity': None,\n",
|
|
" 'automount_service_account_token': None,\n",
|
|
" 'containers': [{'args': None,\n",
|
|
" 'command': ['/usr/local/bin/tensorboard',\n",
|
|
" '--logdir=gs://kubeflow-writers-mnist/mnist',\n",
|
|
" '--port=80'],\n",
|
|
" 'env': None,\n",
|
|
" 'env_from': None,\n",
|
|
" 'image': 'tensorflow/tensorflow:1.15.2-py3',\n",
|
|
" 'image_pull_policy': 'IfNotPresent',\n",
|
|
" 'lifecycle': None,\n",
|
|
" 'liveness_probe': None,\n",
|
|
" 'name': 'tensorboard',\n",
|
|
" 'ports': [{'container_port': 80,\n",
|
|
" 'host_ip': None,\n",
|
|
" 'host_port': None,\n",
|
|
" 'name': None,\n",
|
|
" 'protocol': 'TCP'}],\n",
|
|
" 'readiness_probe': None,\n",
|
|
" 'resources': {'limits': None,\n",
|
|
" 'requests': None},\n",
|
|
" 'security_context': None,\n",
|
|
" 'stdin': None,\n",
|
|
" 'stdin_once': None,\n",
|
|
" 'termination_message_path': '/dev/termination-log',\n",
|
|
" 'termination_message_policy': 'File',\n",
|
|
" 'tty': None,\n",
|
|
" 'volume_devices': None,\n",
|
|
" 'volume_mounts': None,\n",
|
|
" 'working_dir': None}],\n",
|
|
" 'dns_config': None,\n",
|
|
" 'dns_policy': 'ClusterFirst',\n",
|
|
" 'enable_service_links': None,\n",
|
|
" 'host_aliases': None,\n",
|
|
" 'host_ipc': None,\n",
|
|
" 'host_network': None,\n",
|
|
" 'host_pid': None,\n",
|
|
" 'hostname': None,\n",
|
|
" 'image_pull_secrets': None,\n",
|
|
" 'init_containers': None,\n",
|
|
" 'node_name': None,\n",
|
|
" 'node_selector': None,\n",
|
|
" 'priority': None,\n",
|
|
" 'priority_class_name': None,\n",
|
|
" 'readiness_gates': None,\n",
|
|
" 'restart_policy': 'Always',\n",
|
|
" 'runtime_class_name': None,\n",
|
|
" 'scheduler_name': 'default-scheduler',\n",
|
|
" 'security_context': {'fs_group': None,\n",
|
|
" 'run_as_group': None,\n",
|
|
" 'run_as_non_root': None,\n",
|
|
" 'run_as_user': None,\n",
|
|
" 'se_linux_options': None,\n",
|
|
" 'supplemental_groups': None,\n",
|
|
" 'sysctls': None},\n",
|
|
" 'service_account': 'default-editor',\n",
|
|
" 'service_account_name': 'default-editor',\n",
|
|
" 'share_process_namespace': None,\n",
|
|
" 'subdomain': None,\n",
|
|
" 'termination_grace_period_seconds': 30,\n",
|
|
" 'tolerations': None,\n",
|
|
" 'volumes': None}}},\n",
|
|
" 'status': {'available_replicas': None,\n",
|
|
" 'collision_count': None,\n",
|
|
" 'conditions': None,\n",
|
|
" 'observed_generation': None,\n",
|
|
" 'ready_replicas': None,\n",
|
|
" 'replicas': None,\n",
|
|
" 'unavailable_replicas': None,\n",
|
|
" 'updated_replicas': None}}, {'api_version': 'v1',\n",
|
|
" 'kind': 'Service',\n",
|
|
" 'metadata': {'annotations': None,\n",
|
|
" 'cluster_name': None,\n",
|
|
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 20, 4, tzinfo=tzlocal()),\n",
|
|
" 'deletion_grace_period_seconds': None,\n",
|
|
" 'deletion_timestamp': None,\n",
|
|
" 'finalizers': None,\n",
|
|
" 'generate_name': None,\n",
|
|
" 'generation': None,\n",
|
|
" 'initializers': None,\n",
|
|
" 'labels': {'app': 'mnist-tensorboard'},\n",
|
|
" 'managed_fields': None,\n",
|
|
" 'name': 'mnist-tensorboard',\n",
|
|
" 'namespace': 'kubeflow-sarahmaddox',\n",
|
|
" 'owner_references': None,\n",
|
|
" 'resource_version': '782395',\n",
|
|
" 'self_link': '/api/v1/namespaces/kubeflow-sarahmaddox/services/mnist-tensorboard',\n",
|
|
" 'uid': 'e1d7b041-5846-11ea-9ddf-42010a80013f'},\n",
|
|
" 'spec': {'cluster_ip': '10.35.253.170',\n",
|
|
" 'external_i_ps': None,\n",
|
|
" 'external_name': None,\n",
|
|
" 'external_traffic_policy': None,\n",
|
|
" 'health_check_node_port': None,\n",
|
|
" 'load_balancer_ip': None,\n",
|
|
" 'load_balancer_source_ranges': None,\n",
|
|
" 'ports': [{'name': 'http-tb',\n",
|
|
" 'node_port': None,\n",
|
|
" 'port': 80,\n",
|
|
" 'protocol': 'TCP',\n",
|
|
" 'target_port': 80}],\n",
|
|
" 'publish_not_ready_addresses': None,\n",
|
|
" 'selector': {'app': 'mnist-tensorboard'},\n",
|
|
" 'session_affinity': 'None',\n",
|
|
" 'session_affinity_config': None,\n",
|
|
" 'type': 'ClusterIP'},\n",
|
|
" 'status': {'load_balancer': {'ingress': None}}}, {'apiVersion': 'networking.istio.io/v1alpha3',\n",
|
|
" 'kind': 'VirtualService',\n",
|
|
" 'metadata': {'creationTimestamp': '2020-02-26T03:20:04Z',\n",
|
|
" 'generation': 1,\n",
|
|
" 'name': 'mnist-tensorboard',\n",
|
|
" 'namespace': 'kubeflow-sarahmaddox',\n",
|
|
" 'resourceVersion': '782396',\n",
|
|
" 'selfLink': '/apis/networking.istio.io/v1alpha3/namespaces/kubeflow-sarahmaddox/virtualservices/mnist-tensorboard',\n",
|
|
" 'uid': 'e1daadfe-5846-11ea-9ddf-42010a80013f'},\n",
|
|
" 'spec': {'gateways': ['kubeflow/kubeflow-gateway'],\n",
|
|
" 'hosts': ['*'],\n",
|
|
" 'http': [{'match': [{'uri': {'prefix': '/mnist/kubeflow-sarahmaddox/tensorboard/'}}],\n",
|
|
" 'rewrite': {'uri': '/'},\n",
|
|
" 'route': [{'destination': {'host': 'mnist-tensorboard.kubeflow-sarahmaddox.svc.cluster.local',\n",
|
|
" 'port': {'number': 80}}}],\n",
|
|
" 'timeout': '300s'}]}}]"
|
|
]
|
|
},
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"k8s_util.apply_k8s_specs(tb_specs, k8s_util.K8S_CREATE_OR_REPLACE)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Set a variable defining your endpoint\n",
|
|
"\n",
|
|
"Set `endpoint` to `https://your-domain` (with no slash at the end). Your domain typically has the following pattern: `<your-kubeflow-deployment-name>.endpoints.<your-gcp-project>.cloud.goog`. You can see your domain in the URL that you're using to access this notebook."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 36,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"endpoint set to https://sarahmaddox-kfw-v100rc4.endpoints.kubeflow-writers.cloud.goog\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"endpoint = None\n",
|
|
"\n",
|
|
"if endpoint:\n",
|
|
" logging.info(f\"endpoint set to {endpoint}\")\n",
|
|
"else:\n",
|
|
" logging.info(\"Warning: You must set {endpoint} in order to print out the URLs where you can access your web apps.\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Access the TensorBoard UI\n",
|
|
"\n",
|
|
"Run the cell below to find the endpoint for the TensorBoard UI."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 37,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"TensorBoard UI is at <a href='https://sarahmaddox-kfw-v100rc4.endpoints.kubeflow-writers.cloud.goog/mnist/kubeflow-sarahmaddox/tensorboard/'>https://sarahmaddox-kfw-v100rc4.endpoints.kubeflow-writers.cloud.goog/mnist/kubeflow-sarahmaddox/tensorboard/</a>"
|
|
],
|
|
"text/plain": [
|
|
"<IPython.core.display.HTML object>"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"if endpoint: \n",
|
|
" vs = yaml.safe_load(tb_virtual_service)\n",
|
|
" path= vs[\"spec\"][\"http\"][0][\"match\"][0][\"uri\"][\"prefix\"]\n",
|
|
" tb_endpoint = endpoint + path\n",
|
|
" display(HTML(f\"TensorBoard UI is at <a href='{tb_endpoint}'>{tb_endpoint}</a>\"))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Wait for the training job to finish"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can use the TFJob client to wait for the job to finish:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"TFJob kubeflow-sarahmaddox.mnist-train-289e succeeded\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"tf_job = tf_job_client.wait_for_condition(train_name, expected_condition=[\"Succeeded\", \"Failed\"], namespace=namespace)\n",
|
|
"\n",
|
|
"if tf_job_client.is_job_succeeded(train_name, namespace):\n",
|
|
" logging.info(f\"TFJob {namespace}.{train_name} succeeded\")\n",
|
|
"else:\n",
|
|
" raise ValueError(f\"TFJob {namespace}.{train_name} failed\") "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Serve the model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now you can deploy the model using [TensorFlow Serving](https://www.kubeflow.org/docs/components/serving/tfserving_new/).\n",
|
|
"\n",
|
|
"You need to create the following:\n",
|
|
"* A Kubernetes deployment.\n",
|
|
"* A Kubernetes service.\n",
|
|
"* (Optional) A configmap containing the Prometheus monitoring configuration."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"deploy_name = \"mnist-model\"\n",
|
|
"model_base_path = export_path\n",
|
|
"\n",
|
|
"# The web UI defaults to mnist-service so if you change the name, you must\n",
|
|
"# change it in the UI as well.\n",
|
|
"model_service = \"mnist-service\"\n",
|
|
"\n",
|
|
"deploy_spec = f\"\"\"apiVersion: apps/v1\n",
|
|
"kind: Deployment\n",
|
|
"metadata:\n",
|
|
" labels:\n",
|
|
" app: mnist\n",
|
|
" name: {deploy_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" selector:\n",
|
|
" matchLabels:\n",
|
|
" app: mnist-model\n",
|
|
" template:\n",
|
|
" metadata:\n",
|
|
" # TODO(jlewi): Right now we disable the istio side car because otherwise ISTIO rbac will prevent the\n",
|
|
" # UI from sending RPCs to the server. We should create an appropriate ISTIO rbac authorization\n",
|
|
" # policy to allow traffic from the UI to the model servier.\n",
|
|
" # https://istio.io/docs/concepts/security/#target-selectors\n",
|
|
" annotations: \n",
|
|
" sidecar.istio.io/inject: \"false\"\n",
|
|
" labels:\n",
|
|
" app: mnist-model\n",
|
|
" version: v1\n",
|
|
" spec:\n",
|
|
" serviceAccount: default-editor\n",
|
|
" containers:\n",
|
|
" - args:\n",
|
|
" - --port=9000\n",
|
|
" - --rest_api_port=8500\n",
|
|
" - --model_name=mnist\n",
|
|
" - --model_base_path={model_base_path}\n",
|
|
" - --monitoring_config_file=/var/config/monitoring_config.txt\n",
|
|
" command:\n",
|
|
" - /usr/bin/tensorflow_model_server\n",
|
|
" env:\n",
|
|
" - name: modelBasePath\n",
|
|
" value: {model_base_path}\n",
|
|
" image: tensorflow/serving:1.15.0\n",
|
|
" imagePullPolicy: IfNotPresent\n",
|
|
" livenessProbe:\n",
|
|
" initialDelaySeconds: 30\n",
|
|
" periodSeconds: 30\n",
|
|
" tcpSocket:\n",
|
|
" port: 9000\n",
|
|
" name: mnist\n",
|
|
" ports:\n",
|
|
" - containerPort: 9000\n",
|
|
" - containerPort: 8500\n",
|
|
" resources:\n",
|
|
" limits:\n",
|
|
" cpu: \"4\"\n",
|
|
" memory: 4Gi\n",
|
|
" requests:\n",
|
|
" cpu: \"1\"\n",
|
|
" memory: 1Gi\n",
|
|
" volumeMounts:\n",
|
|
" - mountPath: /var/config/\n",
|
|
" name: model-config\n",
|
|
" volumes:\n",
|
|
" - configMap:\n",
|
|
" name: {deploy_name}\n",
|
|
" name: model-config\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"service_spec = f\"\"\"apiVersion: v1\n",
|
|
"kind: Service\n",
|
|
"metadata:\n",
|
|
" annotations: \n",
|
|
" prometheus.io/path: /monitoring/prometheus/metrics\n",
|
|
" prometheus.io/port: \"8500\"\n",
|
|
" prometheus.io/scrape: \"true\"\n",
|
|
" labels:\n",
|
|
" app: mnist-model\n",
|
|
" name: {model_service}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" ports:\n",
|
|
" - name: grpc-tf-serving\n",
|
|
" port: 9000\n",
|
|
" targetPort: 9000\n",
|
|
" - name: http-tf-serving\n",
|
|
" port: 8500\n",
|
|
" targetPort: 8500\n",
|
|
" selector:\n",
|
|
" app: mnist-model\n",
|
|
" type: ClusterIP\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"monitoring_config = f\"\"\"kind: ConfigMap\n",
|
|
"apiVersion: v1\n",
|
|
"metadata:\n",
|
|
" name: {deploy_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"data:\n",
|
|
" monitoring_config.txt: |-\n",
|
|
" prometheus_config: {{\n",
|
|
" enable: true,\n",
|
|
" path: \"/monitoring/prometheus/metrics\"\n",
|
|
" }}\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"model_specs = [deploy_spec, service_spec, monitoring_config]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Created Deployment kubeflow-sarahmaddox.mnist-model\n",
|
|
"Created Service kubeflow-sarahmaddox.mnist-service\n",
|
|
"Created ConfigMap kubeflow-sarahmaddox.mnist-model\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'api_version': 'apps/v1',\n",
|
|
" 'kind': 'Deployment',\n",
|
|
" 'metadata': {'annotations': None,\n",
|
|
" 'cluster_name': None,\n",
|
|
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 30, 28, tzinfo=tzlocal()),\n",
|
|
" 'deletion_grace_period_seconds': None,\n",
|
|
" 'deletion_timestamp': None,\n",
|
|
" 'finalizers': None,\n",
|
|
" 'generate_name': None,\n",
|
|
" 'generation': 1,\n",
|
|
" 'initializers': None,\n",
|
|
" 'labels': {'app': 'mnist'},\n",
|
|
" 'managed_fields': None,\n",
|
|
" 'name': 'mnist-model',\n",
|
|
" 'namespace': 'kubeflow-sarahmaddox',\n",
|
|
" 'owner_references': None,\n",
|
|
" 'resource_version': '788910',\n",
|
|
" 'self_link': '/apis/apps/v1/namespaces/kubeflow-sarahmaddox/deployments/mnist-model',\n",
|
|
" 'uid': '5555d458-5848-11ea-9ddf-42010a80013f'},\n",
|
|
" 'spec': {'min_ready_seconds': None,\n",
|
|
" 'paused': None,\n",
|
|
" 'progress_deadline_seconds': 600,\n",
|
|
" 'replicas': 1,\n",
|
|
" 'revision_history_limit': 10,\n",
|
|
" 'selector': {'match_expressions': None,\n",
|
|
" 'match_labels': {'app': 'mnist-model'}},\n",
|
|
" 'strategy': {'rolling_update': {'max_surge': '25%',\n",
|
|
" 'max_unavailable': '25%'},\n",
|
|
" 'type': 'RollingUpdate'},\n",
|
|
" 'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},\n",
|
|
" 'cluster_name': None,\n",
|
|
" 'creation_timestamp': None,\n",
|
|
" 'deletion_grace_period_seconds': None,\n",
|
|
" 'deletion_timestamp': None,\n",
|
|
" 'finalizers': None,\n",
|
|
" 'generate_name': None,\n",
|
|
" 'generation': None,\n",
|
|
" 'initializers': None,\n",
|
|
" 'labels': {'app': 'mnist-model',\n",
|
|
" 'version': 'v1'},\n",
|
|
" 'managed_fields': None,\n",
|
|
" 'name': None,\n",
|
|
" 'namespace': None,\n",
|
|
" 'owner_references': None,\n",
|
|
" 'resource_version': None,\n",
|
|
" 'self_link': None,\n",
|
|
" 'uid': None},\n",
|
|
" 'spec': {'active_deadline_seconds': None,\n",
|
|
" 'affinity': None,\n",
|
|
" 'automount_service_account_token': None,\n",
|
|
" 'containers': [{'args': ['--port=9000',\n",
|
|
" '--rest_api_port=8500',\n",
|
|
" '--model_name=mnist',\n",
|
|
" '--model_base_path=gs://kubeflow-writers-mnist/mnist/export',\n",
|
|
" '--monitoring_config_file=/var/config/monitoring_config.txt'],\n",
|
|
" 'command': ['/usr/bin/tensorflow_model_server'],\n",
|
|
" 'env': [{'name': 'modelBasePath',\n",
|
|
" 'value': 'gs://kubeflow-writers-mnist/mnist/export',\n",
|
|
" 'value_from': None}],\n",
|
|
" 'env_from': None,\n",
|
|
" 'image': 'tensorflow/serving:1.15.0',\n",
|
|
" 'image_pull_policy': 'IfNotPresent',\n",
|
|
" 'lifecycle': None,\n",
|
|
" 'liveness_probe': {'_exec': None,\n",
|
|
" 'failure_threshold': 3,\n",
|
|
" 'http_get': None,\n",
|
|
" 'initial_delay_seconds': 30,\n",
|
|
" 'period_seconds': 30,\n",
|
|
" 'success_threshold': 1,\n",
|
|
" 'tcp_socket': {'host': None,\n",
|
|
" 'port': 9000},\n",
|
|
" 'timeout_seconds': 1},\n",
|
|
" 'name': 'mnist',\n",
|
|
" 'ports': [{'container_port': 9000,\n",
|
|
" 'host_ip': None,\n",
|
|
" 'host_port': None,\n",
|
|
" 'name': None,\n",
|
|
" 'protocol': 'TCP'},\n",
|
|
" {'container_port': 8500,\n",
|
|
" 'host_ip': None,\n",
|
|
" 'host_port': None,\n",
|
|
" 'name': None,\n",
|
|
" 'protocol': 'TCP'}],\n",
|
|
" 'readiness_probe': None,\n",
|
|
" 'resources': {'limits': {'cpu': '4',\n",
|
|
" 'memory': '4Gi'},\n",
|
|
" 'requests': {'cpu': '1',\n",
|
|
" 'memory': '1Gi'}},\n",
|
|
" 'security_context': None,\n",
|
|
" 'stdin': None,\n",
|
|
" 'stdin_once': None,\n",
|
|
" 'termination_message_path': '/dev/termination-log',\n",
|
|
" 'termination_message_policy': 'File',\n",
|
|
" 'tty': None,\n",
|
|
" 'volume_devices': None,\n",
|
|
" 'volume_mounts': [{'mount_path': '/var/config/',\n",
|
|
" 'mount_propagation': None,\n",
|
|
" 'name': 'model-config',\n",
|
|
" 'read_only': None,\n",
|
|
" 'sub_path': None,\n",
|
|
" 'sub_path_expr': None}],\n",
|
|
" 'working_dir': None}],\n",
|
|
" 'dns_config': None,\n",
|
|
" 'dns_policy': 'ClusterFirst',\n",
|
|
" 'enable_service_links': None,\n",
|
|
" 'host_aliases': None,\n",
|
|
" 'host_ipc': None,\n",
|
|
" 'host_network': None,\n",
|
|
" 'host_pid': None,\n",
|
|
" 'hostname': None,\n",
|
|
" 'image_pull_secrets': None,\n",
|
|
" 'init_containers': None,\n",
|
|
" 'node_name': None,\n",
|
|
" 'node_selector': None,\n",
|
|
" 'priority': None,\n",
|
|
" 'priority_class_name': None,\n",
|
|
" 'readiness_gates': None,\n",
|
|
" 'restart_policy': 'Always',\n",
|
|
" 'runtime_class_name': None,\n",
|
|
" 'scheduler_name': 'default-scheduler',\n",
|
|
" 'security_context': {'fs_group': None,\n",
|
|
" 'run_as_group': None,\n",
|
|
" 'run_as_non_root': None,\n",
|
|
" 'run_as_user': None,\n",
|
|
" 'se_linux_options': None,\n",
|
|
" 'supplemental_groups': None,\n",
|
|
" 'sysctls': None},\n",
|
|
" 'service_account': 'default-editor',\n",
|
|
" 'service_account_name': 'default-editor',\n",
|
|
" 'share_process_namespace': None,\n",
|
|
" 'subdomain': None,\n",
|
|
" 'termination_grace_period_seconds': 30,\n",
|
|
" 'tolerations': None,\n",
|
|
" 'volumes': [{'aws_elastic_block_store': None,\n",
|
|
" 'azure_disk': None,\n",
|
|
" 'azure_file': None,\n",
|
|
" 'cephfs': None,\n",
|
|
" 'cinder': None,\n",
|
|
" 'config_map': {'default_mode': 420,\n",
|
|
" 'items': None,\n",
|
|
" 'name': 'mnist-model',\n",
|
|
" 'optional': None},\n",
|
|
" 'csi': None,\n",
|
|
" 'downward_api': None,\n",
|
|
" 'empty_dir': None,\n",
|
|
" 'fc': None,\n",
|
|
" 'flex_volume': None,\n",
|
|
" 'flocker': None,\n",
|
|
" 'gce_persistent_disk': None,\n",
|
|
" 'git_repo': None,\n",
|
|
" 'glusterfs': None,\n",
|
|
" 'host_path': None,\n",
|
|
" 'iscsi': None,\n",
|
|
" 'name': 'model-config',\n",
|
|
" 'nfs': None,\n",
|
|
" 'persistent_volume_claim': None,\n",
|
|
" 'photon_persistent_disk': None,\n",
|
|
" 'portworx_volume': None,\n",
|
|
" 'projected': None,\n",
|
|
" 'quobyte': None,\n",
|
|
" 'rbd': None,\n",
|
|
" 'scale_io': None,\n",
|
|
" 'secret': None,\n",
|
|
" 'storageos': None,\n",
|
|
" 'vsphere_volume': None}]}}},\n",
|
|
" 'status': {'available_replicas': None,\n",
|
|
" 'collision_count': None,\n",
|
|
" 'conditions': None,\n",
|
|
" 'observed_generation': None,\n",
|
|
" 'ready_replicas': None,\n",
|
|
" 'replicas': None,\n",
|
|
" 'unavailable_replicas': None,\n",
|
|
" 'updated_replicas': None}}, {'api_version': 'v1',\n",
|
|
" 'kind': 'Service',\n",
|
|
" 'metadata': {'annotations': {'prometheus.io/path': '/monitoring/prometheus/metrics',\n",
|
|
" 'prometheus.io/port': '8500',\n",
|
|
" 'prometheus.io/scrape': 'true'},\n",
|
|
" 'cluster_name': None,\n",
|
|
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 30, 28, tzinfo=tzlocal()),\n",
|
|
" 'deletion_grace_period_seconds': None,\n",
|
|
" 'deletion_timestamp': None,\n",
|
|
" 'finalizers': None,\n",
|
|
" 'generate_name': None,\n",
|
|
" 'generation': None,\n",
|
|
" 'initializers': None,\n",
|
|
" 'labels': {'app': 'mnist-model'},\n",
|
|
" 'managed_fields': None,\n",
|
|
" 'name': 'mnist-service',\n",
|
|
" 'namespace': 'kubeflow-sarahmaddox',\n",
|
|
" 'owner_references': None,\n",
|
|
" 'resource_version': '788913',\n",
|
|
" 'self_link': '/api/v1/namespaces/kubeflow-sarahmaddox/services/mnist-service',\n",
|
|
" 'uid': '555d8fc0-5848-11ea-9ddf-42010a80013f'},\n",
|
|
" 'spec': {'cluster_ip': '10.35.254.103',\n",
|
|
" 'external_i_ps': None,\n",
|
|
" 'external_name': None,\n",
|
|
" 'external_traffic_policy': None,\n",
|
|
" 'health_check_node_port': None,\n",
|
|
" 'load_balancer_ip': None,\n",
|
|
" 'load_balancer_source_ranges': None,\n",
|
|
" 'ports': [{'name': 'grpc-tf-serving',\n",
|
|
" 'node_port': None,\n",
|
|
" 'port': 9000,\n",
|
|
" 'protocol': 'TCP',\n",
|
|
" 'target_port': 9000},\n",
|
|
" {'name': 'http-tf-serving',\n",
|
|
" 'node_port': None,\n",
|
|
" 'port': 8500,\n",
|
|
" 'protocol': 'TCP',\n",
|
|
" 'target_port': 8500}],\n",
|
|
" 'publish_not_ready_addresses': None,\n",
|
|
" 'selector': {'app': 'mnist-model'},\n",
|
|
" 'session_affinity': 'None',\n",
|
|
" 'session_affinity_config': None,\n",
|
|
" 'type': 'ClusterIP'},\n",
|
|
" 'status': {'load_balancer': {'ingress': None}}}, {'api_version': 'v1',\n",
|
|
" 'binary_data': None,\n",
|
|
" 'data': {'monitoring_config.txt': 'prometheus_config: {\\n'\n",
|
|
" ' enable: true,\\n'\n",
|
|
" ' path: \"/monitoring/prometheus/metrics\"\\n'\n",
|
|
" '}'},\n",
|
|
" 'kind': 'ConfigMap',\n",
|
|
" 'metadata': {'annotations': None,\n",
|
|
" 'cluster_name': None,\n",
|
|
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 30, 28, tzinfo=tzlocal()),\n",
|
|
" 'deletion_grace_period_seconds': None,\n",
|
|
" 'deletion_timestamp': None,\n",
|
|
" 'finalizers': None,\n",
|
|
" 'generate_name': None,\n",
|
|
" 'generation': None,\n",
|
|
" 'initializers': None,\n",
|
|
" 'labels': None,\n",
|
|
" 'managed_fields': None,\n",
|
|
" 'name': 'mnist-model',\n",
|
|
" 'namespace': 'kubeflow-sarahmaddox',\n",
|
|
" 'owner_references': None,\n",
|
|
" 'resource_version': '788914',\n",
|
|
" 'self_link': '/api/v1/namespaces/kubeflow-sarahmaddox/configmaps/mnist-model',\n",
|
|
" 'uid': '5560bb37-5848-11ea-9ddf-42010a80013f'}}]"
|
|
]
|
|
},
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"k8s_util.apply_k8s_specs(model_specs, k8s_util.K8S_CREATE_OR_REPLACE) "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Deploy the UI for the MNIST web app\n",
|
|
"\n",
|
|
"Deploy the UI to visualize the MNIST prediction results.\n",
|
|
"\n",
|
|
"This example uses a prebuilt and public Docker image for the UI."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ui_name = \"mnist-ui\"\n",
|
|
"ui_deploy = f\"\"\"apiVersion: apps/v1\n",
|
|
"kind: Deployment\n",
|
|
"metadata:\n",
|
|
" name: {ui_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" replicas: 1\n",
|
|
" selector:\n",
|
|
" matchLabels:\n",
|
|
" app: mnist-web-ui\n",
|
|
" template:\n",
|
|
" metadata:\n",
|
|
" labels:\n",
|
|
" app: mnist-web-ui\n",
|
|
" spec:\n",
|
|
" containers:\n",
|
|
" - image: gcr.io/kubeflow-examples/mnist/web-ui:v20190112-v0.2-142-g3b38225\n",
|
|
" name: web-ui\n",
|
|
" ports:\n",
|
|
" - containerPort: 5000 \n",
|
|
" serviceAccount: default-editor\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"ui_service = f\"\"\"apiVersion: v1\n",
|
|
"kind: Service\n",
|
|
"metadata:\n",
|
|
" annotations:\n",
|
|
" name: {ui_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" ports:\n",
|
|
" - name: http-mnist-ui\n",
|
|
" port: 80\n",
|
|
" targetPort: 5000\n",
|
|
" selector:\n",
|
|
" app: mnist-web-ui\n",
|
|
" type: ClusterIP\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"ui_virtual_service = f\"\"\"apiVersion: networking.istio.io/v1alpha3\n",
|
|
"kind: VirtualService\n",
|
|
"metadata:\n",
|
|
" name: {ui_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" gateways:\n",
|
|
" - kubeflow/kubeflow-gateway\n",
|
|
" hosts:\n",
|
|
" - '*'\n",
|
|
" http:\n",
|
|
" - match:\n",
|
|
" - uri:\n",
|
|
" prefix: /mnist/{namespace}/ui/\n",
|
|
" rewrite:\n",
|
|
" uri: /\n",
|
|
" route:\n",
|
|
" - destination:\n",
|
|
" host: {ui_name}.{namespace}.svc.cluster.local\n",
|
|
" port:\n",
|
|
" number: 80\n",
|
|
" timeout: 300s\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"ui_specs = [ui_deploy, ui_service, ui_virtual_service]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Created Deployment kubeflow-sarahmaddox.mnist-ui\n",
|
|
"Created Service kubeflow-sarahmaddox.mnist-ui\n",
|
|
"Created VirtualService mnist-ui.mnist-ui\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'api_version': 'apps/v1',\n",
|
|
" 'kind': 'Deployment',\n",
|
|
" 'metadata': {'annotations': None,\n",
|
|
" 'cluster_name': None,\n",
|
|
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 32, 29, tzinfo=tzlocal()),\n",
|
|
" 'deletion_grace_period_seconds': None,\n",
|
|
" 'deletion_timestamp': None,\n",
|
|
" 'finalizers': None,\n",
|
|
" 'generate_name': None,\n",
|
|
" 'generation': 1,\n",
|
|
" 'initializers': None,\n",
|
|
" 'labels': None,\n",
|
|
" 'managed_fields': None,\n",
|
|
" 'name': 'mnist-ui',\n",
|
|
" 'namespace': 'kubeflow-sarahmaddox',\n",
|
|
" 'owner_references': None,\n",
|
|
" 'resource_version': '790203',\n",
|
|
" 'self_link': '/apis/apps/v1/namespaces/kubeflow-sarahmaddox/deployments/mnist-ui',\n",
|
|
" 'uid': '9d846bf6-5848-11ea-9ddf-42010a80013f'},\n",
|
|
" 'spec': {'min_ready_seconds': None,\n",
|
|
" 'paused': None,\n",
|
|
" 'progress_deadline_seconds': 600,\n",
|
|
" 'replicas': 1,\n",
|
|
" 'revision_history_limit': 10,\n",
|
|
" 'selector': {'match_expressions': None,\n",
|
|
" 'match_labels': {'app': 'mnist-web-ui'}},\n",
|
|
" 'strategy': {'rolling_update': {'max_surge': '25%',\n",
|
|
" 'max_unavailable': '25%'},\n",
|
|
" 'type': 'RollingUpdate'},\n",
|
|
" 'template': {'metadata': {'annotations': None,\n",
|
|
" 'cluster_name': None,\n",
|
|
" 'creation_timestamp': None,\n",
|
|
" 'deletion_grace_period_seconds': None,\n",
|
|
" 'deletion_timestamp': None,\n",
|
|
" 'finalizers': None,\n",
|
|
" 'generate_name': None,\n",
|
|
" 'generation': None,\n",
|
|
" 'initializers': None,\n",
|
|
" 'labels': {'app': 'mnist-web-ui'},\n",
|
|
" 'managed_fields': None,\n",
|
|
" 'name': None,\n",
|
|
" 'namespace': None,\n",
|
|
" 'owner_references': None,\n",
|
|
" 'resource_version': None,\n",
|
|
" 'self_link': None,\n",
|
|
" 'uid': None},\n",
|
|
" 'spec': {'active_deadline_seconds': None,\n",
|
|
" 'affinity': None,\n",
|
|
" 'automount_service_account_token': None,\n",
|
|
" 'containers': [{'args': None,\n",
|
|
" 'command': None,\n",
|
|
" 'env': None,\n",
|
|
" 'env_from': None,\n",
|
|
" 'image': 'gcr.io/kubeflow-examples/mnist/web-ui:v20190112-v0.2-142-g3b38225',\n",
|
|
" 'image_pull_policy': 'IfNotPresent',\n",
|
|
" 'lifecycle': None,\n",
|
|
" 'liveness_probe': None,\n",
|
|
" 'name': 'web-ui',\n",
|
|
" 'ports': [{'container_port': 5000,\n",
|
|
" 'host_ip': None,\n",
|
|
" 'host_port': None,\n",
|
|
" 'name': None,\n",
|
|
" 'protocol': 'TCP'}],\n",
|
|
" 'readiness_probe': None,\n",
|
|
" 'resources': {'limits': None,\n",
|
|
" 'requests': None},\n",
|
|
" 'security_context': None,\n",
|
|
" 'stdin': None,\n",
|
|
" 'stdin_once': None,\n",
|
|
" 'termination_message_path': '/dev/termination-log',\n",
|
|
" 'termination_message_policy': 'File',\n",
|
|
" 'tty': None,\n",
|
|
" 'volume_devices': None,\n",
|
|
" 'volume_mounts': None,\n",
|
|
" 'working_dir': None}],\n",
|
|
" 'dns_config': None,\n",
|
|
" 'dns_policy': 'ClusterFirst',\n",
|
|
" 'enable_service_links': None,\n",
|
|
" 'host_aliases': None,\n",
|
|
" 'host_ipc': None,\n",
|
|
" 'host_network': None,\n",
|
|
" 'host_pid': None,\n",
|
|
" 'hostname': None,\n",
|
|
" 'image_pull_secrets': None,\n",
|
|
" 'init_containers': None,\n",
|
|
" 'node_name': None,\n",
|
|
" 'node_selector': None,\n",
|
|
" 'priority': None,\n",
|
|
" 'priority_class_name': None,\n",
|
|
" 'readiness_gates': None,\n",
|
|
" 'restart_policy': 'Always',\n",
|
|
" 'runtime_class_name': None,\n",
|
|
" 'scheduler_name': 'default-scheduler',\n",
|
|
" 'security_context': {'fs_group': None,\n",
|
|
" 'run_as_group': None,\n",
|
|
" 'run_as_non_root': None,\n",
|
|
" 'run_as_user': None,\n",
|
|
" 'se_linux_options': None,\n",
|
|
" 'supplemental_groups': None,\n",
|
|
" 'sysctls': None},\n",
|
|
" 'service_account': 'default-editor',\n",
|
|
" 'service_account_name': 'default-editor',\n",
|
|
" 'share_process_namespace': None,\n",
|
|
" 'subdomain': None,\n",
|
|
" 'termination_grace_period_seconds': 30,\n",
|
|
" 'tolerations': None,\n",
|
|
" 'volumes': None}}},\n",
|
|
" 'status': {'available_replicas': None,\n",
|
|
" 'collision_count': None,\n",
|
|
" 'conditions': None,\n",
|
|
" 'observed_generation': None,\n",
|
|
" 'ready_replicas': None,\n",
|
|
" 'replicas': None,\n",
|
|
" 'unavailable_replicas': None,\n",
|
|
" 'updated_replicas': None}}, {'api_version': 'v1',\n",
|
|
" 'kind': 'Service',\n",
|
|
" 'metadata': {'annotations': None,\n",
|
|
" 'cluster_name': None,\n",
|
|
" 'creation_timestamp': datetime.datetime(2020, 2, 26, 3, 32, 29, tzinfo=tzlocal()),\n",
|
|
" 'deletion_grace_period_seconds': None,\n",
|
|
" 'deletion_timestamp': None,\n",
|
|
" 'finalizers': None,\n",
|
|
" 'generate_name': None,\n",
|
|
" 'generation': None,\n",
|
|
" 'initializers': None,\n",
|
|
" 'labels': None,\n",
|
|
" 'managed_fields': None,\n",
|
|
" 'name': 'mnist-ui',\n",
|
|
" 'namespace': 'kubeflow-sarahmaddox',\n",
|
|
" 'owner_references': None,\n",
|
|
" 'resource_version': '790209',\n",
|
|
" 'self_link': '/api/v1/namespaces/kubeflow-sarahmaddox/services/mnist-ui',\n",
|
|
" 'uid': '9d8a67e4-5848-11ea-9ddf-42010a80013f'},\n",
|
|
" 'spec': {'cluster_ip': '10.35.244.4',\n",
|
|
" 'external_i_ps': None,\n",
|
|
" 'external_name': None,\n",
|
|
" 'external_traffic_policy': None,\n",
|
|
" 'health_check_node_port': None,\n",
|
|
" 'load_balancer_ip': None,\n",
|
|
" 'load_balancer_source_ranges': None,\n",
|
|
" 'ports': [{'name': 'http-mnist-ui',\n",
|
|
" 'node_port': None,\n",
|
|
" 'port': 80,\n",
|
|
" 'protocol': 'TCP',\n",
|
|
" 'target_port': 5000}],\n",
|
|
" 'publish_not_ready_addresses': None,\n",
|
|
" 'selector': {'app': 'mnist-web-ui'},\n",
|
|
" 'session_affinity': 'None',\n",
|
|
" 'session_affinity_config': None,\n",
|
|
" 'type': 'ClusterIP'},\n",
|
|
" 'status': {'load_balancer': {'ingress': None}}}, {'apiVersion': 'networking.istio.io/v1alpha3',\n",
|
|
" 'kind': 'VirtualService',\n",
|
|
" 'metadata': {'creationTimestamp': '2020-02-26T03:32:29Z',\n",
|
|
" 'generation': 1,\n",
|
|
" 'name': 'mnist-ui',\n",
|
|
" 'namespace': 'kubeflow-sarahmaddox',\n",
|
|
" 'resourceVersion': '790211',\n",
|
|
" 'selfLink': '/apis/networking.istio.io/v1alpha3/namespaces/kubeflow-sarahmaddox/virtualservices/mnist-ui',\n",
|
|
" 'uid': '9d921512-5848-11ea-9ddf-42010a80013f'},\n",
|
|
" 'spec': {'gateways': ['kubeflow/kubeflow-gateway'],\n",
|
|
" 'hosts': ['*'],\n",
|
|
" 'http': [{'match': [{'uri': {'prefix': '/mnist/kubeflow-sarahmaddox/ui/'}}],\n",
|
|
" 'rewrite': {'uri': '/'},\n",
|
|
" 'route': [{'destination': {'host': 'mnist-ui.kubeflow-sarahmaddox.svc.cluster.local',\n",
|
|
" 'port': {'number': 80}}}],\n",
|
|
" 'timeout': '300s'}]}}]"
|
|
]
|
|
},
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"k8s_util.apply_k8s_specs(ui_specs, k8s_util.K8S_CREATE_OR_REPLACE) \n",
|
|
" "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Access the MNIST web UI\n",
|
|
"\n",
|
|
"A reverse proxy route is automatically added to the Kubeflow IAP endpoint. The MNIST endpoint is:\n",
|
|
"\n",
|
|
" ```\n",
|
|
" https:/${KUBEFlOW_ENDPOINT}/mnist/${NAMESPACE}/ui/ \n",
|
|
" ```\n",
|
|
" \n",
|
|
"where `NAMESPACE` is the namespace where you're running the Jupyter notebook."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 38,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"mnist UI is at <a href='https://sarahmaddox-kfw-v100rc4.endpoints.kubeflow-writers.cloud.goog/mnist/kubeflow-sarahmaddox/ui/'>https://sarahmaddox-kfw-v100rc4.endpoints.kubeflow-writers.cloud.goog/mnist/kubeflow-sarahmaddox/ui/</a>"
|
|
],
|
|
"text/plain": [
|
|
"<IPython.core.display.HTML object>"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"if endpoint: \n",
|
|
" vs = yaml.safe_load(ui_virtual_service)\n",
|
|
" path= vs[\"spec\"][\"http\"][0][\"match\"][0][\"uri\"][\"prefix\"]\n",
|
|
" ui_endpoint = endpoint + path\n",
|
|
" display(HTML(f\"mnist UI is at <a href='{ui_endpoint}'>{ui_endpoint}</a>\"))\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Open the MNIST UI in your browser. You should see an image of a hand-written digit from 0 to 9. This is a random image sent to the model for classification. Below the image is a set of bar graphs, one for each classification label from 0 to 9, as output by the model. Each bar represents the probability that the image matches the respective label. \n",
|
|
"\n",
|
|
"Click the **test random image** button to send the model a new image."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Next steps\n",
|
|
"\n",
|
|
"Visit the [Kubeflow docs](https://www.kubeflow.org/docs/gke/) for more information about running Kubeflow on GCP."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.9"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|