{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MNIST E2E on Kubeflow on GKE\n", "\n", "This example guides you through:\n", " \n", " 1. Taking an example TensorFlow model and modifying it to support distributed training\n", " 1. Serving the resulting model using TFServing\n", " 1. Deploying and using a web-app that uses the model\n", " \n", "## Requirements\n", "\n", " * You must be running Kubeflow 1.0 on GKE with IAP\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raise ValueError(\"Fake exception\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare model\n", "\n", "There is a delta between existing distributed mnist examples and what's needed to run well as a TFJob.\n", "\n", "Basically, we must:\n", "\n", "1. Add options in order to make the model configurable.\n", "1. Use `tf.estimator.train_and_evaluate` to enable model exporting and serving.\n", "1. Define serving signatures for model serving.\n", "\n", "The resulting model is [model.py](model.py)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Verify we have a GCP account\n", "\n", "* The cell below checks that this notebook was spawned with credentials to access GCP\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import logging\n", "import os\n", "import uuid\n", "from importlib import reload\n", "from oauth2client.client import GoogleCredentials\n", "credentials = GoogleCredentials.get_application_default()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install Required Libraries\n", "\n", "Import the libraries required to train this model." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "pip installing requirements.txt\n", "Checkout kubeflow/tf-operator @9238906\n", "Configure docker credentials\n" ] } ], "source": [ "import notebook_setup\n", "reload(notebook_setup)\n", "notebook_setup.notebook_setup()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import k8s_util\n", "# Force a reload of kubeflow; since kubeflow is a multi namespace module\n", "# it looks like doing this in notebook_setup may not be sufficient\n", "import kubeflow\n", "reload(kubeflow)\n", "from kubernetes import client as k8s_client\n", "from kubernetes import config as k8s_config\n", "from kubeflow.tfjob.api import tf_job_client as tf_job_client_module\n", "from IPython.core.display import display, HTML\n", "import yaml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure The Docker Registry For Kubeflow Fairing\n", "\n", "* In order to build docker images from your notebook we need a docker registry where the images will be stored\n", "* Below you set some variables specifying a [GCR container registry](https://cloud.google.com/container-registry/docs/)\n", "* Kubeflow Fairing provides a utility function to guess the name of your GCP project" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Running in project jlewi-dev\n", "Running in namespace kubeflow-jlewi\n", "Using docker registry gcr.io/jlewi-dev/fairing-job\n" ] } ], "source": [ "from kubernetes import client as k8s_client\n", "from kubernetes.client import rest as k8s_rest\n", "from kubeflow import fairing \n", "from kubeflow.fairing import utils as fairing_utils\n", "from kubeflow.fairing.builders import append\n", "from kubeflow.fairing.deployers import job\n", "from kubeflow.fairing.preprocessors import base as base_preprocessor\n", "\n", "# Setting up google container repositories (GCR) for storing output containers\n", "# You can use any docker container registry istead of GCR\n", "GCP_PROJECT = fairing.cloud.gcp.guess_project_name()\n", "DOCKER_REGISTRY = 'gcr.io/{}/fairing-job'.format(GCP_PROJECT)\n", "namespace = fairing_utils.get_current_k8s_namespace()\n", "\n", "logging.info(f\"Running in project {GCP_PROJECT}\")\n", "logging.info(f\"Running in namespace {namespace}\")\n", "logging.info(f\"Using docker registry {DOCKER_REGISTRY}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use Kubeflow fairing to build the docker image\n", "\n", "* You will use kubeflow fairing's kaniko builder to build a docker image that includes all your dependencies\n", " * You use kaniko because you want to be able to run `pip` to install dependencies\n", " * Kaniko gives you the flexibility to build images from Dockerfiles" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# TODO(https://github.com/kubeflow/fairing/issues/426): We should get rid of this once the default \n", "# Kaniko image is updated to a newer image than 0.7.0.\n", "from kubeflow.fairing import constants\n", "constants.constants.KANIKO_IMAGE = \"gcr.io/kaniko-project/executor:v0.14.0\"" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "set()" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from kubeflow.fairing.builders import cluster\n", "\n", "# output_map is a map of extra files to add to the notebook.\n", "# It is a map from source location to the location inside the context.\n", "output_map = {\n", " \"Dockerfile.model\": \"Dockerfile\",\n", " \"model.py\": \"model.py\"\n", "}\n", "\n", "\n", "preprocessor = base_preprocessor.BasePreProcessor(\n", " command=[\"python\"], # The base class will set this.\n", " input_files=[],\n", " path_prefix=\"/app\", # irrelevant since we aren't preprocessing any files\n", " output_map=output_map)\n", "\n", "preprocessor.preprocess()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Building image using cluster builder.\n", "Creating docker context: /tmp/fairing_context_n8ikop1c\n", "Dockerfile already exists in Fairing context, skipping...\n", "Waiting for fairing-builder-nv9dh-2kwz9 to start...\n", "Waiting for fairing-builder-nv9dh-2kwz9 to start...\n", "Waiting for fairing-builder-nv9dh-2kwz9 to start...\n", "Pod started running True\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ERROR: logging before flag.Parse: E0212 21:28:24.488770 1 metadata.go:241] Failed to unmarshal scopes: invalid character 'h' looking for beginning of value\n", "\u001b[36mINFO\u001b[0m[0002] Resolved base name tensorflow/tensorflow:1.15.2-py3 to tensorflow/tensorflow:1.15.2-py3\n", "\u001b[36mINFO\u001b[0m[0002] Resolved base name tensorflow/tensorflow:1.15.2-py3 to tensorflow/tensorflow:1.15.2-py3\n", "\u001b[36mINFO\u001b[0m[0002] Downloading base image tensorflow/tensorflow:1.15.2-py3\n", "ERROR: logging before flag.Parse: E0212 21:28:24.983416 1 metadata.go:142] while reading 'google-dockercfg' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg\n", "ERROR: logging before flag.Parse: E0212 21:28:24.989996 1 metadata.go:159] while reading 'google-dockercfg-url' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg-url\n", "\u001b[36mINFO\u001b[0m[0002] Error while retrieving image from cache: getting file info: stat /cache/sha256:28b5f547969d70f825909c8fe06675ffc2959afe6079aeae754afa312f6417b9: no such file or directory\n", "\u001b[36mINFO\u001b[0m[0002] Downloading base image tensorflow/tensorflow:1.15.2-py3\n", "\u001b[36mINFO\u001b[0m[0003] Built cross stage deps: map[]\n", "\u001b[36mINFO\u001b[0m[0003] Downloading base image tensorflow/tensorflow:1.15.2-py3\n", "\u001b[36mINFO\u001b[0m[0003] Error while retrieving image from cache: getting file info: stat /cache/sha256:28b5f547969d70f825909c8fe06675ffc2959afe6079aeae754afa312f6417b9: no such file or directory\n", "\u001b[36mINFO\u001b[0m[0003] Downloading base image tensorflow/tensorflow:1.15.2-py3\n", "\u001b[36mINFO\u001b[0m[0003] Using files from context: [/kaniko/buildcontext/model.py]\n", "\u001b[36mINFO\u001b[0m[0003] Checking for cached layer gcr.io/jlewi-dev/fairing-job/mnist/cache:6802122184979734f01a549e1224c5f46a277db894d4b3e749e41ad1ca522bdf...\n", "\u001b[36mINFO\u001b[0m[0004] Using caching version of cmd: RUN chmod +x /opt/model.py\n", "\u001b[36mINFO\u001b[0m[0004] Skipping unpacking as no commands require it.\n", "\u001b[36mINFO\u001b[0m[0004] Taking snapshot of full filesystem...\n", "\u001b[36mINFO\u001b[0m[0004] Using files from context: [/kaniko/buildcontext/model.py]\n", "\u001b[36mINFO\u001b[0m[0004] ADD model.py /opt/model.py\n", "\u001b[36mINFO\u001b[0m[0004] Taking snapshot of files...\n", "\u001b[36mINFO\u001b[0m[0004] RUN chmod +x /opt/model.py\n", "\u001b[36mINFO\u001b[0m[0004] Found cached layer, extracting to filesystem\n", "\u001b[36mINFO\u001b[0m[0004] Taking snapshot of files...\n", "\u001b[36mINFO\u001b[0m[0004] ENTRYPOINT [\"/usr/bin/python\"]\n", "\u001b[36mINFO\u001b[0m[0004] No files changed in this command, skipping snapshotting.\n", "\u001b[36mINFO\u001b[0m[0004] CMD [\"/opt/model.py\"]\n", "\u001b[36mINFO\u001b[0m[0004] No files changed in this command, skipping snapshotting.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Built image gcr.io/jlewi-dev/fairing-job/mnist:24327351\n" ] } ], "source": [ "# Use a Tensorflow image as the base image\n", "# We use a custom Dockerfile \n", "cluster_builder = cluster.cluster.ClusterBuilder(registry=DOCKER_REGISTRY,\n", " base_image=\"\", # base_image is set in the Dockerfile\n", " preprocessor=preprocessor,\n", " image_name=\"mnist\",\n", " dockerfile_path=\"Dockerfile\",\n", " pod_spec_mutators=[fairing.cloud.gcp.add_gcp_credentials_if_exists],\n", " context_source=cluster.gcs_context.GCSContextSource())\n", "cluster_builder.build()\n", "logging.info(f\"Built image {cluster_builder.image_tag}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a GCS Bucket\n", "\n", "* Create a GCS bucket to store our models and other results.\n", "* Since we are running in python we use the python client libraries but you could also use the `gsutil` command line" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Bucket jlewi-dev-mnist already exists\n" ] } ], "source": [ "from google.cloud import storage\n", "bucket = f\"{GCP_PROJECT}-mnist\"\n", "\n", "client = storage.Client()\n", "b = storage.Bucket(client=client, name=bucket)\n", "\n", "if not b.exists():\n", " logging.info(f\"Creating bucket {bucket}\")\n", " b.create()\n", "else:\n", " logging.info(f\"Bucket {bucket} already exists\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distributed training\n", "\n", "* We will train the model by using TFJob to run a distributed training job" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "train_name = f\"mnist-train-{uuid.uuid4().hex[:4]}\"\n", "num_ps = 1\n", "num_workers = 2\n", "model_dir = f\"gs://{bucket}/mnist\"\n", "export_path = f\"gs://{bucket}/mnist/export\" \n", "train_steps = 200\n", "batch_size = 100\n", "learning_rate = .01\n", "image = cluster_builder.image_tag\n", "\n", "train_spec = f\"\"\"apiVersion: kubeflow.org/v1\n", "kind: TFJob\n", "metadata:\n", " name: {train_name} \n", "spec:\n", " tfReplicaSpecs:\n", " Ps:\n", " replicas: {num_ps}\n", " template:\n", " metadata:\n", " annotations:\n", " sidecar.istio.io/inject: \"false\"\n", " spec:\n", " serviceAccount: default-editor\n", " containers:\n", " - name: tensorflow\n", " command:\n", " - python\n", " - /opt/model.py\n", " - --tf-model-dir={model_dir}\n", " - --tf-export-dir={export_path}\n", " - --tf-train-steps={train_steps}\n", " - --tf-batch-size={batch_size}\n", " - --tf-learning-rate={learning_rate}\n", " image: {image}\n", " workingDir: /opt\n", " restartPolicy: OnFailure\n", " Chief:\n", " replicas: 1\n", " template:\n", " metadata:\n", " annotations:\n", " sidecar.istio.io/inject: \"false\"\n", " spec:\n", " serviceAccount: default-editor\n", " containers:\n", " - name: tensorflow\n", " command:\n", " - python\n", " - /opt/model.py\n", " - --tf-model-dir={model_dir}\n", " - --tf-export-dir={export_path}\n", " - --tf-train-steps={train_steps}\n", " - --tf-batch-size={batch_size}\n", " - --tf-learning-rate={learning_rate}\n", " image: {image}\n", " workingDir: /opt\n", " restartPolicy: OnFailure\n", " Worker:\n", " replicas: 1\n", " template:\n", " metadata:\n", " annotations:\n", " sidecar.istio.io/inject: \"false\"\n", " spec:\n", " serviceAccount: default-editor\n", " containers:\n", " - name: tensorflow\n", " command:\n", " - python\n", " - /opt/model.py\n", " - --tf-model-dir={model_dir}\n", " - --tf-export-dir={export_path}\n", " - --tf-train-steps={train_steps}\n", " - --tf-batch-size={batch_size}\n", " - --tf-learning-rate={learning_rate}\n", " image: {image}\n", " workingDir: /opt\n", " restartPolicy: OnFailure\n", "\"\"\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the training job\n", "\n", "* You could write the spec to a YAML file and then do `kubectl apply -f {FILE}`\n", "* Since you are running in jupyter you will use the TFJob client\n", "* You will run the TFJob in a namespace created by a Kubeflow profile\n", " * The namespace will be the same namespace you are running the notebook in\n", " * Creating a profile ensures the namespace is provisioned with service accounts and other resources needed for Kubeflow" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "tf_job_client = tf_job_client_module.TFJobClient()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "TFJob kubeflow-jlewi.mnist-train-2c73 succeeded\n" ] } ], "source": [ "tf_job_body = yaml.safe_load(train_spec)\n", "tf_job = tf_job_client.create(tf_job_body, namespace=namespace) \n", "\n", "logging.info(f\"Created job {namespace}.{train_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check the job\n", "\n", "* Above you used the python SDK for TFJob to check the status\n", "* You can also use kubectl get the status of your job\n", "* The job conditions will tell you whether the job is running, succeeded or failed" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "apiVersion: kubeflow.org/v1\n", "kind: TFJob\n", "metadata:\n", " creationTimestamp: \"2020-02-12T21:28:31Z\"\n", " generation: 1\n", " name: mnist-train-2c73\n", " namespace: kubeflow-jlewi\n", " resourceVersion: \"1730369\"\n", " selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow-jlewi/tfjobs/mnist-train-2c73\n", " uid: 9e27854c-4dde-11ea-9830-42010a8e016f\n", "spec:\n", " tfReplicaSpecs:\n", " Chief:\n", " replicas: 1\n", " template:\n", " metadata:\n", " annotations:\n", " sidecar.istio.io/inject: \"false\"\n", " spec:\n", " containers:\n", " - command:\n", " - python\n", " - /opt/model.py\n", " - --tf-model-dir=gs://jlewi-dev-mnist/mnist\n", " - --tf-export-dir=gs://jlewi-dev-mnist/mnist/export\n", " - --tf-train-steps=200\n", " - --tf-batch-size=100\n", " - --tf-learning-rate=0.01\n", " image: gcr.io/jlewi-dev/fairing-job/mnist:24327351\n", " name: tensorflow\n", " workingDir: /opt\n", " restartPolicy: OnFailure\n", " serviceAccount: default-editor\n", " Ps:\n", " replicas: 1\n", " template:\n", " metadata:\n", " annotations:\n", " sidecar.istio.io/inject: \"false\"\n", " spec:\n", " containers:\n", " - command:\n", " - python\n", " - /opt/model.py\n", " - --tf-model-dir=gs://jlewi-dev-mnist/mnist\n", " - --tf-export-dir=gs://jlewi-dev-mnist/mnist/export\n", " - --tf-train-steps=200\n", " - --tf-batch-size=100\n", " - --tf-learning-rate=0.01\n", " image: gcr.io/jlewi-dev/fairing-job/mnist:24327351\n", " name: tensorflow\n", " workingDir: /opt\n", " restartPolicy: OnFailure\n", " serviceAccount: default-editor\n", " Worker:\n", " replicas: 1\n", " template:\n", " metadata:\n", " annotations:\n", " sidecar.istio.io/inject: \"false\"\n", " spec:\n", " containers:\n", " - command:\n", " - python\n", " - /opt/model.py\n", " - --tf-model-dir=gs://jlewi-dev-mnist/mnist\n", " - --tf-export-dir=gs://jlewi-dev-mnist/mnist/export\n", " - --tf-train-steps=200\n", " - --tf-batch-size=100\n", " - --tf-learning-rate=0.01\n", " image: gcr.io/jlewi-dev/fairing-job/mnist:24327351\n", " name: tensorflow\n", " workingDir: /opt\n", " restartPolicy: OnFailure\n", " serviceAccount: default-editor\n", "status:\n", " completionTime: \"2020-02-12T21:28:53Z\"\n", " conditions:\n", " - lastTransitionTime: \"2020-02-12T21:28:31Z\"\n", " lastUpdateTime: \"2020-02-12T21:28:31Z\"\n", " message: TFJob mnist-train-2c73 is created.\n", " reason: TFJobCreated\n", " status: \"True\"\n", " type: Created\n", " - lastTransitionTime: \"2020-02-12T21:28:34Z\"\n", " lastUpdateTime: \"2020-02-12T21:28:34Z\"\n", " message: TFJob mnist-train-2c73 is running.\n", " reason: TFJobRunning\n", " status: \"False\"\n", " type: Running\n", " - lastTransitionTime: \"2020-02-12T21:28:53Z\"\n", " lastUpdateTime: \"2020-02-12T21:28:53Z\"\n", " message: TFJob mnist-train-2c73 successfully completed.\n", " reason: TFJobSucceeded\n", " status: \"True\"\n", " type: Succeeded\n", " replicaStatuses:\n", " Chief:\n", " succeeded: 1\n", " PS:\n", " succeeded: 1\n", " Worker:\n", " succeeded: 1\n", " startTime: \"2020-02-12T21:28:32Z\"\n" ] } ], "source": [ "!kubectl get tfjobs -o yaml {train_name}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get The Logs\n", "\n", "* There are two ways to get the logs for the training job\n", "\n", " 1. Using kubectl to fetch the pod logs\n", " * These logs are ephemeral; they will be unavailable when the pod is garbage collected to free up resources\n", " 1. Using stackdriver\n", " \n", " * Kubernetes logs are automatically available in stackdriver\n", " * You can use labels to locate logs for a specific pod\n", " * In the cell below you use labels for the training job name and process type to locate the logs for a specific pod\n", " \n", "* Run the cell below to get a link to stackdriver for your logs" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "Link to: chief logs" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Link to: worker logs" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Link to: ps logs" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from urllib.parse import urlencode\n", "\n", "for replica in [\"chief\", \"worker\", \"ps\"]: \n", " logs_filter = f\"\"\"resource.type=\"k8s_container\" \n", " labels.\"k8s-pod/tf-job-name\" = \"{train_name}\"\n", " labels.\"k8s-pod/tf-replica-type\" = \"{replica}\" \n", " resource.labels.container_name=\"tensorflow\" \"\"\"\n", "\n", " new_params = {'project': GCP_PROJECT,\n", " # Logs for last 7 days\n", " 'interval': 'P7D',\n", " 'advancedFilter': logs_filter}\n", "\n", " query = urlencode(new_params)\n", "\n", " url = \"https://console.cloud.google.com/logs/viewer?\" + query\n", "\n", " display(HTML(f\"Link to: {replica} logs\"))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deploy TensorBoard\n", "\n", "* You will create a Kubernetes Deployment to run TensorBoard\n", "* TensorBoard will be accessible behind the Kubeflow IAP endpoint" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "tb_name = \"mnist-tensorboard\"\n", "tb_deploy = f\"\"\"apiVersion: apps/v1\n", "kind: Deployment\n", "metadata:\n", " labels:\n", " app: mnist-tensorboard\n", " name: {tb_name}\n", " namespace: {namespace}\n", "spec:\n", " selector:\n", " matchLabels:\n", " app: mnist-tensorboard\n", " template:\n", " metadata:\n", " labels:\n", " app: mnist-tensorboard\n", " version: v1\n", " spec:\n", " serviceAccount: default-editor\n", " containers:\n", " - command:\n", " - /usr/local/bin/tensorboard\n", " - --logdir={model_dir}\n", " - --port=80\n", " image: tensorflow/tensorflow:1.15.2-py3\n", " name: tensorboard\n", " ports:\n", " - containerPort: 80\n", "\"\"\"\n", "tb_service = f\"\"\"apiVersion: v1\n", "kind: Service\n", "metadata:\n", " labels:\n", " app: mnist-tensorboard\n", " name: {tb_name}\n", " namespace: {namespace}\n", "spec:\n", " ports:\n", " - name: http-tb\n", " port: 80\n", " targetPort: 80\n", " selector:\n", " app: mnist-tensorboard\n", " type: ClusterIP\n", "\"\"\"\n", "\n", "tb_virtual_service = f\"\"\"apiVersion: networking.istio.io/v1alpha3\n", "kind: VirtualService\n", "metadata:\n", " name: {tb_name}\n", " namespace: {namespace}\n", "spec:\n", " gateways:\n", " - kubeflow/kubeflow-gateway\n", " hosts:\n", " - '*'\n", " http:\n", " - match:\n", " - uri:\n", " prefix: /mnist/{namespace}/tensorboard/\n", " rewrite:\n", " uri: /\n", " route:\n", " - destination:\n", " host: {tb_name}.{namespace}.svc.cluster.local\n", " port:\n", " number: 80\n", " timeout: 300s\n", "\"\"\"\n", "\n", "tb_specs = [tb_deploy, tb_service, tb_virtual_service]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/jovyan/git_kubeflow-examples/mnist/k8s_util.py:55: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.\n", " spec = yaml.load(spec)\n", "Deleted Deployment kubeflow-jlewi.mnist-tensorboard\n", "Created Deployment kubeflow-jlewi.mnist-tensorboard\n", "Deleted Service kubeflow-jlewi.mnist-tensorboard\n", "Created Service kubeflow-jlewi.mnist-tensorboard\n", "Deleted VirtualService kubeflow-jlewi.mnist-tensorboard\n", "Created VirtualService mnist-tensorboard.mnist-tensorboard\n" ] }, { "data": { "text/plain": [ "[{'api_version': 'apps/v1',\n", " 'kind': 'Deployment',\n", " 'metadata': {'annotations': None,\n", " 'cluster_name': None,\n", " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 38, tzinfo=tzlocal()),\n", " 'deletion_grace_period_seconds': None,\n", " 'deletion_timestamp': None,\n", " 'finalizers': None,\n", " 'generate_name': None,\n", " 'generation': 1,\n", " 'initializers': None,\n", " 'labels': {'app': 'mnist-tensorboard'},\n", " 'managed_fields': None,\n", " 'name': 'mnist-tensorboard',\n", " 'namespace': 'kubeflow-jlewi',\n", " 'owner_references': None,\n", " 'resource_version': '1731593',\n", " 'self_link': '/apis/apps/v1/namespaces/kubeflow-jlewi/deployments/mnist-tensorboard',\n", " 'uid': 'e9750d8b-4dde-11ea-9830-42010a8e016f'},\n", " 'spec': {'min_ready_seconds': None,\n", " 'paused': None,\n", " 'progress_deadline_seconds': 600,\n", " 'replicas': 1,\n", " 'revision_history_limit': 10,\n", " 'selector': {'match_expressions': None,\n", " 'match_labels': {'app': 'mnist-tensorboard'}},\n", " 'strategy': {'rolling_update': {'max_surge': '25%',\n", " 'max_unavailable': '25%'},\n", " 'type': 'RollingUpdate'},\n", " 'template': {'metadata': {'annotations': None,\n", " 'cluster_name': None,\n", " 'creation_timestamp': None,\n", " 'deletion_grace_period_seconds': None,\n", " 'deletion_timestamp': None,\n", " 'finalizers': None,\n", " 'generate_name': None,\n", " 'generation': None,\n", " 'initializers': None,\n", " 'labels': {'app': 'mnist-tensorboard',\n", " 'version': 'v1'},\n", " 'managed_fields': None,\n", " 'name': None,\n", " 'namespace': None,\n", " 'owner_references': None,\n", " 'resource_version': None,\n", " 'self_link': None,\n", " 'uid': None},\n", " 'spec': {'active_deadline_seconds': None,\n", " 'affinity': None,\n", " 'automount_service_account_token': None,\n", " 'containers': [{'args': None,\n", " 'command': ['/usr/local/bin/tensorboard',\n", " '--logdir=gs://jlewi-dev-mnist/mnist',\n", " '--port=80'],\n", " 'env': None,\n", " 'env_from': None,\n", " 'image': 'tensorflow/tensorflow:1.15.2-py3',\n", " 'image_pull_policy': 'IfNotPresent',\n", " 'lifecycle': None,\n", " 'liveness_probe': None,\n", " 'name': 'tensorboard',\n", " 'ports': [{'container_port': 80,\n", " 'host_ip': None,\n", " 'host_port': None,\n", " 'name': None,\n", " 'protocol': 'TCP'}],\n", " 'readiness_probe': None,\n", " 'resources': {'limits': None,\n", " 'requests': None},\n", " 'security_context': None,\n", " 'stdin': None,\n", " 'stdin_once': None,\n", " 'termination_message_path': '/dev/termination-log',\n", " 'termination_message_policy': 'File',\n", " 'tty': None,\n", " 'volume_devices': None,\n", " 'volume_mounts': None,\n", " 'working_dir': None}],\n", " 'dns_config': None,\n", " 'dns_policy': 'ClusterFirst',\n", " 'enable_service_links': None,\n", " 'host_aliases': None,\n", " 'host_ipc': None,\n", " 'host_network': None,\n", " 'host_pid': None,\n", " 'hostname': None,\n", " 'image_pull_secrets': None,\n", " 'init_containers': None,\n", " 'node_name': None,\n", " 'node_selector': None,\n", " 'priority': None,\n", " 'priority_class_name': None,\n", " 'readiness_gates': None,\n", " 'restart_policy': 'Always',\n", " 'runtime_class_name': None,\n", " 'scheduler_name': 'default-scheduler',\n", " 'security_context': {'fs_group': None,\n", " 'run_as_group': None,\n", " 'run_as_non_root': None,\n", " 'run_as_user': None,\n", " 'se_linux_options': None,\n", " 'supplemental_groups': None,\n", " 'sysctls': None},\n", " 'service_account': 'default-editor',\n", " 'service_account_name': 'default-editor',\n", " 'share_process_namespace': None,\n", " 'subdomain': None,\n", " 'termination_grace_period_seconds': 30,\n", " 'tolerations': None,\n", " 'volumes': None}}},\n", " 'status': {'available_replicas': None,\n", " 'collision_count': None,\n", " 'conditions': None,\n", " 'observed_generation': None,\n", " 'ready_replicas': None,\n", " 'replicas': None,\n", " 'unavailable_replicas': None,\n", " 'updated_replicas': None}}, {'api_version': 'v1',\n", " 'kind': 'Service',\n", " 'metadata': {'annotations': None,\n", " 'cluster_name': None,\n", " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 38, tzinfo=tzlocal()),\n", " 'deletion_grace_period_seconds': None,\n", " 'deletion_timestamp': None,\n", " 'finalizers': None,\n", " 'generate_name': None,\n", " 'generation': None,\n", " 'initializers': None,\n", " 'labels': {'app': 'mnist-tensorboard'},\n", " 'managed_fields': None,\n", " 'name': 'mnist-tensorboard',\n", " 'namespace': 'kubeflow-jlewi',\n", " 'owner_references': None,\n", " 'resource_version': '1731608',\n", " 'self_link': '/api/v1/namespaces/kubeflow-jlewi/services/mnist-tensorboard',\n", " 'uid': 'e98fa09f-4dde-11ea-9830-42010a8e016f'},\n", " 'spec': {'cluster_ip': '10.55.245.113',\n", " 'external_i_ps': None,\n", " 'external_name': None,\n", " 'external_traffic_policy': None,\n", " 'health_check_node_port': None,\n", " 'load_balancer_ip': None,\n", " 'load_balancer_source_ranges': None,\n", " 'ports': [{'name': 'http-tb',\n", " 'node_port': None,\n", " 'port': 80,\n", " 'protocol': 'TCP',\n", " 'target_port': 80}],\n", " 'publish_not_ready_addresses': None,\n", " 'selector': {'app': 'mnist-tensorboard'},\n", " 'session_affinity': 'None',\n", " 'session_affinity_config': None,\n", " 'type': 'ClusterIP'},\n", " 'status': {'load_balancer': {'ingress': None}}}, {'apiVersion': 'networking.istio.io/v1alpha3',\n", " 'kind': 'VirtualService',\n", " 'metadata': {'creationTimestamp': '2020-02-12T21:30:38Z',\n", " 'generation': 1,\n", " 'name': 'mnist-tensorboard',\n", " 'namespace': 'kubeflow-jlewi',\n", " 'resourceVersion': '1731612',\n", " 'selfLink': '/apis/networking.istio.io/v1alpha3/namespaces/kubeflow-jlewi/virtualservices/mnist-tensorboard',\n", " 'uid': 'e99c4909-4dde-11ea-9830-42010a8e016f'},\n", " 'spec': {'gateways': ['kubeflow/kubeflow-gateway'],\n", " 'hosts': ['*'],\n", " 'http': [{'match': [{'uri': {'prefix': '/mnist/kubeflow-jlewi/tensorboard/'}}],\n", " 'rewrite': {'uri': '/'},\n", " 'route': [{'destination': {'host': 'mnist-tensorboard.kubeflow-jlewi.svc.cluster.local',\n", " 'port': {'number': 80}}}],\n", " 'timeout': '300s'}]}}]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "k8s_util.apply_k8s_specs(tb_specs, k8s_util.K8S_CREATE_OR_REPLACE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Access The TensorBoard UI" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "TensorBoard UI is at https://kf-v1-0210.endpoints.jlewi-dev.cloud.goog/mnist/kubeflow-jlewi/tensorboard/" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "endpoint = k8s_util.get_iap_endpoint() \n", "if endpoint: \n", " vs = yaml.safe_load(tb_virtual_service)\n", " path= vs[\"spec\"][\"http\"][0][\"match\"][0][\"uri\"][\"prefix\"]\n", " tb_endpoint = endpoint + path\n", " display(HTML(f\"TensorBoard UI is at {tb_endpoint}\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wait For the Training Job to finish" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* You can use the TFJob client to wait for it to finish." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "tf_job = tf_job_client.wait_for_condition(train_name, expected_condition=[\"Succeeded\", \"Failed\"], namespace=namespace)\n", "\n", "if tf_job_client.is_job_succeeded(train_name, namespace):\n", " logging.info(f\"TFJob {namespace}.{train_name} succeeded\")\n", "else:\n", " raise ValueError(f\"TFJob {namespace}.{train_name} failed\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Serve the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Deploy the model using tensorflow serving\n", "* We need to create\n", " 1. A Kubernetes Deployment\n", " 1. A Kubernetes service\n", " 1. (Optional) Create a configmap containing the prometheus monitoring config" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "deploy_name = \"mnist-model\"\n", "model_base_path = export_path\n", "\n", "# The web ui defaults to mnist-service so if you change it you will\n", "# need to change it in the UI as well to send predictions to the mode\n", "model_service = \"mnist-service\"\n", "\n", "deploy_spec = f\"\"\"apiVersion: apps/v1\n", "kind: Deployment\n", "metadata:\n", " labels:\n", " app: mnist\n", " name: {deploy_name}\n", " namespace: {namespace}\n", "spec:\n", " selector:\n", " matchLabels:\n", " app: mnist-model\n", " template:\n", " metadata:\n", " # TODO(jlewi): Right now we disable the istio side car because otherwise ISTIO rbac will prevent the\n", " # UI from sending RPCs to the server. We should create an appropriate ISTIO rbac authorization\n", " # policy to allow traffic from the UI to the model servier.\n", " # https://istio.io/docs/concepts/security/#target-selectors\n", " annotations: \n", " sidecar.istio.io/inject: \"false\"\n", " labels:\n", " app: mnist-model\n", " version: v1\n", " spec:\n", " serviceAccount: default-editor\n", " containers:\n", " - args:\n", " - --port=9000\n", " - --rest_api_port=8500\n", " - --model_name=mnist\n", " - --model_base_path={model_base_path}\n", " - --monitoring_config_file=/var/config/monitoring_config.txt\n", " command:\n", " - /usr/bin/tensorflow_model_server\n", " env:\n", " - name: modelBasePath\n", " value: {model_base_path}\n", " image: tensorflow/serving:1.15.0\n", " imagePullPolicy: IfNotPresent\n", " livenessProbe:\n", " initialDelaySeconds: 30\n", " periodSeconds: 30\n", " tcpSocket:\n", " port: 9000\n", " name: mnist\n", " ports:\n", " - containerPort: 9000\n", " - containerPort: 8500\n", " resources:\n", " limits:\n", " cpu: \"4\"\n", " memory: 4Gi\n", " requests:\n", " cpu: \"1\"\n", " memory: 1Gi\n", " volumeMounts:\n", " - mountPath: /var/config/\n", " name: model-config\n", " volumes:\n", " - configMap:\n", " name: {deploy_name}\n", " name: model-config\n", "\"\"\"\n", "\n", "service_spec = f\"\"\"apiVersion: v1\n", "kind: Service\n", "metadata:\n", " annotations: \n", " prometheus.io/path: /monitoring/prometheus/metrics\n", " prometheus.io/port: \"8500\"\n", " prometheus.io/scrape: \"true\"\n", " labels:\n", " app: mnist-model\n", " name: {model_service}\n", " namespace: {namespace}\n", "spec:\n", " ports:\n", " - name: grpc-tf-serving\n", " port: 9000\n", " targetPort: 9000\n", " - name: http-tf-serving\n", " port: 8500\n", " targetPort: 8500\n", " selector:\n", " app: mnist-model\n", " type: ClusterIP\n", "\"\"\"\n", "\n", "monitoring_config = f\"\"\"kind: ConfigMap\n", "apiVersion: v1\n", "metadata:\n", " name: {deploy_name}\n", " namespace: {namespace}\n", "data:\n", " monitoring_config.txt: |-\n", " prometheus_config: {{\n", " enable: true,\n", " path: \"/monitoring/prometheus/metrics\"\n", " }}\n", "\"\"\"\n", "\n", "model_specs = [deploy_spec, service_spec, monitoring_config]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Deleted Deployment kubeflow-jlewi.mnist-model\n", "Created Deployment kubeflow-jlewi.mnist-model\n", "Deleted Service kubeflow-jlewi.mnist-service\n", "Created Service kubeflow-jlewi.mnist-service\n", "Deleted ConfigMap kubeflow-jlewi.mnist-model\n", "Created ConfigMap kubeflow-jlewi.mnist-model\n" ] }, { "data": { "text/plain": [ "[{'api_version': 'apps/v1',\n", " 'kind': 'Deployment',\n", " 'metadata': {'annotations': None,\n", " 'cluster_name': None,\n", " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 38, tzinfo=tzlocal()),\n", " 'deletion_grace_period_seconds': None,\n", " 'deletion_timestamp': None,\n", " 'finalizers': None,\n", " 'generate_name': None,\n", " 'generation': 1,\n", " 'initializers': None,\n", " 'labels': {'app': 'mnist'},\n", " 'managed_fields': None,\n", " 'name': 'mnist-model',\n", " 'namespace': 'kubeflow-jlewi',\n", " 'owner_references': None,\n", " 'resource_version': '1731617',\n", " 'self_link': '/apis/apps/v1/namespaces/kubeflow-jlewi/deployments/mnist-model',\n", " 'uid': 'e9add65c-4dde-11ea-9830-42010a8e016f'},\n", " 'spec': {'min_ready_seconds': None,\n", " 'paused': None,\n", " 'progress_deadline_seconds': 600,\n", " 'replicas': 1,\n", " 'revision_history_limit': 10,\n", " 'selector': {'match_expressions': None,\n", " 'match_labels': {'app': 'mnist-model'}},\n", " 'strategy': {'rolling_update': {'max_surge': '25%',\n", " 'max_unavailable': '25%'},\n", " 'type': 'RollingUpdate'},\n", " 'template': {'metadata': {'annotations': {'sidecar.istio.io/inject': 'false'},\n", " 'cluster_name': None,\n", " 'creation_timestamp': None,\n", " 'deletion_grace_period_seconds': None,\n", " 'deletion_timestamp': None,\n", " 'finalizers': None,\n", " 'generate_name': None,\n", " 'generation': None,\n", " 'initializers': None,\n", " 'labels': {'app': 'mnist-model',\n", " 'version': 'v1'},\n", " 'managed_fields': None,\n", " 'name': None,\n", " 'namespace': None,\n", " 'owner_references': None,\n", " 'resource_version': None,\n", " 'self_link': None,\n", " 'uid': None},\n", " 'spec': {'active_deadline_seconds': None,\n", " 'affinity': None,\n", " 'automount_service_account_token': None,\n", " 'containers': [{'args': ['--port=9000',\n", " '--rest_api_port=8500',\n", " '--model_name=mnist',\n", " '--model_base_path=gs://jlewi-dev-mnist/mnist/export',\n", " '--monitoring_config_file=/var/config/monitoring_config.txt'],\n", " 'command': ['/usr/bin/tensorflow_model_server'],\n", " 'env': [{'name': 'modelBasePath',\n", " 'value': 'gs://jlewi-dev-mnist/mnist/export',\n", " 'value_from': None}],\n", " 'env_from': None,\n", " 'image': 'tensorflow/serving:1.15.0',\n", " 'image_pull_policy': 'IfNotPresent',\n", " 'lifecycle': None,\n", " 'liveness_probe': {'_exec': None,\n", " 'failure_threshold': 3,\n", " 'http_get': None,\n", " 'initial_delay_seconds': 30,\n", " 'period_seconds': 30,\n", " 'success_threshold': 1,\n", " 'tcp_socket': {'host': None,\n", " 'port': 9000},\n", " 'timeout_seconds': 1},\n", " 'name': 'mnist',\n", " 'ports': [{'container_port': 9000,\n", " 'host_ip': None,\n", " 'host_port': None,\n", " 'name': None,\n", " 'protocol': 'TCP'},\n", " {'container_port': 8500,\n", " 'host_ip': None,\n", " 'host_port': None,\n", " 'name': None,\n", " 'protocol': 'TCP'}],\n", " 'readiness_probe': None,\n", " 'resources': {'limits': {'cpu': '4',\n", " 'memory': '4Gi'},\n", " 'requests': {'cpu': '1',\n", " 'memory': '1Gi'}},\n", " 'security_context': None,\n", " 'stdin': None,\n", " 'stdin_once': None,\n", " 'termination_message_path': '/dev/termination-log',\n", " 'termination_message_policy': 'File',\n", " 'tty': None,\n", " 'volume_devices': None,\n", " 'volume_mounts': [{'mount_path': '/var/config/',\n", " 'mount_propagation': None,\n", " 'name': 'model-config',\n", " 'read_only': None,\n", " 'sub_path': None,\n", " 'sub_path_expr': None}],\n", " 'working_dir': None}],\n", " 'dns_config': None,\n", " 'dns_policy': 'ClusterFirst',\n", " 'enable_service_links': None,\n", " 'host_aliases': None,\n", " 'host_ipc': None,\n", " 'host_network': None,\n", " 'host_pid': None,\n", " 'hostname': None,\n", " 'image_pull_secrets': None,\n", " 'init_containers': None,\n", " 'node_name': None,\n", " 'node_selector': None,\n", " 'priority': None,\n", " 'priority_class_name': None,\n", " 'readiness_gates': None,\n", " 'restart_policy': 'Always',\n", " 'runtime_class_name': None,\n", " 'scheduler_name': 'default-scheduler',\n", " 'security_context': {'fs_group': None,\n", " 'run_as_group': None,\n", " 'run_as_non_root': None,\n", " 'run_as_user': None,\n", " 'se_linux_options': None,\n", " 'supplemental_groups': None,\n", " 'sysctls': None},\n", " 'service_account': 'default-editor',\n", " 'service_account_name': 'default-editor',\n", " 'share_process_namespace': None,\n", " 'subdomain': None,\n", " 'termination_grace_period_seconds': 30,\n", " 'tolerations': None,\n", " 'volumes': [{'aws_elastic_block_store': None,\n", " 'azure_disk': None,\n", " 'azure_file': None,\n", " 'cephfs': None,\n", " 'cinder': None,\n", " 'config_map': {'default_mode': 420,\n", " 'items': None,\n", " 'name': 'mnist-model',\n", " 'optional': None},\n", " 'csi': None,\n", " 'downward_api': None,\n", " 'empty_dir': None,\n", " 'fc': None,\n", " 'flex_volume': None,\n", " 'flocker': None,\n", " 'gce_persistent_disk': None,\n", " 'git_repo': None,\n", " 'glusterfs': None,\n", " 'host_path': None,\n", " 'iscsi': None,\n", " 'name': 'model-config',\n", " 'nfs': None,\n", " 'persistent_volume_claim': None,\n", " 'photon_persistent_disk': None,\n", " 'portworx_volume': None,\n", " 'projected': None,\n", " 'quobyte': None,\n", " 'rbd': None,\n", " 'scale_io': None,\n", " 'secret': None,\n", " 'storageos': None,\n", " 'vsphere_volume': None}]}}},\n", " 'status': {'available_replicas': None,\n", " 'collision_count': None,\n", " 'conditions': None,\n", " 'observed_generation': None,\n", " 'ready_replicas': None,\n", " 'replicas': None,\n", " 'unavailable_replicas': None,\n", " 'updated_replicas': None}}, {'api_version': 'v1',\n", " 'kind': 'Service',\n", " 'metadata': {'annotations': {'prometheus.io/path': '/monitoring/prometheus/metrics',\n", " 'prometheus.io/port': '8500',\n", " 'prometheus.io/scrape': 'true'},\n", " 'cluster_name': None,\n", " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 38, tzinfo=tzlocal()),\n", " 'deletion_grace_period_seconds': None,\n", " 'deletion_timestamp': None,\n", " 'finalizers': None,\n", " 'generate_name': None,\n", " 'generation': None,\n", " 'initializers': None,\n", " 'labels': {'app': 'mnist-model'},\n", " 'managed_fields': None,\n", " 'name': 'mnist-service',\n", " 'namespace': 'kubeflow-jlewi',\n", " 'owner_references': None,\n", " 'resource_version': '1731639',\n", " 'self_link': '/api/v1/namespaces/kubeflow-jlewi/services/mnist-service',\n", " 'uid': 'e9dcfd8c-4dde-11ea-9830-42010a8e016f'},\n", " 'spec': {'cluster_ip': '10.55.250.62',\n", " 'external_i_ps': None,\n", " 'external_name': None,\n", " 'external_traffic_policy': None,\n", " 'health_check_node_port': None,\n", " 'load_balancer_ip': None,\n", " 'load_balancer_source_ranges': None,\n", " 'ports': [{'name': 'grpc-tf-serving',\n", " 'node_port': None,\n", " 'port': 9000,\n", " 'protocol': 'TCP',\n", " 'target_port': 9000},\n", " {'name': 'http-tf-serving',\n", " 'node_port': None,\n", " 'port': 8500,\n", " 'protocol': 'TCP',\n", " 'target_port': 8500}],\n", " 'publish_not_ready_addresses': None,\n", " 'selector': {'app': 'mnist-model'},\n", " 'session_affinity': 'None',\n", " 'session_affinity_config': None,\n", " 'type': 'ClusterIP'},\n", " 'status': {'load_balancer': {'ingress': None}}}, {'api_version': 'v1',\n", " 'binary_data': None,\n", " 'data': {'monitoring_config.txt': 'prometheus_config: {\\n'\n", " ' enable: true,\\n'\n", " ' path: \"/monitoring/prometheus/metrics\"\\n'\n", " '}'},\n", " 'kind': 'ConfigMap',\n", " 'metadata': {'annotations': None,\n", " 'cluster_name': None,\n", " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 39, tzinfo=tzlocal()),\n", " 'deletion_grace_period_seconds': None,\n", " 'deletion_timestamp': None,\n", " 'finalizers': None,\n", " 'generate_name': None,\n", " 'generation': None,\n", " 'initializers': None,\n", " 'labels': None,\n", " 'managed_fields': None,\n", " 'name': 'mnist-model',\n", " 'namespace': 'kubeflow-jlewi',\n", " 'owner_references': None,\n", " 'resource_version': '1731646',\n", " 'self_link': '/api/v1/namespaces/kubeflow-jlewi/configmaps/mnist-model',\n", " 'uid': 'e9eeb2f4-4dde-11ea-9830-42010a8e016f'}}]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "k8s_util.apply_k8s_specs(model_specs, k8s_util.K8S_CREATE_OR_REPLACE) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deploy the mnist UI\n", "\n", "* We will now deploy the UI to visual the mnist results\n", "* Note: This is using a prebuilt and public docker image for the UI" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "ui_name = \"mnist-ui\"\n", "ui_deploy = f\"\"\"apiVersion: apps/v1\n", "kind: Deployment\n", "metadata:\n", " name: {ui_name}\n", " namespace: {namespace}\n", "spec:\n", " replicas: 1\n", " selector:\n", " matchLabels:\n", " app: mnist-web-ui\n", " template:\n", " metadata:\n", " labels:\n", " app: mnist-web-ui\n", " spec:\n", " containers:\n", " - image: gcr.io/kubeflow-examples/mnist/web-ui:v20190112-v0.2-142-g3b38225\n", " name: web-ui\n", " ports:\n", " - containerPort: 5000 \n", " serviceAccount: default-editor\n", "\"\"\"\n", "\n", "ui_service = f\"\"\"apiVersion: v1\n", "kind: Service\n", "metadata:\n", " annotations:\n", " name: {ui_name}\n", " namespace: {namespace}\n", "spec:\n", " ports:\n", " - name: http-mnist-ui\n", " port: 80\n", " targetPort: 5000\n", " selector:\n", " app: mnist-web-ui\n", " type: ClusterIP\n", "\"\"\"\n", "\n", "ui_virtual_service = f\"\"\"apiVersion: networking.istio.io/v1alpha3\n", "kind: VirtualService\n", "metadata:\n", " name: {ui_name}\n", " namespace: {namespace}\n", "spec:\n", " gateways:\n", " - kubeflow/kubeflow-gateway\n", " hosts:\n", " - '*'\n", " http:\n", " - match:\n", " - uri:\n", " prefix: /mnist/{namespace}/ui/\n", " rewrite:\n", " uri: /\n", " route:\n", " - destination:\n", " host: {ui_name}.{namespace}.svc.cluster.local\n", " port:\n", " number: 80\n", " timeout: 300s\n", "\"\"\"\n", "\n", "ui_specs = [ui_deploy, ui_service, ui_virtual_service]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Deleted Deployment kubeflow-jlewi.mnist-ui\n", "Created Deployment kubeflow-jlewi.mnist-ui\n", "Deleted Service kubeflow-jlewi.mnist-ui\n", "Created Service kubeflow-jlewi.mnist-ui\n", "Deleted VirtualService kubeflow-jlewi.mnist-ui\n", "Created VirtualService mnist-ui.mnist-ui\n" ] }, { "data": { "text/plain": [ "[{'api_version': 'apps/v1',\n", " 'kind': 'Deployment',\n", " 'metadata': {'annotations': None,\n", " 'cluster_name': None,\n", " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 39, tzinfo=tzlocal()),\n", " 'deletion_grace_period_seconds': None,\n", " 'deletion_timestamp': None,\n", " 'finalizers': None,\n", " 'generate_name': None,\n", " 'generation': 1,\n", " 'initializers': None,\n", " 'labels': None,\n", " 'managed_fields': None,\n", " 'name': 'mnist-ui',\n", " 'namespace': 'kubeflow-jlewi',\n", " 'owner_references': None,\n", " 'resource_version': '1731648',\n", " 'self_link': '/apis/apps/v1/namespaces/kubeflow-jlewi/deployments/mnist-ui',\n", " 'uid': 'e9f77ba8-4dde-11ea-9830-42010a8e016f'},\n", " 'spec': {'min_ready_seconds': None,\n", " 'paused': None,\n", " 'progress_deadline_seconds': 600,\n", " 'replicas': 1,\n", " 'revision_history_limit': 10,\n", " 'selector': {'match_expressions': None,\n", " 'match_labels': {'app': 'mnist-web-ui'}},\n", " 'strategy': {'rolling_update': {'max_surge': '25%',\n", " 'max_unavailable': '25%'},\n", " 'type': 'RollingUpdate'},\n", " 'template': {'metadata': {'annotations': None,\n", " 'cluster_name': None,\n", " 'creation_timestamp': None,\n", " 'deletion_grace_period_seconds': None,\n", " 'deletion_timestamp': None,\n", " 'finalizers': None,\n", " 'generate_name': None,\n", " 'generation': None,\n", " 'initializers': None,\n", " 'labels': {'app': 'mnist-web-ui'},\n", " 'managed_fields': None,\n", " 'name': None,\n", " 'namespace': None,\n", " 'owner_references': None,\n", " 'resource_version': None,\n", " 'self_link': None,\n", " 'uid': None},\n", " 'spec': {'active_deadline_seconds': None,\n", " 'affinity': None,\n", " 'automount_service_account_token': None,\n", " 'containers': [{'args': None,\n", " 'command': None,\n", " 'env': None,\n", " 'env_from': None,\n", " 'image': 'gcr.io/kubeflow-examples/mnist/web-ui:v20190112-v0.2-142-g3b38225',\n", " 'image_pull_policy': 'IfNotPresent',\n", " 'lifecycle': None,\n", " 'liveness_probe': None,\n", " 'name': 'web-ui',\n", " 'ports': [{'container_port': 5000,\n", " 'host_ip': None,\n", " 'host_port': None,\n", " 'name': None,\n", " 'protocol': 'TCP'}],\n", " 'readiness_probe': None,\n", " 'resources': {'limits': None,\n", " 'requests': None},\n", " 'security_context': None,\n", " 'stdin': None,\n", " 'stdin_once': None,\n", " 'termination_message_path': '/dev/termination-log',\n", " 'termination_message_policy': 'File',\n", " 'tty': None,\n", " 'volume_devices': None,\n", " 'volume_mounts': None,\n", " 'working_dir': None}],\n", " 'dns_config': None,\n", " 'dns_policy': 'ClusterFirst',\n", " 'enable_service_links': None,\n", " 'host_aliases': None,\n", " 'host_ipc': None,\n", " 'host_network': None,\n", " 'host_pid': None,\n", " 'hostname': None,\n", " 'image_pull_secrets': None,\n", " 'init_containers': None,\n", " 'node_name': None,\n", " 'node_selector': None,\n", " 'priority': None,\n", " 'priority_class_name': None,\n", " 'readiness_gates': None,\n", " 'restart_policy': 'Always',\n", " 'runtime_class_name': None,\n", " 'scheduler_name': 'default-scheduler',\n", " 'security_context': {'fs_group': None,\n", " 'run_as_group': None,\n", " 'run_as_non_root': None,\n", " 'run_as_user': None,\n", " 'se_linux_options': None,\n", " 'supplemental_groups': None,\n", " 'sysctls': None},\n", " 'service_account': 'default-editor',\n", " 'service_account_name': 'default-editor',\n", " 'share_process_namespace': None,\n", " 'subdomain': None,\n", " 'termination_grace_period_seconds': 30,\n", " 'tolerations': None,\n", " 'volumes': None}}},\n", " 'status': {'available_replicas': None,\n", " 'collision_count': None,\n", " 'conditions': None,\n", " 'observed_generation': None,\n", " 'ready_replicas': None,\n", " 'replicas': None,\n", " 'unavailable_replicas': None,\n", " 'updated_replicas': None}}, {'api_version': 'v1',\n", " 'kind': 'Service',\n", " 'metadata': {'annotations': None,\n", " 'cluster_name': None,\n", " 'creation_timestamp': datetime.datetime(2020, 2, 12, 21, 30, 39, tzinfo=tzlocal()),\n", " 'deletion_grace_period_seconds': None,\n", " 'deletion_timestamp': None,\n", " 'finalizers': None,\n", " 'generate_name': None,\n", " 'generation': None,\n", " 'initializers': None,\n", " 'labels': None,\n", " 'managed_fields': None,\n", " 'name': 'mnist-ui',\n", " 'namespace': 'kubeflow-jlewi',\n", " 'owner_references': None,\n", " 'resource_version': '1731664',\n", " 'self_link': '/api/v1/namespaces/kubeflow-jlewi/services/mnist-ui',\n", " 'uid': 'ea12ef25-4dde-11ea-9830-42010a8e016f'},\n", " 'spec': {'cluster_ip': '10.55.250.134',\n", " 'external_i_ps': None,\n", " 'external_name': None,\n", " 'external_traffic_policy': None,\n", " 'health_check_node_port': None,\n", " 'load_balancer_ip': None,\n", " 'load_balancer_source_ranges': None,\n", " 'ports': [{'name': 'http-mnist-ui',\n", " 'node_port': None,\n", " 'port': 80,\n", " 'protocol': 'TCP',\n", " 'target_port': 5000}],\n", " 'publish_not_ready_addresses': None,\n", " 'selector': {'app': 'mnist-web-ui'},\n", " 'session_affinity': 'None',\n", " 'session_affinity_config': None,\n", " 'type': 'ClusterIP'},\n", " 'status': {'load_balancer': {'ingress': None}}}, {'apiVersion': 'networking.istio.io/v1alpha3',\n", " 'kind': 'VirtualService',\n", " 'metadata': {'creationTimestamp': '2020-02-12T21:30:39Z',\n", " 'generation': 1,\n", " 'name': 'mnist-ui',\n", " 'namespace': 'kubeflow-jlewi',\n", " 'resourceVersion': '1731676',\n", " 'selfLink': '/apis/networking.istio.io/v1alpha3/namespaces/kubeflow-jlewi/virtualservices/mnist-ui',\n", " 'uid': 'ea2ac046-4dde-11ea-9830-42010a8e016f'},\n", " 'spec': {'gateways': ['kubeflow/kubeflow-gateway'],\n", " 'hosts': ['*'],\n", " 'http': [{'match': [{'uri': {'prefix': '/mnist/kubeflow-jlewi/ui/'}}],\n", " 'rewrite': {'uri': '/'},\n", " 'route': [{'destination': {'host': 'mnist-ui.kubeflow-jlewi.svc.cluster.local',\n", " 'port': {'number': 80}}}],\n", " 'timeout': '300s'}]}}]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "k8s_util.apply_k8s_specs(ui_specs, k8s_util.K8S_CREATE_OR_REPLACE) \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Access the web UI\n", "\n", "* A reverse proxy route is automatically added to the Kubeflow IAP endpoint\n", "* The endpoint will be\n", "\n", " ```\n", " http:/${KUBEflOW_ENDPOINT}/mnist/${NAMESPACE}/ui/ \n", " ```kubeflow-jlewi\n", "* You can get the KUBEFLOW_ENDPOINT\n", "\n", " ```\n", " KUBEfLOW_ENDPOINT=`kubectl -n istio-system get ingress envoy-ingress -o jsonpath=\"{.spec.rules[0].host}\"`\n", " ```\n", " \n", " * You must run this command with sufficient RBAC permissions to get the ingress.\n", " \n", "* If you have sufficient privileges you can run the cell below to get the endpoint if you don't have sufficient priveleges you can \n", " grant appropriate permissions by running the command\n", " \n", " ```\n", " kubectl create --namespace=istio-system rolebinding --clusterrole=kubeflow-view --serviceaccount=${NAMESPACE}:default-editor ${NAMESPACE}-istio-view\n", " ```" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "mnist UI is at https://kf-v1-0210.endpoints.jlewi-dev.cloud.goog/mnist/kubeflow-jlewi/ui/" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "endpoint = k8s_util.get_iap_endpoint() \n", "if endpoint: \n", " vs = yaml.safe_load(ui_virtual_service)\n", " path= vs[\"spec\"][\"http\"][0][\"match\"][0][\"uri\"][\"prefix\"]\n", " ui_endpoint = endpoint + path\n", " display(HTML(f\"mnist UI is at {ui_endpoint}\"))\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5rc1" } }, "nbformat": 4, "nbformat_minor": 4 }