mirror of https://github.com/kubeflow/examples.git
1003 lines
31 KiB
Plaintext
1003 lines
31 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# MNIST E2E on Kubeflow on AWS\n",
|
|
"\n",
|
|
"This example guides you through:\n",
|
|
" \n",
|
|
" 1. Taking an example TensorFlow model and modifying it to support distributed training\n",
|
|
" 1. Serving the resulting model using TFServing\n",
|
|
" 1. Deploying and using a web-app that uses the model\n",
|
|
" \n",
|
|
"## Requirements\n",
|
|
"\n",
|
|
" * You must be running Kubeflow 1.0 on EKS\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Install AWS CLI\n",
|
|
"\n",
|
|
"\n",
|
|
"Click `Kernal` -> `Restart` after your install new packages."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install boto3"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create AWS secret in kubernetes and grant aws access to your notebook\n",
|
|
"\n",
|
|
"> Note: Once IAM for Service Account is merged in 1.0.1, we don't have to use credentials\n",
|
|
"\n",
|
|
"1. Please create an AWS secret in current namespace. \n",
|
|
"\n",
|
|
"> Note: To get base64 string, try `echo -n $AWS_ACCESS_KEY_ID | base64`. \n",
|
|
"> Make sure you have `AmazonEC2ContainerRegistryFullAccess` and `AmazonS3FullAccess` for this experiment. Pods will use credentials to talk to AWS services."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%bash\n",
|
|
"\n",
|
|
"# Replace placeholder with your own AWS credentials\n",
|
|
"AWS_ACCESS_KEY_ID='<your_aws_access_key_id>'\n",
|
|
"AWS_SECRET_ACCESS_KEY='<your_aws_secret_access_key>'\n",
|
|
"\n",
|
|
"kubectl create secret generic aws-secret --from-literal=AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} --from-literal=AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"2. Attach `AmazonEC2ContainerRegistryFullAccess` and `AmazonS3FullAccess` to EKS node group role and grant AWS access to notebook."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Verify you have access to AWS services\n",
|
|
"\n",
|
|
"* The cell below checks that this notebook was spawned with credentials to access AWS S3 and ECR"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"import os\n",
|
|
"import uuid\n",
|
|
"from importlib import reload\n",
|
|
"import boto3\n",
|
|
"\n",
|
|
"# Set REGION for s3 bucket and elastic contaienr registry\n",
|
|
"AWS_REGION='us-west-2'\n",
|
|
"boto3.client('s3', region_name=AWS_REGION).list_buckets()\n",
|
|
"boto3.client('ecr', region_name=AWS_REGION).describe_repositories()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Prepare model\n",
|
|
"\n",
|
|
"There is a delta between existing distributed mnist examples and what's needed to run well as a TFJob.\n",
|
|
"\n",
|
|
"Basically, we must:\n",
|
|
"\n",
|
|
"1. Add options in order to make the model configurable.\n",
|
|
"1. Use `tf.estimator.train_and_evaluate` to enable model exporting and serving.\n",
|
|
"1. Define serving signatures for model serving.\n",
|
|
"\n",
|
|
"The resulting model is [model.py](model.py)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Install Required Libraries\n",
|
|
"\n",
|
|
"Import the libraries required to train this model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import notebook_setup\n",
|
|
"reload(notebook_setup)\n",
|
|
"notebook_setup.notebook_setup(platform='aws')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import k8s_util\n",
|
|
"# Force a reload of kubeflow; since kubeflow is a multi namespace module\n",
|
|
"# it looks like doing this in notebook_setup may not be sufficient\n",
|
|
"import kubeflow\n",
|
|
"reload(kubeflow)\n",
|
|
"from kubernetes import client as k8s_client\n",
|
|
"from kubernetes import config as k8s_config\n",
|
|
"from kubeflow.tfjob.api import tf_job_client as tf_job_client_module\n",
|
|
"from IPython.core.display import display, HTML\n",
|
|
"import yaml"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Configure The Docker Registry For Kubeflow Fairing\n",
|
|
"\n",
|
|
"* In order to build docker images from your notebook we need a docker registry where the images will be stored\n",
|
|
"* Below you set some variables specifying a [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/)\n",
|
|
"* Kubeflow Fairing provides a utility function to guess the name of your AWS account"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from kubernetes import client as k8s_client\n",
|
|
"from kubernetes.client import rest as k8s_rest\n",
|
|
"from kubeflow import fairing \n",
|
|
"from kubeflow.fairing import utils as fairing_utils\n",
|
|
"from kubeflow.fairing.builders import append\n",
|
|
"from kubeflow.fairing.deployers import job\n",
|
|
"from kubeflow.fairing.preprocessors import base as base_preprocessor\n",
|
|
"\n",
|
|
"# Setting up AWS Elastic Container Registry (ECR) for storing output containers\n",
|
|
"# You can use any docker container registry istead of ECR\n",
|
|
"AWS_ACCOUNT_ID=fairing.cloud.aws.guess_account_id()\n",
|
|
"AWS_ACCOUNT_ID = boto3.client('sts').get_caller_identity().get('Account')\n",
|
|
"DOCKER_REGISTRY = '{}.dkr.ecr.{}.amazonaws.com'.format(AWS_ACCOUNT_ID, AWS_REGION)\n",
|
|
"\n",
|
|
"namespace = fairing_utils.get_current_k8s_namespace()\n",
|
|
"\n",
|
|
"logging.info(f\"Running in aws region {AWS_REGION}, account {AWS_ACCOUNT_ID}\")\n",
|
|
"logging.info(f\"Running in namespace {namespace}\")\n",
|
|
"logging.info(f\"Using docker registry {DOCKER_REGISTRY}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Use Kubeflow fairing to build the docker image\n",
|
|
"\n",
|
|
"* You will use kubeflow fairing's kaniko builder to build a docker image that includes all your dependencies\n",
|
|
" * You use kaniko because you want to be able to run `pip` to install dependencies\n",
|
|
" * Kaniko gives you the flexibility to build images from Dockerfiles"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# TODO(https://github.com/kubeflow/fairing/issues/426): We should get rid of this once the default \n",
|
|
"# Kaniko image is updated to a newer image than 0.7.0.\n",
|
|
"from kubeflow.fairing import constants\n",
|
|
"constants.constants.KANIKO_IMAGE = \"gcr.io/kaniko-project/executor:v0.14.0\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from kubeflow.fairing.builders import cluster\n",
|
|
"\n",
|
|
"# output_map is a map of extra files to add to the notebook.\n",
|
|
"# It is a map from source location to the location inside the context.\n",
|
|
"output_map = {\n",
|
|
" \"Dockerfile.model\": \"Dockerfile\",\n",
|
|
" \"model.py\": \"model.py\"\n",
|
|
"}\n",
|
|
"\n",
|
|
"preprocessor = base_preprocessor.BasePreProcessor(\n",
|
|
" command=[\"python\"], # The base class will set this.\n",
|
|
" input_files=[],\n",
|
|
" path_prefix=\"/app\", # irrelevant since we aren't preprocessing any files\n",
|
|
" output_map=output_map)\n",
|
|
"\n",
|
|
"preprocessor.preprocess()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Create a new ECR repository to host model image\n",
|
|
"!aws ecr create-repository --repository-name mnist --region=$AWS_REGION"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Use a Tensorflow image as the base image\n",
|
|
"# We use a custom Dockerfile \n",
|
|
"cluster_builder = cluster.cluster.ClusterBuilder(registry=DOCKER_REGISTRY,\n",
|
|
" base_image=\"\", # base_image is set in the Dockerfile\n",
|
|
" preprocessor=preprocessor,\n",
|
|
" image_name=\"mnist\",\n",
|
|
" dockerfile_path=\"Dockerfile\",\n",
|
|
" pod_spec_mutators=[fairing.cloud.aws.add_aws_credentials_if_exists, fairing.cloud.aws.add_ecr_config],\n",
|
|
" context_source=cluster.s3_context.S3ContextSource(region=AWS_REGION))\n",
|
|
"cluster_builder.build()\n",
|
|
"logging.info(f\"Built image {cluster_builder.image_tag}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create a S3 Bucket\n",
|
|
"\n",
|
|
"* Create a S3 bucket to store our models and other results.\n",
|
|
"* Since we are running in python we use the python client libraries but you could also use the `gsutil` command line"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import boto3\n",
|
|
"from botocore.exceptions import ClientError\n",
|
|
"\n",
|
|
"bucket = f\"{AWS_ACCOUNT_ID}-mnist\"\n",
|
|
"\n",
|
|
"def create_bucket(bucket_name, region=None):\n",
|
|
" \"\"\"Create an S3 bucket in a specified region\n",
|
|
"\n",
|
|
" If a region is not specified, the bucket is created in the S3 default\n",
|
|
" region (us-east-1).\n",
|
|
"\n",
|
|
" :param bucket_name: Bucket to create\n",
|
|
" :param region: String region to create bucket in, e.g., 'us-west-2'\n",
|
|
" :return: True if bucket created, else False\n",
|
|
" \"\"\"\n",
|
|
"\n",
|
|
" # Create bucket\n",
|
|
" try:\n",
|
|
" if region is None:\n",
|
|
" s3_client = boto3.client('s3')\n",
|
|
" s3_client.create_bucket(Bucket=bucket_name)\n",
|
|
" else:\n",
|
|
" s3_client = boto3.client('s3', region_name=region)\n",
|
|
" location = {'LocationConstraint': region}\n",
|
|
" s3_client.create_bucket(Bucket=bucket_name,\n",
|
|
" CreateBucketConfiguration=location)\n",
|
|
" except ClientError as e:\n",
|
|
" logging.error(e)\n",
|
|
" return False\n",
|
|
" return True\n",
|
|
"\n",
|
|
"create_bucket(bucket, AWS_REGION)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Distributed training\n",
|
|
"\n",
|
|
"* We will train the model by using TFJob to run a distributed training job"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"train_name = f\"mnist-train-{uuid.uuid4().hex[:4]}\"\n",
|
|
"num_ps = 1\n",
|
|
"num_workers = 2\n",
|
|
"model_dir = f\"s3://{bucket}/mnist\"\n",
|
|
"export_path = f\"s3://{bucket}/mnist/export\"\n",
|
|
"train_steps = 200\n",
|
|
"batch_size = 100\n",
|
|
"learning_rate = .01\n",
|
|
"image = cluster_builder.image_tag\n",
|
|
"\n",
|
|
"train_spec = f\"\"\"apiVersion: kubeflow.org/v1\n",
|
|
"kind: TFJob\n",
|
|
"metadata:\n",
|
|
" name: {train_name} \n",
|
|
"spec:\n",
|
|
" tfReplicaSpecs:\n",
|
|
" Ps:\n",
|
|
" replicas: {num_ps}\n",
|
|
" template:\n",
|
|
" metadata:\n",
|
|
" annotations:\n",
|
|
" sidecar.istio.io/inject: \"false\"\n",
|
|
" spec:\n",
|
|
" serviceAccount: default-editor\n",
|
|
" containers:\n",
|
|
" - name: tensorflow\n",
|
|
" command:\n",
|
|
" - python\n",
|
|
" - /opt/model.py\n",
|
|
" - --tf-model-dir={model_dir}\n",
|
|
" - --tf-export-dir={export_path}\n",
|
|
" - --tf-train-steps={train_steps}\n",
|
|
" - --tf-batch-size={batch_size}\n",
|
|
" - --tf-learning-rate={learning_rate}\n",
|
|
" image: {image}\n",
|
|
" workingDir: /opt\n",
|
|
" env:\n",
|
|
" - name: AWS_REGION\n",
|
|
" value: {AWS_REGION}\n",
|
|
" - name: AWS_ACCESS_KEY_ID\n",
|
|
" valueFrom:\n",
|
|
" secretKeyRef:\n",
|
|
" name: aws-secret\n",
|
|
" key: AWS_ACCESS_KEY_ID\n",
|
|
" - name: AWS_SECRET_ACCESS_KEY\n",
|
|
" valueFrom:\n",
|
|
" secretKeyRef:\n",
|
|
" name: aws-secret\n",
|
|
" key: AWS_SECRET_ACCESS_KEY\n",
|
|
"\n",
|
|
" restartPolicy: OnFailure\n",
|
|
" Chief:\n",
|
|
" replicas: 1\n",
|
|
" template:\n",
|
|
" metadata:\n",
|
|
" annotations:\n",
|
|
" sidecar.istio.io/inject: \"false\"\n",
|
|
" spec:\n",
|
|
" serviceAccount: default-editor\n",
|
|
" containers:\n",
|
|
" - name: tensorflow\n",
|
|
" command:\n",
|
|
" - python\n",
|
|
" - /opt/model.py\n",
|
|
" - --tf-model-dir={model_dir}\n",
|
|
" - --tf-export-dir={export_path}\n",
|
|
" - --tf-train-steps={train_steps}\n",
|
|
" - --tf-batch-size={batch_size}\n",
|
|
" - --tf-learning-rate={learning_rate}\n",
|
|
" image: {image}\n",
|
|
" workingDir: /opt\n",
|
|
" env:\n",
|
|
" - name: AWS_REGION\n",
|
|
" value: {AWS_REGION}\n",
|
|
" - name: AWS_ACCESS_KEY_ID\n",
|
|
" valueFrom:\n",
|
|
" secretKeyRef:\n",
|
|
" name: aws-secret\n",
|
|
" key: AWS_ACCESS_KEY_ID\n",
|
|
" - name: AWS_SECRET_ACCESS_KEY\n",
|
|
" valueFrom:\n",
|
|
" secretKeyRef:\n",
|
|
" name: aws-secret\n",
|
|
" key: AWS_SECRET_ACCESS_KEY\n",
|
|
"\n",
|
|
" restartPolicy: OnFailure\n",
|
|
" Worker:\n",
|
|
" replicas: 1\n",
|
|
" template:\n",
|
|
" metadata:\n",
|
|
" annotations:\n",
|
|
" sidecar.istio.io/inject: \"false\"\n",
|
|
" spec:\n",
|
|
" serviceAccount: default-editor\n",
|
|
" containers:\n",
|
|
" - name: tensorflow\n",
|
|
" command:\n",
|
|
" - python\n",
|
|
" - /opt/model.py\n",
|
|
" - --tf-model-dir={model_dir}\n",
|
|
" - --tf-export-dir={export_path}\n",
|
|
" - --tf-train-steps={train_steps}\n",
|
|
" - --tf-batch-size={batch_size}\n",
|
|
" - --tf-learning-rate={learning_rate}\n",
|
|
" image: {image}\n",
|
|
" workingDir: /opt\n",
|
|
" env:\n",
|
|
" - name: AWS_REGION\n",
|
|
" value: {AWS_REGION}\n",
|
|
" - name: AWS_ACCESS_KEY_ID\n",
|
|
" valueFrom:\n",
|
|
" secretKeyRef:\n",
|
|
" name: aws-secret\n",
|
|
" key: AWS_ACCESS_KEY_ID\n",
|
|
" - name: AWS_SECRET_ACCESS_KEY\n",
|
|
" valueFrom:\n",
|
|
" secretKeyRef:\n",
|
|
" name: aws-secret\n",
|
|
" key: AWS_SECRET_ACCESS_KEY\n",
|
|
" restartPolicy: OnFailure\n",
|
|
"\"\"\" "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create the training job\n",
|
|
"\n",
|
|
"* You could write the spec to a YAML file and then do `kubectl apply -f {FILE}`\n",
|
|
"* Since you are running in jupyter you will use the TFJob client\n",
|
|
"* You will run the TFJob in a namespace created by a Kubeflow profile\n",
|
|
" * The namespace will be the same namespace you are running the notebook in\n",
|
|
" * Creating a profile ensures the namespace is provisioned with service accounts and other resources needed for Kubeflow"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tf_job_client = tf_job_client_module.TFJobClient()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tf_job_body = yaml.safe_load(train_spec)\n",
|
|
"tf_job = tf_job_client.create(tf_job_body, namespace=namespace) \n",
|
|
"\n",
|
|
"logging.info(f\"Created job {namespace}.{train_name}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Check the job\n",
|
|
"\n",
|
|
"* Above you used the python SDK for TFJob to check the status\n",
|
|
"* You can also use kubectl get the status of your job\n",
|
|
"* The job conditions will tell you whether the job is running, succeeded or failed"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!kubectl get tfjobs -o yaml {train_name}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Get The Logs\n",
|
|
"\n",
|
|
"* There are two ways to get the logs for the training job\n",
|
|
"\n",
|
|
" 1. Using kubectl to fetch the pod logs\n",
|
|
" * These logs are ephemeral; they will be unavailable when the pod is garbage collected to free up resources\n",
|
|
" 1. Using Fluentd-Cloud-Watch\n",
|
|
" * Kubernetes data plane logs are not automatically available in AWS\n",
|
|
" * You need to install fluentd-cloud-watch plugin to ship containers logs to Cloud Watch \n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Deploy TensorBoard\n",
|
|
"\n",
|
|
"* You will create a Kubernetes Deployment to run TensorBoard\n",
|
|
"* TensorBoard will be accessible behind the Kubeflow endpoint"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tb_name = \"mnist-tensorboard\"\n",
|
|
"tb_deploy = f\"\"\"apiVersion: apps/v1\n",
|
|
"kind: Deployment\n",
|
|
"metadata:\n",
|
|
" labels:\n",
|
|
" app: mnist-tensorboard\n",
|
|
" name: {tb_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" selector:\n",
|
|
" matchLabels:\n",
|
|
" app: mnist-tensorboard\n",
|
|
" template:\n",
|
|
" metadata:\n",
|
|
" labels:\n",
|
|
" app: mnist-tensorboard\n",
|
|
" version: v1\n",
|
|
" spec:\n",
|
|
" serviceAccount: default-editor\n",
|
|
" containers:\n",
|
|
" - command:\n",
|
|
" - /usr/local/bin/tensorboard\n",
|
|
" - --logdir={model_dir}\n",
|
|
" - --port=80\n",
|
|
" image: tensorflow/tensorflow:1.15.2-py3\n",
|
|
" name: tensorboard\n",
|
|
" env:\n",
|
|
" - name: AWS_REGION\n",
|
|
" value: {AWS_REGION}\n",
|
|
" - name: AWS_ACCESS_KEY_ID\n",
|
|
" valueFrom:\n",
|
|
" secretKeyRef:\n",
|
|
" name: aws-secret\n",
|
|
" key: AWS_ACCESS_KEY_ID\n",
|
|
" - name: AWS_SECRET_ACCESS_KEY\n",
|
|
" valueFrom:\n",
|
|
" secretKeyRef:\n",
|
|
" name: aws-secret\n",
|
|
" key: AWS_SECRET_ACCESS_KEY\n",
|
|
" ports:\n",
|
|
" - containerPort: 80\n",
|
|
"\"\"\"\n",
|
|
"tb_service = f\"\"\"apiVersion: v1\n",
|
|
"kind: Service\n",
|
|
"metadata:\n",
|
|
" labels:\n",
|
|
" app: mnist-tensorboard\n",
|
|
" name: {tb_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" ports:\n",
|
|
" - name: http-tb\n",
|
|
" port: 80\n",
|
|
" targetPort: 80\n",
|
|
" selector:\n",
|
|
" app: mnist-tensorboard\n",
|
|
" type: ClusterIP\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"tb_virtual_service = f\"\"\"apiVersion: networking.istio.io/v1alpha3\n",
|
|
"kind: VirtualService\n",
|
|
"metadata:\n",
|
|
" name: {tb_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" gateways:\n",
|
|
" - kubeflow/kubeflow-gateway\n",
|
|
" hosts:\n",
|
|
" - '*'\n",
|
|
" http:\n",
|
|
" - match:\n",
|
|
" - uri:\n",
|
|
" prefix: /mnist/{namespace}/tensorboard/\n",
|
|
" rewrite:\n",
|
|
" uri: /\n",
|
|
" route:\n",
|
|
" - destination:\n",
|
|
" host: {tb_name}.{namespace}.svc.cluster.local\n",
|
|
" port:\n",
|
|
" number: 80\n",
|
|
" timeout: 300s\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"tb_specs = [tb_deploy, tb_service, tb_virtual_service]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"k8s_util.apply_k8s_specs(tb_specs, k8s_util.K8S_CREATE_OR_REPLACE)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Access The TensorBoard UI\n",
|
|
"\n",
|
|
"> Note: By default, your namespace may not have access to `istio-system` namespace to get "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"endpoint = k8s_util.get_ingress_endpoint() \n",
|
|
"if endpoint: \n",
|
|
" vs = yaml.safe_load(tb_virtual_service)\n",
|
|
" path= vs[\"spec\"][\"http\"][0][\"match\"][0][\"uri\"][\"prefix\"]\n",
|
|
" tb_endpoint = endpoint + path\n",
|
|
" display(HTML(f\"TensorBoard UI is at <a href='{tb_endpoint}'>{tb_endpoint}</a>\"))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Wait For the Training Job to finish"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* You can use the TFJob client to wait for it to finish."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tf_job = tf_job_client.wait_for_condition(train_name, expected_condition=[\"Succeeded\", \"Failed\"], namespace=namespace)\n",
|
|
"\n",
|
|
"if tf_job_client.is_job_succeeded(train_name, namespace):\n",
|
|
" logging.info(f\"TFJob {namespace}.{train_name} succeeded\")\n",
|
|
"else:\n",
|
|
" raise ValueError(f\"TFJob {namespace}.{train_name} failed\") "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Serve the model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* Deploy the model using tensorflow serving\n",
|
|
"* We need to create\n",
|
|
" 1. A Kubernetes Deployment\n",
|
|
" 1. A Kubernetes service\n",
|
|
" 1. (Optional) Create a configmap containing the prometheus monitoring config"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"deploy_name = \"mnist-model\"\n",
|
|
"model_base_path = export_path\n",
|
|
"\n",
|
|
"# The web ui defaults to mnist-service so if you change it you will\n",
|
|
"# need to change it in the UI as well to send predictions to the mode\n",
|
|
"model_service = \"mnist-service\"\n",
|
|
"\n",
|
|
"deploy_spec = f\"\"\"apiVersion: apps/v1\n",
|
|
"kind: Deployment\n",
|
|
"metadata:\n",
|
|
" labels:\n",
|
|
" app: mnist\n",
|
|
" name: {deploy_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" selector:\n",
|
|
" matchLabels:\n",
|
|
" app: mnist-model\n",
|
|
" template:\n",
|
|
" metadata:\n",
|
|
" # TODO(jlewi): Right now we disable the istio side car because otherwise ISTIO rbac will prevent the\n",
|
|
" # UI from sending RPCs to the server. We should create an appropriate ISTIO rbac authorization\n",
|
|
" # policy to allow traffic from the UI to the model servier.\n",
|
|
" # https://istio.io/docs/concepts/security/#target-selectors\n",
|
|
" annotations: \n",
|
|
" sidecar.istio.io/inject: \"false\"\n",
|
|
" labels:\n",
|
|
" app: mnist-model\n",
|
|
" version: v1\n",
|
|
" spec:\n",
|
|
" serviceAccount: default-editor\n",
|
|
" containers:\n",
|
|
" - args:\n",
|
|
" - --port=9000\n",
|
|
" - --rest_api_port=8500\n",
|
|
" - --model_name=mnist\n",
|
|
" - --model_base_path={model_base_path}\n",
|
|
" - --monitoring_config_file=/var/config/monitoring_config.txt\n",
|
|
" command:\n",
|
|
" - /usr/bin/tensorflow_model_server\n",
|
|
" env:\n",
|
|
" - name: modelBasePath\n",
|
|
" value: {model_base_path}\n",
|
|
" - name: AWS_REGION\n",
|
|
" value: {AWS_REGION}\n",
|
|
" - name: AWS_ACCESS_KEY_ID\n",
|
|
" valueFrom:\n",
|
|
" secretKeyRef:\n",
|
|
" name: aws-secret\n",
|
|
" key: AWS_ACCESS_KEY_ID\n",
|
|
" - name: AWS_SECRET_ACCESS_KEY\n",
|
|
" valueFrom:\n",
|
|
" secretKeyRef:\n",
|
|
" name: aws-secret\n",
|
|
" key: AWS_SECRET_ACCESS_KEY\n",
|
|
" image: tensorflow/serving:1.15.0\n",
|
|
" imagePullPolicy: IfNotPresent\n",
|
|
" livenessProbe:\n",
|
|
" initialDelaySeconds: 30\n",
|
|
" periodSeconds: 30\n",
|
|
" tcpSocket:\n",
|
|
" port: 9000\n",
|
|
" name: mnist\n",
|
|
" ports:\n",
|
|
" - containerPort: 9000\n",
|
|
" - containerPort: 8500\n",
|
|
" resources:\n",
|
|
" limits:\n",
|
|
" cpu: \"1\"\n",
|
|
" memory: 1Gi\n",
|
|
" requests:\n",
|
|
" cpu: \"1\"\n",
|
|
" memory: 1Gi\n",
|
|
" volumeMounts:\n",
|
|
" - mountPath: /var/config/\n",
|
|
" name: model-config\n",
|
|
" volumes:\n",
|
|
" - configMap:\n",
|
|
" name: {deploy_name}\n",
|
|
" name: model-config\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"service_spec = f\"\"\"apiVersion: v1\n",
|
|
"kind: Service\n",
|
|
"metadata:\n",
|
|
" annotations: \n",
|
|
" prometheus.io/path: /monitoring/prometheus/metrics\n",
|
|
" prometheus.io/port: \"8500\"\n",
|
|
" prometheus.io/scrape: \"true\"\n",
|
|
" labels:\n",
|
|
" app: mnist-model\n",
|
|
" name: {model_service}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" ports:\n",
|
|
" - name: grpc-tf-serving\n",
|
|
" port: 9000\n",
|
|
" targetPort: 9000\n",
|
|
" - name: http-tf-serving\n",
|
|
" port: 8500\n",
|
|
" targetPort: 8500\n",
|
|
" selector:\n",
|
|
" app: mnist-model\n",
|
|
" type: ClusterIP\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"monitoring_config = f\"\"\"kind: ConfigMap\n",
|
|
"apiVersion: v1\n",
|
|
"metadata:\n",
|
|
" name: {deploy_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"data:\n",
|
|
" monitoring_config.txt: |-\n",
|
|
" prometheus_config: {{\n",
|
|
" enable: true,\n",
|
|
" path: \"/monitoring/prometheus/metrics\"\n",
|
|
" }}\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"model_specs = [deploy_spec, service_spec, monitoring_config]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"k8s_util.apply_k8s_specs(model_specs, k8s_util.K8S_CREATE_OR_REPLACE)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Deploy the mnist UI\n",
|
|
"\n",
|
|
"* We will now deploy the UI to visual the mnist results\n",
|
|
"* Note: This is using a prebuilt and public docker image for the UI"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ui_name = \"mnist-ui\"\n",
|
|
"ui_deploy = f\"\"\"apiVersion: apps/v1\n",
|
|
"kind: Deployment\n",
|
|
"metadata:\n",
|
|
" name: {ui_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" replicas: 1\n",
|
|
" selector:\n",
|
|
" matchLabels:\n",
|
|
" app: mnist-web-ui\n",
|
|
" template:\n",
|
|
" metadata:\n",
|
|
" labels:\n",
|
|
" app: mnist-web-ui\n",
|
|
" spec:\n",
|
|
" containers:\n",
|
|
" - image: gcr.io/kubeflow-examples/mnist/web-ui:v20190112-v0.2-142-g3b38225\n",
|
|
" name: web-ui\n",
|
|
" ports:\n",
|
|
" - containerPort: 5000 \n",
|
|
" serviceAccount: default-editor\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"ui_service = f\"\"\"apiVersion: v1\n",
|
|
"kind: Service\n",
|
|
"metadata:\n",
|
|
" annotations:\n",
|
|
" name: {ui_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" ports:\n",
|
|
" - name: http-mnist-ui\n",
|
|
" port: 80\n",
|
|
" targetPort: 5000\n",
|
|
" selector:\n",
|
|
" app: mnist-web-ui\n",
|
|
" type: ClusterIP\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"ui_virtual_service = f\"\"\"apiVersion: networking.istio.io/v1alpha3\n",
|
|
"kind: VirtualService\n",
|
|
"metadata:\n",
|
|
" name: {ui_name}\n",
|
|
" namespace: {namespace}\n",
|
|
"spec:\n",
|
|
" gateways:\n",
|
|
" - kubeflow/kubeflow-gateway\n",
|
|
" hosts:\n",
|
|
" - '*'\n",
|
|
" http:\n",
|
|
" - match:\n",
|
|
" - uri:\n",
|
|
" prefix: /mnist/{namespace}/ui/\n",
|
|
" rewrite:\n",
|
|
" uri: /\n",
|
|
" route:\n",
|
|
" - destination:\n",
|
|
" host: {ui_name}.{namespace}.svc.cluster.local\n",
|
|
" port:\n",
|
|
" number: 80\n",
|
|
" timeout: 300s\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"ui_specs = [ui_deploy, ui_service, ui_virtual_service]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"k8s_util.apply_k8s_specs(ui_specs, k8s_util.K8S_CREATE_OR_REPLACE) "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Access the web UI\n",
|
|
"\n",
|
|
"* A reverse proxy route is automatically added to the Kubeflow endpoint\n",
|
|
"* The endpoint will be\n",
|
|
"\n",
|
|
" ```\n",
|
|
" http:/${KUBEflOW_ENDPOINT}/mnist/${NAMESPACE}/ui/ \n",
|
|
" ```\n",
|
|
"* You can get the KUBEFLOW_ENDPOINT\n",
|
|
"\n",
|
|
" ```\n",
|
|
" KUBEfLOW_ENDPOINT=`kubectl -n istio-system get ingress istio-ingress -o jsonpath=\"{.status.loadBalancer.ingress[0].hostname}\"`\n",
|
|
" ```\n",
|
|
" \n",
|
|
" * You must run this command with sufficient RBAC permissions to get the ingress.\n",
|
|
" \n",
|
|
"* If you have sufficient privileges you can run the cell below to get the endpoint if you don't have sufficient priveleges you can \n",
|
|
" grant appropriate permissions by running the command\n",
|
|
" \n",
|
|
" ```\n",
|
|
" kubectl create --namespace=istio-system rolebinding --clusterrole=kubeflow-view --serviceaccount=${NAMESPACE}:default-editor ${NAMESPACE}-istio-view\n",
|
|
" ```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"endpoint = k8s_util.get_ingress_endpoint() \n",
|
|
"if endpoint: \n",
|
|
" vs = yaml.safe_load(ui_virtual_service)\n",
|
|
" path= vs[\"spec\"][\"http\"][0][\"match\"][0][\"uri\"][\"prefix\"]\n",
|
|
" ui_endpoint = endpoint + path\n",
|
|
" display(HTML(f\"mnist UI is at <a href='{ui_endpoint}'>{ui_endpoint}</a>\"))"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.9"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|