Added tutorial for object detection distributed training (#74)

* Added tutorial for object detection distributed training

Added steps on how to leverage kubeflow tooling to
submit a distributed object detection training job
in a small kubernetes cluster (minikube, 2-4 node cluster)

* Added Jobs to prepare the training data and model

* Updated instructions

* fixed typos and added export tf graph job

* Fixed paths in jobs and instructions

* Enhanced instructions and re-arranged folder structure

* Updated links to kubeflow user guide documentation
This commit is contained in:
Daniel Castellanos 2018-07-03 14:10:20 -07:00 committed by k8s-ci-robot
parent 5a9748bf8f
commit b6a3c4c0ea
21 changed files with 861 additions and 7 deletions

View File

@ -146,7 +146,7 @@ your github ID in the appropriate column, creating a corresponding GitHub issue,
a PR. It is not an exhaustive list, only the result of brainstorming for
inspiration. Feel free to add to this list and/or reprioritize.
Priority guidance:
Priority guidance:
* **P0**: Very important, try to self-assign if there is a P0 available
* **P1**: Important, try to self-assign if there is no P0 available
@ -160,8 +160,8 @@ Priority guidance:
| [Zillow housing prediction](https://www.kaggle.com/c/zillow-prize-1/kernels) | Zillow's home value prediction on Kaggle | **P0** | High prize Kaggle competition w/ opportunity to show XGBoost | XGBoost | [puneith](https://github.com/puneith) | Google | [issue #16](https://github.com/kubeflow/examples/issues/16) |
| [Mercari price suggestion challenge](https://www.kaggle.com/c/mercari-price-suggestion-challenge) | Automatically suggest product process to online sellers | **P0** | | | | | |
| [Airbnb new user bookings](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings) | Where will a new guest book their first travel experience | | | | | | |
| [TensorFlow object detection](https://github.com/tensorflow/models/tree/master/research/object_detection) | Object detection using TensorFlow API | | | | | | |
| [TFGAN](https://github.com/tensorflow/models/blob/master/research/gan/tutorial.ipynb) | Define, Train and Evaluate GAN | | GANs are of great interest currently | | | | |
| [Nested LSTM](https://github.com/hannw/nlstm) | TensorFlow implementation of nested LSTM cell | | LSTM are the canonical implementation of RNN to solve vanishing gradient problem and widely used for Time Series | | | | |
| [How to solve 90% of NLP problems: A step by step guide on Medium](https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e) | Medium post on how to solve 90% of NLP problems from Emmanuel Ameisen | | Solves a really common problem in a generic way. Great example for people who want to do NLP and don't know how to do 80% of stuff like tokenzation, basic transforms, stop word removal etc and are boilerplate across every NLP task | | | | |
| [Tour of top-10 algorithms for ML newbies](https://towardsdatascience.com/a-tour-of-the-top-10-algorithms-for-machine-learning-newbies-dde4edffae11) | Top 10 algorithms for ML newbies | | Medium post with 8K claps and a guide for ML newbies to get started with ML | | | | | |
| [TensorFlow object detection](https://github.com/tensorflow/models/tree/master/research/object_detection) | Object detection using TensorFlow API | | | TensorFlow | [ldcastell](https://github.com/ldcastell) | Intel | [issue #73](https://github.com/kubeflow/examples/issues/73) |
| [TFGAN](https://github.com/tensorflow/models/blob/master/research/gan/tutorial.ipynb) | Define, Train and Evaluate GAN | GANs are of great interest currently | | | | | |
| [Nested LSTM](https://github.com/hannw/nlstm) | TensorFlow implementation of nested LSTM cell | LSTM are the canonical implementation of RNN to solve vanishing gradient problem and widely used for Time Series | | | | | |
| [How to solve 90% of NLP problems: A step by step guide on Medium](https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e) | Medium post on how to solve 90% of NLP problems from Emmanuel Ameisen | Solves a really common problem in a generic way. Great example for people who want to do NLP and don't know how to do 80% of stuff like tokenzation, basic transforms, stop word removal etc and are boilerplate across every NLP task | | | | | |
| [Tour of top-10 algorithms for ML newbies](https://towardsdatascience.com/a-tour-of-the-top-10-algorithms-for-machine-learning-newbies-dde4edffae11) | Top 10 algorithms for ML newbies | Medium post with 8K claps and a guide for ML newbies to get started with ML | | | | | |

View File

@ -35,6 +35,15 @@ This example covers the following concepts:
1. Monitoring with Argo UI and Tensorboard
1. Serving with Tensorflow
### [Distributed Object Detection](./object_detection)
Author: [Daniel Castellanos](https://github.com/ldcastell)
This example covers the following concepts:
1. Gathering and preparing the data for model training using K8s jobs
1. Using Kubeflow tf-job and tf-operator to launch a distributed object training job
1. Serving the model through Kubeflow's tf-serving
## Component-focused
1.
@ -45,7 +54,7 @@ This example covers the following concepts:
## Third-party hosted
| Source | Example | Description |
| Source | Example | Description |
| ------ | ------- | ----------- |
| | | | |

View File

@ -0,0 +1,11 @@
# Distributed TensorFlow Object Detection Training on K8s with [Kubeflow](https://github.com/kubeflow/kubeflow)
This example demonstrates how to use `kubeflow` to train an object detection model on an existing K8s cluster using
the [TensorFlow object detection API](https://github.com/tensorflow/models/tree/master/research/object_detection)
This example is based on the TensorFlow [Pets tutorial](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md).
## Steps:
1. [Setup a Kubeflow cluster](setup.md)
2. [Submit a distributed object detection training job](submit_job.md)
3. [Monitor your training job](monitor_job.md)
4. [Serve your model with TensorFlow serving](export_tf_graph.md)

View File

@ -0,0 +1,144 @@
# Faster R-CNN with Resnet-101 (v1) configured for the Oxford-IIIT Pet Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.
model {
faster_rcnn {
num_classes: 37
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_resnet101'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0003
schedule {
step: 0
learning_rate: .0003
}
schedule {
step: 900000
learning_rate: .00003
}
schedule {
step: 1200000
learning_rate: .000003
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "/pets_data/faster_rcnn_resnet101_coco_2018_01_28/model.ckpt"
from_detection_checkpoint: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "/pets_data/pet_train_with_masks.record"
}
label_map_path: "/models/research/object_detection/data/pet_label_map.pbtxt"
}
eval_config: {
num_examples: 2000
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10
}
eval_input_reader: {
tf_record_input_reader {
input_path: "/pets_data/pet_val_with_masks.record"
}
label_map_path: "/models/research/object_detection/data/pet_label_map.pbtxt"
shuffle: false
num_readers: 1
}

View File

@ -0,0 +1,66 @@
FROM ubuntu:16.04
LABEL maintainer="Soila Kavulya <soila.p.kavulya@intel.com>"
# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
curl \
libfreetype6-dev \
libpng12-dev \
libzmq3-dev \
pkg-config \
protobuf-compiler \
python \
python-dev \
python-pil \
python-tk \
python-lxml \
rsync \
git \
software-properties-common \
unzip \
&& \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py
RUN pip --no-cache-dir install \
Cython \
glob2 \
h5py \
ipykernel \
jupyter \
matplotlib \
numpy \
pandas \
scipy \
sklearn \
six \
tensorflow \
tensorflow-serving-api \
&& \
python -m ipykernel.kernelspec
# Setup Universal Object Detection
ENV MODELS_HOME "/models"
RUN git clone https://github.com/tensorflow/models.git $MODELS_HOME && cd $MODELS_HOME && git checkout r1.5
#COPY models $MODELS_HOME
RUN cd $MODELS_HOME/research \
&& protoc object_detection/protos/*.proto --python_out=.
ENV PYTHONPATH "$MODELS_HOME/research:$MODELS_HOME/research/slim:$PYTHONPATH"
COPY scripts /scripts
# TensorBoard
EXPOSE 6006
WORKDIR $MODELS_HOME
# Run training job
ARG pipeline_config_path
ARG train_dir
CMD ["python", "$MODELS_HOME/research/object_detection/train.py", "--logtostderr", "--pipeline_config_path=$pipeline_config_path" "--train_dir=$train_dir"]

View File

@ -0,0 +1,62 @@
## Export the TensorFlow Graph
In the [jobs](./jobs) directory you will find a manifest file [export-tf-graph.yaml](./jobs/export-tf-graph.yaml).
Before executing the job we first need to identify a checkpoint candidate in the `pets-data-claim` pvc under
`/pets_data/train`.
To see what's being saved in `/pets_data/train` while the training job is running you can use:
```
kubectl -n kubeflow exec -it pets-training-master-r1hv-0-i6k7c sh
```
This will open an interactive shell to your container and now you can execute `ls /pets_data/train` and look for a
checkpoint candidate.
Once you have identified the checkpoint open the `export-tf-graph.yaml` file under the ./jobs directory
and edit the container command args: `--trained_checkpoint_prefix model.ckpt-<number>`
(line 20) to match the chosen checkpoint.
Now you can execute the job with:
```
kubectl -n kubeflow apply ./jobs/export-tf-graph.yaml
```
Once the job is completed a new directory called `exported_graphs` under `/pets_data` in the pets-data-claim PCV
will be created containing the model and the frozen graph.
Before serving the model we need to perform a quick hack since the object detection export python api does not
generate a "version" folder for the saved model. This hack consists on creating a directory and move some files to it.
One way of doing this is by accessing to an interactive shell in one of your running containers and moving the data yourself
```
kubectl -n kubeflow exec -it pets-training-master-r1hv-0-i6k7c sh
mkdir /pets_data/exported_graphs/saved_model/1
cp /pets_data/exported_graphs/saved_model/* /pets_data/exported_graphs/saved_model/1
```
## Serve the model using TF-Serving
Apply the manifest file under [tf-serving](./tf-serving) directory:
```
kubectl -n kubeflow apply -f ./tf-serving/tf-serving.yaml
```
After that you should see pets-model pod. Run:
```
kubectl -n kubeflow get pods | grep pets-model
```
That will output something like this:
```
pets-model-v1-57674c8f76-4qrqp 1/1 Running 0 4h
```
Take a look at the logs:
```
kubectl -n kubeflow logs pets-model-v1-57674c8f76-4qrqp
```
And you should see:
```
2018-06-21 19:20:32.325406: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: pets-model version: 1}
E0621 19:20:34.134165172 7 ev_epoll1_linux.c:1051] grpc epoll fd: 3
2018-06-21 19:20:34.135354: I tensorflow_serving/model_servers/main.cc:288] Running ModelServer at 0.0.0.0:9000 ...
```
Now you can use a gRPC client to run inference using your trained model!

View File

@ -0,0 +1,12 @@
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pets-data-claim
spec:
accessModes:
- ReadWriteMany
volumeMode: Block
resources:
requests:
storage: 20Gi

View File

@ -0,0 +1,23 @@
---
apiVersion: batch/v1
kind: Job
metadata:
name: get-pets-dataset
spec:
template:
spec:
containers:
- name: get-data
image: inutano/wget
imagePullPolicy: IfNotPresent
command: ["wget", "http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz", "-P", "/pets_data"]
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
restartPolicy: Never
backoffLimit: 4

View File

@ -0,0 +1,23 @@
---
apiVersion: batch/v1
kind: Job
metadata:
name: get-pets-annotations
spec:
template:
spec:
containers:
- name: get-annotations
image: inutano/wget
imagePullPolicy: IfNotPresent
command: ["wget", "http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz", "-P", "/pets_data" ]
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
restartPolicy: Never
backoffLimit: 4

View File

@ -0,0 +1,22 @@
---
apiVersion: batch/v1
kind: Job
metadata:
name: get-fasterrcnn-model
spec:
template:
spec:
containers:
- name: get-model
image: inutano/wget
imagePullPolicy: IfNotPresent
command: ["wget", "http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_coco_2018_01_28.tar.gz", "-P", "/pets_data"]
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
restartPolicy: Never
backoffLimit: 4

View File

@ -0,0 +1,23 @@
---
apiVersion: batch/v1
kind: Job
metadata:
name: decompress-images
spec:
template:
spec:
containers:
- name: decompress-images
image: ubuntu:16.04
imagePullPolicy: IfNotPresent
command: ["tar", "-xzvf", "/pets_data/images.tar.gz", "-C", "/pets_data"]
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
restartPolicy: Never
backoffLimit: 4

View File

@ -0,0 +1,23 @@
---
apiVersion: batch/v1
kind: Job
metadata:
name: decompress-annotations
spec:
template:
spec:
containers:
- name: decompress-annotations
image: ubuntu:16.04
imagePullPolicy: IfNotPresent
command: ["tar", "-xzvf", "/pets_data/annotations.tar.gz", "-C", "/pets_data"]
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
restartPolicy: Never
backoffLimit: 4

View File

@ -0,0 +1,23 @@
---
apiVersion: batch/v1
kind: Job
metadata:
name: decompress-fasterrcnn-model
spec:
template:
spec:
containers:
- name: decompress-model
image: ubuntu:16.04
imagePullPolicy: IfNotPresent
command: ["tar", "-xzvf", "/pets_data/faster_rcnn_resnet101_coco_2018_01_28.tar.gz", "-C", "/pets_data"]
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
restartPolicy: Never
backoffLimit: 4

View File

@ -0,0 +1,22 @@
---
apiVersion: batch/v1
kind: Job
metadata:
name: get-fasterrcnn-config
spec:
template:
spec:
containers:
- name: get-model
image: inutano/wget
imagePullPolicy: IfNotPresent
command: ["wget", "--no-check-certificate", "https://raw.githubusercontent.com/ldcastell/examples/distributedTrainihttps://raw.githubusercontent.com/ldcastell/examples/distributedTrainingExample/object_detection/conf/faster_rcnngExample/object_detection/distributed_training/conf/faster_rcnn_resnet101_pets.config", "-P", "/pets_data"]
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
restartPolicy: Never
backoffLimit: 4

View File

@ -0,0 +1,24 @@
---
apiVersion: batch/v1
kind: Job
metadata:
name: create-tf-record
spec:
template:
spec:
containers:
- name: create-tf-record
image: lcastell/pets_object_detection
imagePullPolicy: IfNotPresent
workingDir: "/"
command: ["python", "models/research/object_detection/dataset_tools/create_pet_tf_record.py", "--label_map_path=models/research/object_detection/data/pet_label_map.pbtxt", "--data_dir=/pets_data", "--output_dir=/pets_data"]
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
restartPolicy: Never
backoffLimit: 4

View File

@ -0,0 +1,30 @@
---
apiVersion: batch/v1
kind: Job
metadata:
name: export-tf-graph
spec:
template:
spec:
containers:
- name: export-tf-graph
image: lcastell/pets_object_detection
imagePullPolicy: IfNotPresent
workingDir: "/"
command:
- "python"
- "models/research/object_detection/export_inference_graph.py"
args:
- "--input_type=image_tensor"
- "--pipeline_config_path=/pets_data/faster_rcnn_resnet101_pets.config"
- "--trained_checkpoint_prefix=/pets_data/train/model.ckpt-<number>"
- "--output_directory=/pets_data/exported_graphs"
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
restartPolicy: Never
backoffLimit: 4

View File

@ -0,0 +1,77 @@
---
apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
name: pets-training
spec:
replicaSpecs:
- replicas: 1
template:
spec:
containers:
- image: lcastell/pets_object_detection
imagePullPolicy: IfNotPresent
name: tensorflow
workingDir: /models
command:
- "python"
- "research/object_detection/train.py"
args:
- "--logtostderr"
- "--pipeline_config_path=/pets_data/faster_rcnn_resnet101_pets.config"
- "--train_dir=/pets_data/train"
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
restartPolicy: OnFailure
tfReplicaType: MASTER
- replicas: 1
template:
spec:
containers:
- image: lcastell/pets_object_detection
imagePullPolicy: IfNotPresent
name: tensorflow
command:
- "python"
- "research/object_detection/train.py"
args:
- "--logtostderr"
- "--pipeline_config_path=/pets_data/faster_rcnn_resnet101_pets.config"
- "--train_dir=/pets_data/train"
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
restartPolicy: OnFailure
tfReplicaType: WORKER
- replicas: 1
template:
spec:
containers:
- image: lcastell/pets_object_detection
imagePullPolicy: IfNotPresent
name: tensorflow
command:
- "python"
- "research/object_detection/train.py"
args:
- "--logtostderr"
- "--pipeline_config_path=/pets_data/faster_rcnn_resnet101_pets.config"
- "--train_dir=/pets_data/train"
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
restartPolicy: OnFailure
tfReplicaType: PS

View File

@ -0,0 +1,51 @@
## Monitor your job
### View status
```
kubectl -n kubeflow describe tfjobs pets-training
```
### View logs of individual pods
```
kubectl -n kubeflow get pods
kubectl -n kubeflow logs <name_of_master_pod>
```
**NOTE:** When the job finishes, the pods will be automatically terminated. To see, run the `get pods` command with the `-a` flag:
```
kubectl -n kubeflow get pods -a
```
While the job is running, you should see something like this in your master pod logs:
```
INFO:tensorflow:Saving checkpoint to path /pets_data/train/model.ckpt
INFO:tensorflow:Recording summary at step 819.
INFO:tensorflow:global step 819: loss = 0.8603 (19.898 sec/step)
INFO:tensorflow:global step 822: loss = 1.9421 (18.507 sec/step)
INFO:tensorflow:global step 825: loss = 0.7147 (17.088 sec/step)
INFO:tensorflow:global step 828: loss = 1.7722 (18.033 sec/step)
INFO:tensorflow:global step 831: loss = 1.3933 (17.739 sec/step)
INFO:tensorflow:global step 834: loss = 0.2307 (16.493 sec/step)
INFO:tensorflow:Recording summary at step 839
```
When the job finishes, you should see something like this in your completed/terminated master pod logs:
```
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /pets_data/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 200006.
INFO:tensorflow:global step 200006: loss = 0.0091 (9.854 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
```
Now you have a trained model!! find it at `/pets_data/train` inside pvc `pets-data-claim``.
### Delete job
```
kubectl -n kubeflow delete -f training/pets-tf-jobs.yaml
```
## Next
[Export the TensorFlow Graph and Serve the model with TF Serving](./export_tf_graph.md)

59
object_detection/setup.md Normal file
View File

@ -0,0 +1,59 @@
## Setup Kubeflow
### Requirements
- Kubernetes cluster
- Access to a working `kubectl` (Kubernetes CLI)
- Ksonnet CLI: [ks](https://ksonnet.io/)
### Setup
Refer to the [user guide](https://www.kubeflow.org/docs/about/user_guide) for instructions on how to setup kubeflow on your kubernetes cluster. Specifically, look at the section on [deploying kubeflow](https://www.kubeflow.org/docs/about/user_guide#deploy-kubeflow).
For this example, we will be using ks `nocloud` environment. If you plan to use `cloud` ks environment, please make sure you follow the proper instructions in the kubeflow user guide.
After completing the steps in the kubeflow user guide you will have the following:
- A ksonnet app directory called `my-kubeflow`
- A new namespace in you K8s cluster called `kubeflow`
- The following pods in your kubernetes cluster in the `kubeflow` namespace:
```
kubectl -n kubeflow get pods
NAME READY STATUS RESTARTS AGE
ambassador-7987df44b9-4pht8 2/2 Running 0 1m
ambassador-7987df44b9-dh5h6 2/2 Running 0 1m
ambassador-7987df44b9-qrgsm 2/2 Running 0 1m
tf-hub-0 1/1 Running 0 1m
tf-job-operator-78757955b-qkg7s 1/1 Running 0 1m
```
## Preparing the training data
We have prepared a set of K8s batch jobs to create a persistent volume and copy the data to it.
The `yaml` manifest files for these jobs can be found at [jobs](./jobs) directory. These `yaml` files are numbered and must be executed in order.
```
# First create the PVC where the training data will be stored
kubectl -n kubeflow apply -f ./jobs/00create-pvc.yaml
```
The 00create-pvc.yaml creates a PVC with `ReadWriteMany` access mode if your Kubernetes cluster
does not support this feature you can modify the manifest to create the PVC in `ReadWriteOnce`
and before you execute the tf-job to train the model add a `nodeSelector:` configuration to execute the pods
in the same node. You can find more about assigning pods to specific nodes [here](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/)
```
# Get the dataset, annotations and faster-rcnn-model tars
kubectl -n kubeflow apply -f ./jobs/01get-dataset.yaml
kubectl -n kubeflow apply -f ./jobs/02get-annotations.yaml
kubectl -n kubeflow apply -f ./jobs/03get-model-job.yaml
```
```
# Decompress tar files
kubectl -n kubeflow apply -f ./jobs/04decompress-images.yaml
kubectl -n kubeflow apply -f ./jobs/05decompress-annotations.yaml
kubectl -n kubeflow apply -f ./jobs/06decompress-model.yaml
```
```
# Configuring the training pipeline
kubectl -n kubeflow apply -f ./jobs/07get-fasterrcnn-config.yaml
kubectl -n kubeflow apply -f ./jobs/08create-pet-record.yaml
```
## Next
[Submit the TF Job](submit_job.md)

View File

@ -0,0 +1,73 @@
# Launch a distributed object detection training job
## Requirements
- Docker
- Docker Registry
- Object Detection Training Docker Image
Build the TensorFlow object detection training image, or use the pre-built image `lcastell/pets_object_detection` in Docker hub.
## To build the image:
First copy the Dockerfile file from `./docker` directory into your $HOME path
```
# from your $HOME directory
docker build --pull -t $USER/pets_object_detection -f ./Dockerfile .
```
### Push the image to your docker registry
```
# from your $HOME directory
docker tag $USER/pets_object_detection <your_server:your_port>/pets_object_detection
docker push <your_server:your_port>/pets_object_detection
```
## Create training TF-Job deployment and launching it
**NOTE:** You can skip this step and copy the [pets-training.yaml](./jobs/pets-training.yaml) from the `conf` directory and modify it to your needs.
Or simply run:
```
kubectl -n kubeflow apply -f ./jobs/pets-training.yaml
```
### Follow these steps to generate the tf-job manifest file:
Generate the ksonnet component using the tf-job prototype
```
# from the my-kubeflow directory
ks generate tf-job pets-training --name=pets-traning \
--namespace=kubeflow \
--image=<your_server:your_port>/pets_object_detection \
--num_masters=1 \
--num_workers= 1 \
--num_ps= 1
```
Dump the generated component into a K8s deployment manifest file.
```
ks show nocloud -c pets-training > pets-training.yaml
```
Add the volume mounts information at the end manifest file. We will be mounting `/pets_data` path to all the containers so they can pull the data for the training job
```
vim pets-training.yaml
```
Add the following to the template.spec:
```
volumes:
- name: pets-data
persistentVolumeClaim:
claimName: pets-data-claim
```
Add the following to the container properties:
```
volumeMounts:
- mountPath: "/pets_data"
name: pets-data
```
At the end you should have something similar to [this](./jobs/pets-training.yaml)
No you can submit the TF-Job to K8s:
```
kubectl -n kubeflow apply -f pets-training.yaml
```
## Next
[Monitor your job](monitor_job.md)

View File

@ -0,0 +1,77 @@
---
apiVersion: v1
kind: Service
metadata:
annotations:
getambassador.io/config: |-
---
apiVersion: ambassador/v0
kind: Mapping
name: tfserving-mapping-pets-model-get
prefix: /pets_data/exported_graphs/saved_model/
rewrite: /
method: GET
service: pets-model.default:8000
---
apiVersion: ambassador/v0
kind: Mapping
name: tfserving-mapping-pets-model-post
prefix: /pets_data/exported_graphs/saved_model/
rewrite: /pets_data/exported_graphs/saved_model/:predict
method: POST
service: pets-model.default:8000
labels:
app: pets-model
name: pets-model
spec:
ports:
- name: grpc-tf-serving
port: 9000
targetPort: 9000
- name: http-tf-serving-proxy
port: 8000
targetPort: 8000
selector:
app: pets-model
type: ClusterIP
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: pets-model
name: pets-model-v1
spec:
template:
metadata:
labels:
app: pets-model
version: v1
spec:
containers:
- args:
- /usr/bin/tensorflow_model_server
- --port=9000
- --model_name=pets-model
- --model_base_path=/pets_data/exported_graphs/saved_model
image: gcr.io/kubeflow-images-public/tf-model-server-cpu:v20180327-995786ec
imagePullPolicy: IfNotPresent
name: pets-model
ports:
- containerPort: 9000
resources:
limits:
cpu: "4"
memory: 4Gi
requests:
cpu: "1"
memory: 1Gi
securityContext:
runAsUser: 1000
volumeMounts:
- mountPath: /pets_data
name: nfs
volumes:
- name: nfs
persistentVolumeClaim:
claimName: pets-data-claim