mirror of https://github.com/kubeflow/examples.git
2.3 KiB
2.3 KiB
Launch a distributed object detection training job
Requirements
- Docker
- Docker Registry
- Object Detection Training Docker Image
Build the TensorFlow object detection training image, or use the pre-built image lcastell/pets_object_detection in Docker hub.
To build the image:
First copy the Dockerfile file from ./docker directory into your $HOME path
# from your $HOME directory
docker build --pull -t $USER/pets_object_detection -f ./Dockerfile.training .
Push the image to your docker registry
# from your $HOME directory
docker tag $USER/pets_object_detection <your_server:your_port>/pets_object_detection
docker push <your_server:your_port>/pets_object_detection
Create training TF-Job deployment and launching it
# from the ks-app directory
PIPELINE_CONFIG_PATH="${MOUNT_PATH}/faster_rcnn_resnet101_pets.config"
TRAINING_DIR="${MOUNT_PATH}/train"
ks param set tf-training-job image ${OBJ_DETECTION_IMAGE}
ks param set tf-training-job mountPath ${MOUNT_PATH}
ks param set tf-training-job pvc ${PVC}
ks param set tf-training-job numPs 1
ks param set tf-training-job numWorkers 1
ks param set tf-training-job pipelineConfigPath ${PIPELINE_CONFIG_PATH}
ks param set tf-training-job trainDir ${TRAINING_DIR}
ks apply ${ENV} -c tf-training-job
NOTE: The default TFJob api verison in the component is kubeflow.org/v1beta1. You can override the default version by setting the tfjobApiVersion param in the ksonnet app
ks param set tf-training-job tfjobApiVersion ${NEW_VERSION}
For GPU support set the numGpu param like:
# from the ks-app directory
ks param set tf-training-job numGpu 1
Here is a quick description for the tf-training-job component parameters:
imagestring, docker image to usemountPathstring, Volume mount pathnumGpunumber, optional param, default to 0numPsnumber, Number of Parameter servers to usenumWorkersnumber, Number of workers to usepipelineConfigPathstring, the path to the pipeline config file in the volume mountpvcstring, Persistent Volume Claim name to usetrainDirstring, Directory where the training outputs will be saved
To see the default values for the tf-training-job component params, please take a look at the params.libsonnet file.