mirror of https://github.com/kubeflow/website.git
147 lines
4.6 KiB
Markdown
147 lines
4.6 KiB
Markdown
+++
|
|
title = "Chainer Training"
|
|
description = "Instructions for using Chainer for training"
|
|
weight = 4
|
|
toc = true
|
|
+++
|
|
|
|
This guide walks you through using Chainer for training your model.
|
|
|
|
## What is Chainer?
|
|
|
|
[Chainer](https://chainer.org/) is a powerful, flexible and intuitive deep learning framework.
|
|
|
|
- Chainer supports CUDA computation. It only requires a few lines of code to leverage a GPU. It also runs on multiple GPUs with little effort.
|
|
- Chainer supports various network architectures including feed-forward nets, convnets, recurrent nets and recursive nets. It also supports per-batch architectures.
|
|
- Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. It makes code intuitive and easy to debug.
|
|
|
|
[ChainerMN](https://github.com/chainer/chainermn) is an additional package for Chainer, a flexible deep learning framework. ChainerMN enables multi-node distributed deep learning with the following features:
|
|
|
|
- Scalable --- it makes full use of the latest technologies such as NVIDIA NCCL and CUDA-Aware MPI,
|
|
- Flexible --- even dynamic neural networks can be trained in parallel thanks to Chainer's flexibility, and
|
|
- Easy --- minimal changes to existing user code are required.
|
|
|
|
[This blog post](https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html) provides a benchmark results using up to 128 GPUs.
|
|
|
|
## Installing Chainer Operator
|
|
|
|
If you haven't already done so please follow the [Getting Started Guide](/docs/started/getting-started/) to deploy Kubeflow.
|
|
|
|
An **alpha** version of [Chainer](https://chainer.org/) support was introduced with Kubeflow 0.3.0. You must be using a version of Kubeflow newer than 0.3.0.
|
|
|
|
## Verify that Chainer support is included in your Kubeflow deployment
|
|
|
|
Check that the Chainer Job custom resource is installed
|
|
|
|
```shell
|
|
kubectl get crd
|
|
```
|
|
|
|
The output should include `chainerjobs.kubeflow.org`
|
|
|
|
```
|
|
NAME AGE
|
|
...
|
|
chainerjobs.kubeflow.org 4d
|
|
...
|
|
```
|
|
|
|
If it is not included you can add it as follows
|
|
|
|
```shells
|
|
cd ${KSONNET_APP}
|
|
ks pkg install kubeflow/chainer-job
|
|
ks generate chainer-operator chainer-operator
|
|
ks apply ${ENVIRONMENT} -c chainer-operator
|
|
```
|
|
|
|
## Creating a Chainer Job
|
|
|
|
You can create an Chainer Job by defining an ChainerJob config file. First, please create a file `example-job-mn.yaml` like below:
|
|
|
|
```yaml
|
|
apiVersion: kubeflow.org/v1alpha1
|
|
kind: ChainerJob
|
|
metadata:
|
|
name: example-job-mn
|
|
spec:
|
|
backend: mpi
|
|
master:
|
|
mpiConfig:
|
|
slots: 1
|
|
activeDeadlineSeconds: 6000
|
|
backoffLimit: 60
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: chainer
|
|
image: everpeace/chainermn:1.3.0
|
|
command:
|
|
- sh
|
|
- -c
|
|
- |
|
|
mpiexec -n 3 -N 1 --allow-run-as-root --display-map --mca mpi_cuda_support 0 \
|
|
python3 /train_mnist.py -e 2 -b 1000 -u 100
|
|
workerSets:
|
|
ws0:
|
|
replicas: 2
|
|
mpiConfig:
|
|
slots: 1
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: chainer
|
|
image: everpeace/chainermn:1.3.0
|
|
command:
|
|
- sh
|
|
- -c
|
|
- |
|
|
while true; do sleep 1 & wait; done
|
|
```
|
|
|
|
See [examples/chainerjob-reference.yaml](https://github.com/kubeflow/chainer-operator/blob/master/examples/chainerjob-reference.yaml) for definitions of each attributes. You may change the config file based on your requirements. By default, the example job is distributed learning with 3 nodes (1 master, 2 workers).
|
|
|
|
Deploy the ChainerJob resource to start training:
|
|
|
|
```shell
|
|
kubectl create -f example-job-mn.yaml
|
|
```
|
|
|
|
You should now be able to see the created pods which consist of the chainer job.
|
|
|
|
```
|
|
kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn
|
|
```
|
|
|
|
The training should run only for 2 epochs and takes within a few minutes even on cpu only cluster. Logs can be inspected to see its training progress.
|
|
|
|
```
|
|
PODNAME=$(kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn,chainerjob.kubeflow.org/role=master -o name)
|
|
kubectl logs -f ${PODNAME}
|
|
```
|
|
|
|
## Monitoring an Chainer Job
|
|
|
|
```shell
|
|
kubectl get -o yaml chainerjobs example-job-mn
|
|
```
|
|
|
|
See the status section to monitor the job status. Here is sample output when the job is successfully completed.
|
|
|
|
```yaml
|
|
apiVersion: kubeflow.org/v1alpha1
|
|
kind: ChainerJob
|
|
metadata:
|
|
name: example-job-mn
|
|
...
|
|
status:
|
|
completionTime: 2018-09-01T16:42:35Z
|
|
conditions:
|
|
- lastProbeTime: 2018-09-01T16:42:35Z
|
|
lastTransitionTime: 2018-09-01T16:42:35Z
|
|
status: "True"
|
|
type: Complete
|
|
startTime: 2018-09-01T16:34:04Z
|
|
succeeded: 1
|
|
```
|