website/content/docs/components/chainer.md

+++
title = "Chainer Training"
description = "Instructions for using Chainer for training"
weight = 4
toc = true
+++

This guide walks you through using Chainer for training your model.

## What is Chainer?

[Chainer](https://chainer.org/) is a powerful, flexible and intuitive deep learning framework.

- Chainer supports CUDA computation. It only requires a few lines of code to leverage a GPU. It also runs on multiple GPUs with little effort.
- Chainer supports various network architectures including feed-forward nets, convnets, recurrent nets and recursive nets. It also supports per-batch architectures.
- Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. It makes code intuitive and easy to debug.

[ChainerMN](https://github.com/chainer/chainermn) is an additional package for Chainer, a flexible deep learning framework. ChainerMN enables multi-node distributed deep learning with the following features:

- Scalable --- it makes full use of the latest technologies such as NVIDIA NCCL and CUDA-Aware MPI,
- Flexible --- even dynamic neural networks can be trained in parallel thanks to Chainer's flexibility, and
- Easy --- minimal changes to existing user code are required.

[This blog post](https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html) provides a benchmark results using up to 128 GPUs.

## Installing Chainer Operator

If you haven't already done so please follow the [Getting Started Guide](/docs/started/getting-started/) to deploy Kubeflow.

An **alpha** version of [Chainer](https://chainer.org/) support was introduced with Kubeflow 0.3.0. You must be using a version of Kubeflow newer than 0.3.0.

## Verify that Chainer support is included in your Kubeflow deployment

Check that the Chainer Job custom resource is installed

```shell
kubectl get crd
```

The output should include `chainerjobs.kubeflow.org`

```
NAME                                       AGE
...
chainerjobs.kubeflow.org                   4d
...
```

If it is not included you can add it as follows

```shells
cd ${KSONNET_APP}
ks pkg install kubeflow/chainer-job
ks generate chainer-operator chainer-operator
ks apply ${ENVIRONMENT} -c chainer-operator
```

## Creating a Chainer Job

You can create an Chainer Job by defining an ChainerJob config file.  First, please create a file `example-job-mn.yaml` like below:

```yaml
apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
  name: example-job-mn
spec:
  backend: mpi
  master:
    mpiConfig:
      slots: 1
    activeDeadlineSeconds: 6000
    backoffLimit: 60
    template:
      spec:
        containers:
        - name: chainer
          image: everpeace/chainermn:1.3.0
          command:
          - sh
          - -c
          - |
            mpiexec -n 3 -N 1 --allow-run-as-root --display-map  --mca mpi_cuda_support 0 \
            python3 /train_mnist.py -e 2 -b 1000 -u 100
  workerSets:
    ws0:
      replicas: 2
      mpiConfig:
        slots: 1
      template:
        spec:
          containers:
          - name: chainer
            image: everpeace/chainermn:1.3.0
            command:
            - sh
            - -c
            - |
              while true; do sleep 1 & wait; done
```

See [examples/chainerjob-reference.yaml](https://github.com/kubeflow/chainer-operator/blob/master/examples/chainerjob-reference.yaml) for definitions of each attributes. You may change the config file based on your requirements. By default, the example job is distributed learning with 3 nodes (1 master, 2 workers).

Deploy the ChainerJob resource to start training:

```shell
kubectl create -f example-job-mn.yaml
```

You should now be able to see the created pods which consist of the chainer job.

```
kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn
```

The training should run only for 2 epochs and takes within a few minutes even on cpu only cluster. Logs can be inspected to see its training progress.

```
PODNAME=$(kubectl get pods -l chainerjob.kubeflow.org/name=example-job-mn,chainerjob.kubeflow.org/role=master -o name)
kubectl logs -f ${PODNAME}
```

## Monitoring an Chainer Job

```shell
kubectl get -o yaml chainerjobs example-job-mn
```

See the status section to monitor the job status. Here is sample output when the job is successfully completed.

```yaml
apiVersion: kubeflow.org/v1alpha1
kind: ChainerJob
metadata:
  name: example-job-mn
...
status:
  completionTime: 2018-09-01T16:42:35Z
  conditions:
  - lastProbeTime: 2018-09-01T16:42:35Z
    lastTransitionTime: 2018-09-01T16:42:35Z
    status: "True"
    type: Complete
  startTime: 2018-09-01T16:34:04Z
  succeeded: 1
```