website/content/docs/components/pytorch.md

143 lines
4.1 KiB
Markdown

+++
title = "PyTorch Training"
description = "Instructions for using PyTorch"
weight = 35
+++
This guide walks you through using PyTorch with Kubeflow.
## Installing PyTorch Operator
If you haven't already done so please follow the [Getting Started Guide](/docs/started/getting-started/) to deploy Kubeflow.
An **alpha** version of PyTorch support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow between 0.2.0 and 0.3.5 to use this version.
More recently, a **beta** version of PyTorch support was introduced with Kubeflow 0.4.0. You must be using a version of Kubeflow newer than 0.4.0 to use this version.
## Verify that PyTorch support is included in your Kubeflow deployment
Check that the PyTorch custom resource is installed
```
kubectl get crd
```
The output should include `pytorchjobs.kubeflow.org`
```
NAME AGE
...
pytorchjobs.kubeflow.org 4d
...
```
If it is not included you can add it as follows
```
cd ${KSONNET_APP}
ks pkg install kubeflow/pytorch-job
ks generate pytorch-operator pytorch-operator
ks apply ${ENVIRONMENT} -c pytorch-operator
```
## Creating a PyTorch Job
You can create PyTorch Job by defining a PyTorchJob config file. See [distributed MNIST example](https://github.com/kubeflow/pytorch-operator/blob/master/examples/tcp-dist/mnist/v1beta1/pytorch_job_mnist.yaml) config file. You may change the config file based on your requirements.
```
cat pytorch_job_mnist.yaml
```
Deploy the PyTorchJob resource to start training:
```
kubectl create -f pytorch_job_mnist.yaml
```
You should now be able to see the created pods matching the specified number of replicas.
```
kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist
```
Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.
```
PODNAME=$(kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist,pytorch-replica-type=master,pytorch-replica-index=0 -o name)
kubectl logs -f ${PODNAME}
```
## Monitoring a PyTorch Job
```
kubectl get -o yaml pytorchjobs pytorch-tcp-dist-mnist
```
See the status section to monitor the job status. Here is sample output when the job is successfully completed.
```
apiVersion: kubeflow.org/v1beta1
kind: PyTorchJob
metadata:
clusterName: ""
creationTimestamp: 2018-12-16T21:39:09Z
generation: 1
name: pytorch-tcp-dist-mnist
namespace: default
resourceVersion: "15532"
selfLink: /apis/kubeflow.org/v1beta1/namespaces/default/pytorchjobs/pytorch-tcp-dist-mnist
uid: 059391e8-017b-11e9-bf13-06afd8f55a5c
spec:
cleanPodPolicy: None
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
name: pytorch
ports:
- containerPort: 23456
name: pytorchjob-port
resources: {}
Worker:
replicas: 3
restartPolicy: OnFailure
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
name: pytorch
ports:
- containerPort: 23456
name: pytorchjob-port
resources: {}
status:
completionTime: 2018-12-16T21:43:27Z
conditions:
- lastTransitionTime: 2018-12-16T21:39:09Z
lastUpdateTime: 2018-12-16T21:39:09Z
message: PyTorchJob pytorch-tcp-dist-mnist is created.
reason: PyTorchJobCreated
status: "True"
type: Created
- lastTransitionTime: 2018-12-16T21:39:09Z
lastUpdateTime: 2018-12-16T21:40:45Z
message: PyTorchJob pytorch-tcp-dist-mnist is running.
reason: PyTorchJobRunning
status: "False"
type: Running
- lastTransitionTime: 2018-12-16T21:39:09Z
lastUpdateTime: 2018-12-16T21:43:27Z
message: PyTorchJob pytorch-tcp-dist-mnist is successfully completed.
reason: PyTorchJobSucceeded
status: "True"
type: Succeeded
replicaStatuses:
Master: {}
Worker: {}
startTime: 2018-12-16T21:40:45Z
```