mirror of https://github.com/kubeflow/website.git
143 lines
4.1 KiB
Markdown
143 lines
4.1 KiB
Markdown
+++
|
|
title = "PyTorch Training"
|
|
description = "Instructions for using PyTorch"
|
|
weight = 35
|
|
+++
|
|
|
|
This guide walks you through using PyTorch with Kubeflow.
|
|
|
|
## Installing PyTorch Operator
|
|
|
|
If you haven't already done so please follow the [Getting Started Guide](/docs/started/getting-started/) to deploy Kubeflow.
|
|
|
|
An **alpha** version of PyTorch support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow between 0.2.0 and 0.3.5 to use this version.
|
|
|
|
More recently, a **beta** version of PyTorch support was introduced with Kubeflow 0.4.0. You must be using a version of Kubeflow newer than 0.4.0 to use this version.
|
|
|
|
## Verify that PyTorch support is included in your Kubeflow deployment
|
|
|
|
Check that the PyTorch custom resource is installed
|
|
|
|
```
|
|
kubectl get crd
|
|
```
|
|
|
|
The output should include `pytorchjobs.kubeflow.org`
|
|
|
|
```
|
|
NAME AGE
|
|
...
|
|
pytorchjobs.kubeflow.org 4d
|
|
...
|
|
```
|
|
|
|
If it is not included you can add it as follows
|
|
|
|
```
|
|
cd ${KSONNET_APP}
|
|
ks pkg install kubeflow/pytorch-job
|
|
ks generate pytorch-operator pytorch-operator
|
|
ks apply ${ENVIRONMENT} -c pytorch-operator
|
|
```
|
|
|
|
## Creating a PyTorch Job
|
|
|
|
You can create PyTorch Job by defining a PyTorchJob config file. See [distributed MNIST example](https://github.com/kubeflow/pytorch-operator/blob/master/examples/tcp-dist/mnist/v1beta1/pytorch_job_mnist.yaml) config file. You may change the config file based on your requirements.
|
|
|
|
```
|
|
cat pytorch_job_mnist.yaml
|
|
```
|
|
Deploy the PyTorchJob resource to start training:
|
|
|
|
```
|
|
kubectl create -f pytorch_job_mnist.yaml
|
|
```
|
|
You should now be able to see the created pods matching the specified number of replicas.
|
|
|
|
```
|
|
kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist
|
|
```
|
|
Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.
|
|
|
|
```
|
|
PODNAME=$(kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist,pytorch-replica-type=master,pytorch-replica-index=0 -o name)
|
|
kubectl logs -f ${PODNAME}
|
|
```
|
|
## Monitoring a PyTorch Job
|
|
|
|
```
|
|
kubectl get -o yaml pytorchjobs pytorch-tcp-dist-mnist
|
|
```
|
|
See the status section to monitor the job status. Here is sample output when the job is successfully completed.
|
|
|
|
```
|
|
apiVersion: kubeflow.org/v1beta1
|
|
kind: PyTorchJob
|
|
metadata:
|
|
clusterName: ""
|
|
creationTimestamp: 2018-12-16T21:39:09Z
|
|
generation: 1
|
|
name: pytorch-tcp-dist-mnist
|
|
namespace: default
|
|
resourceVersion: "15532"
|
|
selfLink: /apis/kubeflow.org/v1beta1/namespaces/default/pytorchjobs/pytorch-tcp-dist-mnist
|
|
uid: 059391e8-017b-11e9-bf13-06afd8f55a5c
|
|
spec:
|
|
cleanPodPolicy: None
|
|
pytorchReplicaSpecs:
|
|
Master:
|
|
replicas: 1
|
|
restartPolicy: OnFailure
|
|
template:
|
|
metadata:
|
|
creationTimestamp: null
|
|
spec:
|
|
containers:
|
|
- image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
|
|
name: pytorch
|
|
ports:
|
|
- containerPort: 23456
|
|
name: pytorchjob-port
|
|
resources: {}
|
|
Worker:
|
|
replicas: 3
|
|
restartPolicy: OnFailure
|
|
template:
|
|
metadata:
|
|
creationTimestamp: null
|
|
spec:
|
|
containers:
|
|
- image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
|
|
name: pytorch
|
|
ports:
|
|
- containerPort: 23456
|
|
name: pytorchjob-port
|
|
resources: {}
|
|
status:
|
|
completionTime: 2018-12-16T21:43:27Z
|
|
conditions:
|
|
- lastTransitionTime: 2018-12-16T21:39:09Z
|
|
lastUpdateTime: 2018-12-16T21:39:09Z
|
|
message: PyTorchJob pytorch-tcp-dist-mnist is created.
|
|
reason: PyTorchJobCreated
|
|
status: "True"
|
|
type: Created
|
|
- lastTransitionTime: 2018-12-16T21:39:09Z
|
|
lastUpdateTime: 2018-12-16T21:40:45Z
|
|
message: PyTorchJob pytorch-tcp-dist-mnist is running.
|
|
reason: PyTorchJobRunning
|
|
status: "False"
|
|
type: Running
|
|
- lastTransitionTime: 2018-12-16T21:39:09Z
|
|
lastUpdateTime: 2018-12-16T21:43:27Z
|
|
message: PyTorchJob pytorch-tcp-dist-mnist is successfully completed.
|
|
reason: PyTorchJobSucceeded
|
|
status: "True"
|
|
type: Succeeded
|
|
replicaStatuses:
|
|
Master: {}
|
|
Worker: {}
|
|
startTime: 2018-12-16T21:40:45Z
|
|
|
|
```
|