Update README and out-of-date docs (#2252)

* Update README and out-of-date docs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Move KEPs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Revert Jax KEP table

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix readme text

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
This commit is contained in:
Andrey Velichkevich 2024-09-10 11:18:20 +01:00 committed by GitHub
parent 7c8d4df1d4
commit 2cc5dfed46
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
12 changed files with 85 additions and 410 deletions

View File

@ -1,7 +1,7 @@
<!-- Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, check our contributor guidelines: https://www.kubeflow.org/docs/about/contributing
2. To know more about Training Operator, check the developer guide:
https://github.com/kubeflow/training-operator/blob/master/docs/development/developer_guide.md
https://github.com/kubeflow/training-operator/blob/master/CONTRIBUTING.md
3. If you want *faster* PR reviews, check how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
-->

125
README.md
View File

@ -8,93 +8,78 @@
Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and
scalable distributed training of machine learning (ML) models created with various ML frameworks
such as PyTorch, Tensorflow, XGBoost, MPI, Paddle and others.
such as PyTorch, TensorFlow, HuggingFace, Jax, DeepSpeed, XGBoost, PaddlePaddle and others.
Training Operator allows you to use Kubernetes workloads to effectively train your large models
You can run high-performance computing (HPC) tasks with the Training Operator and `MPIJob` since it
supports running Message Passing Interface (MPI) on Kubernetes which is heavily used for HPC.
The Training Operator implements the V1 API version of MPI Operator. For the MPI Operator V2 version,
please follow [this guide](https://www.kubeflow.org/docs/components/training/user-guides/mpi/) to
install MPI Operator V2.
The Training Operator allows you to use Kubernetes workloads to effectively train your large models
via [Kubernetes Custom Resources APIs](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
or using Training Operator Python SDK.
> Note: Before v1.2 release, Kubeflow Training Operator only supports TFJob on Kubernetes.
- For a complete reference of the custom resource definitions, please refer to the API Definition.
- [TensorFlow API Definition](pkg/apis/kubeflow.org/v1/tensorflow_types.go)
- [PyTorch API Definition](pkg/apis/kubeflow.org/v1/pytorch_types.go)
- [XGBoost API Definition](pkg/apis/kubeflow.org/v1/xgboost_types.go)
- [MPI API Definition](pkg/apis/kubeflow.org/v1/mpi_types.go)
- [PaddlePaddle API Definition](pkg/apis/kubeflow.org/v1/paddlepaddle_types.go)
- For details of all-in-one operator design, please refer to the [All-in-one Kubeflow Training Operator](https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/edit#heading=h.e33ufidnl8z6)
- For details on its observability, please refer to the [monitoring design doc](docs/monitoring/README.md).
or using the Training Operator Python SDK.
## Prerequisites
- Version >= 1.25 of Kubernetes cluster and `kubectl`
Please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/training/installation/#prerequisites)
for prerequisites to install the Training Operator.
## Installation
### Master Branch
Please follow [the Kubeflow Training Operator guide](https://www.kubeflow.org/docs/components/training/installation/#installing-the-training-operator)
for the detailed instructions on how to install Training Operator.
### Installing the Control Plane
Run the following command to install the latest stable release of the Training Operator control plane: `v1.8.0`.
```bash
kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.0"
```
Run the following command to install the latest changes of the Training Operator control plane:
```bash
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
```
### Stable Release
### Installing the Python SDK
```bash
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
```
The Training Operator [implements a Python SDK](https://pypi.org/project/kubeflow-training/)
to simplify creation of distributed training and fine-tuning jobs for Data Scientists.
### TensorFlow Release Only
For users who prefer to use original TensorFlow controllers, please checkout `v1.2-branch`, patches for bug fixes will still be accepted to this branch.
```bash
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.2.0"
```
### Python SDK for Kubeflow Training Operator
Training Operator provides Python SDK for the custom resources. To learn more about available
SDK APIs check [the `TrainingClient`](sdk/python/kubeflow/training/api/training_client.py).
Use `pip install` command to install the latest release of the SDK:
Run the following command to install the latest stable release of the Training SDK:
```
pip install kubeflow-training
pip install -U kubeflow-training
```
Training Operator controller and Python SDK have the same release versions.
## Getting Started
## Quickstart
Please refer to the [getting started guide](https://www.kubeflow.org/docs/components/training/overview/#getting-started)
to quickly create your first Training Operator Job using Python SDK.
Please refer to [the getting started guide](https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob)
to quickly create your first distributed training job using the Python SDK.
If you want to work directly with Kubernetes Custom Resources provided by Training Operator,
follow [the PyTorchJob MNIST guide](https://www.kubeflow.org/docs/components/training/pytorch/#creating-a-pytorch-training-job).
## API Documentation
Please refer to following API Documentation:
- [Kubeflow.org v1 API Documentation](docs/api/kubeflow.org_v1_generated.asciidoc)
## Community
The following links provide information about getting involved in the community:
The following links provide information on how to get involved in the community:
- Attend [the AutoML and Training Working Group](https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit) community meeting.
- Attend [the bi-weekly AutoML and Training Working Group](https://bit.ly/2PWVCkV) community meeting.
- Join our [`#kubeflow-training` Slack channel](https://www.kubeflow.org/docs/about/community/#kubeflow-slack).
- Check out [who is using the Training Operator](./docs/adopters.md).
- Check out [who is using the Training Operator](ADOPTERS.md).
This is a part of Kubeflow, so please see [readme in kubeflow/kubeflow](https://github.com/kubeflow/kubeflow#get-involved) to get in touch with the community.
## Contributing
Please refer to the [DEVELOPMENT](docs/development/developer_guide.md)
Please refer to the [CONTRIBUTING guide](CONTRIBUTING.md).
## Change Log
Please refer to [CHANGELOG](CHANGELOG.md)
Please refer to the [CHANGELOG](CHANGELOG.md).
## Version Matrix
@ -102,21 +87,39 @@ The following table lists the most recent few versions of the operator.
| Operator Version | API Version | Kubernetes Version |
| ---------------------- | ----------- | ------------------ |
| `v1.0.x` | `v1` | 1.16+ |
| `v1.1.x` | `v1` | 1.16+ |
| `v1.2.x` | `v1` | 1.16+ |
| `v1.3.x` | `v1` | 1.18+ |
| `v1.4.x` | `v1` | 1.23+ |
| `v1.5.x` | `v1` | 1.23+ |
| `v1.6.x` | `v1` | 1.23+ |
| `v1.7.x` | `v1` | 1.25+ |
| `latest` (master HEAD) | `v1` | 1.25+ |
| `v1.8.x` | `v1` | 1.27+ |
| `latest` (master HEAD) | `v1` | 1.27+ |
## Reference
For a complete reference of the custom resource definitions, please refer to the API Definition.
- [TensorFlow API Definition](pkg/apis/kubeflow.org/v1/tensorflow_types.go)
- [PyTorch API Definition](pkg/apis/kubeflow.org/v1/pytorch_types.go)
- [XGBoost API Definition](pkg/apis/kubeflow.org/v1/xgboost_types.go)
- [MPI API Definition](pkg/apis/kubeflow.org/v1/mpi_types.go)
- [PaddlePaddle API Definition](pkg/apis/kubeflow.org/v1/paddlepaddle_types.go)
For details on the Training Operator custom resources APIs, refer to
[the following API documentation](docs/api/kubeflow.org_v1_generated.asciidoc)
## Acknowledgement
This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow training operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.
This project was originally started as a distributed training operator for TensorFlow and later we
merged efforts from other Kubeflow Training Operators to provide a unified and simplified experience
for both users and developers. We are very grateful to all who filed issues or helped resolve them,
asked and answered questions, and were part of inspiring discussions.
We'd also like to thank everyone who's contributed to and maintained the original operators.
- PyTorch Operator: [list of contributors](https://github.com/kubeflow/pytorch-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS).
- MPI Operator: [list of contributors](https://github.com/kubeflow/mpi-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS).
- XGBoost Operator: [list of contributors](https://github.com/kubeflow/xgboost-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/xgboost-operator/blob/master/OWNERS).
- Common library: [list of contributors](https://github.com/kubeflow/common/graphs/contributors) and [maintainers](https://github.com/kubeflow/common/blob/master/OWNERS).
- PyTorch Operator: [list of contributors](https://github.com/kubeflow/pytorch-operator/graphs/contributors)
and [maintainers](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS).
- MPI Operator: [list of contributors](https://github.com/kubeflow/mpi-operator/graphs/contributors)
and [maintainers](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS).
- XGBoost Operator: [list of contributors](https://github.com/kubeflow/xgboost-operator/graphs/contributors)
and [maintainers](https://github.com/kubeflow/xgboost-operator/blob/master/OWNERS).
- Common library: [list of contributors](https://github.com/kubeflow/common/graphs/contributors) and
[maintainers](https://github.com/kubeflow/common/blob/master/OWNERS).

5
docs/README.md Normal file
View File

@ -0,0 +1,5 @@
# Training Operator Documentation
Welcome to Kubeflow Training Operator!
The Training Operator documentation is available on [kubeflow.org](https://www.kubeflow.org/docs/components/training/).

View File

@ -1,117 +0,0 @@
# Design Doc TFJob K8s CRD
# Objective
The goal is to make it easy to run TensorFlow training (and distributed training in particular) on Kubernetes (K8s). I propose doing this by creating a K8s custom resource descriptor (CRD) and associated controller. The CRD takes care of managing the K8s resources needed to run a training job.
# Background
Kubernetes makes it easy to manage processes by providing a process (as opposed to VM centric) view of the world. Kubernetes also provides essential building blocks for complex distributed applications. For example, K8s provides built in support for DNS, health checking, logs collections, metrics collection, storage, etc....
In K8s, [Controllers](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/) are responsible for ensuring a set of [Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod-overview/) are running. A Pod is the basic building block in K8s and describes one or more processes that should be colocated (same ip). K8s comes with a number of built in controllers. For example, a [ReplicaSet](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/) can ensure N Pods are running with a particular specification. A [Job controller](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/) can be used to run a binary to completion.
The built in [Controllers](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/) are insufficient for running a distributed TensorFlow job. TensorFlow is a stateful application; each parameter server and worker needs to be uniquely addressable to support all the different patterns of distributed training. K8s has a [stateful sets controller](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/). However, stateful sets are intended for stateful services that run forever (e.g. a sharded in memory cache service like Redis) as opposed to jobs intended to run to completion.
Consequently, running a distributed TF job on K8s today means cobbling together a solution out of the built in primitives. Typically, this means managing multiple resources manually. For example, a user could create 1 stateful set for parameter servers, 1 stateful set for the workers, and 1 job for the master.
To address the limitations of the built in resources, K8s supports [Custom Resources (CRD) and Controllers.](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) Using a CRD, it is easy to create a controller with the desired semantics for a particular workload while hiding users from the implementation. The K8s community has quickly adopted this pattern contributing [numerous CRDs](https://github.com/coreos/awesome-kubernetes-extensions) for various workloads.
# Requirements and Scale
I think O(100) jobs is a reasonable upper bound for the number of TF training jobs the average K8s customer will be running simultaneously in a single cluster.
The input from the K8s team that developed CRDs and various controllers is that most controllers use a non-distributed, multi-threaded design and that scaling is not a problem.
# Design
## TFJob Resource
The TFJob CRD defines a TFJob resource for K8s.
The [TFJob](https://github.com/kubeflow/training-operator/blob/master/pkg/apis/tensorflow/v1/types.go#L29)
resource is a collection of TfReplicas. Each TfReplica corresponds to a
set of TensorFlow processes performing a role in the job;
e.g. master, parameter server or worker. The set of replica types can be expanded (it is just an enum) to support new TF patterns such as eval workers. Figure 1. shows an example yaml spec for a distributed job.
```
apiVersion: "kubeflow.org/v1alpha1"
kind: "TFJob"
metadata:
name: "example-job"
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
containers:
- image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
name: tensorflow
args:
- --log_dir=gs://my-job/log-dir
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
name: tensorflow
args:
- --log_dir=gs://my-job/log-dir
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: PS
```
**Fig 1.** An example job spec for a distributed Training job with 1 master, 2 workers and 1 PS.
As illustrated by Fig 1, I made an explicit decision not to try to hide or replace K8s abstractions. For example, each TfReplica contains a standard K8s [PodTemplate](https://kubernetes.io/docs/api-reference/v1.7/#podtemplate-v1-core) to specify the processes (including TF) to run in each replica. I did this because K8s already provides a widely adopted and understood API. So introducing new concepts in place of K8s concepts is just confusing. Furthermore, exposing the [PodTemplate](https://kubernetes.io/docs/api-reference/v1.7/#podtemplate-v1-core) makes it easy for TFJob users to leverage K8s features. For example, TFJob users can use K8s to attach volumes to their TF processes. This makes it easy to use TF in conjunction with any storage system supported by K8s (e.g. PDs, NFS, etc...)
**Defaults**
The controller can be used to configure defaults for TFJob to create a simpler user experience. The most common use for this right now is supporting GPUs. To use GPUs, the NVIDIA drivers and libraries need to be mounted from the host into the container. This step should become unnecessary with Kubernetes 1.8. The TFJob controller will automatically add these volume mounts based on configuration specified when the controller is started. This prevents users from having to specify them for each job. Instead, only the cluster administrator who deploys the TFJob controller needs to know how the volumes should be configured.
Another use case is minimizing the boilerplate users have to write to run standard processes (e.g. [Parameter Servers](https://github.com/kubeflow/training-operator/pull/36#discussion_r141135711)) using official TF Docker images.
## Controller
The controller manages a distributed TFJob by creating a series of Job controllers Fig 2. The TFJob controller sets the environment variable TF_CONFIG to make the TensorFlow cluster spec and replica type (PS, WORKER, MASTER) and replica index available to TensorFlow code. The Job controller takes care of restarting TensorFlow processes that terminate due to an error. Additional logic in the TFJob controller looks at exit codes and fails the job if a TF process exits with an exit code indicating a permanent error. The TFJob controller treats exit codes of 1-127 as permanent errors; this is an arbitrary convention.
When the master exits successfully or with a permanent error the job is considered finished. There is an open issue([issues/61](https://github.com/kubeflow/training-operator/issues/61)) to make the changes necessary to support evaluation with the Estimator API in 1.4. The pods aren't deleted until the TFJob is deleted. This allows the logs to be fetched via kubectl logs.
![Resources for TFJob](./../diagrams/tfjob_k8s_resources.svg)
## Non-distributed training
A TFJob can handle non-distributed training; the TFJob spec would consist of a single replica of type master.
## in-graph replication
The current design can handle in-graph replication. In-graph vs between-graph replication is determined by the code the user runs in the workers and master.
## Testing
TFJob is using [Prow](https://github.com/kubernetes/test-infra), K8s test infrastructure, to run E2E tests continuously; e.g. presubmits, postsubmits etc... The K8s test-infra team has allowed us to use the Prow instance they maintain so we don't need to support our own instance.
One advantage of Prow over Jenkins is that its API is Kubernetes centric meaning it uses concepts (e.g. Pods, Secrets, etc...) that are very familiar to K8s developers. So Prow is much more intuitive to TFJob developers than Jenkins.
# Alternatives Considered
## Helm and/or Ksonnet
Rather than use a CRD, we could use a tool like Helm or Ksonnet to create templates to simplify creating the different K8s resources needed to manage a TensorFlow job. This is in line with the current approach in [tensorflow/ecosystem](https://github.com/tensorflow/ecosystem/tree/master/kubernetes).
One disadvantage of templates is that they do not provide a mechanism to add custom control logic. None of the K8s builtin controllers provide a mechanism for distinguishing between retryable and permanent errors. Furthermore, the built in controllers don't propagate errors; if worker i fails with a permanent error this error won't cause the parameter servers and master controllers to be terminated.
Another major disadvantage is that the templating approach forces users to manually manage multiple K8s resources.

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 70 KiB

View File

@ -1,4 +1,4 @@
**<h1>Train/Fine-tune API Proposal for LLMs</h1>**
**<h1>KEP-2003: Train/Fine-tune API Proposal for LLMs</h1>**
**<h3>Authors:</h3>**

View File

@ -1,7 +1,9 @@
# Kubeflow Enhancement Proposal: Integrate JAX with Kubeflow Training Operator for Distributed Training on Kubernetes
# KEP-2145: Integrate JAX with Kubeflow Training Operator for Distributed Training on Kubernetes
<!-- toc -->
## Table of Contents
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
@ -71,7 +73,6 @@ As a DevOps engineer, I want to manage JAX distributed training jobs using the K
| `JAX_INITIALIZATION_TIMEOUT`| `initialization_timeout (int)` | Time period (in seconds) for which connection will be retried. If the initialization takes more than the timeout specified, the initialization will error. Defaults to 300 secs i.e. 5 mins. | Optional. Can be set in the pod spec if a different timeout is needed. |
| `JAX_COORDINATOR_BIND_ADDRESS` | `coordinator_bind_address (str)` | The IP address and port to which the JAX service on process 0 in your cluster will bind. By default, it will bind to all available interfaces using the same port as `coordinator_address`. | Optional. Can be set in the coordinator pod spec. Default binds to all available addresses. |
#### Validations for JaxJob
##### Key Validations
@ -230,19 +231,18 @@ metadata:
name: jaxjob-worker-${job_id}
spec:
containers:
- image: ghcr.io/kubeflow/jax:latest
imagePullPolicy: IfNotPresent
name: worker
env:
- name: JAX_COORDINATOR_ADDRESS
value: '127.0.0.1:6666'
- name: JAX_NUM_PROCESSES
value: 1
- name: JAX_PROCESS_ID
value: 0
# process 0 is coordinator
- image: ghcr.io/kubeflow/jax:latest
imagePullPolicy: IfNotPresent
name: worker
env:
- name: JAX_COORDINATOR_ADDRESS
value: "127.0.0.1:6666"
- name: JAX_NUM_PROCESSES
value: 1
- name: JAX_PROCESS_ID
value: 0
# process 0 is coordinator
restartPolicy: OnFailure
```
## Alternatives

View File

@ -1,122 +0,0 @@
# How to debug an E2E test for Kubeflow Training Operator
TODO (andreyvelich): This doc is outdated. Currently, E2Es are located here:
[`sdk/python/test/e2e`](../../sdk/python/test/e2e)
[E2E Testing](./e2e_testing.md) gives an overview of writing e2e tests. This guidance concentrates more on the e2e failure debugging.
## Prerequsite
1. Install python 3.7
2. Clone `kubeflow/testing` repo under `$GOPATH/src/kubeflow/`
3. Install [ksonnet](https://ksonnet.io/)
```
wget https://github.com/ksonnet/ksonnet/releases/download/v0.13.1/ks_0.13.1_linux_amd64.tar.gz
tar -xvzf ks_0.13.1_linux_amd64.tar.gz
sudo cp ks_0.13.1_linux_amd64/ks /usr/local/bin/ks-13
```
> We would like to deprecate `ksonnet` but may takes some time. Feel free to pick up [the issue](https://github.com/kubeflow/training-operator/issues/1468) if you are interested in it.
> If your platform is darwin or windows, feel free to download binaries in [ksonnet v0.13.1](https://github.com/ksonnet/ksonnet/releases/tag/v0.13.1)
4. Deploy HEAD training operator version in your environment
```
IMG=kubeflow/training-operator:e2e-debug-prid make docker-build
# Optional - load image into kind cluster if you are using kind
kind load docker-image kubeflow/training-operator:e2e-debug-1462
kubectl set image deployment.v1.apps/training-operator training-operator=kubeflow/training-operator:e2e-debug-1462
```
## Run E2E Tests locally
1. Set environments
```
export KUBEFLOW_PATH=$GOPATH/src/github.com/kubeflow
export KUBEFLOW_TRAINING_REPO=$KUBEFLOW_PATH/training-operator
export KUBEFLOW_TESTING_REPO=$KUBEFLOW_PATH/testing
export PYTHONPATH=$KUBEFLOW_TRAINING_REPO:$KUBEFLOW_TRAINING_REPO/py:$KUBEFLOW_TESTING_REPO/py:$KUBEFLOW_TRAINING_REPO/sdk/python
```
2. Install python dependencies
```
pip3 install -r $KUBEFLOW_TESTING_REPO/py/kubeflow/testing/requirements.txt
```
> Note: if you have meet problem install requirement, you may need to `sudo apt-get install libffi-dev`. Feel free to share error logs if you don't know how to handle it.
3. Run Tests
```
# enter the ksonnet app to run tests
cd $KUBEFLOW_TRAINING_REPO/test/workflows
# run individual test that failed in the presubmit job.
python3 -m kubeflow.tf_operator.pod_names_validation_tests --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=pod-names-validation-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=1 --artifacts_path=/tmp/output/artifacts
python3 -m kubeflow.tf_operator.cleanpod_policy_tests --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=cleanpod-policy-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=1 --artifacts_path=/tmp/output/artifacts
python3 -m kubeflow.tf_operator.simple_tfjob_tests --app_dir=$KUBEFLOW_TRAINING_REPO/test/workflows --params=name=simple-tfjob-tests-v1,namespace=kubeflow --tfjob_version=v1 --num_trials=2 --artifacts_path=/tmp/output/artifact
```
## Check results
You can either check logs or check results in `/tmp/output/artifact`.
```
$ ls -al /tmp/output/artifact
junit_test_simple_tfjob_cpu.xml
$ cat /tmp/output/artifact/junit_test_simple_tfjob_cpu.xml
<testsuite failures="0" tests="1" time="659.5505294799805"><testcase classname="SimpleTfJobTests" name="simple-tfjob-tests-v1" time="659.5505294799805" /></testsuite>
```
## Common issues
1. ksonnet is not installed
```
ERROR|2021-11-16T03:06:06|/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py|57| There was a problem running the job; Exception [Errno 2] No such file or directory: 'ks-13': 'ks-13'
Traceback (most recent call last):
File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py", line 38, in run_test
test_func()
File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/pod_names_validation_tests.py", line 53, in test_pod_names
self.params)
File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/util.py", line 579, in setup_ks_app
cwd=app_dir)
File "/home/jiaxin.shan/go/src/github.com/kubeflow/testing/py/kubeflow/testing/util.py", line 59, in run
command, cwd=cwd, env=env, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
File "/usr/local/lib/python3.7/subprocess.py", line 775, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.7/subprocess.py", line 1522, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'ks-13': 'ks-13'
```
Please check `Prerequsite` section to install ksonnet.
2. TypeError: load() missing 1 required positional argument: 'Loader'
```
ERROR|2021-11-16T03:04:12|/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py|57| There was a problem running the job; Exception load() missing 1 required positional argument: 'Loader'
Traceback (most recent call last):
File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/test_runner.py", line 38, in run_test
test_func()
File "/home/jiaxin.shan/go/src/github.com/kubeflow/training-operator/py/kubeflow/tf_operator/pod_names_validation_tests.py", line 51, in test_pod_names
ks_cmd = ks_util.get_ksonnet_cmd(self.app_dir)
File "/home/jiaxin.shan/go/src/github.com/kubeflow/testing/py/kubeflow/testing/ks_util.py", line 47, in get_ksonnet_cmd
results = yaml.load(app_yaml)
TypeError: load() missing 1 required positional argument: 'Loader'
```
This is the pyyaml compatibility issue. Please check if you are using pyyaml==6.0.0. If so, downgrade to `5.4.1` instead.
```
pip3 uninstall pyyaml
pip3 install pyyaml==5.4.1 --user
```

View File

@ -1,91 +0,0 @@
# How to Write an E2E Test for Kubeflow Training Operator
TODO (andreyvelich): This doc is outdated. Currently, E2Es are located here:
[`sdk/python/test/e2e`](../../sdk/python/test/e2e)
The E2E tests for Kubeflow Training operator are implemented as Argo workflows. For more background and details
about Argo (not required for understanding the rest of this document), please take a look at
[this link](https://github.com/kubeflow/testing/blob/master/README.md).
Test results can be monitored at the [Prow dashboard](http://prow.kubeflow-testing.com/?repo=kubeflow%2Ftraining-operator).
At a high level, the E2E test suites are structured as Python test classes. Each test class contains
one or more tests. A test typically runs the following:
- Create a ksonnet component using a TFJob spec;
- Creates the specified TFJob;
- Verifies some expected results (e.g. number of pods started, job status);
- Deletes the TFJob.
## Adding a Test Method
An example can be found [here](https://github.com/kubeflow/training-operator/blob/master/py/kubeflow/tf_operator/simple_tfjob_tests.py).
A test class can have several test methods. Each method executes a series of user actions (e.g.
starting or deleting a TFJob), and performs verifications of expected results (e.g. TFJob exits with
correct status, pods are deleted, etc).
Test classes should follow this pattern:
```python
class MyTest(test_util.TestCase):
def __init__(self, args):
# Initialize environment
def test_case_1(self):
# Test code
def test_case_2(self):
# Test code
if __name__ == "__main__"
test_runner.main(module=__name__)
```
The code here ideally should only contain API calls. Any common functionalities used by the test code should
be added to one of the helper modules:
- k8s_util - for K8s operations like querying/deleting a pod
- ks_util - for ksonnet operations
- tf_job_client - for TFJob-specific operations, such as waiting for the job to be in a certain phase
## Adding a TFJob Spec
This is needed if you want to use your own TFJob spec instead of an existing one. An example can be found
[here](https://github.com/kubeflow/training-operator/tree/master/test/workflows/components/simple_tfjob_v1.jsonnet).
All TFJob specs should be placed in the same directory.
These are similar to actual TFJob specs. Note that many of these are using the
[training-operator-test-server](https://github.com/kubeflow/training-operator/tree/master/test/test-server) as the test image.
This gives us more control over when each replica exits, and allows us to send specific requests like fetching the
runtime TensorFlow config.
## Adding a New Test Class
This is needed if you are creating a new test class. Creating a new test class is recommended if you are implementing
a new feature, and want to group all relevant E2E tests together.
New test classes should be added as Argo workflow steps to the
[workflows.libsonnet](https://github.com/kubeflow/training-operator/blob/master/test/workflows/components/workflows.libsonnet) file.
Under the templates section, add the following to the dag:
```
{
name: "my-test",
template: "my-test",
dependencies: ["setup-kubeflow"],
},
```
This will configure Argo to run `my-test` after setting up the Kubeflow cluster.
Next, add the following lines toward the end of the file:
```
$.parts(namespace, name, overrides).e2e(prow_env, bucket).buildTestTemplate(
"my-test"),
```
This assumes that there is a corresponding Python file named `my_test.py` (note the difference between dashes and
underscores).