Compare commits
189 Commits
Author | SHA1 | Date |
---|---|---|
|
f8ee31410c | |
|
ec5255280c | |
|
d1f7be63ab | |
|
a190ca253b | |
|
695c2c67f0 | |
|
75ec421d62 | |
|
25d7b1109e | |
|
d2d5f77a97 | |
|
c4ccb4ca7e | |
|
aa33dc51b7 | |
|
9e84dad37a | |
|
c9d5653de3 | |
|
4618e321ab | |
|
ca7bf97da4 | |
|
1c633d76ff | |
|
3693f59663 | |
|
fa2fad7d6e | |
|
8f4a602ce6 | |
|
ad85546c23 | |
|
babcb76f91 | |
|
ba7a09ace6 | |
|
545f86bfe9 | |
|
568e3845f5 | |
|
8b84559944 | |
|
ee2384b911 | |
|
2fbb3d7ed4 | |
|
19b5133e6e | |
|
8d413b5861 | |
|
2f6e202bbf | |
|
f3d52fa73a | |
|
ece85b8ce3 | |
|
d497232013 | |
|
9407f9b1a0 | |
|
d9bf195879 | |
|
19abf194bb | |
|
1f9350d78c | |
|
23e9731b52 | |
|
d6b177b93d | |
|
0ca2670770 | |
|
7d7f75ad2d | |
|
4b21f7299b | |
|
36a59bba67 | |
|
ccdbf44815 | |
|
36b17b4175 | |
|
1058d48063 | |
|
ce9c5f3bff | |
|
970afbd209 | |
|
f1bb3bcdbb | |
|
b814410627 | |
|
38218aa3a0 | |
|
13fa5c8dc8 | |
|
f098f1af85 | |
|
b0e411cab5 | |
|
5e18210479 | |
|
13df29407c | |
|
0a701eb03d | |
|
0482946a0c | |
|
0d4b513d65 | |
|
e8b9fcd10d | |
|
190c18e840 | |
|
dc0929f32f | |
|
74ade74d3e | |
|
316e33c999 | |
|
fc47e460e1 | |
|
1cba9b99dc | |
|
866ec44648 | |
|
ac164b85bf | |
|
d61a784a13 | |
|
74fd3f2ad3 | |
|
a765b1c5a0 | |
|
0838d54757 | |
|
ca735b6152 | |
|
969ad681a3 | |
|
29b2d6d2c5 | |
|
22a3df5023 | |
|
68b71f9006 | |
|
70278ce8f7 | |
|
8e008a4916 | |
|
46a795e3db | |
|
76ca05975e | |
|
dce03cc700 | |
|
7885f46081 | |
|
8d6c23d14c | |
|
bd1b0da049 | |
|
e15cb18aeb | |
|
82fd0ba7e5 | |
|
a1b7285e1d | |
|
522a0c610f | |
|
b8af066a2f | |
|
42b8fcae2e | |
|
45c8e1b150 | |
|
fdcfd18a98 | |
|
41fb18b640 | |
|
bf49baae30 | |
|
bd159b2d0f | |
|
7c10b6756c | |
|
0d95df6f1e | |
|
11b771b417 | |
|
223e534b91 | |
|
7197b5cb40 | |
|
b2c5686543 | |
|
dfd3268cc6 | |
|
513894a1f0 | |
|
064927ef5c | |
|
a9ed5f6eaf | |
|
b2380e60dc | |
|
bf53ba33ea | |
|
305005ebdf | |
|
b70297a03a | |
|
ded5780b29 | |
|
c1f39aba1f | |
|
94fc66024f | |
|
c3e73610b0 | |
|
e279bad1cf | |
|
3409e5b1e4 | |
|
3afe470d8d | |
|
f11dae2a6f | |
|
a80b33508f | |
|
6c2373d32e | |
|
b500f9eda2 | |
|
98a43dc6d9 | |
|
881780fb08 | |
|
9064896a91 | |
|
c9dbc8f968 | |
|
5748fe4136 | |
|
33181529ab | |
|
5e8b6ddbff | |
|
a3a348c00a | |
|
7acbb8c408 | |
|
19c9090bd7 | |
|
48eed0fe82 | |
|
3926187d64 | |
|
95d4bbeb94 | |
|
dbf740f8cb | |
|
64808b67e6 | |
|
37d8ab4d50 | |
|
a031bae968 | |
|
f31e1b0be0 | |
|
5034f390d2 | |
|
43b60eddb7 | |
|
1398c8f307 | |
|
acac0fbb25 | |
|
451030cfcb | |
|
adb43b8d74 | |
|
fed8afc602 | |
|
dd69d9c1af | |
|
768218e8f5 | |
|
d1e62ffa3a | |
|
c114755222 | |
|
12f205ef89 | |
|
5ac396c7ab | |
|
8b05634bea | |
|
b7f0ecf50e | |
|
57093a20fb | |
|
70f4a13547 | |
|
d648a2a8cf | |
|
0a7501c542 | |
|
e4631c492d | |
|
6fd3d0e022 | |
|
f27a6780ce | |
|
ed2aea2f86 | |
|
a707f81ef6 | |
|
23b4fe9090 | |
|
3e7e915c16 | |
|
8739eb536c | |
|
875d0022b5 | |
|
1449e75f92 | |
|
10e1e629af | |
|
ff24a10944 | |
|
8db2d49353 | |
|
cdf1bb3102 | |
|
67a9150c56 | |
|
0df51d7492 | |
|
7f31c6b209 | |
|
ce87d1095d | |
|
c4d37efa2b | |
|
a577b6d6ce | |
|
261cf3a362 | |
|
4dc39d6b52 | |
|
a7e6a0fc19 | |
|
4afe00e05a | |
|
46093aec39 | |
|
bf33adad6d | |
|
650d2ef0f8 | |
|
14fa45c995 | |
|
2029700bd8 | |
|
de8cb950de | |
|
3fe9ae4026 | |
|
81a8bf85c9 |
|
@ -1,25 +0,0 @@
|
|||
# Golang CircleCI 2.0 configuration file
|
||||
#
|
||||
# Check https://circleci.com/docs/2.0/language-go/ for more details
|
||||
version: 2
|
||||
jobs:
|
||||
build:
|
||||
docker:
|
||||
- image: circleci/golang:1.14.10
|
||||
working_directory: /go/src/github.com/kubeflow/arena
|
||||
steps:
|
||||
- checkout
|
||||
- setup_remote_docker:
|
||||
docker_layer_caching: false
|
||||
- run:
|
||||
name: run tests
|
||||
command: |
|
||||
test -z "$(go fmt ./... 2>/dev/null | tee /dev/stderr)" || (echo "please format Go code with 'gofmt'")
|
||||
go vet ./...
|
||||
go test -race -v ./...
|
||||
- run: docker build -t acs/arena:$CIRCLE_BUILD_NUM -f Dockerfile.install .
|
||||
- run:
|
||||
name: codecov
|
||||
command: |
|
||||
go test -race -coverprofile=coverage.txt -covermode=atomic ./...
|
||||
bash <(curl -s https://codecov.io/bash)
|
|
@ -0,0 +1,18 @@
|
|||
bin/
|
||||
docs/
|
||||
jupyter/
|
||||
samples/
|
||||
sdk/
|
||||
.gitignore
|
||||
.readthedocs.yaml
|
||||
Dockerfile*
|
||||
LICENSE
|
||||
OWNERS
|
||||
README.md
|
||||
README_cn.md
|
||||
ROADMAP.md
|
||||
ROADMAP_cn.md
|
||||
cover.out
|
||||
demo.jpg
|
||||
mkdocs.yml
|
||||
prow_config.yaml
|
|
@ -0,0 +1,48 @@
|
|||
name: Bug Report
|
||||
description: Tell us about a problem you are experiencing with Arena
|
||||
labels: ["kind/bug", "lifecycle/needs-triage"]
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thanks for taking the time to fill out this Arena bug report!
|
||||
- type: textarea
|
||||
id: problem
|
||||
attributes:
|
||||
label: What happened?
|
||||
description: |
|
||||
Please provide as much info as possible.
|
||||
Not doing so may result in your bug not being addressed in a timely manner.
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: expected
|
||||
attributes:
|
||||
label: What did you expect to happen?
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: environment
|
||||
attributes:
|
||||
label: Environment
|
||||
value: |
|
||||
Kubernetes version:
|
||||
|
||||
```bash
|
||||
$ kubectl version
|
||||
|
||||
```
|
||||
|
||||
Arena version:
|
||||
|
||||
```bash
|
||||
$ arena version
|
||||
|
||||
```
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
id: votes
|
||||
attributes:
|
||||
label: Impacted by this bug?
|
||||
value: Give it a 👍 We prioritize the issues with most 👍
|
|
@ -0,0 +1,6 @@
|
|||
blank_issues_enabled: true
|
||||
|
||||
contact_links:
|
||||
- name: Arena Documentation
|
||||
url: https://arena-docs.readthedocs.io/en/stable
|
||||
about: Much help can be found in the docs
|
|
@ -0,0 +1,28 @@
|
|||
name: Feature Request
|
||||
description: Suggest an idea for Arena
|
||||
labels: ["kind/feature", "lifecycle/needs-triage"]
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thanks for taking the time to fill out this Arena feature request!
|
||||
- type: textarea
|
||||
id: feature
|
||||
attributes:
|
||||
label: What you would like to be added?
|
||||
description: |
|
||||
A clear and concise description of what you want to add to Arena.
|
||||
Please consider to write Arena enhancement proposal if it is a large feature request.
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: rationale
|
||||
attributes:
|
||||
label: Why is this needed?
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
id: votes
|
||||
attributes:
|
||||
label: Love this feature?
|
||||
value: Give it a 👍 We prioritize the features with most 👍
|
|
@ -0,0 +1,27 @@
|
|||
name: Question
|
||||
description: Ask question about Arena
|
||||
labels: ["kind/question", "lifecycle/needs-triage"]
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thanks for taking the time to fill out this question!
|
||||
- type: textarea
|
||||
id: feature
|
||||
attributes:
|
||||
label: What question do you want to ask?
|
||||
description: |
|
||||
A clear and concise description of what you want to ask about Arena.
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: rationale
|
||||
attributes:
|
||||
label: Any additional context?
|
||||
validations:
|
||||
required: false
|
||||
- type: input
|
||||
id: votes
|
||||
attributes:
|
||||
label: Have the same question?
|
||||
value: Give it a 👍 We prioritize the question with most 👍
|
|
@ -0,0 +1,29 @@
|
|||
<!-- Thanks for sending a pull request! Here are some tips for you:
|
||||
1. If this is your first time, check our contributor guidelines: https://www.kubeflow.org/docs/about/contributing
|
||||
2. To know more about Arena, check the developer guide:
|
||||
https://arena-docs.readthedocs.io/en/latest/
|
||||
3. If you want *faster* PR reviews, check how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
|
||||
-->
|
||||
|
||||
## Purpose of this PR
|
||||
|
||||
<!-- Provide a clear and concise description of the changes. Explain the motivation behind these changes and link to relevant issues or discussions. -->
|
||||
|
||||
**Proposed changes:**
|
||||
|
||||
- <Change 1>
|
||||
- <Change 2>
|
||||
- <Change 3>
|
||||
|
||||
## Change Category
|
||||
|
||||
<!-- Indicate the type of change by marking the applicable boxes. -->
|
||||
|
||||
- [ ] Bugfix (non-breaking change which fixes an issue)
|
||||
- [ ] Feature (non-breaking change which adds functionality)
|
||||
- [ ] Breaking change (fix or feature that could affect existing functionality)
|
||||
- [ ] Documentation update
|
||||
|
||||
### Rationale
|
||||
|
||||
<!-- Provide reasoning for the changes if not already covered in the description above. -->
|
|
@ -1,337 +1,26 @@
|
|||
updates:
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: .
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: docker
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: samples/docker/serve-custom-sample
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: docker
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/golang.org/x/net/http2
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: docker
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: kubernetes-artifacts/tf-operator
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: docker
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: kubernetes-artifacts/jobmon
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: docker
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: .
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- dims
|
||||
- thockin
|
||||
- justinsb
|
||||
- tallclair
|
||||
- piosz
|
||||
- brancz
|
||||
- DirectXMan12
|
||||
- lavalamp
|
||||
directory: vendor/k8s.io/klog
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- jayunit100
|
||||
- hoegaarden
|
||||
- andyxning
|
||||
- neolit123
|
||||
- pohly
|
||||
- yagonobre
|
||||
- vincepri
|
||||
- detiber
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/google.golang.org/appengine
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/json-iterator/go
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/hashicorp/golang-lru
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/hashicorp/hcl
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/mitchellh/go-homedir
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/mitchellh/mapstructure
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/fsnotify/fsnotify
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/spf13/pflag
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/spf13/viper
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/spf13/afero
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/spf13/cast
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/spf13/jwalterweatherman
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/magiconair/properties
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/google/gofuzz
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/konsorten/go-windows-terminal-sequences
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/github.com/sirupsen/logrus
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/golang.org/x/oauth2
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
- assignees:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
directory: vendor/gopkg.in/yaml.v2
|
||||
open-pull-requests-limit: 10
|
||||
package-ecosystem: gomod
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
schedule:
|
||||
interval: daily
|
||||
version: 2
|
||||
updates:
|
||||
- package-ecosystem: gomod
|
||||
directory: /
|
||||
schedule:
|
||||
interval: daily
|
||||
|
||||
- package-ecosystem: maven
|
||||
directory: /
|
||||
schedule:
|
||||
interval: daily
|
||||
|
||||
- package-ecosystem: pip
|
||||
directory: /
|
||||
schedule:
|
||||
interval: daily
|
||||
|
||||
- package-ecosystem: docker
|
||||
directory: /
|
||||
schedule:
|
||||
interval: daily
|
||||
|
||||
- package-ecosystem: github-actions
|
||||
directory: /
|
||||
schedule:
|
||||
interval: daily
|
||||
|
|
|
@ -0,0 +1,5 @@
|
|||
# For https://mlbot.net a Github bot that labels issues using KubeFlow
|
||||
label-alias:
|
||||
bug: kind/bug
|
||||
feature_request: kind/feature
|
||||
question: kind/question
|
|
@ -0,0 +1,69 @@
|
|||
name: Check Release
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches:
|
||||
- master
|
||||
paths:
|
||||
- VERSION
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
env:
|
||||
SEMVER_PATTERN: '^([0-9]+)\.([0-9]+)\.([0-9]+)(-rc\.([0-9]+))?$'
|
||||
ARENA_ARTIFACTS_CHART: arena-artifacts
|
||||
|
||||
jobs:
|
||||
check:
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
steps:
|
||||
- name: Checkout source code
|
||||
uses: actions/checkout@v5
|
||||
with:
|
||||
fetch-depth: 0
|
||||
|
||||
- name: Configure Git
|
||||
run: |
|
||||
git config user.name "$GITHUB_ACTOR"
|
||||
git config user.email "$GITHUB_ACTOR@users.noreply.github.com"
|
||||
|
||||
- name: Check whether version matches semver pattern
|
||||
run: |
|
||||
VERSION=$(cat VERSION)
|
||||
if [[ ${VERSION} =~ ${{ env.SEMVER_PATTERN }} ]]; then
|
||||
echo "Version '${VERSION}' matches semver pattern."
|
||||
else
|
||||
echo "Version '${VERSION}' does not match semver pattern."
|
||||
exit 1
|
||||
fi
|
||||
echo "VERSION=${VERSION}" >> $GITHUB_ENV
|
||||
|
||||
- name: Check arena artifacts chart version and appVersion
|
||||
run: |
|
||||
CHART_VERSION=$(cat ${{ env.ARENA_ARTIFACTS_CHART }}/Chart.yaml | grep -e '^version:' | awk '{print $2}')
|
||||
CHART_APP_VERSION=$(cat ${{ env.ARENA_ARTIFACTS_CHART }}/Chart.yaml | grep -e '^appVersion:' | awk '{print $2}')
|
||||
if [[ ${CHART_VERSION} == ${VERSION} ]]; then
|
||||
echo "Chart version '${CHART_VERSION}' matches version '${VERSION}'."
|
||||
else
|
||||
echo "Chart version '${CHART_VERSION}' does not match version '${VERSION}'."
|
||||
exit 1
|
||||
fi
|
||||
if [[ ${CHART_APP_VERSION} == ${VERSION} ]]; then
|
||||
echo "Chart appVersion '${CHART_APP_VERSION}' matches version '${VERSION}'."
|
||||
else
|
||||
echo "Chart appVersion '${CHART_APP_VERSION}' does not match version '${VERSION}'."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
- name: Check if tag exists
|
||||
run: |
|
||||
git fetch --tags
|
||||
if git tag -l | grep -q "^v${VERSION}$"; then
|
||||
echo "Tag 'v${VERSION}' already exists."
|
||||
exit 1
|
||||
else
|
||||
echo "Tag 'v${VERSION}' does not exist."
|
||||
fi
|
|
@ -0,0 +1,137 @@
|
|||
name: Integration Test
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches:
|
||||
- master
|
||||
- release-*
|
||||
|
||||
push:
|
||||
branches:
|
||||
- master
|
||||
- release-*
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.actor }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
build-arena:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout source code
|
||||
uses: actions/checkout@v5
|
||||
|
||||
- name: Set up Go
|
||||
uses: actions/setup-go@v5
|
||||
with:
|
||||
go-version-file: go.mod
|
||||
|
||||
- name: Run go mod tidy
|
||||
run: |
|
||||
go mod tidy
|
||||
if ! git diff --quiet; then
|
||||
echo "Please run 'go mod tidy' to add missing and remove unused dependencies"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
- name: Run go mod vendor
|
||||
run: |
|
||||
go mod vendor
|
||||
if ! git diff --quiet; then
|
||||
echo "Please run 'go mod vendor' to make vendored copy of dependencies"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
- name: Run go fmt check
|
||||
run: |
|
||||
make go-fmt
|
||||
if ! git diff --quiet; then
|
||||
echo "Please run 'make go-fmt' to run go fmt against code"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
- name: Run go vet check
|
||||
run: |
|
||||
make go-vet
|
||||
if ! git diff --quiet; then
|
||||
echo "Please run 'make go-vet' to run go vet against code"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
- name: Run golangci-lint
|
||||
run: |
|
||||
make go-lint
|
||||
|
||||
- name: Run Go unit tests
|
||||
run: |
|
||||
make unit-test
|
||||
|
||||
- name: Run Helm unit tests
|
||||
run: |
|
||||
make helm-unittest
|
||||
|
||||
- name: Build arena binary
|
||||
run: |
|
||||
make arena
|
||||
|
||||
build-java-sdk:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout source code
|
||||
uses: actions/checkout@v5
|
||||
|
||||
- uses: actions/setup-java@v5
|
||||
with:
|
||||
distribution: zulu
|
||||
java-version: 8
|
||||
|
||||
- name: Build Java SDK
|
||||
run: |
|
||||
make java-sdk
|
||||
|
||||
build-docs:
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
steps:
|
||||
- name: Checkout source code
|
||||
uses: actions/checkout@v5
|
||||
|
||||
- uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: 3.11
|
||||
|
||||
- name: Build docs
|
||||
run: |
|
||||
pip install -r docs/requirements.txt
|
||||
mkdocs build --strict
|
||||
|
||||
e2e-test:
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
steps:
|
||||
- name: Checkout source code
|
||||
uses: actions/checkout@v5
|
||||
with:
|
||||
fetch-depth: 0
|
||||
|
||||
- name: Set up Go
|
||||
uses: actions/setup-go@v5
|
||||
with:
|
||||
go-version-file: go.mod
|
||||
|
||||
- name: Set up Kind cluster
|
||||
uses: helm/kind-action@v1
|
||||
with:
|
||||
node_image: kindest/node:v1.29.10
|
||||
config: arena-artifacts/ci/kind-config.yaml
|
||||
|
||||
- name: Install arena client
|
||||
run: |
|
||||
make arena-installer
|
||||
tar -zxf arena-installer-*.tar.gz
|
||||
arena-installer-*/install.sh --only-binary
|
||||
|
||||
- name: Run e2e tests
|
||||
run: |
|
||||
make e2e-test
|
|
@ -0,0 +1,242 @@
|
|||
name: Release
|
||||
|
||||
on:
|
||||
push:
|
||||
branches:
|
||||
- master
|
||||
paths:
|
||||
- VERSION
|
||||
|
||||
env:
|
||||
IMAGE_REGISTRY: ghcr.io
|
||||
IMAGE_REPOSITORY: ${{ github.repository }}
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
package-arena-installer:
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
os:
|
||||
- linux
|
||||
- darwin
|
||||
arch:
|
||||
- amd64
|
||||
- arm64
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v5
|
||||
|
||||
- name: Read version from VERSION file
|
||||
run: |
|
||||
VERSION=$(cat VERSION)
|
||||
echo "VERSION=${VERSION}" >> ${GITHUB_ENV}
|
||||
|
||||
- name: Get git commit id
|
||||
run: |
|
||||
COMMIT=$(git rev-parse --short HEAD)
|
||||
echo "COMMIT=${COMMIT}" >>${GITHUB_ENV}
|
||||
|
||||
- name: Build arena installer tarball
|
||||
run: |
|
||||
make arena-installer OS=${{ matrix.os }} ARCH=${{ matrix.arch }}
|
||||
|
||||
- uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: arena-installer-${{ env.VERSION }}-${{ matrix.os }}-${{ matrix.arch }}
|
||||
path: arena-installer-${{ env.VERSION }}-${{ matrix.os }}-${{ matrix.arch }}.tar.gz
|
||||
if-no-files-found: error
|
||||
overwrite: true
|
||||
|
||||
build-arena-image:
|
||||
name: Build Arena container image
|
||||
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
platform:
|
||||
- linux/amd64
|
||||
- linux/arm64
|
||||
|
||||
steps:
|
||||
- name: Prepare
|
||||
run: |
|
||||
platform=${{ matrix.platform }}
|
||||
echo "PLATFORM_PAIR=${platform//\//-}" >> $GITHUB_ENV
|
||||
|
||||
- name: Checkout source code
|
||||
uses: actions/checkout@v5
|
||||
|
||||
- name: Read version from VERSION file
|
||||
run: |
|
||||
VERSION=$(cat VERSION)
|
||||
echo "VERSION=${VERSION}" >> $GITHUB_ENV
|
||||
|
||||
- name: Docker meta
|
||||
id: meta
|
||||
uses: docker/metadata-action@v5
|
||||
with:
|
||||
images: ${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}
|
||||
tags: |
|
||||
type=semver,pattern={{version}},value=${{ env.VERSION }}
|
||||
|
||||
- name: Set up QEMU
|
||||
uses: docker/setup-qemu-action@v3
|
||||
|
||||
- name: Set up Docker buildx
|
||||
uses: docker/setup-buildx-action@v3
|
||||
|
||||
- name: Login to container registry
|
||||
uses: docker/login-action@v3
|
||||
with:
|
||||
registry: ${{ env.IMAGE_REGISTRY }}
|
||||
username: ${{ github.actor }}
|
||||
password: ${{ secrets.GITHUB_TOKEN }}
|
||||
|
||||
- name: Build and push by digest
|
||||
id: build
|
||||
uses: docker/build-push-action@v6
|
||||
with:
|
||||
platforms: ${{ matrix.platform }}
|
||||
labels: ${{ steps.meta.outputs.labels }}
|
||||
outputs: type=image,name=${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }},push-by-digest=true,name-canonical=true,push=true
|
||||
|
||||
- name: Export digest
|
||||
run: |
|
||||
mkdir -p /tmp/digests
|
||||
digest="${{ steps.build.outputs.digest }}"
|
||||
touch "/tmp/digests/${digest#sha256:}"
|
||||
|
||||
- name: Upload digest
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: digests-${{ env.PLATFORM_PAIR }}
|
||||
path: /tmp/digests/*
|
||||
if-no-files-found: error
|
||||
retention-days: 1
|
||||
|
||||
release-image:
|
||||
needs:
|
||||
- build-arena-image
|
||||
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
steps:
|
||||
- name: Checkout source code
|
||||
uses: actions/checkout@v5
|
||||
|
||||
- name: Read version from VERSION file
|
||||
run: |
|
||||
VERSION=$(cat VERSION)
|
||||
echo "VERSION=${VERSION}" >> $GITHUB_ENV
|
||||
|
||||
- name: Docker meta
|
||||
id: meta
|
||||
uses: docker/metadata-action@v5
|
||||
with:
|
||||
images: ${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}
|
||||
tags: |
|
||||
type=semver,pattern={{version}},value=${{ env.VERSION }}
|
||||
|
||||
- name: Download digests
|
||||
uses: actions/download-artifact@v5
|
||||
with:
|
||||
path: /tmp/digests
|
||||
pattern: digests-*
|
||||
merge-multiple: true
|
||||
|
||||
- name: Set up Docker buildx
|
||||
uses: docker/setup-buildx-action@v3
|
||||
|
||||
- name: Login to container registry
|
||||
uses: docker/login-action@v3
|
||||
with:
|
||||
registry: ${{ env.IMAGE_REGISTRY }}
|
||||
username: ${{ github.actor }}
|
||||
password: ${{ secrets.GITHUB_TOKEN }}
|
||||
|
||||
- name: Create manifest list and push
|
||||
working-directory: /tmp/digests
|
||||
run: |
|
||||
docker buildx imagetools create $(jq -cr '.tags | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON") \
|
||||
$(printf '${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}@sha256:%s ' *)
|
||||
|
||||
- name: Inspect image
|
||||
run: |
|
||||
docker buildx imagetools inspect ${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}:${{ steps.meta.outputs.version }}
|
||||
|
||||
push_tag:
|
||||
needs:
|
||||
- package-arena-installer
|
||||
- release-image
|
||||
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
steps:
|
||||
- name: Checkout source code
|
||||
uses: actions/checkout@v5
|
||||
with:
|
||||
fetch-depth: 0
|
||||
|
||||
- name: Configure Git
|
||||
run: |
|
||||
git config user.name "$GITHUB_ACTOR"
|
||||
git config user.email "$GITHUB_ACTOR@users.noreply.github.com"
|
||||
|
||||
- name: Read version from VERSION file
|
||||
run: |
|
||||
VERSION=$(cat VERSION)
|
||||
echo "VERSION=${VERSION}" >> ${GITHUB_ENV}
|
||||
|
||||
- name: Create and push tag
|
||||
run: |
|
||||
TAG="v${VERSION}"
|
||||
git tag -a ${TAG} -m "Release v${VERSION}"
|
||||
git push origin ${TAG}
|
||||
|
||||
draft_release:
|
||||
needs:
|
||||
- push_tag
|
||||
|
||||
permissions:
|
||||
contents: write
|
||||
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v5
|
||||
|
||||
- name: Configure Git
|
||||
run: |
|
||||
git config user.name "$GITHUB_ACTOR"
|
||||
git config user.email "$GITHUB_ACTOR@users.noreply.github.com"
|
||||
|
||||
- name: Read version from VERSION file
|
||||
run: |
|
||||
VERSION=$(cat VERSION)
|
||||
echo "VERSION=${VERSION}" >> ${GITHUB_ENV}
|
||||
|
||||
- name: Download arena installer tarballs
|
||||
uses: actions/download-artifact@v5
|
||||
with:
|
||||
pattern: arena-installer-${{ env.VERSION }}-{linux,darwin}-{amd64,arm64}
|
||||
|
||||
- name: Release
|
||||
uses: softprops/action-gh-release@v2
|
||||
with:
|
||||
token: ${{ secrets.GITHUB_TOKEN }}
|
||||
tag_name: v${{ env.VERSION }}
|
||||
prerelease: ${{ contains(env.VERSION, 'rc') }}
|
||||
target_commitish: ${{ github.sha }}
|
||||
draft: true
|
||||
files: |
|
||||
arena-installer-*/arena-installer-*.tar.gz
|
||||
fail_on_unmatched_files: true
|
|
@ -0,0 +1,43 @@
|
|||
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
|
||||
#
|
||||
# You can adjust the behavior by modifying this file.
|
||||
# For more information, see:
|
||||
# https://github.com/actions/stale
|
||||
|
||||
name: Mark stale issues and pull requests
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: "0 0 * * 0"
|
||||
|
||||
jobs:
|
||||
stale:
|
||||
runs-on: ubuntu-latest
|
||||
permissions:
|
||||
issues: write
|
||||
pull-requests: write
|
||||
|
||||
steps:
|
||||
- uses: actions/stale@v9
|
||||
with:
|
||||
repo-token: ${{ secrets.GITHUB_TOKEN }}
|
||||
days-before-stale: 360
|
||||
days-before-close: 180
|
||||
stale-issue-message: >
|
||||
This issue has been automatically marked as stale because it has not had
|
||||
recent activity. It will be closed if no further activity occurs. Thank you
|
||||
for your contributions.
|
||||
close-issue-message: >
|
||||
This issue has been automatically closed because it has not had recent
|
||||
activity. Please comment "/reopen" to reopen it.
|
||||
stale-issue-label: lifecycle/stale
|
||||
exempt-issue-labels: lifecycle/frozen
|
||||
stale-pr-message: >
|
||||
This pull request has been automatically marked as stale because it has not had
|
||||
recent activity. It will be closed if no further activity occurs. Thank you
|
||||
for your contributions.
|
||||
close-pr-message: >
|
||||
This pull request has been automatically closed because it has not had recent
|
||||
activity. Please comment "/reopen" to reopen it.
|
||||
stale-pr-label: lifecycle/stale
|
||||
exempt-pr-labels: lifecycle/frozen
|
|
@ -1,12 +1,25 @@
|
|||
bin/
|
||||
**/*.tgz
|
||||
**/.DS_Store
|
||||
.idea
|
||||
.kube
|
||||
.vscode
|
||||
Library
|
||||
|
||||
public/
|
||||
site/
|
||||
tmp/
|
||||
sdk/arena-python-sdk/dist/
|
||||
sdk/arena-python-sdk/build/
|
||||
sdk/arena-python-sdk/arenasdk.egg-info/
|
||||
__pycache__
|
||||
.hugo_build.lock
|
||||
.kube
|
||||
*.tgz
|
||||
*.tar.gz
|
||||
|
||||
# Python
|
||||
__pycache__/
|
||||
|
||||
# Go
|
||||
cover.out
|
||||
|
||||
# IDE files
|
||||
.idea/
|
||||
.vscode/
|
||||
|
||||
# MacOS
|
||||
.DS_Store
|
||||
|
|
|
@ -0,0 +1,76 @@
|
|||
version: "2"
|
||||
|
||||
run:
|
||||
# Timeout for total work, e.g. 30s, 5m, 5m30s.
|
||||
# If the value is lower or equal to 0, the timeout is disabled.
|
||||
# Default: 0 (disabled)
|
||||
timeout: 2m
|
||||
|
||||
linters:
|
||||
# Enable specific linters.
|
||||
# https://golangci-lint.run/usage/linters/#enabled-by-default
|
||||
enable:
|
||||
# Detects places where loop variables are copied.
|
||||
- copyloopvar
|
||||
# Checks for duplicate words in the source code.
|
||||
- dupword
|
||||
# Tool for detection of FIXME, TODO and other comment keywords.
|
||||
# - godox
|
||||
# Enforces consistent import aliases.
|
||||
- importas
|
||||
# Find code that shadows one of Go's predeclared identifiers.
|
||||
- predeclared
|
||||
# Check that struct tags are well aligned.
|
||||
- tagalign
|
||||
# Remove unnecessary type conversions.
|
||||
- unconvert
|
||||
# Checks Go code for unused constants, variables, functions and types.
|
||||
- unused
|
||||
# Disable specific linters.
|
||||
disable:
|
||||
# Errcheck is a program for checking for unchecked errors in Go code.
|
||||
- errcheck
|
||||
|
||||
settings:
|
||||
importas:
|
||||
# List of aliases
|
||||
alias:
|
||||
- pkg: k8s.io/api/admissionregistration/v1
|
||||
alias: admissionregistrationv1
|
||||
- pkg: k8s.io/api/apps/v1
|
||||
alias: appsv1
|
||||
- pkg: k8s.io/api/batch/v1
|
||||
alias: batchv1
|
||||
- pkg: k8s.io/api/core/v1
|
||||
alias: corev1
|
||||
- pkg: k8s.io/api/extensions/v1beta1
|
||||
alias: extensionsv1beta1
|
||||
- pkg: k8s.io/api/networking/v1
|
||||
alias: networkingv1
|
||||
- pkg: k8s.io/apimachinery/pkg/apis/meta/v1
|
||||
alias: metav1
|
||||
- pkg: sigs.k8s.io/controller-runtime
|
||||
alias: ctrl
|
||||
|
||||
exclusions:
|
||||
# Which file paths to exclude: they will be analyzed, but issues from them won't be reported.
|
||||
# "/" will be replaced by the current OS file path separator to properly work on Windows.
|
||||
# Default: []
|
||||
paths:
|
||||
- pkg/operators
|
||||
|
||||
issues:
|
||||
# Maximum issues count per one linter.
|
||||
# Set to 0 to disable.
|
||||
# Default: 50
|
||||
max-issues-per-linter: 50
|
||||
# Maximum count of issues with the same text.
|
||||
# Set to 0 to disable.
|
||||
# Default: 3
|
||||
max-same-issues: 10
|
||||
|
||||
formatters:
|
||||
enable:
|
||||
# Check import statements are formatted according to the 'goimport' command.
|
||||
- goimports
|
||||
|
|
@ -4,6 +4,12 @@
|
|||
# Required
|
||||
version: 2
|
||||
|
||||
# Set the version of Python and other tools you might need
|
||||
build:
|
||||
os: ubuntu-22.04
|
||||
tools:
|
||||
python: "3.12"
|
||||
|
||||
mkdocs:
|
||||
configuration: mkdocs.yml
|
||||
|
||||
|
@ -13,6 +19,5 @@ formats:
|
|||
|
||||
# Optionally set the version of Python and requirements required to build your docs
|
||||
python:
|
||||
version: 3.7
|
||||
install:
|
||||
- requirements: docs/requirements.txt
|
||||
|
|
16
.travis.yml
|
@ -1,16 +0,0 @@
|
|||
language: go
|
||||
|
||||
go:
|
||||
- "1.14.10"
|
||||
|
||||
go_import_path: github.com/kubeflow/arena
|
||||
|
||||
# let us have speedy Docker-based Travis workers
|
||||
sudo: false
|
||||
|
||||
script:
|
||||
- go build -o bin/arena cmd/arena/*.go
|
||||
- go vet ./...
|
||||
- go test -v ./...
|
||||
- test -z "$(go fmt ./... 2>/dev/null | tee /dev/stderr)" || (echo "please format Go code with 'gofmt'")
|
||||
- go test -race -v ./...
|
|
@ -0,0 +1,236 @@
|
|||
# Changelog
|
||||
|
||||
## [v0.15.1](https://github.com/kubeflow/arena/tree/v0.15.1) (2025-06-25)
|
||||
|
||||
### Features
|
||||
|
||||
- Add support for configuring tolerations ([#1337](https://github.com/kubeflow/arena/pull/1337) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
### Misc
|
||||
|
||||
- Remove kubernetes artifacts ([#1329](https://github.com/kubeflow/arena/pull/1329) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- [CI] Add CI workflow for releasing Arena images ([#1340](https://github.com/kubeflow/arena/pull/1340) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Update uninstall bash script ([#1335](https://github.com/kubeflow/arena/pull/1335) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Fix golangci-lint issues ([#1341](https://github.com/kubeflow/arena/pull/1341) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Bump golang version from 1.22.7 to 1.23.10 ([#1345](https://github.com/kubeflow/arena/pull/1345) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- chore(deps): bump github.com/prometheus/common from 0.60.1 to 0.65.0 ([#1343](https://github.com/kubeflow/arena/pull/1343) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- chore(deps): bump golang.org/x/crypto from 0.38.0 to 0.39.0 ([#1334](https://github.com/kubeflow/arena/pull/1334) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
|
||||
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.15.0...v0.15.1)
|
||||
|
||||
## [v0.15.0](https://github.com/kubeflow/arena/tree/v0.15.0) (2025-06-04)
|
||||
|
||||
### Features
|
||||
|
||||
- refactor: use helm lib instead of helm binary ([#1207](https://github.com/kubeflow/arena/pull/1207) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- feat: add new value for using localtime in cron-operator ([#1296](https://github.com/kubeflow/arena/pull/1296) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Delete all services when the TFJob is terminated ([#1316](https://github.com/kubeflow/arena/pull/1316) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Make number of replicas of cron-operator deployment configurable ([#1325](https://github.com/kubeflow/arena/pull/1325) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Make number of replicas of tf-operator deployment configurable ([#1323](https://github.com/kubeflow/arena/pull/1323) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Add custom device support for kserve and kserving. ([#1315](https://github.com/kubeflow/arena/pull/1315) by [@Leoyzen](https://github.com/Leoyzen))
|
||||
- Feat: support affinity policy for kserve and tfjob ([#1319](https://github.com/kubeflow/arena/pull/1319) by [@Syspretor](https://github.com/Syspretor))
|
||||
- Feat: support separate affinity policy configuration for PS and worke… ([#1331](https://github.com/kubeflow/arena/pull/1331) by [@Syspretor](https://github.com/Syspretor))
|
||||
|
||||
### Bug Fixes
|
||||
|
||||
- fix: job status displays incorrectly ([#1289](https://github.com/kubeflow/arena/pull/1289) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- fix: service account should use release namespace ([#1308](https://github.com/kubeflow/arena/pull/1308) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
### Misc
|
||||
|
||||
- Add basic e2e tests ([#1225](https://github.com/kubeflow/arena/pull/1225) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Bump github.com/containerd/containerd from 1.7.23 to 1.7.27 ([#1290](https://github.com/kubeflow/arena/pull/1290) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Add stale bot to mark stale issues and PRs ([#1141](https://github.com/kubeflow/arena/pull/1141) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Fix typos in multiple files ([#1304](https://github.com/kubeflow/arena/pull/1304) by [@co63oc](https://github.com/co63oc))
|
||||
- Fix typos in multiple files ([#1310](https://github.com/kubeflow/arena/pull/1310) by [@co63oc](https://github.com/co63oc))
|
||||
|
||||
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.14.2...v0.15.0)
|
||||
|
||||
## [v0.14.2](https://github.com/kubeflow/arena/tree/v0.14.2) (2025-03-10)
|
||||
|
||||
### Misc
|
||||
|
||||
- Fix typos ([#1276](https://github.com/kubeflow/arena/pull/1276) by [@co63oc](https://github.com/co63oc))
|
||||
- Update pytorch operator image ([#1281](https://github.com/kubeflow/arena/pull/1281) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.14.1...v0.14.2)
|
||||
|
||||
## [v0.14.1](https://github.com/kubeflow/arena/tree/v0.14.1) (2025-02-24)
|
||||
|
||||
### Bug Fixes
|
||||
|
||||
- fix: device value does not support k8s resource quantity ([#1267](https://github.com/kubeflow/arena/pull/1267) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- fix: pytorchjob does not support backoff limit ([#1272](https://github.com/kubeflow/arena/pull/1272) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- unset env NVIDIA_VISIBLE_DEVICES when gpushare is enabled ([#1273](https://github.com/kubeflow/arena/pull/1273) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
### Misc
|
||||
|
||||
- docs: fixed typo ([#1257](https://github.com/kubeflow/arena/pull/1257) by [@DBMxrco](https://github.com/DBMxrco))
|
||||
- Bump github.com/golang/glog from 1.2.3 to 1.2.4 ([#1263](https://github.com/kubeflow/arena/pull/1263) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- fix: format of tensorflow standalone training docs is messed up ([#1265](https://github.com/kubeflow/arena/pull/1265) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.14.0...v0.14.1)
|
||||
|
||||
## [v0.14.0](https://github.com/kubeflow/arena/tree/v0.14.0) (2025-02-12)
|
||||
|
||||
### Features
|
||||
|
||||
- rename parameter ([#1262](https://github.com/kubeflow/arena/pull/1262) by [@gujingit](https://github.com/gujingit))
|
||||
|
||||
### Misc
|
||||
|
||||
- Add changelog for v0.13.1 ([#1248](https://github.com/kubeflow/arena/pull/1248) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Bump github.com/go-resty/resty/v2 from 2.16.0 to 2.16.5 ([#1254](https://github.com/kubeflow/arena/pull/1254) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
|
||||
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.13.1...v0.14.0)
|
||||
|
||||
## [v0.13.1](https://github.com/kubeflow/arena/tree/v0.13.1) (2025-01-13)
|
||||
|
||||
### Misc
|
||||
|
||||
- feat: add linux/arm64 support for tf-operator image ([#1238](https://github.com/kubeflow/arena/pull/1238) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- feat: add linux/arm64 support for mpi-operator image ([#1239](https://github.com/kubeflow/arena/pull/1239) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- feat: add linux/arm64 support for cron-operator image ([#1240](https://github.com/kubeflow/arena/pull/1240) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- feat: add linux/arm64 support for et-operator image ([#1241](https://github.com/kubeflow/arena/pull/1241) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Add PyTorch mnist example ([#1237](https://github.com/kubeflow/arena/pull/1237) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Update the version of elastic-job-supervisor in arena-artifacts ([#1247](https://github.com/kubeflow/arena/pull/1247) by [@AlanFokCo](https://github.com/AlanFokCo))
|
||||
|
||||
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.13.0...v0.13.1)
|
||||
|
||||
## [v0.13.0](https://github.com/kubeflow/arena/tree/v0.13.0) (2024-12-23)
|
||||
|
||||
### New Features
|
||||
|
||||
- feat: add support for torchrun ([#1228](https://github.com/kubeflow/arena/pull/1228) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Update pytorch-operator image ([#1234](https://github.com/kubeflow/arena/pull/1234) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
### Bug Fix
|
||||
|
||||
- Avoid listing jobs and statefulsets when get pytorchjob ([#1229](https://github.com/kubeflow/arena/pull/1229) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
### Misc
|
||||
|
||||
- Update tfjob standalone training job doc ([#1222](https://github.com/kubeflow/arena/pull/1222) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Remove archived docs ([#1208](https://github.com/kubeflow/arena/pull/1208) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Add changelog for v0.12.1 ([#1224](https://github.com/kubeflow/arena/pull/1224) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Bump golang.org/x/crypto from 0.29.0 to 0.31.0 ([#1231](https://github.com/kubeflow/arena/pull/1231) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump google.golang.org/protobuf from 1.35.1 to 1.36.0 ([#1227](https://github.com/kubeflow/arena/pull/1227) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
|
||||
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.12.1...v0.13.0)
|
||||
|
||||
## [v0.12.1](https://github.com/kubeflow/arena/tree/v0.12.1) (2024-11-25)
|
||||
|
||||
### New Features
|
||||
|
||||
- Support MPI Job with generic devices ([#1209](https://github.com/kubeflow/arena/pull/1209) by [@cheyang](https://github.com/cheyang))
|
||||
|
||||
### Bug Fix
|
||||
|
||||
- Update tf-operator image to fix clean pod policy issues ([#1200](https://github.com/kubeflow/arena/pull/1200) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Fix etjob rendering error when using local logging dir ([#1203](https://github.com/kubeflow/arena/pull/1203) by [@TrafalgarZZZ](https://github.com/TrafalgarZZZ))
|
||||
- Fix the functionality of generating kubeconfig (#1204) ([#1205](https://github.com/kubeflow/arena/pull/1205) by [@wqlparallel](https://github.com/wqlparallel))
|
||||
- Update cron operator image ([#1214](https://github.com/kubeflow/arena/pull/1214) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
### Misc
|
||||
|
||||
- Add changelog for v0.12.0 ([#1199](https://github.com/kubeflow/arena/pull/1199) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Add go mod vendor check to integration test ([#1198](https://github.com/kubeflow/arena/pull/1198) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- bump github.com/go-resty/resty/v2 from 2.15.3 to 2.16.0 ([#1202](https://github.com/kubeflow/arena/pull/1202) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Publish releases only on master branch ([#1210](https://github.com/kubeflow/arena/pull/1210) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Add docs for releasing arena ([#1201](https://github.com/kubeflow/arena/pull/1201) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Bump golang.org/x/crypto from 0.28.0 to 0.29.0 ([#1206](https://github.com/kubeflow/arena/pull/1206) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Release v0.12.1 ([#1215](https://github.com/kubeflow/arena/pull/1215) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
[Full Changelog](https://github.com/kubeflow/arena/compare/29b2d6d2...v0.12.1)
|
||||
|
||||
## [v0.12.0](https://github.com/kubeflow/arena/tree/v0.12.0) (2024-11-11)
|
||||
|
||||
### New Features
|
||||
|
||||
- Feat: add support for distributed serving type ([#1187](https://github.com/kubeflow/arena/pull/1187) by [@linnlh](https://github.com/linnlh))
|
||||
- Support distributed serving with vendor update ([#1194](https://github.com/kubeflow/arena/pull/1194) by [@cheyang](https://github.com/cheyang))
|
||||
|
||||
### Misc
|
||||
|
||||
- Bump github.com/golang/glog from 1.2.2 to 1.2.3 ([#1189](https://github.com/kubeflow/arena/pull/1189) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump github.com/prometheus/common from 0.60.0 to 0.60.1 ([#1182](https://github.com/kubeflow/arena/pull/1182) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump mkdocs-material from 9.5.42 to 9.5.44 ([#1190](https://github.com/kubeflow/arena/pull/1190) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Release v0.12.0 ([#1197](https://github.com/kubeflow/arena/pull/1197) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
[Full Changelog](https://github.com/kubeflow/arena/compare/46a795e3...v0.12.0)
|
||||
|
||||
## [v0.11.0](https://github.com/kubeflow/arena/tree/v0.11.0) (2024-10-24)
|
||||
|
||||
### New Features
|
||||
|
||||
- Support ray job ([#1123](https://github.com/kubeflow/arena/pull/1123) by [@qile123](https://github.com/qile123))
|
||||
|
||||
### Misc
|
||||
|
||||
- Bump github.com/prometheus/client_golang from 1.20.4 to 1.20.5 ([#1176](https://github.com/kubeflow/arena/pull/1176) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump mkdocs-material from 9.5.40 to 9.5.42 ([#1179](https://github.com/kubeflow/arena/pull/1179) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
|
||||
[Full Changelog](https://github.com/kubeflow/arena/compare/e15cb18...v0.11.0)
|
||||
|
||||
## [v0.10.1](https://github.com/kubeflow/arena/tree/v0.10.1) (2024-10-14)
|
||||
|
||||
### Bug Fixes
|
||||
|
||||
- fix: keep arena installer after installing the binary ([#1164](https://github.com/kubeflow/arena/pull/1164) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- fix: unsupported success policy when success policy is not specified ([#1170](https://github.com/kubeflow/arena/pull/1170) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- fix: failed to sync cache due to status subresouce missed in tfjob CRD ([#1173](https://github.com/kubeflow/arena/pull/1173) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
### Misc
|
||||
|
||||
- Bump github.com/prometheus/common from 0.59.1 to 0.60.0 ([#1160](https://github.com/kubeflow/arena/pull/1160) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump golang.org/x/crypto from 0.27.0 to 0.28.0 ([#1162](https://github.com/kubeflow/arena/pull/1162) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Migrate docker image to ACREE ([#1171](https://github.com/kubeflow/arena/pull/1171) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Bump mkdocs-material from 9.5.38 to 9.5.40 ([#1166](https://github.com/kubeflow/arena/pull/1166) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump google.golang.org/protobuf from 1.34.2 to 1.35.1 ([#1163](https://github.com/kubeflow/arena/pull/1163) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Remove redundant run_arena.sh file ([#1172](https://github.com/kubeflow/arena/pull/1172) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.10.0...v0.10.1)
|
||||
|
||||
## [v0.10.0](https://github.com/kubeflow/arena/tree/v0.10.0) (2024-09-29)
|
||||
|
||||
### New Features
|
||||
|
||||
- Support multiple type devices ([#1122](https://github.com/kubeflow/arena/pull/1122) by [@lizhiboo](https://github.com/lizhiboo))
|
||||
- Increase RSA key bit size from 1024 to 2048 ([#1130](https://github.com/kubeflow/arena/pull/1130) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Add success policy to TF training job ([#1148](https://github.com/kubeflow/arena/pull/1148) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
### Bug Fixes
|
||||
|
||||
- Fix submitting spark training jobs and update docs ([#1112](https://github.com/kubeflow/arena/pull/1112) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- docs: fix broken links and add CI for checking document build status ([#1131](https://github.com/kubeflow/arena/pull/1131) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- [Bugfix] Make PytorchJob devices format to key=value ([#1155](https://github.com/kubeflow/arena/pull/1155) by [@AlanFokCo](https://github.com/AlanFokCo))
|
||||
|
||||
### SDK
|
||||
|
||||
- Bump arena Java SDK version to 1.0.8 ([#1124](https://github.com/kubeflow/arena/pull/1124) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
### Misc
|
||||
|
||||
- Remove docker dependency ([#1113](https://github.com/kubeflow/arena/pull/1113) by [@Syulin7](https://github.com/Syulin7))
|
||||
- Update Makefile and release workflow ([#1128](https://github.com/kubeflow/arena/pull/1128) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- chore: remove travis and circle CI ([#1129](https://github.com/kubeflow/arena/pull/1129) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- chore: add issue templates and update depenabot bot ([#1140](https://github.com/kubeflow/arena/pull/1140) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Bump github.com/golang/glog from 1.1.2 to 1.2.2 ([#1139](https://github.com/kubeflow/arena/pull/1139) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump golang.org/x/crypto from 0.21.0 to 0.27.0 ([#1126](https://github.com/kubeflow/arena/pull/1126) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump github.com/spf13/cobra from 1.8.0 to 1.8.1 ([#1137](https://github.com/kubeflow/arena/pull/1137) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump github.com/go-resty/resty/v2 from 2.12.0 to 2.14.0 ([#1134](https://github.com/kubeflow/arena/pull/1134) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump github.com/kserve/kserve from 0.13.0 to 0.13.1 ([#1135](https://github.com/kubeflow/arena/pull/1135) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump github.com/prometheus/common from 0.45.0 to 0.59.1 ([#1138](https://github.com/kubeflow/arena/pull/1138) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump client-java from 10.0.1 to 11.0.1 ([#1132](https://github.com/kubeflow/arena/pull/1132) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Bump github.com/prometheus/client_golang from 1.20.0 to 1.20.4 ([#1144](https://github.com/kubeflow/arena/pull/1144) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump github.com/go-resty/resty/v2 from 2.14.0 to 2.15.0 ([#1143](https://github.com/kubeflow/arena/pull/1143) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump mkdocs-material from 9.5.34 to 9.5.35 ([#1145](https://github.com/kubeflow/arena/pull/1145) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump github.com/go-resty/resty/v2 from 2.15.0 to 2.15.1 ([#1147](https://github.com/kubeflow/arena/pull/1147) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump github.com/go-resty/resty/v2 from 2.15.1 to 2.15.2 ([#1150](https://github.com/kubeflow/arena/pull/1150) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump mkdocs-material from 9.5.35 to 9.5.36 ([#1151](https://github.com/kubeflow/arena/pull/1151) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump golang from 1.21 to 1.22.7 ([#1142](https://github.com/kubeflow/arena/pull/1142) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
- Bump mkdocs-material from 9.5.36 to 9.5.38 ([#1153](https://github.com/kubeflow/arena/pull/1153) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Bump github.com/go-resty/resty/v2 from 2.15.2 to 2.15.3 ([#1156](https://github.com/kubeflow/arena/pull/1156) by [@dependabot[bot]](https://github.com/apps/dependabot))
|
||||
- Release v0.10.0 ([#1157](https://github.com/kubeflow/arena/pull/1157) by [@ChenYi015](https://github.com/ChenYi015))
|
||||
|
||||
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.9.16...v0.10.0)
|
|
@ -0,0 +1,41 @@
|
|||
ARG BASE_IMAGE=debian:12-slim
|
||||
|
||||
FROM golang:1.24.0 AS builder
|
||||
|
||||
ARG TARGETOS
|
||||
|
||||
ARG TARGETARCH
|
||||
|
||||
WORKDIR /workspace
|
||||
|
||||
COPY . .
|
||||
|
||||
RUN set -eux && \
|
||||
VERSION=$(cat VERSION) && \
|
||||
make arena-installer OS=${TARGETOS} ARCH=${TARGETARCH} && \
|
||||
mv arena-installer-${VERSION}-${TARGETOS}-${TARGETARCH}.tar.gz arena-installer.tar.gz
|
||||
|
||||
|
||||
FROM ${BASE_IMAGE}
|
||||
|
||||
ARG TARGETOS
|
||||
|
||||
ARG TARGETARCH
|
||||
|
||||
WORKDIR /root
|
||||
|
||||
RUN apt-get update \
|
||||
&& apt-get install -y tini \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
COPY --from=builder /workspace/arena-installer.tar.gz .
|
||||
|
||||
RUN set -eux && \
|
||||
tar -zxvf arena-installer.tar.gz && \
|
||||
mv arena-installer-*-${TARGETOS}-${TARGETARCH} arena-installer && \
|
||||
arena-installer/install.sh --only-binary && \
|
||||
rm -rf arena-installer.tar.gz
|
||||
|
||||
COPY entrypoint.sh /usr/local/bin/
|
||||
|
||||
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
|
|
@ -1,94 +0,0 @@
|
|||
#**********************************************************************
|
||||
# Builder
|
||||
#
|
||||
# Create a go runtime for building arena
|
||||
|
||||
ARG GOLANG_VERSION=1.16
|
||||
ARG KUBE_VERSION=v1.23.0
|
||||
ARG HELM_VERSION=v3.7.2
|
||||
ARG VERSION=v0.3.0-rc
|
||||
ARG OS_ARCH=linux-amd64
|
||||
ARG COMMIT=stable
|
||||
ARG TARGET=cli-$OS_ARCH
|
||||
|
||||
|
||||
FROM golang:$GOLANG_VERSION-stretch as build
|
||||
|
||||
ARG KUBE_VERSION
|
||||
ARG HELM_VERSION
|
||||
ARG OS_ARCH
|
||||
ARG TARGET
|
||||
|
||||
ENV KUBE_VERSION $KUBE_VERSION
|
||||
ENV HELM_VERSION $HELM_VERSION
|
||||
ENV VERSION $VERSION
|
||||
ENV OS_ARCH $OS_ARCH
|
||||
ENV COMMIT $COMMIT
|
||||
ENV TARGET $TARGET
|
||||
ENV GO111MODULE off
|
||||
|
||||
RUN mkdir -p /go/src/github.com/kubeflow/arena
|
||||
|
||||
WORKDIR /go/src/github.com/kubeflow/arena
|
||||
COPY . .
|
||||
|
||||
RUN make $TARGET
|
||||
|
||||
RUN wget https://get.helm.sh/helm-$HELM_VERSION-$OS_ARCH.tar.gz && \
|
||||
tar -xvf helm-$HELM_VERSION-$OS_ARCH.tar.gz && \
|
||||
mv $OS_ARCH/helm /usr/local/bin/helm && \
|
||||
chmod u+x /usr/local/bin/helm && \
|
||||
chmod u+x /go/src/github.com/kubeflow/arena/install.sh
|
||||
|
||||
RUN OS=$(echo $OS_ARCH | cut -f1 -d-) && \
|
||||
ARCH=$(echo $OS_ARCH | cut -f2 -d-) && \
|
||||
cd /usr/local/bin && \
|
||||
curl -LO https://dl.k8s.io/release/${KUBE_VERSION}/bin/${OS}/${ARCH}/kubectl && \
|
||||
chmod +x /usr/local/bin/kubectl
|
||||
|
||||
|
||||
#**********************************************************************
|
||||
#
|
||||
# Create arena pacakge
|
||||
#
|
||||
|
||||
FROM centos:7
|
||||
|
||||
ARG KUBE_VERSION
|
||||
ARG HELM_VERSION
|
||||
ARG OS_ARCH
|
||||
ARG TARGET
|
||||
ARG COMMIT
|
||||
ARG VERSION
|
||||
|
||||
ENV OS_ARCH $OS_ARCH
|
||||
ENV COMMIT $COMMIT
|
||||
ENV TARGET $TARGET
|
||||
ENV VERSION $VERSION
|
||||
|
||||
ENV ARENA_HOME /arena-installer
|
||||
ENV ARENA_TARFILE /arena-installer-$VERSION-$COMMIT-$OS_ARCH.tar.gz
|
||||
|
||||
RUN mkdir -p $ARENA_HOME/bin
|
||||
|
||||
COPY --from=build /go/src/github.com/kubeflow/arena/bin/arena $ARENA_HOME/bin/arena
|
||||
|
||||
COPY --from=build /go/src/github.com/kubeflow/arena/uninstall.sh $ARENA_HOME/bin/arena-uninstall
|
||||
|
||||
COPY --from=build /go/src/github.com/kubeflow/arena/install.sh $ARENA_HOME/install.sh
|
||||
|
||||
COPY --from=build /go/src/github.com/kubeflow/arena/arena-gen-kubeconfig.sh $ARENA_HOME/bin/arena-gen-kubeconfig.sh
|
||||
|
||||
COPY --from=build /usr/local/bin/helm $ARENA_HOME/bin/helm
|
||||
|
||||
COPY --from=build /go/src/github.com/kubeflow/arena/kubernetes-artifacts $ARENA_HOME/kubernetes-artifacts
|
||||
|
||||
COPY --from=build /go/src/github.com/kubeflow/arena/arena-artifacts $ARENA_HOME/arena-artifacts
|
||||
|
||||
COPY --from=build /usr/local/bin/kubectl $ARENA_HOME/bin/kubectl
|
||||
|
||||
COPY --from=build /go/src/github.com/kubeflow/arena/charts $ARENA_HOME/charts
|
||||
|
||||
RUN sed -i "s@^version: \(.*\)@version: $VERSION-$COMMIT@g" $ARENA_HOME/arena-artifacts/Chart.yaml && \
|
||||
sed -i "s@^appVersion: \(.*\)@appVersion: $VERSION-$COMMIT@g" $ARENA_HOME/arena-artifacts/Chart.yaml && \
|
||||
tar -zcvf $ARENA_TARFILE $ARENA_HOME
|
|
@ -1,40 +0,0 @@
|
|||
#FROM golang:1.10-stretch as build
|
||||
FROM golang:1.14-stretch as build
|
||||
|
||||
RUN mkdir -p /go/src/github.com/kubeflow/arena
|
||||
|
||||
WORKDIR /go/src/github.com/kubeflow/arena
|
||||
COPY . .
|
||||
|
||||
RUN make
|
||||
|
||||
RUN wget https://get.helm.sh/helm-v2.14.1-linux-amd64.tar.gz && \
|
||||
tar -xvf helm-v2.14.1-linux-amd64.tar.gz && \
|
||||
mv linux-amd64/helm /usr/local/bin/helm && \
|
||||
chmod u+x /usr/local/bin/helm
|
||||
|
||||
ENV K8S_VERSION v1.13.6
|
||||
RUN curl -o /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${K8S_VERSION}/bin/linux/amd64/kubectl && chmod +x /usr/local/bin/kubectl
|
||||
|
||||
|
||||
FROM centos:7
|
||||
|
||||
COPY --from=build /go/src/github.com/kubeflow/arena/bin/arena /usr/local/bin/arena
|
||||
|
||||
COPY --from=build /usr/local/bin/helm /usr/local/bin/helm
|
||||
|
||||
COPY --from=build /go/src/github.com/kubeflow/arena/kubernetes-artifacts /root/kubernetes-artifacts
|
||||
|
||||
COPY --from=build /usr/local/bin/kubectl /usr/local/bin/kubectl
|
||||
|
||||
COPY --from=build /go/src/github.com/kubeflow/arena/charts /charts
|
||||
|
||||
ADD run_arena.sh /usr/local/bin
|
||||
|
||||
RUN chmod u+x /usr/local/bin/run_arena.sh
|
||||
|
||||
RUN yum install bash-completion -y && \
|
||||
echo "source <(arena completion bash)" >> ~/.bashrc
|
||||
|
||||
ENTRYPOINT ["/usr/local/bin/run_arena.sh"]
|
||||
|
|
@ -3,7 +3,7 @@ ARG BASE_IMAGE=tensorflow/tensorflow:1.12.0-devel-py3
|
|||
|
||||
ARG USER=root
|
||||
|
||||
FROM golang:1.14-stretch as build
|
||||
FROM golang:1.23.10 AS build
|
||||
|
||||
RUN mkdir -p /go/src/github.com/kubeflow/arena
|
||||
|
||||
|
@ -12,12 +12,12 @@ COPY . .
|
|||
|
||||
RUN make
|
||||
|
||||
RUN wget https://get.helm.sh/helm-v2.14.1-linux-amd64.tar.gz && \
|
||||
tar -xvf helm-v2.14.1-linux-amd64.tar.gz && \
|
||||
RUN wget https://get.helm.sh/helm-v3.13.3-linux-amd64.tar.gz && \
|
||||
tar -xvf helm-v3.13.3-linux-amd64.tar.gz && \
|
||||
mv linux-amd64/helm /usr/local/bin/helm && \
|
||||
chmod u+x /usr/local/bin/helm
|
||||
|
||||
ENV K8S_VERSION v1.13.6
|
||||
ENV K8S_VERSION v1.28.4
|
||||
RUN curl -o /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${K8S_VERSION}/bin/linux/amd64/kubectl && chmod +x /usr/local/bin/kubectl
|
||||
|
||||
FROM $BASE_IMAGE
|
||||
|
|
|
@ -2,7 +2,7 @@ ARG BASE_IMAGE=registry.aliyuncs.com/kubeflow-images-public/tensorflow-1.12.0-no
|
|||
|
||||
ARG USER=jovyan
|
||||
|
||||
FROM golang:1.14-stretch as build
|
||||
FROM golang:1.23.10 AS build
|
||||
|
||||
RUN mkdir -p /go/src/github.com/kubeflow/arena
|
||||
|
||||
|
@ -11,12 +11,12 @@ COPY . .
|
|||
|
||||
RUN make
|
||||
|
||||
RUN wget https://get.helm.sh/helm-v2.14.1-linux-amd64.tar.gz && \
|
||||
tar -xvf helm-v2.14.1-linux-amd64.tar.gz && \
|
||||
RUN wget https://get.helm.sh/helm-v3.13.3-linux-amd64.tar.gz && \
|
||||
tar -xvf helm-v3.13.3-linux-amd64.tar.gz && \
|
||||
mv linux-amd64/helm /usr/local/bin/helm && \
|
||||
chmod u+x /usr/local/bin/helm
|
||||
|
||||
ENV K8S_VERSION v1.13.6
|
||||
ENV K8S_VERSION v1.28.4
|
||||
RUN curl -o /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${K8S_VERSION}/bin/linux/amd64/kubectl && chmod +x /usr/local/bin/kubectl
|
||||
|
||||
FROM $BASE_IMAGE
|
||||
|
|
317
Makefile
|
@ -1,19 +1,64 @@
|
|||
PACKAGE=github.com/kubeflow/arena
|
||||
CURRENT_DIR=$(shell pwd)
|
||||
DIST_DIR=${CURRENT_DIR}/bin
|
||||
ARENA_CLI_NAME=arena
|
||||
JOB_MONITOR=jobmon
|
||||
ARENA_UNINSTALL=arena-uninstall
|
||||
OS_ARCH?=linux-amd64
|
||||
.SILENT:
|
||||
|
||||
VERSION=$(shell cat ${CURRENT_DIR}/VERSION)
|
||||
BUILD_DATE=$(shell date -u +'%Y-%m-%dT%H:%M:%SZ')
|
||||
GIT_COMMIT=$(shell git rev-parse HEAD)
|
||||
GIT_SHORT_COMMIT=$(shell git rev-parse --short HEAD)
|
||||
DOCKER_BUILD_DATE=$(shell date -u +'%Y%m%d%H%M%S')
|
||||
GIT_TAG=$(shell if [ -z "`git status --porcelain`" ]; then git describe --exact-match --tags HEAD 2>/dev/null; fi)
|
||||
GIT_TREE_STATE=$(shell if [ -z "`git status --porcelain`" ]; then echo "clean" ; else echo "dirty"; fi)
|
||||
PACKR_CMD=$(shell if [ "`which packr`" ]; then echo "packr"; else echo "go run vendor/github.com/gobuffalo/packr/packr/main.go"; fi)
|
||||
# Get the currently used golang install path (in GOPATH/bin, unless GOBIN is set)
|
||||
ifeq (,$(shell go env GOBIN))
|
||||
GOBIN=$(shell go env GOPATH)/bin
|
||||
else
|
||||
GOBIN=$(shell go env GOBIN)
|
||||
endif
|
||||
|
||||
# Setting SHELL to bash allows bash commands to be executed by recipes.
|
||||
# Options are set to exit when a recipe line exits non-zero or a piped command fails.
|
||||
SHELL = /usr/bin/env bash -o pipefail
|
||||
.SHELLFLAGS = -ec
|
||||
|
||||
PACKAGE ?= github.com/kubeflow/arena
|
||||
CURRENT_DIR ?= $(shell pwd)
|
||||
DIST_DIR ?= $(CURRENT_DIR)/bin
|
||||
ARENA_CLI_NAME ?= arena
|
||||
JOB_MONITOR ?= jobmon
|
||||
ARENA_UNINSTALL ?= arena-uninstall
|
||||
OS ?= $(shell go env GOOS)
|
||||
ARCH ?= $(shell go env GOARCH)
|
||||
|
||||
VERSION ?= $(shell cat VERSION)
|
||||
BUILD_DATE := $(shell date -u +'%Y-%m-%dT%H:%M:%SZ')
|
||||
GIT_COMMIT := $(shell git rev-parse HEAD)
|
||||
GIT_SHORT_COMMIT := $(shell git rev-parse --short HEAD)
|
||||
DOCKER_BUILD_DATE := $(shell date -u +'%Y%m%d%H%M%S')
|
||||
GIT_TAG := $(shell if [ -z "`git status --porcelain`" ]; then git describe --exact-match --tags HEAD 2>/dev/null; fi)
|
||||
GIT_TREE_STATE := $(shell if [ -z "`git status --porcelain`" ]; then echo "clean" ; else echo "dirty"; fi)
|
||||
PACKR_CMD := $(shell if [ "`which packr`" ]; then echo "packr"; else echo "go run vendor/github.com/gobuffalo/packr/packr/main.go"; fi)
|
||||
|
||||
# Location to install binaries
|
||||
LOCALBIN ?= $(CURRENT_DIR)/bin
|
||||
# Location to put temp files
|
||||
TEMPDIR ?= $(CURRENT_DIR)/tmp
|
||||
# ARENA_ARTIFACTS
|
||||
ARENA_ARTIFACTS_CHART_PATH ?= $(CURRENT_DIR)/arena-artifacts
|
||||
|
||||
# Versions
|
||||
GOLANG_VERSION=$(shell grep -e '^go ' go.mod | cut -d ' ' -f 2)
|
||||
KUBECTL_VERSION ?= v1.28.4
|
||||
HELM_VERSION ?= $(shell grep -e 'helm.sh/helm/v3 ' go.mod | cut -d ' ' -f 2)
|
||||
HELM_UNITTEST_VERSION ?= 0.5.1
|
||||
KIND_VERSION ?= v0.23.0
|
||||
KIND_K8S_VERSION ?= v1.29.3
|
||||
ENVTEST_VERSION ?= release-0.18
|
||||
ENVTEST_K8S_VERSION ?= 1.29.3
|
||||
GOLANGCI_LINT_VERSION ?= v2.1.6
|
||||
|
||||
# Binaries
|
||||
ARENA ?= arena-v$(VERSION)-$(OS)-$(ARCH)
|
||||
KUBECTL ?= kubectl-$(KUBECTL_VERSION)-$(OS)-$(ARCH)
|
||||
HELM ?= helm-$(HELM_VERSION)-$(OS)-$(ARCH)
|
||||
KIND ?= $(LOCALBIN)/kind-$(KIND_VERSION)
|
||||
ENVTEST ?= $(LOCALBIN)/setup-envtest-$(ENVTEST_VERSION)
|
||||
GOLANGCI_LINT ?= golangci-lint-$(GOLANGCI_LINT_VERSION)
|
||||
|
||||
# Tarballs
|
||||
ARENA_INSTALLER ?= arena-installer-$(VERSION)-$(OS)-$(ARCH)
|
||||
ARENA_INSTALLER_TARBALL ?= $(ARENA_INSTALLER).tar.gz
|
||||
|
||||
BUILDER_IMAGE=arena-builder
|
||||
BASE_IMAGE=registry.aliyuncs.com/kubeflow-images-public/tensorflow-1.12.0-notebook-gpu:v0.4.0
|
||||
|
@ -32,8 +77,12 @@ override LDFLAGS += \
|
|||
-extldflags "-static"
|
||||
|
||||
# docker image publishing options
|
||||
IMAGE_REGISTRY ?= docker.io
|
||||
IMAGE_REPOSITORY ?= kubeflow/arena
|
||||
IMAGE_TAG ?= $(VERSION)
|
||||
IMAGE ?= $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY):$(IMAGE_TAG)
|
||||
DOCKER_PUSH=false
|
||||
IMAGE_TAG=latest
|
||||
BASE_IMAGE ?= debian:12-slim
|
||||
|
||||
ifneq (${GIT_TAG},)
|
||||
IMAGE_TAG=${GIT_TAG}
|
||||
|
@ -56,49 +105,117 @@ ifdef IMAGE_NAMESPACE
|
|||
IMAGE_PREFIX=${IMAGE_NAMESPACE}/
|
||||
endif
|
||||
|
||||
##@ General
|
||||
|
||||
# The help target prints out all targets with their descriptions organized
|
||||
# beneath their categories. The categories are represented by '##@' and the
|
||||
# target descriptions by '##'. The awk command is responsible for reading the
|
||||
# entire set of makefiles included in this invocation, looking for lines of the
|
||||
# file as xyz: ## something, and then pretty-format the target and help. Then,
|
||||
# if there's a line with ##@ something, that gets pretty-printed as a category.
|
||||
# More info on the usage of ANSI control characters for terminal formatting:
|
||||
# https://en.wikipedia.org/wiki/ANSI_escape_code#SGR_parameters
|
||||
# More info on the awk command:
|
||||
# http://linuxcommand.org/lc3_adv_awk.php
|
||||
|
||||
.PHONY: help
|
||||
help: ## Display this help.
|
||||
@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n make \033[36m<target>\033[0m\n"} /^[a-zA-Z_0-9-]+:.*?##/ { printf " \033[36m%-30s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)
|
||||
|
||||
.PHONY: all
|
||||
all: go-fmt go-vet go-lint unit-test e2e-test
|
||||
|
||||
##@ Development
|
||||
|
||||
go-fmt: ## Run go fmt against code.
|
||||
@echo "Running go fmt..."
|
||||
go fmt ./...
|
||||
|
||||
go-vet: ## Run go vet against code.
|
||||
@echo "Running go vet..."
|
||||
go vet ./...
|
||||
|
||||
.PHONY: go-lint
|
||||
go-lint: golangci-lint ## Run golangci-lint linter.
|
||||
@echo "Running golangci-lint run..."
|
||||
$(LOCALBIN)/$(GOLANGCI_LINT) run --timeout 5m ./...
|
||||
|
||||
.PHONY: go-lint-fix
|
||||
go-lint-fix: golangci-lint ## Run golangci-lint linter and perform fixes.
|
||||
@echo "Running golangci-lint run --fix..."
|
||||
$(LOCALBIN)/$(GOLANGCI_LINT) run --fix --timeout 5m ./...
|
||||
|
||||
.PHONY: unit-test
|
||||
unit-test: ## Run go unit tests.
|
||||
@echo "Running go test..."
|
||||
go test $(shell go list ./... | grep -v /e2e) -coverprofile cover.out
|
||||
|
||||
.PHONY: e2e-test
|
||||
e2e-test: envtest ## Run the e2e tests against a Kind k8s instance that is spun up.
|
||||
@echo "Running e2e tests..."
|
||||
go test ./test/e2e/ -v -ginkgo.v -timeout 30m
|
||||
|
||||
# Build the project
|
||||
.PHONY: default
|
||||
default:
|
||||
ifeq ($(OS),Windows_NT)
|
||||
default: cli-windows
|
||||
default: arena-windows
|
||||
else
|
||||
UNAME_S := $(shell uname -s)
|
||||
ifeq ($(UNAME_S),Linux)
|
||||
$(info "Building on Linux")
|
||||
default: cli-linux-amd64
|
||||
default: arena-linux-amd64
|
||||
else ifeq ($(UNAME_S),Darwin)
|
||||
$(info "Building on Darwin")
|
||||
default: cli-darwin-amd64
|
||||
default: arena-darwin-amd64
|
||||
else
|
||||
$(error "The OS is not supported")
|
||||
endif
|
||||
endif
|
||||
|
||||
.PHONY: cli-linux-amd64
|
||||
cli-linux-amd64:
|
||||
mkdir -p bin
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 GO111MODULE=off go build -tags 'netgo' -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${ARENA_CLI_NAME} cmd/arena/*.go
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 GO111MODULE=off go build -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${JOB_MONITOR} cmd/job-monitor/*.go
|
||||
##@ Build
|
||||
|
||||
.PHONY: cli-darwin-amd64
|
||||
cli-darwin-amd64:
|
||||
mkdir -p bin
|
||||
CGO_ENABLED=0 GOOS=darwin GO111MODULE=off go build -tags 'netgo' -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${ARENA_CLI_NAME} ./cmd/arena/*.go
|
||||
$(LOCALBIN):
|
||||
mkdir -p $(LOCALBIN)
|
||||
|
||||
.PHONY: cli-darwin-arm64
|
||||
cli-darwin-arm64:
|
||||
mkdir -p bin
|
||||
CGO_ENABLED=0 GOOS=darwin GO111MODULE=off go build -tags 'netgo' -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${ARENA_CLI_NAME} ./cmd/arena/*.go
|
||||
$(TEMPDIR):
|
||||
mkdir -p $(TEMPDIR)
|
||||
|
||||
.PHONY: cli-windows
|
||||
cli-windows:
|
||||
mkdir -p bin
|
||||
CGO_ENABLED=0 GOARCH=amd64 GOOS=windows GO111MODULE=off go build -tags 'netgo' -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${ARENA_CLI_NAME} ./cmd/arena/*.go
|
||||
clean: ## Clean up all downloaded and generated files.
|
||||
rm -rf $(LOCALBIN) $(TEMPDIR)
|
||||
|
||||
.PHONY: arena
|
||||
arena: $(LOCALBIN) ## Build arena CLI for current platform.
|
||||
@echo "Building arena CLI..."
|
||||
CGO_ENABLED=0 GOOS=$(OS) GOARCH=$(ARCH) go build -tags netgo -ldflags '${LDFLAGS}' -o $(LOCALBIN)/$(ARENA) cmd/arena/main.go
|
||||
|
||||
.PHONY: install-image
|
||||
install-image:
|
||||
docker build -t cheyang/arena:${VERSION}-${DOCKER_BUILD_DATE}-${GIT_SHORT_COMMIT} -f Dockerfile.install .
|
||||
.PHONY: java-sdk
|
||||
java-sdk: ## Build Java SDK.
|
||||
echo "Building arena Java SDK..."
|
||||
mvn package -Dmaven.test.skip=true -Dgpg.skip -f sdk/arena-java-sdk
|
||||
|
||||
.PHONY: docker-build
|
||||
docker-build: ## Build docker image.
|
||||
docker build \
|
||||
--build-arg BASE_IMAGE=$(BASE_IMAGE) \
|
||||
--tag $(IMAGE) \
|
||||
-f Dockerfile \
|
||||
.
|
||||
|
||||
.PHONY: docker-push
|
||||
docker-push: ## Push docker image.
|
||||
docker push $(IMAGE)
|
||||
|
||||
.PHONY: docker-buildx
|
||||
PLATFORMS ?= linux/amd64,linux/arm64
|
||||
docker-buildx: ## Build and push docker images for multiple platforms.
|
||||
- $(CONTAINER_TOOL) buildx create --name arena-builder
|
||||
$(CONTAINER_TOOL) buildx use arena-builder
|
||||
- $(CONTAINER_TOOL) buildx build --push \
|
||||
--platform=$(PLATFORMS) \
|
||||
--build-arg BASE_IMAGE=$(BASE_IMAGE) \
|
||||
--tag $(IMAGE) \
|
||||
-f Dockerfile \
|
||||
.
|
||||
- $(CONTAINER_TOOL) buildx rm arena-builder
|
||||
|
||||
.PHONY: notebook-image-kubeflow
|
||||
notebook-image-kubeflow:
|
||||
|
@ -110,22 +227,106 @@ notebook-image:
|
|||
docker build --build-arg "BASE_IMAGE=tensorflow/tensorflow:1.12.0-devel-py3" -t cheyang/arena:${VERSION}-notebook-${DOCKER_BUILD_DATE}-${GIT_SHORT_COMMIT}-cpu -f Dockerfile.notebook.cpu .
|
||||
docker tag cheyang/arena:${VERSION}-notebook-${DOCKER_BUILD_DATE}-${GIT_SHORT_COMMIT}-cpu cheyang/arena-notebook:cpu
|
||||
|
||||
# make OS_ARCH=darwin-amd64 build-pkg for mac
|
||||
.PHONY: build-pkg
|
||||
build-pkg:
|
||||
docker rm -f arena-pkg || true
|
||||
docker build --build-arg "KUBE_VERSION=v1.23.0" \
|
||||
--build-arg "HELM_VERSION=v3.7.2" \
|
||||
--build-arg "COMMIT=${GIT_SHORT_COMMIT}" \
|
||||
--build-arg "VERSION=${VERSION}" \
|
||||
--build-arg "OS_ARCH=${OS_ARCH}" \
|
||||
--build-arg "GOLANG_VERSION=1.16" \
|
||||
--build-arg "TARGET=cli-${OS_ARCH}" \
|
||||
-t arena-build:${VERSION}-${GIT_SHORT_COMMIT}-${OS_ARCH} -f Dockerfile.build .
|
||||
docker run -itd --name=arena-pkg arena-build:${VERSION}-${GIT_SHORT_COMMIT}-${OS_ARCH} /bin/bash
|
||||
docker cp arena-pkg:/arena-installer-${VERSION}-${GIT_SHORT_COMMIT}-${OS_ARCH}.tar.gz .
|
||||
docker rm -f arena-pkg
|
||||
|
||||
|
||||
.PHONY: build-dependabot
|
||||
build-dependabot:
|
||||
python3 hack/create_dependabot.py
|
||||
|
||||
.PHONY: arena-installer
|
||||
arena-installer: $(ARENA_INSTALLER_TARBALL) ## Build arena installer tarball
|
||||
$(ARENA_INSTALLER_TARBALL): arena kubectl helm
|
||||
echo "Building arena installer tarball..." && \
|
||||
rm -rf $(TEMPDIR)/$(ARENA_INSTALLER) && \
|
||||
mkdir -p $(TEMPDIR)/$(ARENA_INSTALLER)/bin && \
|
||||
cp $(LOCALBIN)/$(ARENA) $(TEMPDIR)/$(ARENA_INSTALLER)/bin/arena && \
|
||||
cp $(LOCALBIN)/$(KUBECTL) $(TEMPDIR)/$(ARENA_INSTALLER)/bin/kubectl && \
|
||||
cp $(LOCALBIN)/$(HELM) $(TEMPDIR)/$(ARENA_INSTALLER)/bin/helm && \
|
||||
cp -R charts $(TEMPDIR)/$(ARENA_INSTALLER) && \
|
||||
cp -R arena-artifacts $(TEMPDIR)/$(ARENA_INSTALLER) && \
|
||||
cp arena-gen-kubeconfig.sh $(TEMPDIR)/$(ARENA_INSTALLER)/bin && \
|
||||
cp install.sh $(TEMPDIR)/$(ARENA_INSTALLER) && \
|
||||
cp uninstall.sh $(TEMPDIR)/$(ARENA_INSTALLER)/bin/arena-uninstall && \
|
||||
tar -zcf $(ARENA_INSTALLER).tar.gz -C $(TEMPDIR) $(ARENA_INSTALLER) && \
|
||||
echo "Successfully saved arena installer to $(ARENA_INSTALLER).tar.gz."
|
||||
|
||||
##@ Helm
|
||||
|
||||
.PHONY: helm-unittest
|
||||
helm-unittest: helm-unittest-plugin ## Run Helm chart unittests.
|
||||
set -x && $(LOCALBIN)/$(HELM) unittest $(ARENA_ARTIFACTS_CHART_PATH) --strict --file "tests/**/*_test.yaml" --chart-tests-path $(CURRENT_DIR)
|
||||
|
||||
##@ Dependencies
|
||||
|
||||
.PHONY: golangci-lint
|
||||
golangci-lint: $(LOCALBIN)/$(GOLANGCI_LINT) ## Download golangci-lint locally if necessary.
|
||||
$(LOCALBIN)/$(GOLANGCI_LINT): $(LOCALBIN)
|
||||
$(call go-install-tool,$(LOCALBIN)/$(GOLANGCI_LINT),github.com/golangci/golangci-lint/v2/cmd/golangci-lint,${GOLANGCI_LINT_VERSION})
|
||||
|
||||
.PHONY: envtest
|
||||
envtest: $(ENVTEST) ## Download setup-envtest locally if necessary.
|
||||
$(ENVTEST): $(LOCALBIN)
|
||||
$(call go-install-tool,$(ENVTEST),sigs.k8s.io/controller-runtime/tools/setup-envtest,$(ENVTEST_VERSION))
|
||||
|
||||
.PHONY: kubectl
|
||||
kubectl: $(LOCALBIN)/$(KUBECTL)
|
||||
$(LOCALBIN)/$(KUBECTL): $(LOCALBIN) $(TEMPDIR)
|
||||
$(eval KUBECTL_URL=https://dl.k8s.io/release/$(KUBECTL_VERSION)/bin/$(OS)/$(ARCH)/kubectl)
|
||||
$(eval KUBECTL_SHA_URL=$(KUBECTL_URL).sha256)
|
||||
|
||||
cd $(TEMPDIR) && \
|
||||
echo "Download $(KUBECTL) if not present..." && \
|
||||
if [ ! -f $(KUBECTL) ]; then \
|
||||
curl -sSLo $(KUBECTL) $(KUBECTL_URL); \
|
||||
fi && \
|
||||
echo "Download $(KUBECTL).sha256 if not present..." && \
|
||||
if [ ! -f kubectl.sha256 ]; then \
|
||||
curl -sSLo $(KUBECTL).sha256 $(KUBECTL_SHA_URL); \
|
||||
fi && \
|
||||
echo "Verifying checksum..." && \
|
||||
echo -n "$$(cat $(KUBECTL).sha256) $(KUBECTL)" | shasum -a 256 --check --quiet || (echo "Checksum verification failed, exiting." && false) && \
|
||||
echo "Make kubectl executable and move it to bin directory..." && \
|
||||
chmod +x $(KUBECTL) && \
|
||||
cp $(KUBECTL) $(LOCALBIN) && \
|
||||
echo "Successfully installed kubectl to $(LOCALBIN)/$(KUBECTL)."
|
||||
|
||||
.PHONY: helm
|
||||
helm: $(LOCALBIN)/$(HELM)
|
||||
$(LOCALBIN)/$(HELM): $(LOCALBIN) $(TEMPDIR)
|
||||
$(eval HELM_URL=https://get.helm.sh/$(HELM).tar.gz)
|
||||
$(eval HELM_SHA_URL=https://get.helm.sh/$(HELM).tar.gz.sha256sum)
|
||||
|
||||
cd $(TEMPDIR) && \
|
||||
echo "Download $(HELM).tar.gz if not present..." && \
|
||||
if [ ! -f $(HELM).tar.gz ]; then \
|
||||
wget -qO $(HELM).tar.gz $(HELM_URL); \
|
||||
fi && \
|
||||
echo "Download $(HELM).tar.gz.sha256sum if not present..." && \
|
||||
if [ ! -f $(HELM).tar.gz.sha256sum ]; then \
|
||||
wget -qO $(HELM).tar.gz.sha256sum $(HELM_SHA_URL); \
|
||||
fi && \
|
||||
echo "Verifying checksum..." && \
|
||||
cat $(HELM).tar.gz.sha256sum | shasum -a 256 --check --quiet || (echo "Checksum verification failed, exiting." && false) && \
|
||||
echo "Extract helm tarball and move it to bin directory..." && \
|
||||
tar -zxf $(HELM).tar.gz && \
|
||||
cp ${OS}-${ARCH}/helm $(LOCALBIN)/$(HELM) && \
|
||||
echo "Successfully installed helm to $(LOCALBIN)/$(HELM)."
|
||||
|
||||
.PHONY: helm-unittest-plugin
|
||||
helm-unittest-plugin: helm ## Download helm unittest plugin locally if necessary.
|
||||
if [ -z "$(shell $(LOCALBIN)/$(HELM) plugin list | grep unittest)" ]; then \
|
||||
echo "Installing helm unittest plugin"; \
|
||||
$(LOCALBIN)/$(HELM) plugin install https://github.com/helm-unittest/helm-unittest.git --version $(HELM_UNITTEST_VERSION); \
|
||||
fi
|
||||
|
||||
# go-install-tool will 'go install' any package with custom target and name of binary, if it doesn't exist
|
||||
# $1 - target path with name of binary (ideally with version)
|
||||
# $2 - package url which can be installed
|
||||
# $3 - specific version of package
|
||||
define go-install-tool
|
||||
@[ -f $(1) ] || { \
|
||||
set -e; \
|
||||
package=$(2)@$(3) ;\
|
||||
echo "Downloading $${package}" ;\
|
||||
GOBIN=$(LOCALBIN) go install $${package} ;\
|
||||
mv "$$(echo "$(1)" | sed "s/-$(3)$$//")" $(1) ;\
|
||||
}
|
||||
endef
|
||||
|
|
10
OWNERS
|
@ -1,9 +1,11 @@
|
|||
approvers:
|
||||
- cheyang
|
||||
- wsxiaozhang
|
||||
- denverdino
|
||||
- happy2048
|
||||
- Syulin7
|
||||
- xieydd
|
||||
- denkensk
|
||||
- gujingit
|
||||
- ChenYi015
|
||||
reviewers:
|
||||
- GarnettWang
|
||||
- wsxiaozhang
|
||||
- xiaozhouX
|
||||
- osswangxining
|
||||
|
|
24
README.md
|
@ -1,8 +1,6 @@
|
|||
# Arena
|
||||
|
||||
[](https://circleci.com/gh/kubeflow/arena)
|
||||
[](https://travis-ci.org/kubeflow/arena)
|
||||
[](https://goreportcard.com/report/github.com/kubeflow/arena)
|
||||
[](https://github.com/kubeflow/arena/releases) [](https://github.com/kubeflow/arena/actions/workflows/integration.yaml) [](https://goreportcard.com/report/github.com/kubeflow/arena)
|
||||
|
||||
View the [Arena documentation](https://arena-docs.readthedocs.io/en/latest).
|
||||
|
||||
|
@ -22,13 +20,11 @@ You can follow up the [Installation guide](https://arena-docs.readthedocs.io/en/
|
|||
|
||||
## User Guide
|
||||
|
||||
Arena is a command-line interface to run and monitor the machine learning training jobs and check their results in an easy way. Please refer the [User Guide](https://arena-docs.readthedocs.io/en/latest/training) to manage your training jobs.
|
||||
|
||||
Arena is a command-line interface to run and monitor the machine learning training jobs and check their results in an easy way. Please refer the [User Guide](https://arena-docs.readthedocs.io/en/latest/training) to manage your training jobs.
|
||||
|
||||
## Demo
|
||||
|
||||
[](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/50210690772.mp4)
|
||||
|
||||
[](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/50210690772.mp4)
|
||||
|
||||
## Developing
|
||||
|
||||
|
@ -36,7 +32,7 @@ Prerequisites:
|
|||
|
||||
- Go >= 1.8
|
||||
|
||||
```
|
||||
```shell
|
||||
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
|
||||
cd $(go env GOPATH)/src/github.com/kubeflow
|
||||
git clone https://github.com/kubeflow/arena.git
|
||||
|
@ -50,7 +46,7 @@ Then you can follow [Installation guide for developer](https://arena-docs.readth
|
|||
|
||||
## CPU Profiling
|
||||
|
||||
```
|
||||
```shell
|
||||
# set profile rate (HZ)
|
||||
export PROFILE_RATE=1000
|
||||
|
||||
|
@ -61,20 +57,18 @@ INFO[0000] Dump cpu profile file into /tmp/cpu_profile
|
|||
|
||||
Then you can analyze the profile by following [Go CPU profiling: pprof and speedscope](https://coder.today/go-profiling-pprof-and-speedscope-b05b812cc429)
|
||||
|
||||
|
||||
## Adopters
|
||||
|
||||
If you are intrested in Arena and would like to share your experiences with others, you are warmly welcome to add your information on [ADOPTERS.md](docs/about/ADOPTERS.md) page. We will continuousely discuss new requirements and feature design with you in advance.
|
||||
|
||||
If you are interested in Arena and would like to share your experiences with others, you are warmly welcome to add your information on [ADOPTERS.md](docs/about/ADOPTERS.md) page. We will continuously discuss new requirements and feature design with you in advance.
|
||||
|
||||
## FAQ
|
||||
|
||||
Please refer to [FAQ](https://arena-docs.readthedocs.io/en/latest/faq)
|
||||
Please refer to [FAQ](https://arena-docs.readthedocs.io/en/latest/faq).
|
||||
|
||||
## CLI Document
|
||||
|
||||
Please refer to [arena.md](docs/cli/arena.md)
|
||||
Please refer to [arena.md](docs/cli/arena.md).
|
||||
|
||||
## RoadMap
|
||||
|
||||
See [RoadMap](ROADMAP.md)
|
||||
See [RoadMap](ROADMAP.md).
|
||||
|
|
12
README_cn.md
|
@ -1,9 +1,6 @@
|
|||
# Arena
|
||||
|
||||
[](https://circleci.com/gh/kubeflow/arena)
|
||||
[](https://travis-ci.org/kubeflow/arena)
|
||||
[](https://goreportcard.com/report/github.com/kubeflow/arena)
|
||||
|
||||
[](https://github.com/kubeflow/arena/actions/workflows/integration.yaml)[](https://goreportcard.com/report/github.com/kubeflow/arena)
|
||||
|
||||
## 概述
|
||||
|
||||
|
@ -13,7 +10,6 @@ Arena 是一个命令行工具,可供数据科学家轻而易举地运行和
|
|||
|
||||
简而言之,Arena 的目标是让数据科学家感觉自己就像是在一台机器上工作,而实际上还可以享受到 GPU 集群的强大力量。
|
||||
|
||||
|
||||
## 设置
|
||||
|
||||
您可以按照 [安装指南](https://arena-docs.readthedocs.io/en/latest/installation) 执行操作
|
||||
|
@ -32,8 +28,7 @@ Arena 是一种命令行界面,支持轻而易举地运行和监控机器学
|
|||
|
||||
## 演示
|
||||
|
||||
[](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/50210690772.mp4)
|
||||
|
||||
[](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/50210690772.mp4)
|
||||
|
||||
## 开发
|
||||
|
||||
|
@ -41,7 +36,7 @@ Arena 是一种命令行界面,支持轻而易举地运行和监控机器学
|
|||
|
||||
- Go >= 1.8
|
||||
|
||||
```
|
||||
```shell
|
||||
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
|
||||
cd $(go env GOPATH)/src/github.com/kubeflow
|
||||
git clone https://github.com/kubeflow/arena.git
|
||||
|
@ -58,4 +53,3 @@ make
|
|||
## 路线图
|
||||
|
||||
请参阅[路线图](ROADMAP.md)
|
||||
|
||||
|
|
40
ROADMAP.md
|
@ -1,10 +1,40 @@
|
|||
# Arena 2019 Roadmap
|
||||
# Kubeflow Arena Roadmap
|
||||
|
||||
## Kubeflow Arena 2024 Roadmap
|
||||
|
||||
This document defines a high level roadmap for Arena development.
|
||||
|
||||
### 2019
|
||||
* Objective:Simplify the user experience by deeply integrating with the Kubeflow Ecosystem.
|
||||
* Kubeflow Integration
|
||||
* Prepare Arena for release v1.0.0 alongside kubeflow v1.10.
|
||||
* Develop a seamless integration with the Training Operator to help simplify model training using command line.
|
||||
* Integrate with Kubeflow Pipelines to facilitate model training from a Pipeline.
|
||||
* Enable mode serving with KServe.
|
||||
* Add documentation to Kubeflow website:
|
||||
* Installation, uninstallation, and upgrade processes.
|
||||
* Guide for tfjob, mpijob, pytorchJob.
|
||||
|
||||
#### Core CUJs
|
||||
* Objective:Amplify the Extensibility of the Arena for Different ML frameworks, AIGC models and platforms.
|
||||
* Support DeepSpeed Training Job.
|
||||
* Support for submitting and managing llm fine-tuning jobs, like DeepSpeed.
|
||||
* Enable model serving for an expanded set of models like Baichuan, LLaMA, ChatGLM, Falcon, and more.
|
||||
* Extend platform support to include ARM.
|
||||
* Integrate [Fluid project](https://github.com/fluid-cloudnative/fluid) for efficient data management.
|
||||
* Add support for Ray Job management with [Kuberay](https://github.com/ray-project/kuberay).
|
||||
|
||||
* Objective: Boost Performance and Stability.
|
||||
* Regularly publish recommended practices documentation.
|
||||
* Enhancements on Arena SDK.
|
||||
* Enhance code quality by:
|
||||
* Reduce repetitive code.
|
||||
* Enhance unit test.
|
||||
* Implement automated End-to-End Test:
|
||||
* Add integration tests using GitHub Actions.
|
||||
* Strive for more than 60% Test Coverage of Supported Features.
|
||||
|
||||
## Kubeflow Arena 2019 Roadmap
|
||||
|
||||
### Core CUJs
|
||||
|
||||
Objectives: "Make Arena easily to be integrated with External System."
|
||||
|
||||
|
@ -19,13 +49,13 @@ Objectives: "Simplify the user experience of the data scientists and provide a l
|
|||
* Submit and manage Model Serving with [KF Serving](https://github.com/kubeflow/kfserving)
|
||||
|
||||
|
||||
Objectives: "Make Arena support the same Operator compatiable with different API version, so the upgrade of operator doesn't impact the existing users' experiences."
|
||||
Objectives: "Make Arena support the same Operator compatible with different API version, so the upgrade of operator doesn't impact the existing users' experiences."
|
||||
|
||||
* Compatibility:
|
||||
* v1aphla2 and v1 TFJob
|
||||
* v1alpha1 and v1aphla2 MPIJob
|
||||
|
||||
Objectives: "Enchance the software quality of Arena so it can be in the quick iteration"
|
||||
Objectives: "Enhance the software quality of Arena so it can be in the quick iteration"
|
||||
|
||||
* Refactor the source code
|
||||
* Move Training implementation from `cmd` into `pkg`
|
||||
|
|
|
@ -1,16 +0,0 @@
|
|||
# Adopters Of Arena
|
||||
|
||||
Below are the adopters of project Arena. If you are using Arena to improve efficiency and productivity in Machine Learning with Kubernetes, please feel free to add yourself into the following list by a pull request. There're several phases as follow:
|
||||
|
||||
* **Evaluation:** Known Arena, that's interesting; evaluating the features/scopes of Arena
|
||||
* **Testing:** Take Arena as one of candidates, testing Kubernetes cluster with Arena
|
||||
* **Staging:** Decide to use Arena, testing it in pre-product environment
|
||||
* **Production:** Already put Arena into product environment
|
||||
|
||||
| Organization | Contact | Phases | Description of Use |
|
||||
| ------------ | ------- | ----------- | ------------------ |
|
||||
| [Weibo](https://www.weibo.com) | [@phoenixwu0229](https://github.com/phoenixwu0229) | **Production** | Weibo ML Platform |
|
||||
| [HUYA](https://www.huya.com) | [@BobLiu20](https://github.com/bobliu20) | **Production** | HUYA AI Platform |
|
||||
| [Microsoft](https://www.microsoft.com) | [@chaowangnk1](https://github.com/chaowangnk1) | **Testing** | AzureML DataCache internal benchmark system |
|
||||
| [Unisound](https://www.unisound.com) | [@xieydd](https://github.com/xieydd) | **Production** | Unisound ATLAS AI Platform |
|
||||
| [DOUYU](https://www.douyu.com) | [@gongcan1219](https://github.com/gongcan1219) | **Production** | DOUYU AI Platform |
|
|
@ -1,40 +0,0 @@
|
|||
## arena
|
||||
|
||||
arena is the command line interface to Arena
|
||||
|
||||
### Synopsis
|
||||
|
||||
arena is the command line interface to Arena
|
||||
|
||||
```
|
||||
arena [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
-h, --help help for arena
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena completion](arena_completion.md) - output shell completion code for the specified shell (bash or zsh)
|
||||
* [arena data](arena_data.md) - manage data.
|
||||
* [arena delete](arena_delete.md) - delete a training job and its associated pods
|
||||
* [arena get](arena_get.md) - display details of a training job
|
||||
* [arena list](arena_list.md) - list all the training jobs
|
||||
* [arena logs](arena_logs.md) - print the logs for a task of the training job
|
||||
* [arena logviewer](arena_logviewer.md) - display Log Viewer URL of a training job
|
||||
* [arena prune](arena_prune.md) - prune history job
|
||||
* [arena serve](arena_serve.md) - Serve a job.
|
||||
* [arena submit](arena_submit.md) - Submit a job.
|
||||
* [arena top](arena_top.md) - Display Resource (GPU) usage.
|
||||
* [arena version](arena_version.md) - Print version information
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,43 +0,0 @@
|
|||
## arena completion
|
||||
|
||||
output shell completion code for the specified shell (bash or zsh)
|
||||
|
||||
### Synopsis
|
||||
|
||||
Write bash or zsh shell completion code to standard output.
|
||||
|
||||
For bash, ensure you have bash completions installed and enabled.
|
||||
To access completions in your current shell, run
|
||||
$ source <(arena completion bash)
|
||||
Alternatively, write it to a file and source in .bash_profile
|
||||
|
||||
For zsh, output to a file in a directory referenced by the $fpath shell
|
||||
variable.
|
||||
|
||||
|
||||
```
|
||||
arena completion SHELL [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for completion
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena](arena.md) - arena is the command line interface to Arena
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,39 +0,0 @@
|
|||
## arena data
|
||||
|
||||
manage data.
|
||||
|
||||
### Synopsis
|
||||
|
||||
manage data volumes.
|
||||
|
||||
Available Commands:
|
||||
list,ls List the data volumes.
|
||||
|
||||
|
||||
```
|
||||
arena data [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for data
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena](arena.md) - arena is the command line interface to Arena
|
||||
* [arena data list](arena_data_list.md) - list all the data volume.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,35 +0,0 @@
|
|||
## arena data list
|
||||
|
||||
list all the data volume.
|
||||
|
||||
### Synopsis
|
||||
|
||||
list all the data volume.
|
||||
|
||||
```
|
||||
arena data list [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
--allNamespaces show all the namespaces
|
||||
-h, --help help for list
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena data](arena_data.md) - manage data.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,35 +0,0 @@
|
|||
## arena delete
|
||||
|
||||
delete a training job and its associated pods
|
||||
|
||||
### Synopsis
|
||||
|
||||
delete a training job and its associated pods
|
||||
|
||||
```
|
||||
arena delete a training job [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for delete
|
||||
--type string The training type to delete, the possible option is tfjob, mpijob, horovodjob or standalonejob. (optional)
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena](arena.md) - arena is the command line interface to Arena
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,37 +0,0 @@
|
|||
## arena get
|
||||
|
||||
display details of a training job
|
||||
|
||||
### Synopsis
|
||||
|
||||
display details of a training job
|
||||
|
||||
```
|
||||
arena get training job [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-e, --events Specify if show pending pod's events.
|
||||
-h, --help help for get
|
||||
-o, --output string Output format. One of: json|yaml|wide
|
||||
--type string The training type to delete, the possible option is tfjob, mpijob, horovodjob or standalonejob. (optional)
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena](arena.md) - arena is the command line interface to Arena
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,35 +0,0 @@
|
|||
## arena list
|
||||
|
||||
list all the training jobs
|
||||
|
||||
### Synopsis
|
||||
|
||||
list all the training jobs
|
||||
|
||||
```
|
||||
arena list [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
--allNamespaces show all the namespaces
|
||||
-h, --help help for list
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena](arena.md) - arena is the command line interface to Arena
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,41 +0,0 @@
|
|||
## arena logs
|
||||
|
||||
print the logs for a task of the training job
|
||||
|
||||
### Synopsis
|
||||
|
||||
print the logs for a task of the training job
|
||||
|
||||
```
|
||||
arena logs training job [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-f, --follow Specify if the logs should be streamed.
|
||||
-h, --help help for logs
|
||||
-i, --instance string Specify the task instance to get log
|
||||
--since string Only return logs newer than a relative duration like 5s, 2m, or 3h. Defaults to all logs. Only one of since-time / since may be used.
|
||||
--since-time string Only return logs after a specific date (RFC3339). Defaults to all logs. Only one of since-time / since may be used.
|
||||
--tail int Lines of recent log file to display. Defaults to -1 with no selector, showing all log lines otherwise 10, if a selector is provided. (default -1)
|
||||
--timestamps Include timestamps on each line in the log output
|
||||
--type string The training type to show logging, the possible option is tfjob, mpijob, horovodjob or standalonejob. (optional)
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena](arena.md) - arena is the command line interface to Arena
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,34 +0,0 @@
|
|||
## arena logviewer
|
||||
|
||||
display Log Viewer URL of a training job
|
||||
|
||||
### Synopsis
|
||||
|
||||
display Log Viewer URL of a training job
|
||||
|
||||
```
|
||||
arena logviewer job [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for logviewer
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena](arena.md) - arena is the command line interface to Arena
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,35 +0,0 @@
|
|||
## arena prune
|
||||
|
||||
prune history job
|
||||
|
||||
### Synopsis
|
||||
|
||||
prune history job
|
||||
|
||||
```
|
||||
arena prune history job [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for prune
|
||||
-s, --since duration Clean job that live longer than relative duration like 5s, 2m, or 3h. (default -1ns)
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena](arena.md) - arena is the command line interface to Arena
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,43 +0,0 @@
|
|||
## arena serve
|
||||
|
||||
Serve a job.
|
||||
|
||||
### Synopsis
|
||||
|
||||
serve a job.
|
||||
|
||||
Available Commands:
|
||||
tensorflow,tf Submit a TensorFlow Serving Job.
|
||||
tensorrt,trt Submit a TensorRT Job
|
||||
|
||||
```
|
||||
arena serve [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for serve
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena](arena.md) - arena is the command line interface to Arena
|
||||
* [arena serve delete](arena_serve_delete.md) - delete a serving job and its associated pods
|
||||
* [arena serve list](arena_serve_list.md) - list all the serving jobs
|
||||
* [arena serve tensorflow](arena_serve_tensorflow.md) - Submit tensorflow serving job to deploy and serve machine learning models.
|
||||
* [arena serve tensorrt](arena_serve_tensorrt.md) - Submit tensorRT inference serving job to deploy and serve machine learning models.
|
||||
* [arena serve traffic-split](arena_serve_traffic-split.md) - Adjust traffic routing dynamically for tfserving jobs
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,34 +0,0 @@
|
|||
## arena serve delete
|
||||
|
||||
delete a serving job and its associated pods
|
||||
|
||||
### Synopsis
|
||||
|
||||
delete a serving job and its associated pods
|
||||
|
||||
```
|
||||
arena serve delete a serving job [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for delete
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena serve](arena_serve.md) - Serve a job.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,34 +0,0 @@
|
|||
## arena serve list
|
||||
|
||||
list all the serving jobs
|
||||
|
||||
### Synopsis
|
||||
|
||||
list all the serving jobs
|
||||
|
||||
```
|
||||
arena serve list [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for list
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena serve](arena_serve.md) - Serve a job.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,54 +0,0 @@
|
|||
## arena serve tensorflow
|
||||
|
||||
Submit tensorflow serving job to deploy and serve machine learning models.
|
||||
|
||||
### Synopsis
|
||||
|
||||
Submit tensorflow serving job to deploy and serve machine learning models.
|
||||
|
||||
```
|
||||
arena serve tensorflow [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
--command string the command will inject to container's command.
|
||||
--cpu string the request cpu of each replica to run the serve.
|
||||
-d, --data stringArray specify the trained models datasource to mount for serving, like <name_of_datasource>:<mount_point_on_job>
|
||||
--enableIstio enable Istio for serving or not (disable Istio by default)
|
||||
-e, --envs stringArray the environment variables
|
||||
--exposeService expose service using Istio gateway for external access or not (not expose by default)
|
||||
--gpumemory int the limit GPU memory of each replica to run the serve.
|
||||
--gpus int the limit GPU count of each replica to run the serve.
|
||||
-h, --help help for tensorflow
|
||||
--image string the docker image name of serve job, and the default image is tensorflow/serving:latest (default "tensorflow/serving:latest")
|
||||
--imagePullPolicy string the policy to pull the image, and the default policy is IfNotPresent (default "IfNotPresent")
|
||||
--memory string the request memory of each replica to run the serve.
|
||||
--modelConfigFile string Corresponding with --model_config_file in tensorflow serving
|
||||
--modelName string the model name for serving
|
||||
--modelPath string the model path for serving in the container
|
||||
--port int the port of tensorflow gRPC listening port (default 8500)
|
||||
--replicas int the replicas number of the serve job. (default 1)
|
||||
--restfulPort int the port of tensorflow RESTful listening port (default 8501)
|
||||
--servingName string the serving name
|
||||
--servingVersion string the serving version
|
||||
--versionPolicy string support latest, latest:N, specific:N, all
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena serve](arena_serve.md) - Serve a job.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,55 +0,0 @@
|
|||
## arena serve tensorrt
|
||||
|
||||
Submit tensorRT inference serving job to deploy and serve machine learning models.
|
||||
|
||||
### Synopsis
|
||||
|
||||
Submit tensorRT inference serving job to deploy and serve machine learning models.
|
||||
|
||||
```
|
||||
arena serve tensorrt [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
--allowMetrics Open Metric
|
||||
--command string the command will inject to container's command.
|
||||
--cpu string the request cpu of each replica to run the serve.
|
||||
-d, --data stringArray specify the trained models datasource to mount for serving, like <name_of_datasource>:<mount_point_on_job>
|
||||
--enableIstio enable Istio for serving or not (disable Istio by default)
|
||||
-e, --envs stringArray the environment variables
|
||||
--exposeService expose service using Istio gateway for external access or not (not expose by default)
|
||||
--gpumemory int the limit GPU memory of each replica to run the serve.
|
||||
--gpus int the limit GPU count of each replica to run the serve.
|
||||
--grpcPort int the port of grpc serving server (default 8001)
|
||||
-h, --help help for tensorrt
|
||||
--httpPort int the port of http serving server (default 8000)
|
||||
--image string the docker image name of serve job, and the default image is registry.cn-beijing.aliyuncs.com/xiaozhou/tensorrt-serving:18.12-py3 (default "registry.cn-beijing.aliyuncs.com/xiaozhou/tensorrt-serving:18.12-py3")
|
||||
--imagePullPolicy string the policy to pull the image, and the default policy is IfNotPresent (default "IfNotPresent")
|
||||
--memory string the request memory of each replica to run the serve.
|
||||
--metricPort int the port of metrics server (default 8002)
|
||||
--modelName string the model name for serving
|
||||
--modelPath string the model path for serving in the container
|
||||
--modelStore string the path of tensorRT model path
|
||||
--replicas int the replicas number of the serve job. (default 1)
|
||||
--servingName string the serving name
|
||||
--servingVersion string the serving version
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena serve](arena_serve.md) - Serve a job.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,36 +0,0 @@
|
|||
## arena serve traffic-router-split
|
||||
|
||||
Adjust traffic routing dynamically for tfserving jobs
|
||||
|
||||
### Synopsis
|
||||
|
||||
Adjust traffic routing dynamically for tfserving jobs
|
||||
|
||||
```
|
||||
arena serve traffic-router-split [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for traffic-router-split
|
||||
--servingName string the serving name
|
||||
--versions string Model versions which the traffic will be routed to, e.g. [1,2,3] (default "[]")
|
||||
--weights string Weight percentage values for each model version which the traffic will be routed to,e.g. [70,20,10] (default "[]")
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arenaNamespace string The namespace of arena system service, like TFJob (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
--namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena serve](arena_serve.md) - Serve a job.
|
||||
|
||||
###### Auto generated by spf13/cobra on 7-Sep-2018
|
|
@ -1,37 +0,0 @@
|
|||
## arena serve traffic-split
|
||||
|
||||
Adjust traffic routing dynamically for tfserving jobs
|
||||
|
||||
### Synopsis
|
||||
|
||||
Adjust traffic routing dynamically for tfserving jobs
|
||||
|
||||
```
|
||||
arena serve traffic-split [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for traffic-split
|
||||
--servingName string the serving name
|
||||
--servingVersions string Model versions which the traffic will be routed to, e.g. 1,2,3
|
||||
--weights string Weight percentage values for each model version which the traffic will be routed to,e.g. 70,20,10
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena serve](arena_serve.md) - Serve a job.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,47 +0,0 @@
|
|||
## arena submit
|
||||
|
||||
Submit a job.
|
||||
|
||||
### Synopsis
|
||||
|
||||
Submit a job.
|
||||
|
||||
Available Commands:
|
||||
tfjob,tf Submit a TFJob.
|
||||
horovod,hj Submit a Horovod Job.
|
||||
mpijob,mpi Submit a MPIJob.
|
||||
standalonejob,sj Submit a standalone Job.
|
||||
tfserving,tfserving Submit a Serving Job.
|
||||
sparkjob,spark Submit a Spark Job.
|
||||
|
||||
```
|
||||
arena submit [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for submit
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena](arena.md) - arena is the command line interface to Arena
|
||||
* [arena submit horovodjob](arena_submit_horovodjob.md) - Submit horovodjob as training job.
|
||||
* [arena submit mpijob](arena_submit_mpijob.md) - Submit MPIjob as training job.
|
||||
* [arena submit standalonejob](arena_submit_standalonejob.md) - Submit StandaloneJob as training job. And it will be deprecated soon, please use tfjob instead.
|
||||
* [arena submit tfjob](arena_submit_tfjob.md) - Submit TFJob as training job.
|
||||
* [arena submit sparkjob](arena_submit_sparkjob.md) - Submit SparkJob as training job.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,51 +0,0 @@
|
|||
## arena submit horovodjob
|
||||
|
||||
Submit horovodjob as training job.
|
||||
|
||||
### Synopsis
|
||||
|
||||
Submit horovodjob as training job.
|
||||
|
||||
```
|
||||
arena submit horovodjob [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-a, --annotation stringArray the annotations
|
||||
--cpu string the cpu resource to use for the training, like 1 for 1 core.
|
||||
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
|
||||
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
|
||||
-e, --env stringArray the environment variables
|
||||
--gpus int the GPU count of each worker to run the training.
|
||||
-h, --help help for horovodjob
|
||||
--image string the docker image name of training job
|
||||
--memory string the memory resource to use for the training, like 1Gi.
|
||||
--name string override name
|
||||
--rdma enable RDMA
|
||||
--retry int retry times.
|
||||
--sshPort int ssh port.
|
||||
--sync-image string the docker image of syncImage
|
||||
--sync-mode string syncMode: support rsync, hdfs, git
|
||||
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
|
||||
--workers int the worker number to run the distributed training. (default 1)
|
||||
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena submit](arena_submit.md) - Submit a job.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,53 +0,0 @@
|
|||
## arena submit mpijob
|
||||
|
||||
Submit MPIjob as training job.
|
||||
|
||||
### Synopsis
|
||||
|
||||
Submit MPIjob as training job.
|
||||
|
||||
```
|
||||
arena submit mpijob [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-a, --annotation stringArray the annotations
|
||||
--cpu string the cpu resource to use for the training, like 1 for 1 core.
|
||||
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
|
||||
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
|
||||
-e, --env stringArray the environment variables
|
||||
--gpus int the GPU count of each worker to run the training.
|
||||
-h, --help help for mpijob
|
||||
--image string the docker image name of training job
|
||||
--logdir string the training logs dir, default is /training_logs (default "/training_logs")
|
||||
--memory string the memory resource to use for the training, like 1Gi.
|
||||
--name string override name
|
||||
--rdma enable RDMA
|
||||
--retry int retry times.
|
||||
--sync-image string the docker image of syncImage
|
||||
--sync-mode string syncMode: support rsync, hdfs, git
|
||||
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
|
||||
--tensorboard enable tensorboard
|
||||
--tensorboard-image string the docker image for tensorboard (default "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/tensorflow:1.12.0-devel")
|
||||
--workers int the worker number to run the distributed training. (default 1)
|
||||
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena submit](arena_submit.md) - Submit a job.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,37 +0,0 @@
|
|||
## arena submit sparkjob
|
||||
|
||||
Submit SparkJob as training job.
|
||||
|
||||
### Synopsis
|
||||
|
||||
Submit SparkJob as training job.
|
||||
|
||||
```
|
||||
arena submit tfjob [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
--image string the docker image name of training job
|
||||
--jar string jar path in image
|
||||
--main-class string main class of your jar
|
||||
--name string override name
|
||||
--workers int the worker number to run the distributed training. (default 1)
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arenaNamespace string The namespace of arena system service, like TFJob (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
--namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena submit](arena_submit.md) - Submit a job.
|
||||
|
|
@ -1,52 +0,0 @@
|
|||
## arena submit standalonejob(deprecated)
|
||||
|
||||
**Warning: standalonejob has been deprecated,please use [tfjob](../userguide/1-tfjob-standalone.md) instead.**
|
||||
|
||||
Submit StandaloneJob as training job. And it will be deprecated soon, please use tfjob instead.
|
||||
|
||||
### Synopsis
|
||||
|
||||
Submit StandaloneJob as training job. And it will be deprecated soon, please use tfjob instead.
|
||||
|
||||
```
|
||||
arena submit standalonejob [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-a, --annotation stringArray the annotations
|
||||
--cpu string the cpu resource to use for the training, like 1 for 1 core.
|
||||
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
|
||||
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
|
||||
-e, --env stringArray the environment variables
|
||||
--gpus int the GPU count of each worker to run the training.
|
||||
-h, --help help for standalonejob
|
||||
--image string the docker image name of training job
|
||||
--memory string the memory resource to use for the training, like 1Gi.
|
||||
--name string override name
|
||||
--rdma enable RDMA
|
||||
--retry int retry times.
|
||||
--sync-image string the docker image of syncImage
|
||||
--sync-mode string syncMode: support rsync, hdfs, git
|
||||
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
|
||||
--workers int the worker number to run the distributed training. (default 1)
|
||||
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena submit](arena_submit.md) - Submit a job.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,68 +0,0 @@
|
|||
## arena submit tfjob
|
||||
|
||||
Submit TFJob as training job.
|
||||
|
||||
### Synopsis
|
||||
|
||||
Submit TFJob as training job.
|
||||
|
||||
```
|
||||
arena submit tfjob [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-a, --annotation stringArray the annotations
|
||||
--chief enable chief, which is required for estimator.
|
||||
--chief-cpu string the cpu resource to use for the Chief, like 1 for 1 core.
|
||||
--chief-memory string the memory resource to use for the Chief, like 1Gi.
|
||||
--chief-port int the port of the chief.
|
||||
--clean-task-policy string How to clean tasks after Training is done, only support Running, None. (default "Running")
|
||||
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
|
||||
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
|
||||
-e, --env stringArray the environment variables
|
||||
--evaluator enable evaluator, which is optional for estimator.
|
||||
--evaluator-cpu string the cpu resource to use for the evaluator, like 1 for 1 core.
|
||||
--evaluator-memory string the memory resource to use for the evaluator, like 1Gi.
|
||||
--gpus int the GPU count of each worker to run the training.
|
||||
-h, --help help for tfjob
|
||||
--image string the docker image name of training job
|
||||
--logdir string the training logs dir, default is /training_logs (default "/training_logs")
|
||||
--name string override name
|
||||
--ps int the number of the parameter servers.
|
||||
--ps-cpu string the cpu resource to use for the parameter servers, like 1 for 1 core.
|
||||
--ps-image string the docker image for tensorflow workers
|
||||
--ps-memory string the memory resource to use for the parameter servers, like 1Gi.
|
||||
--ps-port int the port of the parameter server.
|
||||
--rdma enable RDMA
|
||||
--retry int retry times.
|
||||
--sync-image string the docker image of syncImage
|
||||
--sync-mode string syncMode: support rsync, hdfs, git
|
||||
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
|
||||
--tensorboard enable tensorboard
|
||||
--tensorboard-image string the docker image for tensorboard (default "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/tensorflow:1.12.0-devel")
|
||||
--worker-cpu string the cpu resource to use for the worker, like 1 for 1 core.
|
||||
--worker-image string the docker image for tensorflow workers
|
||||
--worker-memory string the memory resource to use for the worker, like 1Gi.
|
||||
--worker-port int the port of the worker.
|
||||
--workers int the worker number to run the distributed training. (default 1)
|
||||
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena submit](arena_submit.md) - Submit a job.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,41 +0,0 @@
|
|||
## arena top
|
||||
|
||||
Display Resource (GPU) usage.
|
||||
|
||||
### Synopsis
|
||||
|
||||
Display Resource (GPU) usage.
|
||||
|
||||
Available Commands:
|
||||
node Display Resource (GPU) usage of nodes
|
||||
job Display Resource (GPU) usage of pods
|
||||
|
||||
|
||||
```
|
||||
arena top [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for top
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena](arena.md) - arena is the command line interface to Arena
|
||||
* [arena top job](arena_top_job.md) - Display Resource (GPU) usage of jobs.
|
||||
* [arena top node](arena_top_node.md) - Display Resource (GPU) usage of nodes.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,37 +0,0 @@
|
|||
## arena top job
|
||||
|
||||
Display Resource (GPU) usage of jobs.
|
||||
|
||||
### Synopsis
|
||||
|
||||
Display Resource (GPU) usage of jobs.
|
||||
|
||||
```
|
||||
arena top job [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
--allNamespaces show all the namespaces
|
||||
-h, --help help for job
|
||||
-i, --instance string Display instance top info
|
||||
-r, --refresh Display continuously
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena top](arena_top.md) - Display Resource (GPU) usage.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,35 +0,0 @@
|
|||
## arena top node
|
||||
|
||||
Display Resource (GPU) usage of nodes.
|
||||
|
||||
### Synopsis
|
||||
|
||||
Display Resource (GPU) usage of nodes.
|
||||
|
||||
```
|
||||
arena top node [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-d, --details Display details
|
||||
-h, --help help for node
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena top](arena_top.md) - Display Resource (GPU) usage.
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,35 +0,0 @@
|
|||
## arena version
|
||||
|
||||
Print version information
|
||||
|
||||
### Synopsis
|
||||
|
||||
Print version information
|
||||
|
||||
```
|
||||
arena version [flags]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
```
|
||||
-h, --help help for version
|
||||
--short print just the version number
|
||||
```
|
||||
|
||||
### Options inherited from parent commands
|
||||
|
||||
```
|
||||
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
|
||||
--config string Path to a kube config. Only required if out-of-cluster
|
||||
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
|
||||
-n, --namespace string the namespace of the job (default "default")
|
||||
--pprof enable cpu profile
|
||||
--trace enable trace
|
||||
```
|
||||
|
||||
### SEE ALSO
|
||||
|
||||
* [arena](arena.md) - arena is the command line interface to Arena
|
||||
|
||||
###### Auto generated by spf13/cobra on 24-Apr-2019
|
|
@ -1,50 +0,0 @@
|
|||
## The TFJob plugin framework
|
||||
|
||||
If you'd like to customize or enhance the TFJob with your own chart or code.
|
||||
|
||||
|
||||
## Developer Workflow
|
||||
|
||||
### Step 1: Implement the following function (optional)
|
||||
|
||||
```
|
||||
// Customized runtime for tf training training
|
||||
type tfRuntime interface {
|
||||
// check the tfjob args
|
||||
check(tf *submitTFJobArgs) (err error)
|
||||
// transform the tfjob
|
||||
transform(tf *submitTFJobArgs) (err error)
|
||||
|
||||
getChartName() string
|
||||
}
|
||||
```
|
||||
|
||||
You can refer the implmentation of default tf runtime [../../cmd/arena/commands/training_plugin_interface.go](training_plugin_interface.go)
|
||||
|
||||
|
||||
### Step 2. Create your own chart
|
||||
|
||||
If you don't need to create your code for `check` or `transform`, you can create the chart in the same directory of tfjob, mpijob. For example, the chart name is `mock`.
|
||||
|
||||
```
|
||||
cd /charts
|
||||
cp -r tfjob mock
|
||||
```
|
||||
|
||||
## User Workflow
|
||||
|
||||
Just run with the command by specifying annotation `runtime={your runtime}`
|
||||
|
||||
```
|
||||
arena submit tf \
|
||||
--name=test \
|
||||
--annotation="runtime=mock" \
|
||||
--workers=1 \
|
||||
--chief \
|
||||
--chief-cpu=4 \
|
||||
--evaluator \
|
||||
--evaluator-cpu=4 \
|
||||
--worker-cpu=2 \
|
||||
"python test.py"
|
||||
```
|
||||
|
|
@ -1,118 +0,0 @@
|
|||
## Setup
|
||||
|
||||
This documentation assumes you have a Kubernetes cluster already available.
|
||||
|
||||
If you need help setting up a Kubernetes cluster please refer to [Kubernetes Setup](https://kubernetes.io/docs/setup/).
|
||||
|
||||
If you want to use GPUs, be sure to follow the Kubernetes [instructions for enabling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/).
|
||||
|
||||
Arena doesn't have to run can be run within Kubernetes cluster. It can also be run in your laptop. If you can run `kubectl` to manage the Kubernetes cluster there, you can also use `arena` to manage Training Jobs.
|
||||
|
||||
### Requirements
|
||||
|
||||
* Linux OS
|
||||
* Kubernetes >= 1.11, kubectl >= 1.11
|
||||
* helm version [v2.14.1](https://docs.helm.sh/using_helm/#installing-helm) or later
|
||||
* tiller with ths same version of helm should be also installed (https://docs.helm.sh/using_helm/#installing-tiller)
|
||||
|
||||
### Steps
|
||||
|
||||
1\. Prepare kubeconfig file by using `export KUBECONFIG=/etc/kubernetes/admin.conf` or creating a `~/.kube/config`
|
||||
|
||||
2\. Download the latest installer from [Release Page](https://github.com/kubeflow/arena/releases), and rename it to `arena-installer.tar.gz`
|
||||
|
||||
3\. Untar the installer package
|
||||
|
||||
```
|
||||
# tar -xvf arena-installer.tar.gz
|
||||
```
|
||||
|
||||
4\. Setup Environment Varaibles for customization
|
||||
|
||||
4.1\. If you'd like to train and serving in hostNetwork
|
||||
|
||||
```
|
||||
export USE_HOSTNETWORK=true
|
||||
```
|
||||
|
||||
4.2\. If you'd like to customize Kubernetes namespace of arena infrastructure
|
||||
|
||||
```
|
||||
export NAMESPACE={your namespace}
|
||||
```
|
||||
|
||||
4.3\. If you'd like to use your private docker registry instead of `ACR(Alibaba Cloud Container Registry)`:
|
||||
|
||||
```
|
||||
export DOCKER_REGISTRY={your docker registry}
|
||||
```
|
||||
|
||||
4.4\. If you'd like to deploy prometheus in `ACK(Alibaba Container Service for Kubernetes)`
|
||||
|
||||
```
|
||||
export USE_PROMETHEUS=true
|
||||
export PLATFORM=ack
|
||||
```
|
||||
|
||||
4.5\. If you'd like to use Cloud loadbalancer
|
||||
|
||||
```
|
||||
export USE_LOADBALANCER=true
|
||||
```
|
||||
|
||||
5\. Install arena
|
||||
|
||||
```
|
||||
# cd arena-installer
|
||||
# sudo ./install.sh
|
||||
```
|
||||
|
||||
6\. Enable shell autocompletion
|
||||
|
||||
On Linux, please use bash
|
||||
|
||||
On CentOS Linux, you may need to install the bash-completion package which is not installed by default.
|
||||
|
||||
```
|
||||
yum install bash-completion -y
|
||||
```
|
||||
|
||||
On Debian or Ubuntu Linux you may need to install with
|
||||
|
||||
```
|
||||
apt-get install bash-completion
|
||||
```
|
||||
|
||||
To add arena autocompletion to your current shell, run `source <(arena completion bash)`.
|
||||
|
||||
On MacOS, please use bash
|
||||
|
||||
You can install it with Homebrew:
|
||||
|
||||
```
|
||||
brew install bash-completion@2
|
||||
```
|
||||
|
||||
To add arena autocompletion to your profile, so it is automatically loaded in future shells run:
|
||||
|
||||
```
|
||||
echo "source <(arena completion bash)" >> ~/.bashrc
|
||||
chmod u+x ~/.bashrc
|
||||
```
|
||||
|
||||
For MacOS, add the following to your `~/.bashrc` file:
|
||||
|
||||
```
|
||||
echo "source $(brew --prefix)/etc/profile.d/bash_completion.sh" >> ~/.bashrc
|
||||
```
|
||||
|
||||
Then you can use [tab] to auto complete the command
|
||||
|
||||
```
|
||||
# arena list
|
||||
NAME STATUS TRAINER AGE NODE
|
||||
tf1 PENDING TFJOB 0s N/A
|
||||
caffe-1080ti-1 RUNNING HOROVOD 45s 192.168.1.120
|
||||
# arena get [tab]
|
||||
caffe-1080ti-1 tf1
|
||||
```
|
|
@ -1,157 +0,0 @@
|
|||
## Setup
|
||||
|
||||
This documentation assumes you have a Kubernetes cluster already available.
|
||||
|
||||
If you need help setting up a Kubernetes cluster please refer to [Kubernetes Setup](https://kubernetes.io/docs/setup/).
|
||||
|
||||
If you want to use GPUs, be sure to follow the Kubernetes [instructions for enabling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/).
|
||||
|
||||
Arena doesn't have to run can be run within Kubernetes cluster. It can also be run in your laptop. If you can run `kubectl` to manage the Kubernetes cluster there, you can also use `arena` to manage Training Jobs.
|
||||
|
||||
### Requirements
|
||||
|
||||
* Kubernetes >= 1.11, kubectl >= 1.11
|
||||
* helm version [v2.14.1](https://docs.helm.sh/using_helm/#installing-helm) or later
|
||||
* tiller with ths same version of helm should be also installed (https://docs.helm.sh/using_helm/#installing-tiller)
|
||||
|
||||
### Steps
|
||||
|
||||
1\. Prepare kubeconfig file by using `export KUBECONFIG=/etc/kubernetes/admin.conf` or creating a `~/.kube/config`
|
||||
|
||||
2\. Install kubectl client
|
||||
|
||||
Please follow [kubectl installation guide](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
|
||||
|
||||
3\. Install Helm client
|
||||
|
||||
- Download Helm client from [github.com](https://github.com/helm/helm/releases)
|
||||
- Unpack it (tar -zxvf helm-v2.14.1-linux-amd64.tgz)
|
||||
- Find the `helm` binary in the unpacked directory, and move it to its desired destination (mv linux-amd64/helm /usr/local/bin/arena-helm)
|
||||
|
||||
Then run `helm list` to check if the the kubernetes can be managed successfully by helm.
|
||||
|
||||
```
|
||||
# arena-helm list
|
||||
# echo $?
|
||||
0
|
||||
```
|
||||
|
||||
4\. Download the charts
|
||||
|
||||
```
|
||||
mkdir /charts
|
||||
git clone https://github.com/kubeflow/arena.git
|
||||
cp -r arena/charts/* /charts
|
||||
```
|
||||
|
||||
5\. Install TFJob Controller
|
||||
|
||||
```
|
||||
kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml
|
||||
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-crd.yaml
|
||||
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml
|
||||
```
|
||||
|
||||
6\. Install Dashboard
|
||||
|
||||
```
|
||||
kubectl create -f arena/kubernetes-artifacts/dashboard/dashboard.yaml
|
||||
```
|
||||
|
||||
7\. Install MPIJob Controller
|
||||
|
||||
```
|
||||
kubectl create -f arena/kubernetes-artifacts/mpi-operator/mpi-operator.yaml
|
||||
```
|
||||
|
||||
8\. Build arena
|
||||
|
||||
Prerequisites:
|
||||
|
||||
- Go >= 1.8
|
||||
|
||||
```
|
||||
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
|
||||
cd $(go env GOPATH)/src/github.com/kubeflow
|
||||
git clone https://github.com/kubeflow/arena.git
|
||||
cd arena
|
||||
make
|
||||
```
|
||||
|
||||
`arena` binary is located in directory `arena/bin`. You may want add the directory to `$PATH`.
|
||||
|
||||
|
||||
9\. Install and configure kube-arbitrator for gang scheduling(optional)
|
||||
|
||||
```
|
||||
kubectl create -f arena/kubernetes-artifacts/kube-batchd/kube-batched.yaml
|
||||
```
|
||||
|
||||
10\. Enable shell autocompletion
|
||||
|
||||
On Linux, please use bash
|
||||
|
||||
On CentOS Linux, you may need to install the bash-completion package which is not installed by default.
|
||||
|
||||
```
|
||||
yum install bash-completion -y
|
||||
```
|
||||
|
||||
To add arena autocompletion to your current shell, run source <(arena completion bash).
|
||||
|
||||
To add arena autocompletion to your profile, so it is automatically loaded in future shells run:
|
||||
|
||||
```
|
||||
echo "source <(arena completion bash)" >> ~/.bashrc
|
||||
```
|
||||
|
||||
Then you can use [tab] to auto complete the command
|
||||
|
||||
```
|
||||
# arena list
|
||||
NAME STATUS TRAINER AGE NODE
|
||||
tf1 PENDING TFJOB 0s N/A
|
||||
caffe-1080ti-1 RUNNING HOROVOD 45s 192.168.1.120
|
||||
# arena get [tab]
|
||||
caffe-1080ti-1 tf1
|
||||
```
|
||||
|
||||
|
||||
11\. Enable Host network for training (optional)
|
||||
|
||||
The training is not `useHostNetwork` by default. If you'd like to run the training in HostNetwork. You can run the command below:
|
||||
|
||||
```
|
||||
find /charts/ -name values.yaml | xargs sed -i "/useHostNetwork/s/false/true/g"
|
||||
```
|
||||
|
||||
12\. Enable Loadbalancer in the public cloud (optional)
|
||||
|
||||
Kubernetes can be run on AWS, GCE, Azure and Alibaba Cloud, and `LoadBalancer` is supported in their cloud provider. If you want to access tensorboard on the internet directly, you can run the command below:
|
||||
|
||||
|
||||
```
|
||||
find /charts/ -name "*.yaml" | xargs sed -i "s/NodePort/LoadBalancer/g"
|
||||
```
|
||||
|
||||
> Warning: it's not encouraged to expose the service to the internet, because the service can be attacked by hacker easily.
|
||||
|
||||
|
||||
13\. Enable Ingress in the public cloud (optional)
|
||||
|
||||
If you have ingress controller configured, you are able to access tensorboard through ingress. You can run the command below:
|
||||
|
||||
```
|
||||
find /charts/ -name values.yaml | xargs sed -i "/ingress/s/false/true/g"
|
||||
```
|
||||
|
||||
> Warning: it's not encouraged to expose the service to the internet, because the service can be attacked by hacker easily.
|
||||
|
||||
|
||||
14\. Change imagePullPolicy from `Always` to `IfNotPresent` (optional)
|
||||
|
||||
```
|
||||
find /charts/ -name values.yaml| xargs sed -i "s/Always/IfNotPresent/g"
|
||||
```
|
||||
|
||||
> Warning: this may cause the docker images are not up to date if it's already downloaded in node.
|
|
@ -1,154 +0,0 @@
|
|||
## 部署
|
||||
|
||||
本文档假设您已经有可用的 Kubernetes 集群。
|
||||
|
||||
如果您需要有关 Kubernetes 集群设置的帮助,请参阅 [Kubernetes 设置](https://kubernetes.io/docs/setup/)。
|
||||
|
||||
如果您希望使用 GPU,请务必按照 Kubernetes [GPU 启用说明](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) 操作。
|
||||
|
||||
Arena 并非必需在 Kubernetes 集群内运行。它也可以在您的笔记本电脑中运行。如果您可以运行 `kubectl` 以管理 Kubernetes 集群,那么也可以使用 `arena` 管理训练作业。
|
||||
|
||||
### 要求
|
||||
|
||||
* Kubernetes >= 1.11, kubectl >= 1.11
|
||||
* helm 版本 [v2.14.1](https://docs.helm.sh/using_helm/#installing-helm) 或更新版本
|
||||
* 此外还要部署与 helm 版本相同的 tiller(https://docs.helm.sh/using_helm/#installing-tiller)
|
||||
|
||||
### 步骤
|
||||
|
||||
1\.通过使用 `export KUBECONFIG=/etc/kubernetes/admin.conf` 或创建一个 `~/.kube/config` 来准备 kubeconfig 文件
|
||||
|
||||
2\.安装 kubectl 客户端
|
||||
|
||||
请按照 [kubectl 安装指南] 操作(https://kubernetes.io/docs/tasks/tools/install-kubectl/)
|
||||
|
||||
3\.安装 Helm 客户端
|
||||
|
||||
- 从 [github.com] 下载 Helm 客户端(https://github.com/helm/helm/releases)
|
||||
- 将下载到的文件解压缩 (tar -zxvf helm-v2.8.2-linux-amd64.tgz)
|
||||
- 在解压缩目录中找到 `helm` 二进制文件,将其移到所需目标位置 (mv linux-amd64/helm /usr/local/bin/arena-helm)
|
||||
|
||||
然后运行 `helm list` 以检查 helm 能否成功管理 kubernetes。
|
||||
|
||||
```
|
||||
#helm list
|
||||
#echo $?
|
||||
0
|
||||
```
|
||||
|
||||
4\.下载 Chart
|
||||
|
||||
```
|
||||
mkdir /charts
|
||||
git clone https://github.com/kubeflow/arena.git
|
||||
cp -r arena/charts/* /charts
|
||||
```
|
||||
|
||||
5\.安装 TFJob 控制器
|
||||
|
||||
```
|
||||
kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml
|
||||
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-crd.yaml
|
||||
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml
|
||||
```
|
||||
|
||||
6\.安装控制台 (可选)
|
||||
|
||||
```
|
||||
kubectl create -f arena/kubernetes-artifacts/dashboard/dashboard.yaml
|
||||
```
|
||||
|
||||
7\.安装 MPIJob 控制器
|
||||
|
||||
```
|
||||
kubectl create -f arena/kubernetes-artifacts/mpi-operator/mpi-operator.yaml
|
||||
```
|
||||
|
||||
8\.安装 arena
|
||||
|
||||
先决条件:
|
||||
|
||||
- Go >= 1.8
|
||||
|
||||
```
|
||||
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
|
||||
cd $(go env GOPATH)/src/github.com/kubeflow
|
||||
git clone https://github.com/kubeflow/arena.git
|
||||
cd arena
|
||||
make
|
||||
```
|
||||
|
||||
`arena` 二进制文件位于 `arena/bin` 目录下。您可能希望将目录添加到 `$PATH`。
|
||||
|
||||
|
||||
9\.安装并为群调度配置 kube-arbitrator(可选)
|
||||
|
||||
```
|
||||
kubectl create -f arena/kubernetes-artifacts/kube-batchd/kube-batched.yaml
|
||||
```
|
||||
|
||||
10\.启用 shell 自动完成
|
||||
|
||||
在 Linux 上,请使用 bash
|
||||
|
||||
在 CentOS Linux 上,您可能需要安装默认并未安装的 bash-completion 包。
|
||||
|
||||
```
|
||||
yum install bash-completion -y
|
||||
```
|
||||
|
||||
要为当前 shell 添加 arena 自动完成,请运行 source <(arena completion bash)。
|
||||
|
||||
通过如下方法向您的配置文件添加 arena 自动完成功能,以便将来 shell 运行时可以自动加载此功能:
|
||||
|
||||
```
|
||||
echo "source <(arena completion bash)" >> ~/.bashrc
|
||||
```
|
||||
|
||||
然后,你可以使用 [TAB] 来自动完成命令
|
||||
|
||||
```
|
||||
#arena list
|
||||
NAME STATUS TRAINER AGE NODE
|
||||
tf1 PENDING TFJOB 0s N/A
|
||||
caffe-1080ti-1 RUNNING HOROVOD 45s 192.168.1.120
|
||||
#arena get [tab]
|
||||
caffe-1080ti-1 tf1
|
||||
```
|
||||
|
||||
|
||||
11\.为训练启用主机网络(可选)
|
||||
|
||||
默认情况下,训练并非 `useHostNetwork`。如果您希望在 HostNetwork 中运行训练。可以运行如下命令:
|
||||
|
||||
```
|
||||
find /charts/ -name values.yaml | xargs sed -i "/useHostNetwork/s/false/true/g"
|
||||
```
|
||||
|
||||
12\.在公共云中启用 Loadbalancer
|
||||
|
||||
Kubernetes 可在 AWS、GCE、Azure 和阿里云中运行,其云提供商支持 `LoadBalancer`。如果您希望在互联网上直接访问 tensorboard,可以运行如下代码:
|
||||
|
||||
```
|
||||
find /charts/ -name "*.yaml" | xargs sed -i "s/NodePort/LoadBalancer/g"
|
||||
```
|
||||
|
||||
> 警告:我们不鼓励将服务公开给互联网,因为这种做法会导致服务受黑客攻击。
|
||||
|
||||
13\. 在公共云中启用 Ingress
|
||||
|
||||
Kubernetes 可在 AWS、GCE、Azure 和阿里云中运行,其云提供商支持 `Ingress`。如果您希望在互联网上直接通过统一入口访问 tensorboard,可以运行如下代码:
|
||||
|
||||
```
|
||||
find /charts/ -name values.yaml | xargs sed -i "/ingress/s/false/true/g"
|
||||
```
|
||||
|
||||
> 警告:我们不鼓励将服务公开给互联网,因为这种做法会导致服务受黑客攻击。
|
||||
|
||||
14\. 将 imagePullPolicy 策略由 `Always` 修改为 `IfNotPresent` (可选)
|
||||
|
||||
```
|
||||
find /charts/ -name values.yaml| xargs sed -i "s/Always/IfNotPresent/g"
|
||||
```
|
||||
|
||||
> 警告: 这会导致容器镜像可能不是最新更新版本。
|
Before Width: | Height: | Size: 223 KiB |
|
@ -1,138 +0,0 @@
|
|||
|
||||
Here is an example how you can use `Arena` for the machine learning training. It will download the source code from git url.
|
||||
|
||||
1. the first step is to check the available resources
|
||||
|
||||
```
|
||||
arena top node
|
||||
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
|
||||
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
|
||||
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0
|
||||
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0
|
||||
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
|
||||
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
|
||||
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
|
||||
-----------------------------------------------------------------------------------------
|
||||
Allocated/Total GPUs In Cluster:
|
||||
0/3 (0%)
|
||||
```
|
||||
|
||||
There are 3 available nodes with GPU for running training jobs.
|
||||
|
||||
|
||||
2\. Now we can submit a training job with `arena`, it will download the source code from github
|
||||
|
||||
```
|
||||
# arena submit tf \
|
||||
--name=tf-git \
|
||||
--gpus=1 \
|
||||
--image=tensorflow/tensorflow:1.5.0-devel-gpu \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
|
||||
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 10000 --data_dir=code/tensorflow-sample-code/data"
|
||||
configmap/tf-git-tfjob created
|
||||
configmap/tf-git-tfjob labeled
|
||||
tfjob.kubeflow.org/tf-git created
|
||||
INFO[0000] The Job tf-git has been submitted successfully
|
||||
INFO[0000] You can run `arena get tf-git --type tfjob` to check the job status
|
||||
```
|
||||
|
||||
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`. Also, you may specify the branch you are pulling code from by addding `--env GIT_SYNC_BRANCH=main` to the paramasters while submitting the job.
|
||||
> If you are using the private git repo, you can use the following command:
|
||||
|
||||
```
|
||||
# arena submit tf \
|
||||
--name=tf-git \
|
||||
--gpus=1 \
|
||||
--image=tensorflow/tensorflow:1.5.0-devel-gpu \
|
||||
--syncMode=git \
|
||||
--syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
|
||||
--env=GIT_SYNC_USERNAME=yourname \
|
||||
--env=GIT_SYNC_PASSWORD=yourpwd \
|
||||
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py"
|
||||
```
|
||||
|
||||
Notice: `arena` is using [git-sync](https://github.com/kubernetes/git-sync/blob/master/cmd/git-sync/main.go) to sync up source code. You can set the environment variables defined in git-sync project.
|
||||
|
||||
3\. List all the jobs
|
||||
|
||||
```
|
||||
# arena list
|
||||
NAME STATUS TRAINER AGE NODE
|
||||
tf-git RUNNING tfjob 0s 192.168.1.120
|
||||
```
|
||||
|
||||
4\. Check the resource usage of the job
|
||||
|
||||
```
|
||||
# arena top job
|
||||
NAME STATUS TRAINER AGE NODE GPU(Requests) GPU(Allocated)
|
||||
tf-git RUNNING TFJOB 17s 192.168.1.120 1 1
|
||||
|
||||
|
||||
Total Allocated GPUs of Training Job:
|
||||
1
|
||||
|
||||
Total Requested GPUs of Training Job:
|
||||
1
|
||||
```
|
||||
|
||||
5\. Check the resource usage of the cluster
|
||||
|
||||
```
|
||||
# arena top node
|
||||
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
|
||||
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
|
||||
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0
|
||||
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 1
|
||||
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
|
||||
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
|
||||
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
|
||||
-----------------------------------------------------------------------------------------
|
||||
Allocated/Total GPUs In Cluster:
|
||||
1/3 (33%)
|
||||
```
|
||||
|
||||
|
||||
6\. Get the details of the specific job
|
||||
|
||||
```
|
||||
# arena get tf-git
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
tf-git RUNNING TFJOB 5s tf-git-tfjob-worker-0 192.168.1.120
|
||||
```
|
||||
|
||||
7\. Check logs
|
||||
|
||||
```
|
||||
# arena logs tf-git
|
||||
2018-07-22T23:56:20.841129509Z WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:119: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
|
||||
2018-07-22T23:56:20.841211064Z Instructions for updating:
|
||||
2018-07-22T23:56:20.841217002Z
|
||||
2018-07-22T23:56:20.841221287Z Future major versions of TensorFlow will allow gradients to flow
|
||||
2018-07-22T23:56:20.841225581Z into the labels input on backprop by default.
|
||||
2018-07-22T23:56:20.841229492Z
|
||||
...
|
||||
2018-07-22T23:57:11.842929868Z Accuracy at step 920: 0.967
|
||||
2018-07-22T23:57:11.842933859Z Accuracy at step 930: 0.9646
|
||||
2018-07-22T23:57:11.842937832Z Accuracy at step 940: 0.967
|
||||
2018-07-22T23:57:11.842941362Z Accuracy at step 950: 0.9674
|
||||
2018-07-22T23:57:11.842945487Z Accuracy at step 960: 0.9693
|
||||
2018-07-22T23:57:11.842949067Z Accuracy at step 970: 0.9687
|
||||
2018-07-22T23:57:11.842952818Z Accuracy at step 980: 0.9688
|
||||
2018-07-22T23:57:11.842956775Z Accuracy at step 990: 0.9649
|
||||
2018-07-22T23:57:11.842961076Z Adding run metadata for 999
|
||||
```
|
||||
|
||||
8\. More information about the training job in the logviewer
|
||||
|
||||
```
|
||||
# arena logviewer tf-git
|
||||
Your LogViewer will be available on:
|
||||
192.168.1.120:8080/tfjobs/ui/#/default/tf-git-tfjob
|
||||
```
|
||||
|
||||

|
||||
|
||||
|
||||
Congratulations! You've run the first training job with `arena` successfully.
|
|
@ -1,45 +0,0 @@
|
|||
Arena supports RDMA For distributed Training. We can allocate RDMA device for worker jobs by adding parameter `--rdma`
|
||||
|
||||
1. Deploy rdma device plugin
|
||||
|
||||
```
|
||||
# Deploy RDMA device plugin
|
||||
kubectl create -f kubernetes-artifacts/rdma/rdma-config.yaml
|
||||
kubectl create -f kubernetes-artifacts/rdma/device-plugin.yaml
|
||||
```
|
||||
|
||||
2\. Label your node with infiniband device
|
||||
|
||||
```
|
||||
# Label RDMA NODE
|
||||
kubectl label node <your node> accelerator/rdma=true
|
||||
```
|
||||
|
||||
```
|
||||
# Check Device plugin status
|
||||
kubectl -n arena-system get ds
|
||||
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
|
||||
rdma-sriov-dp-ds 1 1 1 1 1 accelerator/rdma=true 46d
|
||||
```
|
||||
|
||||
3\. Enable arena RDMA config
|
||||
|
||||
```
|
||||
find /charts/ -name values.yaml | xargs sed -i "/enableRDMA/s/false/true/g"
|
||||
```
|
||||
|
||||
4\. Submit a Tensorflow training job using RDMA
|
||||
|
||||
```
|
||||
# arena submit mpi --name=mpi-dist \
|
||||
--rdma \
|
||||
--gpus=1 \
|
||||
--workers=2 \
|
||||
--image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
|
||||
--env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \
|
||||
--syncMode=git \
|
||||
--syncSource=https://github.com/tensorflow/benchmarks.git \
|
||||
--tensorboard \
|
||||
"mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3
|
||||
--save_summaries_steps=10"
|
||||
```
|
|
@ -1,201 +0,0 @@
|
|||
|
||||
Arena supports and simplifies distributed spark job.
|
||||
|
||||
### 1. To run a distributed spark job, you need to specify:
|
||||
- The spark job image which contains the main class jar. (required)
|
||||
- Main class of your jar. (required)
|
||||
- Jar path in the container.(required)
|
||||
- The number of executors.(default: 1)
|
||||
- The resource cpu request of driver pod (default: 1)
|
||||
- The resource memory request of driver pod (default: 500m)
|
||||
- The resource cpu request of executor pod (default: 1)
|
||||
- The resource memory request of executor pod (default: 500m)
|
||||
|
||||
### 2. How to create spark job image.
|
||||
|
||||
Arena spark job is based on spark-on-k8s-operator(https://github.com/GoogleCloudPlatform/spark-on-k8s-operator).You can create spark job image with tool `docker-image-tool` (https://spark.apache.org/docs/latest/running-on-kubernetes.html#docker-images)
|
||||
|
||||
### 3. How to use Arena spark job
|
||||
|
||||
##### install spark operator
|
||||
```$xslt
|
||||
# arena-system is the default namespace,if not exist please create it.
|
||||
kubectl create -f arena/kubernetes-artifacts/spark-operator/spark-operator.yaml
|
||||
```
|
||||
|
||||
##### create rbac of spark job
|
||||
The spark job need service account `spark` to create executors.
|
||||
```$xslt
|
||||
kubectl create -f arena/kubernetes-artifacts/spark-operator/spark-rbac.yaml
|
||||
```
|
||||
The default namespace is `default`. If you want to run spark job in other namespaces. You can change namespace in spark-rbac.yaml and create a new service account.
|
||||
##### submit a spark job
|
||||
```$xslt
|
||||
arena submit sparkjob --name=demo --image=registry.aliyuncs.com/acs/spark:v2.4.0 --main-class=org.apache.spark.examples.SparkPi --jar=local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar
|
||||
```
|
||||
The result is like below.
|
||||
```$xslt
|
||||
configmap/demo-sparkjob created
|
||||
configmap/demo-sparkjob labeled
|
||||
sparkapplication.sparkoperator.k8s.io/demo created
|
||||
INFO[0005] The Job demo has been submitted successfully
|
||||
INFO[0005] You can run `arena get demo --type sparkjob` to check the job status
|
||||
```
|
||||
##### get spark job status
|
||||
```$xslt
|
||||
arena get --type=sparkjob demo
|
||||
```
|
||||
When the job succeed,you will see the result below.
|
||||
```$xslt
|
||||
STATUS: SUCCEEDED
|
||||
NAMESPACE: default
|
||||
TRAINING DURATION: 15s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
demo1 SUCCEEDED SPARKJOB 1h demo1-driver N/A
|
||||
```
|
||||
|
||||
##### watch log of spark job
|
||||
```$xslt
|
||||
arena logs -f demo
|
||||
```
|
||||
You will get the log of spark driver pod.
|
||||
```$xslt
|
||||
2019-05-08T08:25:21.904409561Z ++ id -u
|
||||
2019-05-08T08:25:21.904639867Z + myuid=0
|
||||
2019-05-08T08:25:21.904649704Z ++ id -g
|
||||
2019-05-08T08:25:21.904901542Z + mygid=0
|
||||
2019-05-08T08:25:21.904909072Z + set +e
|
||||
2019-05-08T08:25:21.905241846Z ++ getent passwd 0
|
||||
2019-05-08T08:25:21.905608733Z + uidentry=root:x:0:0:root:/root:/bin/ash
|
||||
2019-05-08T08:25:21.905623028Z + set -e
|
||||
2019-05-08T08:25:21.905629226Z + '[' -z root:x:0:0:root:/root:/bin/ash ']'
|
||||
2019-05-08T08:25:21.905633894Z + SPARK_K8S_CMD=driver
|
||||
2019-05-08T08:25:21.905757494Z + case "$SPARK_K8S_CMD" in
|
||||
2019-05-08T08:25:21.90622059Z + shift 1
|
||||
2019-05-08T08:25:21.906232126Z + SPARK_CLASSPATH=':/opt/spark/jars/*'
|
||||
2019-05-08T08:25:21.906236316Z + env
|
||||
2019-05-08T08:25:21.906239651Z + grep SPARK_JAVA_OPT_
|
||||
2019-05-08T08:25:21.90624307Z + sort -t_ -k4 -n
|
||||
2019-05-08T08:25:21.906585896Z + sed 's/[^=]*=\(.*\)/\1/g'
|
||||
2019-05-08T08:25:21.906908601Z + readarray -t SPARK_EXECUTOR_JAVA_OPTS
|
||||
2019-05-08T08:25:21.906917535Z + '[' -n '' ']'
|
||||
2019-05-08T08:25:21.906999069Z + '[' -n '' ']'
|
||||
2019-05-08T08:25:21.907003871Z + PYSPARK_ARGS=
|
||||
2019-05-08T08:25:21.907006605Z + '[' -n '' ']'
|
||||
2019-05-08T08:25:21.907008951Z + R_ARGS=
|
||||
2019-05-08T08:25:21.907012105Z + '[' -n '' ']'
|
||||
2019-05-08T08:25:21.907148385Z + '[' '' == 2 ']'
|
||||
2019-05-08T08:25:21.907994286Z + '[' '' == 3 ']'
|
||||
2019-05-08T08:25:21.908014459Z + case "$SPARK_K8S_CMD" in
|
||||
2019-05-08T08:25:21.908018653Z + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
|
||||
2019-05-08T08:25:21.908023924Z + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.20.90.160 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal
|
||||
2019-05-08T08:25:23.326681135Z 2019-05-08 08:25:23 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
|
||||
2019-05-08T08:25:23.829843117Z 2019-05-08 08:25:23 INFO SparkContext:54 - Running Spark version 2.4.0
|
||||
2019-05-08T08:25:23.8529898Z 2019-05-08 08:25:23 INFO SparkContext:54 - Submitted application: Spark Pi
|
||||
2019-05-08T08:25:23.94670344Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing view acls to: root
|
||||
2019-05-08T08:25:23.946735076Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing modify acls to: root
|
||||
2019-05-08T08:25:23.946740267Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing view acls groups to:
|
||||
2019-05-08T08:25:23.946744543Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing modify acls groups to:
|
||||
2019-05-08T08:25:23.946748767Z 2019-05-08 08:25:23 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
|
||||
2019-05-08T08:25:24.273960575Z 2019-05-08 08:25:24 INFO Utils:54 - Successfully started service 'sparkDriver' on port 7078.
|
||||
2019-05-08T08:25:24.307632934Z 2019-05-08 08:25:24 INFO SparkEnv:54 - Registering MapOutputTracker
|
||||
2019-05-08T08:25:24.339548141Z 2019-05-08 08:25:24 INFO SparkEnv:54 - Registering BlockManagerMaster
|
||||
2019-05-08T08:25:24.339577986Z 2019-05-08 08:25:24 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
|
||||
2019-05-08T08:25:24.340887925Z 2019-05-08 08:25:24 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
|
||||
2019-05-08T08:25:24.359682519Z 2019-05-08 08:25:24 INFO DiskBlockManager:54 - Created local directory at /var/data/spark-118b216d-2d39-4287-ad71-5b5d7c7195c9/blockmgr-5532fd8b-64b9-492c-b94d-308b55d60a71
|
||||
2019-05-08T08:25:24.388529744Z 2019-05-08 08:25:24 INFO MemoryStore:54 - MemoryStore started with capacity 110.0 MB
|
||||
2019-05-08T08:25:24.413347888Z 2019-05-08 08:25:24 INFO SparkEnv:54 - Registering OutputCommitCoordinator
|
||||
2019-05-08T08:25:24.560654618Z 2019-05-08 08:25:24 INFO log:192 - Logging initialized @2462ms
|
||||
2019-05-08T08:25:24.654721075Z 2019-05-08 08:25:24 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
|
||||
2019-05-08T08:25:24.680943254Z 2019-05-08 08:25:24 INFO Server:419 - Started @2586ms
|
||||
2019-05-08T08:25:24.715867156Z 2019-05-08 08:25:24 INFO AbstractConnector:278 - Started ServerConnector@7e97551f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
|
||||
2019-05-08T08:25:24.715897312Z 2019-05-08 08:25:24 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040.
|
||||
2019-05-08T08:25:24.76123501Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1450078a{/jobs,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.762173789Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@534ca02b{/jobs/json,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.763361524Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@29a23c3d{/jobs/job,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.764374535Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6fe46b62{/jobs/job/json,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.764919809Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@591fd34d{/stages,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.765687152Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@61e45f87{/stages/json,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.766434602Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7c9b78e3{/stages/stage,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.769934319Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5491f68b{/stages/stage/json,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.769949155Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@736ac09a{/stages/pool,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.769966711Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6ecd665{/stages/pool/json,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.77037559Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@45394b31{/storage,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.772696599Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1ec7d8b3{/storage/json,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.772709487Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3b0ca5e1{/storage/rdd,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.773014833Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5bb3131b{/storage/rdd/json,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.77546416Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@54dcbb9f{/environment,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.775478151Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@74fef3f7{/environment/json,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.775882882Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2a037324{/executors,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.780702953Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@69eb86b4{/executors/json,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.780717178Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@585ac855{/executors/threadDump,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.78072195Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5bb8f9e2{/executors/threadDump/json,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.793805533Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6a933be2{/static,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.808511998Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@378bd86d{/,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.808532751Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2189e7a7{/api,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.808537695Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@644abb8f{/jobs/job/kill,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.80854206Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a411233{/stages/stage/kill,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:24.808546336Z 2019-05-08 08:25:24 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://demo1-1557303918993-driver-svc.default.svc:4040
|
||||
2019-05-08T08:25:24.834767942Z 2019-05-08 08:25:24 INFO SparkContext:54 - Added JAR file:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar at spark://demo1-1557303918993-driver-svc.default.svc:7078/jars/spark-examples_2.11-2.4.0.jar with timestamp 1557303924832
|
||||
2019-05-08T08:25:26.274526541Z 2019-05-08 08:25:26 INFO ExecutorPodsAllocator:54 - Going to request 1 executors from Kubernetes.
|
||||
2019-05-08T08:25:26.455658752Z 2019-05-08 08:25:26 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
|
||||
2019-05-08T08:25:26.47651031Z 2019-05-08 08:25:26 INFO NettyBlockTransferService:54 - Server created on demo1-1557303918993-driver-svc.default.svc:7079
|
||||
2019-05-08T08:25:26.476533172Z 2019-05-08 08:25:26 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
|
||||
2019-05-08T08:25:26.503099521Z 2019-05-08 08:25:26 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None)
|
||||
2019-05-08T08:25:26.506168762Z 2019-05-08 08:25:26 INFO BlockManagerMasterEndpoint:54 - Registering block manager demo1-1557303918993-driver-svc.default.svc:7079 with 110.0 MB RAM, BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None)
|
||||
2019-05-08T08:25:26.529524775Z 2019-05-08 08:25:26 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None)
|
||||
2019-05-08T08:25:26.529543725Z 2019-05-08 08:25:26 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None)
|
||||
2019-05-08T08:25:26.661414752Z 2019-05-08 08:25:26 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4c777e7b{/metrics/json,null,AVAILABLE,@Spark}
|
||||
2019-05-08T08:25:30.459756195Z 2019-05-08 08:25:30 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.20.90.161:52168) with ID 1
|
||||
2019-05-08T08:25:30.534179215Z 2019-05-08 08:25:30 INFO KubernetesClusterSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
|
||||
2019-05-08T08:25:30.679510273Z 2019-05-08 08:25:30 INFO BlockManagerMasterEndpoint:54 - Registering block manager 172.20.90.161:36718 with 110.0 MB RAM, BlockManagerId(1, 172.20.90.161, 36718, None)
|
||||
2019-05-08T08:25:30.906713226Z 2019-05-08 08:25:30 INFO SparkContext:54 - Starting job: reduce at SparkPi.scala:38
|
||||
2019-05-08T08:25:30.93537711Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
|
||||
2019-05-08T08:25:30.936000643Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
|
||||
2019-05-08T08:25:30.936506781Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Parents of final stage: List()
|
||||
2019-05-08T08:25:30.938152322Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Missing parents: List()
|
||||
2019-05-08T08:25:30.958509715Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
|
||||
2019-05-08T08:25:31.128459296Z 2019-05-08 08:25:31 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 110.0 MB)
|
||||
2019-05-08T08:25:31.172704042Z 2019-05-08 08:25:31 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 1256.0 B, free 110.0 MB)
|
||||
2019-05-08T08:25:31.178025215Z 2019-05-08 08:25:31 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on demo1-1557303918993-driver-svc.default.svc:7079 (size: 1256.0 B, free: 110.0 MB)
|
||||
2019-05-08T08:25:31.182000364Z 2019-05-08 08:25:31 INFO SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1161
|
||||
2019-05-08T08:25:31.202640906Z 2019-05-08 08:25:31 INFO DAGScheduler:54 - Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
|
||||
2019-05-08T08:25:31.203502967Z 2019-05-08 08:25:31 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
|
||||
2019-05-08T08:25:31.245126257Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, 172.20.90.161, executor 1, partition 0, PROCESS_LOCAL, 7878 bytes)
|
||||
2019-05-08T08:25:31.805815672Z 2019-05-08 08:25:31 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 172.20.90.161:36718 (size: 1256.0 B, free: 110.0 MB)
|
||||
2019-05-08T08:25:31.946492966Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, 172.20.90.161, executor 1, partition 1, PROCESS_LOCAL, 7878 bytes)
|
||||
2019-05-08T08:25:31.957903365Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 727 ms on 172.20.90.161 (executor 1) (1/2)
|
||||
2019-05-08T08:25:31.99308236Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Finished task 1.0 in stage 0.0 (TID 1) in 47 ms on 172.20.90.161 (executor 1) (2/2)
|
||||
2019-05-08T08:25:31.994764897Z 2019-05-08 08:25:31 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
|
||||
2019-05-08T08:25:31.995390219Z 2019-05-08 08:25:31 INFO DAGScheduler:54 - ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.998 s
|
||||
2019-05-08T08:25:32.003622135Z 2019-05-08 08:25:32 INFO DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 1.094511 s
|
||||
2019-05-08T08:25:32.005407995Z Pi is roughly 3.1436157180785904
|
||||
2019-05-08T08:25:32.011499948Z 2019-05-08 08:25:32 INFO AbstractConnector:318 - Stopped Spark@7e97551f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
|
||||
2019-05-08T08:25:32.014105609Z 2019-05-08 08:25:32 INFO SparkUI:54 - Stopped Spark web UI at http://demo1-1557303918993-driver-svc.default.svc:4040
|
||||
2019-05-08T08:25:32.01861939Z 2019-05-08 08:25:32 INFO KubernetesClusterSchedulerBackend:54 - Shutting down all executors
|
||||
2019-05-08T08:25:32.019973046Z 2019-05-08 08:25:32 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asking each executor to shut down
|
||||
2019-05-08T08:25:32.025136562Z 2019-05-08 08:25:32 WARN ExecutorPodsWatchSnapshotSource:87 - Kubernetes client has been closed (this is expected if the application is shutting down.)
|
||||
2019-05-08T08:25:32.087137746Z 2019-05-08 08:25:32 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
|
||||
2019-05-08T08:25:32.097659039Z 2019-05-08 08:25:32 INFO MemoryStore:54 - MemoryStore cleared
|
||||
2019-05-08T08:25:32.098360561Z 2019-05-08 08:25:32 INFO BlockManager:54 - BlockManager stopped
|
||||
2019-05-08T08:25:32.104432515Z 2019-05-08 08:25:32 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
|
||||
2019-05-08T08:25:32.10761075Z 2019-05-08 08:25:32 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
|
||||
2019-05-08T08:25:32.114734944Z 2019-05-08 08:25:32 INFO SparkContext:54 - Successfully stopped SparkContext
|
||||
2019-05-08T08:25:32.117170277Z 2019-05-08 08:25:32 INFO ShutdownHookManager:54 - Shutdown hook called
|
||||
2019-05-08T08:25:32.118273045Z 2019-05-08 08:25:32 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bdb4e416-5ab7-420c-905e-ef43c30fb187
|
||||
2019-05-08T08:25:32.120019227Z 2019-05-08 08:25:32 INFO ShutdownHookManager:54 - Deleting directory /var/data/spark-118b216d-2d39-4287-ad71-5b5d7c7195c9/spark-06dbab1f-13aa-474c-a1db-8845e14627bf
|
||||
```
|
||||
|
||||
##### delete spark job
|
||||
```$xslt
|
||||
arena delete --type=sparkjob demo
|
||||
```
|
||||
You will found the spark job is deleted.
|
||||
```$xslt
|
||||
sparkapplication.sparkoperator.k8s.io "demo1" deleted
|
||||
time="2019-05-08T17:27:06+08:00" level=info msg="The Job demo1 has been deleted successfully"
|
||||
configmap "demo1-sparkjob" deleted
|
||||
```
|
||||
|
||||
Congratulations! You've run the distributed spark job with `arena` successfully.
|
|
@ -1,156 +0,0 @@
|
|||
|
||||
# Arena supports and simplifies volcano job.
|
||||
|
||||
Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms currently missing from
|
||||
Kubernetes that are commonly required by many classes of batch & elastic workload including:
|
||||
|
||||
1. machine learning/deep learning,
|
||||
2. bioinformatics/genomics, and
|
||||
3. other "big data" applications.
|
||||
|
||||
## pre requisites
|
||||
|
||||
- k8s deployment
|
||||
- deploy the volcano following the steps from kubernetes-artifacts/volcano-operator/README.md
|
||||
|
||||
### 1. To run a batch/distributed volcano job, you may need to specify:
|
||||
|
||||
```
|
||||
--minAvailable int The minimal available pods to run for this Job. default value is 1 (default 1)
|
||||
--name string override name
|
||||
--queue string Specifies the queue that will be used in the scheduler, default queue is used this leaves empty (default "default")
|
||||
--schedulerName string Specifies the scheduler Name, default is volcano when not specified (default "volcano")
|
||||
--taskCPU string cpu request for each task replica / pod. default value is 250m (default "250m")
|
||||
--taskImages strings the docker images of different tasks of volcano job. default used 3 tasks with ubuntu,nginx and busybox images (default [ubuntu,nginx,busybox])
|
||||
--taskMemory string memory request for each task replica/pod.default value is 128Mi) (default "128Mi")
|
||||
--taskName string the task name of volcano job, default value is task (default "task")
|
||||
--taskPort int the task port number. default value is 2222 (default 2222)
|
||||
--taskReplicas int the task replica's number to run the distributed tasks. default value is 1 (default 1)
|
||||
```
|
||||
|
||||
### 2. More information related to volcano job.
|
||||
|
||||
Arena volcano job is based on (https://github.com/volcano-sh/volcano).
|
||||
You can get more information related to volcano from https://volcano.sh/
|
||||
|
||||
### 3. How to use Arena volcano job
|
||||
|
||||
##### install volcano
|
||||
|
||||
deploy the volcano following the steps from kubernetes-artifacts/volcano-operator/README.md
|
||||
|
||||
To install the chart with the release name `volcano-release`
|
||||
|
||||
```bash
|
||||
$ helm install --name volcano-release kubernetes-artifacts/volcano-operator
|
||||
```
|
||||
|
||||
TO verify all deployments are running use the below command
|
||||
|
||||
```bash
|
||||
kubectl get deployment --all-namespaces | grep {release_name}
|
||||
```
|
||||
We should get similar output like given below, where three deployments for controller, admission, scheduler should be running.
|
||||
|
||||
```bash
|
||||
NAME READY UP-TO-DATE AVAILABLE AGE
|
||||
{release_name}-admission 1/1 1 1 4s
|
||||
{release_name}-controllers 1/1 1 1 4s
|
||||
{release_name}-scheduler 1/1 1 1 4s
|
||||
```
|
||||
|
||||
TO verify all pods are running use the below command
|
||||
|
||||
```bash
|
||||
kubectl get pods --all-namespaces | grep {release_name}
|
||||
```
|
||||
|
||||
We should get similar output like given below, where three pods for controller, admission,admissioninit, scheduler should be running.
|
||||
|
||||
```bash
|
||||
NAMESPACE NAME READY STATUS RESTARTS AGE
|
||||
default volcano-release-admission-cbfdb8549-dz5hg 1/1 Running 0 33s
|
||||
default volcano-release-admission-init-7xmzd 0/1 Completed 0 33s
|
||||
default volcano-release-controllers-7967fffb8d-7vnn9 1/1 Running 0 33s
|
||||
default volcano-release-scheduler-746f6557d8-9pfg6 1/1 Running 0 33s
|
||||
```
|
||||
|
||||
##### submit a volcano job
|
||||
|
||||
```$xslt
|
||||
arena submit volcanojob --name=demo
|
||||
```
|
||||
|
||||
The result is like below.
|
||||
```$xslt
|
||||
|
||||
configmap/demo-volcanojob created
|
||||
configmap/demo-volcanojob labeled
|
||||
job.batch.volcano.sh/demo created
|
||||
INFO[0003] The Job demo has been submitted successfully
|
||||
INFO[0003] You can run `arena get demo --type volcanojob` to check the job status
|
||||
|
||||
```
|
||||
|
||||
if we want to provide more command line parameters then
|
||||
```$xslt
|
||||
./bin/arena submit volcanojob --name demo12 --taskImages busybox,busybox --taskReplicas 2
|
||||
```
|
||||
|
||||
in above case it creates two tasks each with 2 replicas as shown below
|
||||
```$xslt
|
||||
arena get --type volcanojob demo12
|
||||
```
|
||||
the result is as below
|
||||
```$xslt
|
||||
STATUS: SUCCEEDED
|
||||
NAMESPACE: default
|
||||
TRAINING DURATION: 2m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-0-0 11.245.101.184
|
||||
demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-0-1 11.245.101.184
|
||||
demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-1-0 11.245.101.184
|
||||
demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-1-1 11.245.101.184
|
||||
```
|
||||
##### get volcano job status
|
||||
|
||||
```$xslt
|
||||
arena get --type=volcanojob demo
|
||||
```
|
||||
When the job running/succeed,you will see the result below.
|
||||
```$xslt
|
||||
STATUS: RUNNING/SUCCEEDED
|
||||
NAMESPACE: default
|
||||
TRAINING DURATION: 45s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
demo SUCCEEDED VOLCANOJOB 59s demo-task-0-0 11.245.101.184
|
||||
demo RUNNING VOLCANOJOB 59s demo-task-1-0 11.245.101.184
|
||||
demo SUCCEEDED VOLCANOJOB 59s demo-task-2-0 11.245.101.184
|
||||
|
||||
```
|
||||
##### list arena jobs
|
||||
|
||||
```$xslt
|
||||
arena list
|
||||
```
|
||||
we can observe the below data
|
||||
```$xslt
|
||||
NAME STATUS TRAINER AGE NODE
|
||||
demo RUNNING VOLCANOJOB 2m 11.245.101.184
|
||||
```
|
||||
|
||||
##### delete volcano job
|
||||
|
||||
```$xslt
|
||||
arena delete --type=volcanojob demo
|
||||
```
|
||||
You will found the volcano job is deleted.
|
||||
```$xslt
|
||||
job.batch.volcano.sh "demo" deleted
|
||||
configmap "demo-volcanojob" deleted
|
||||
INFO[0000] The Job demo has been deleted successfully
|
||||
```
|
||||
|
||||
Congratulations! You've run the batch/distributed volcano job with `arena` successfully.
|
|
@ -1,169 +0,0 @@
|
|||
|
||||
# Arena supports Priority and Preemption for MPIJob
|
||||
|
||||
## prerequisites
|
||||
|
||||
- k8s > 1.11
|
||||
|
||||
1.Create `PriorityClass` with the yaml below:
|
||||
|
||||
```yaml
|
||||
apiVersion: scheduling.k8s.io/v1beta1
|
||||
description: Used for the critical app
|
||||
kind: PriorityClass
|
||||
metadata:
|
||||
name: critical
|
||||
value: 1100000
|
||||
|
||||
---
|
||||
|
||||
apiVersion: scheduling.k8s.io/v1beta1
|
||||
description: Used for the medium app
|
||||
kind: PriorityClass
|
||||
metadata:
|
||||
name: medium
|
||||
value: 1000000
|
||||
```
|
||||
|
||||
Save the template that applies in a file named `pc.yaml`, and create the `PriorityClass`:
|
||||
|
||||
```
|
||||
kubectl create -f pc.yaml
|
||||
```
|
||||
|
||||
2.There is only 1 GPU available in the Kubernetes cluster
|
||||
|
||||
```
|
||||
# arena top node
|
||||
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
|
||||
192.168.0.20 192.168.0.20 master 0 0
|
||||
192.168.0.21 192.168.0.21 master 0 0
|
||||
192.168.0.22 192.168.0.22 master 0 0
|
||||
192.168.0.23 192.168.0.23 <none> 1 0
|
||||
-----------------------------------------------------------------------------------------
|
||||
Allocated/Total GPUs In Cluster:
|
||||
0/1 (0%)
|
||||
```
|
||||
|
||||
3.Run the MPI training Job with `medium` priority:
|
||||
|
||||
|
||||
The following command is an example.
|
||||
|
||||
```
|
||||
# arena submit mpi \
|
||||
--name=medium \
|
||||
--priority=medium \
|
||||
--gpus=1 \
|
||||
--workers=1 \
|
||||
--image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
|
||||
"mpirun tail -f /dev/null"
|
||||
configmap/medium-mpijob created
|
||||
configmap/medium-mpijob labeled
|
||||
mpijob.kubeflow.org/medium created
|
||||
INFO[0000] The Job medium has been submitted successfully
|
||||
INFO[0000] You can run `arena get medium --type mpijob` to check the job status
|
||||
```
|
||||
|
||||
4.Get the details of the specific job
|
||||
|
||||
```
|
||||
# arena get medium
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: medium
|
||||
TRAINING DURATION: 58s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
medium RUNNING MPIJOB 58s medium-launcher-sz5xj 192.168.0.23
|
||||
medium RUNNING MPIJOB 58s medium-worker-0 192.168.0.23
|
||||
```
|
||||
|
||||
5.The only one GPU is used by MPI training Job `medium`
|
||||
|
||||
```
|
||||
# arena top node -d
|
||||
|
||||
NAME: cn-hangzhou.192.168.0.23
|
||||
IPADDRESS: 192.168.0.23
|
||||
ROLE: <none>
|
||||
|
||||
NAMESPACE NAME GPU REQUESTS GPU LIMITS
|
||||
default medium-worker-0 1 1
|
||||
|
||||
Total GPUs In Node cn-hangzhou.192.168.0.23: 1
|
||||
Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%)
|
||||
-----------------------------------------------------------------------------------------
|
||||
|
||||
Allocated/Total GPUs In Cluster: 1/1 (100%)
|
||||
```
|
||||
|
||||
6.Run the MPI training Job with `critical` priority:
|
||||
|
||||
```
|
||||
# arena submit mpi \
|
||||
--name=critical \
|
||||
--priority=critical \
|
||||
--gpus=1 \
|
||||
--workers=1 \
|
||||
--image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
|
||||
"mpirun tail -f /dev/null"
|
||||
```
|
||||
|
||||
7.Check MPI Training Job `medium`, and find it's preempted by critical-worker-0
|
||||
|
||||
```
|
||||
# kubectl get events --field-selector involvedObject.name=medium-worker-0
|
||||
LAST SEEN TYPE REASON OBJECT MESSAGE
|
||||
15m Normal Scheduled pod/medium-worker-0 Successfully assigned default/medium-worker-0 to 192.168.0.23
|
||||
14m Normal Pulled pod/medium-worker-0 Container image "registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5" already present on machine
|
||||
14m Normal Created pod/medium-worker-0 Created container mpi
|
||||
14m Normal Started pod/medium-worker-0 Started container mpi
|
||||
2m32s Normal Preempted pod/medium-worker-0 by default/critical-worker-0 on node 192.168.0.23
|
||||
2m32s Normal Killing pod/medium-worker-0 Stopping container mpi
|
||||
```
|
||||
|
||||
8.Check the details of the MPI Training Job `medium`, and it's turned to fail
|
||||
|
||||
```
|
||||
# arena get medium
|
||||
STATUS: FAILED
|
||||
NAMESPACE: default
|
||||
PRIORITY: medium
|
||||
TRAINING DURATION: 12m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
medium FAILED MPIJOB 20m medium-launcher-sz5xj 192.168.0.23
|
||||
```
|
||||
|
||||
9.And check the details of the MPI Training Job `critical`, it's running.
|
||||
|
||||
```
|
||||
# arena get critical
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: critical
|
||||
TRAINING DURATION: 10m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
critical RUNNING MPIJOB 10m critical-launcher-mfffs 192.168.0.23
|
||||
critical RUNNING MPIJOB 10m critical-worker-0 192.168.0.23
|
||||
```
|
||||
|
||||
10.And we can find the only GPU is used by the MPI Training Job `critical`
|
||||
|
||||
```
|
||||
# arena top node -d
|
||||
NAME: cn-hangzhou.192.168.0.23
|
||||
IPADDRESS: 192.168.0.23
|
||||
ROLE: <none>
|
||||
|
||||
NAMESPACE NAME GPU REQUESTS GPU LIMITS
|
||||
default critical-worker-0 1 1
|
||||
|
||||
Total GPUs In Node cn-hangzhou.192.168.0.23: 1
|
||||
Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%)
|
||||
-----------------------------------------------------------------------------------------
|
||||
```
|
||||
|
||||
Congratulations! You've run the the job in priorities and preemptions with `arena` successfully.
|
|
@ -1,160 +0,0 @@
|
|||
|
||||
|
||||
Arena supports assigning jobs to some k8s particular nodes(Currently only support mpi job and tf job).
|
||||
|
||||
some usage examples in here.
|
||||
|
||||
1.query k8s cluster information:
|
||||
```
|
||||
# kubectl get nodes
|
||||
NAME STATUS ROLES AGE VERSION
|
||||
cn-beijing.192.168.3.225 Ready master 2d23h v1.12.6-aliyun.1
|
||||
cn-beijing.192.168.3.226 Ready master 2d23h v1.12.6-aliyun.1
|
||||
cn-beijing.192.168.3.227 Ready master 2d23h v1.12.6-aliyun.1
|
||||
cn-beijing.192.168.3.228 Ready <none> 2d22h v1.12.6-aliyun.1
|
||||
cn-beijing.192.168.3.229 Ready <none> 2d22h v1.12.6-aliyun.1
|
||||
cn-beijing.192.168.3.230 Ready <none> 2d22h v1.12.6-aliyun.1
|
||||
```
|
||||
2.give a label to nodes,for example: give label "gpu_node=ok" to node "cn-beijing.192.168.3.228" and node "cn-beijing.192.168.3.229",give label "ssd_node=ok" to node "cn-beijing.192.168.3.230"
|
||||
```
|
||||
# kubectl label nodes cn-beijing.192.168.3.228 gpu_node=ok
|
||||
node/cn-beijing.192.168.3.228 labeled
|
||||
# kubectl label nodes cn-beijing.192.168.3.229 gpu_node=ok
|
||||
node/cn-beijing.192.168.3.229 labeled
|
||||
# kubectl label nodes cn-beijing.192.168.3.230 ssd_node=ok
|
||||
node/cn-beijing.192.168.3.230 labeled
|
||||
```
|
||||
## for MPI job
|
||||
1.when submit a job,you can assign nodes to run job with operation "--selector"
|
||||
```
|
||||
# arena submit mpi --name=mpi-dist \
|
||||
--gpus=1 \
|
||||
--workers=1 \
|
||||
--selector gpu_node=ok \
|
||||
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
|
||||
--tensorboard \
|
||||
--loglevel debug \
|
||||
"mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
|
||||
```
|
||||
2.query the job information
|
||||
```
|
||||
# arena get mpi-dist
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 21s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
mpi-dist RUNNING MPIJOB 21s mpi-dist-launcher-7jn4q 192.168.3.229
|
||||
mpi-dist RUNNING MPIJOB 21s mpi-dist-worker-0 192.168.3.229
|
||||
|
||||
Your tensorboard will be available on:
|
||||
http://192.168.3.225:31611
|
||||
```
|
||||
the jobs are running on node cn-beijing.192.168.3.229(ip is 192.168.3.229).
|
||||
|
||||
3.you can use "--selector" multiple times,for example you can use "--selector gpu_node=ok --selector ssd_node=ok" in arena submit command,it represents that the job should be running on nodes which own label "gpu_node=ok" and label "ssd_node=ok".
|
||||
|
||||
## for tf job
|
||||
|
||||
1.because there is four roles("PS","Worker","Evaluator","Chief") in tf job,you can use "--selector" to assgin nodes,this is effective for all roles.for example:
|
||||
```
|
||||
arena submit tfjob \
|
||||
--name=tf \
|
||||
--gpus=1 \
|
||||
--workers=1 \
|
||||
--selector ssd_node=ok \
|
||||
--workerImage=cheyang/tf-mnist-distributed:gpu \
|
||||
--psImage=cheyang/tf-mnist-distributed:cpu \
|
||||
--ps=1 \
|
||||
--tensorboard \
|
||||
--loglevel debug \
|
||||
"python /app/main.py"
|
||||
```
|
||||
use follow command to check the job status:
|
||||
|
||||
```
|
||||
# arena get tf
|
||||
STATUS: PENDING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 24s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
tf RUNNING TFJOB 24s tf-ps-0 192.168.3.230
|
||||
tf PENDING TFJOB 24s tf-worker-0 192.168.3.230
|
||||
|
||||
Your tensorboard will be available on:
|
||||
http://192.168.3.225:31867
|
||||
```
|
||||
|
||||
the jobs(include "PS" and "Worker") have been running on cn-beijing.192.168.3.230(ip is 192.168.3.230,label is "ssd_node=ok").
|
||||
|
||||
2.you also can assign node to run single role job,for example: if you want to run a job whose role is "PS" on nodes which own label ssd_node="ok" and run "Worker" job on nodes which own label gpu_node=ok,you can use option "--ps-selector" and "--worker-selector"
|
||||
```
|
||||
arena submit tfjob \
|
||||
--name=tf \
|
||||
--gpus=1 \
|
||||
--workers=1 \
|
||||
--ps-selector ssd_node=ok \
|
||||
--worker-selector gpu_node=ok \
|
||||
--workerImage=cheyang/tf-mnist-distributed:gpu \
|
||||
--psImage=cheyang/tf-mnist-distributed:cpu \
|
||||
--ps=1 \
|
||||
--tensorboard \
|
||||
--loglevel debug \
|
||||
"python /app/main.py"
|
||||
```
|
||||
|
||||
then check the jobs's status:
|
||||
|
||||
```
|
||||
# arena get tf
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 23s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
tf RUNNING TFJOB 23s tf-ps-0 192.168.3.230
|
||||
tf RUNNING TFJOB 23s tf-worker-0 192.168.3.228
|
||||
|
||||
Your tensorboard will be available on:
|
||||
http://192.168.3.225:30162
|
||||
```
|
||||
|
||||
the "PS" job is running on cn-beijing.192.168.3.230(ip is 192.168.3.230,label is "ssd_node=ok") and the "Worker" job is running on cn-beijing.192.168.3.228(ip is 192.168.3.228,label is "gpu_node=ok")
|
||||
|
||||
3.if you use "--selector" in "arena submit tf" command and also use "--ps-selector"(or "--worker-selector","--evaluator-selector","chief-selector"),the value of "--ps-selector" would cover value of "--selector",for example:
|
||||
|
||||
```
|
||||
arena submit tfjob \
|
||||
--name=tf \
|
||||
--gpus=1 \
|
||||
--workers=1 \
|
||||
--ps-selector ssd_node=ok \
|
||||
--selector gpu_node=ok \
|
||||
--workerImage=cheyang/tf-mnist-distributed:gpu \
|
||||
--psImage=cheyang/tf-mnist-distributed:cpu \
|
||||
--ps=1 \
|
||||
--tensorboard \
|
||||
--loglevel debug \
|
||||
"python /app/main.py"
|
||||
```
|
||||
|
||||
"PS" job will be running on nodes whose label is "ssd_node=ok",other jobs will be running on nodes whose label is "gpu_node=ok",now verify our conclusions,use follow command to check job status.
|
||||
```
|
||||
# arena get tf
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 39s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
tf RUNNING TFJOB 39s tf-ps-0 192.168.3.230
|
||||
tf RUNNING TFJOB 39s tf-worker-0 192.168.3.228
|
||||
|
||||
Your tensorboard will be available on:
|
||||
http://192.168.3.225:32105
|
||||
```
|
||||
as you can see, "PS" job is running on nodes which own label "ssd_node=ok",other jobs are running on nodes which own label "gpu_node=ok"
|
|
@ -1,85 +0,0 @@
|
|||
|
||||
|
||||
Arena supports submiting a job with tolerating k8s nodes with taints(Currently only support mpi job and tf job).
|
||||
|
||||
some usage examples in here.
|
||||
|
||||
1.query k8s cluster information:
|
||||
```
|
||||
# kubectl get nodes
|
||||
NAME STATUS ROLES AGE VERSION
|
||||
cn-beijing.192.168.3.225 Ready master 2d23h v1.12.6-aliyun.1
|
||||
cn-beijing.192.168.3.226 Ready master 2d23h v1.12.6-aliyun.1
|
||||
cn-beijing.192.168.3.227 Ready master 2d23h v1.12.6-aliyun.1
|
||||
cn-beijing.192.168.3.228 Ready <none> 2d22h v1.12.6-aliyun.1
|
||||
cn-beijing.192.168.3.229 Ready <none> 2d22h v1.12.6-aliyun.1
|
||||
cn-beijing.192.168.3.230 Ready <none> 2d22h v1.12.6-aliyun.1
|
||||
```
|
||||
2.give some taints for k8s nodes,for example: give taint "gpu_node=invalid:NoSchedule" to node "cn-beijing.192.168.3.228" and node "cn-beijing.192.168.3.229",give taint "ssd_node=invalid:NoSchedule" to node "cn-beijing.192.168.3.230",now all k8s pod can't schedule to these nodes.
|
||||
```
|
||||
# kubectl taint nodes cn-beijing.192.168.3.228 gpu_node=invalid:NoSchedule
|
||||
node/cn-beijing.192.168.3.228 tainted
|
||||
# kubectl taint nodes cn-beijing.192.168.3.229 gpu_node=invalid:NoSchedule
|
||||
node/cn-beijing.192.168.3.229 tainted
|
||||
# kubectl taint nodes cn-beijing.192.168.3.230 ssd_node=invalid:NoSchedule
|
||||
node/cn-beijing.192.168.3.230 tainted
|
||||
```
|
||||
3.when submit a job,you can tolerate some nodes with taints to run job with operation "--toleration"
|
||||
```
|
||||
# arena submit mpi --name=mpi-dist \
|
||||
--gpus=1 \
|
||||
--workers=1 \
|
||||
--toleration ssd_node \
|
||||
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
|
||||
--tensorboard \
|
||||
--loglevel debug \
|
||||
"mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
|
||||
```
|
||||
query the job information
|
||||
```
|
||||
# arena get mpi-dist
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 29s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
mpi-dist RUNNING MPIJOB 29s mpi-dist-launcher-jgms7 192.168.3.230
|
||||
mpi-dist RUNNING MPIJOB 29s mpi-dist-worker-0 192.168.3.230
|
||||
|
||||
Your tensorboard will be available on:
|
||||
http://192.168.3.225:30052
|
||||
```
|
||||
the jobs are running on node cn-beijing.192.168.3.230(ip is 192.168.3.230,taint is ssd_node=invalid).
|
||||
|
||||
4.you can use "--toleration" multiple times,for example you can use "--toleration gpu_node --toleration ssd_node" in arena submit command,it represents that the job tolerates nodes which own taint "gpu_node=invalid" and taint "ssd_node=invalid".
|
||||
|
||||
```
|
||||
# arena submit mpi --name=mpi-dist \
|
||||
--gpus=1 \
|
||||
--workers=1 \
|
||||
--toleration ssd_node \
|
||||
--toleration gpu_node \
|
||||
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
|
||||
--tensorboard \
|
||||
--loglevel debug \
|
||||
"mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
|
||||
```
|
||||
query the job status:
|
||||
|
||||
```
|
||||
# arena get mpi-dist
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 29s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
mpi-dist RUNNING MPIJOB 29s mpi-dist-launcher-jgms7 192.168.3.229
|
||||
mpi-dist RUNNING MPIJOB 29s mpi-dist-worker-0 192.168.3.230
|
||||
|
||||
Your tensorboard will be available on:
|
||||
http://192.168.3.225:30052
|
||||
```
|
||||
|
||||
5.you can use "--toleration all" to tolerate all node taints.
|
Before Width: | Height: | Size: 183 KiB |
Before Width: | Height: | Size: 290 KiB |
|
@ -1,80 +0,0 @@
|
|||
# Serving Trained Model with arena
|
||||
|
||||
You can use arena to deploy your trained model as RESTful APIs.to illustrate usage,we use a sample project [fast-style-transfer](https://github.com/floydhub/fast-style-transfer).in order to save time,we use its' trainted model and add the model to docker images.
|
||||
|
||||
### 1.Serve Mode
|
||||
|
||||
we use the app.py script in project to start restful server,you can use arena to deploy trainted model:
|
||||
|
||||
```
|
||||
# arena serve custom \
|
||||
--name=fast-style-transfer \
|
||||
--gpus=1 \
|
||||
--version=alpha \
|
||||
--replicas=1 \
|
||||
--restful-port=5000 \
|
||||
--image=happy365/fast-style-transfer:latest \
|
||||
"python app.py"
|
||||
```
|
||||
|
||||
check the status of TensorFlow Serving Job:
|
||||
|
||||
```
|
||||
# arena serve list
|
||||
NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS
|
||||
fast-style-transfer CUSTOM alpha 1 0 172.21.8.94 grpc:8001,restful:5000
|
||||
```
|
||||
|
||||
because the docker image is very large,pulling it requests some time,we can use kubectl to check the pod status:
|
||||
|
||||
```
|
||||
# kubectl get po
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
fast-style-transfer-alpha-custom-serving-845ffbf7dd-btbhj 0/1 ContainerCreating 0 6m44s
|
||||
```
|
||||
|
||||
### 2.Access the service
|
||||
|
||||
we can use a client to access the service,run the follow command to create a client:
|
||||
```
|
||||
# kubectl run sample-client \
|
||||
--generator=run-pod/v1 \
|
||||
--image=happy365/arena-serve-custem-sample-client:latest \
|
||||
--command -- \
|
||||
/bin/sleep infinity
|
||||
```
|
||||
|
||||
then,we can query the status of sample-client:
|
||||
```
|
||||
# kubectl get po sample-client
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
sample-client 1/1 Running 0 87s
|
||||
|
||||
```
|
||||
we should query the sevice name,it is a combination of job name and version(the sample job name is fast-style-transfer and version is alpha):
|
||||
|
||||
```
|
||||
# kubectl get svc fast-style-transfer-alpha
|
||||
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
|
||||
fast-style-transfer-alpha ClusterIP 172.21.1.114 <none> 5000/TCP 31m
|
||||
```
|
||||
|
||||
now,we can use the "kubectl exec" command to login the container:
|
||||
|
||||
```
|
||||
# kubectl exec -ti sample-client /bin/sh
|
||||
#
|
||||
```
|
||||
|
||||
then we use "curl" command to access the custom serving job:
|
||||
```
|
||||
# curl -o /root/output/beijing_out.jpg -F "file=@/root/input/beijing.jpg" http://fast-style-transfer-alpha:5000
|
||||
```
|
||||
the input is an image which name is "beijing.jpg" ,the image is stored in "/root/input",the output is stored in "/root/output". you can use "kubectl cp" command to copy output image from container to host:
|
||||
```
|
||||
# kubectl cp sample-client:/root/output/beijing_out.jpg ~/beijing_out.jpg
|
||||
```
|
||||
now you can view the image in ~/beijing_out.jpg,there is "beijing_out.jpg" 
|
||||
|
||||
|
||||
|
|
@ -1,73 +0,0 @@
|
|||
# Assign configuration files for jobs
|
||||
|
||||
you can pass the configuration files to containers when submiting jobs.
|
||||
|
||||
this feature only support follow jobs:
|
||||
|
||||
* tfjob
|
||||
* mpijob
|
||||
|
||||
## 1.usage
|
||||
|
||||
you can use `--config-file <host_path_file>:<container_path_file>` to assign a configuration file to container.and there is some rules:
|
||||
|
||||
* if assignd <host_path_file> and not assign <container_path_file>,we see <container_path_file> is the same as <host_path_file>
|
||||
* <container_path_file> must be a file with absolute path
|
||||
* you can use `--config-file` more than one in a command,eg: "--config-file /tmp/test1.conf:/etc/config/test1.conf --config-file /tmp/test2.conf:/etc/config/test2.conf"
|
||||
|
||||
|
||||
## 2.sample
|
||||
|
||||
|
||||
firstly,we create a test file which name is "test-config.json",its' path is "/tmp/test-config.json". we want push this file to containers of a tfjob (or mpijob) and the path in container is "/etc/config/config.json".
|
||||
```
|
||||
# cat /tmp/test-config.json
|
||||
{
|
||||
"key": "job-config"
|
||||
|
||||
}
|
||||
```
|
||||
secondly,use follow command to create tfjob:
|
||||
```
|
||||
# arena submit tfjob \
|
||||
--name=tf \
|
||||
--gpus=1 \
|
||||
--workers=1 \
|
||||
--workerImage=cheyang/tf-mnist-distributed:gpu \
|
||||
--psImage=cheyang/tf-mnist-distributed:cpu \
|
||||
--ps=1 \
|
||||
--tensorboard \
|
||||
--config-file /tmp/test-config.json:/etc/config/config.json \
|
||||
"python /app/main.py"
|
||||
```
|
||||
wait a minute,get the job status:
|
||||
```
|
||||
# arena get tf
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 16s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
tf RUNNING TFJOB 16s tf-ps-0 192.168.7.18
|
||||
tf RUNNING TFJOB 16s tf-worker-0 192.168.7.16
|
||||
|
||||
Your tensorboard will be available on:
|
||||
http://192.168.7.10:31825
|
||||
```
|
||||
use kubectl to check file is in container or not:
|
||||
```
|
||||
# kubectl exec -ti tf-ps-0 -- cat /etc/config/config.json
|
||||
{
|
||||
"key": "job-config"
|
||||
|
||||
}
|
||||
# kubectl exec -ti tf-worker-0 -- cat /etc/config/config.json
|
||||
{
|
||||
"key": "job-config"
|
||||
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
as you see,the file is in the containers.
|
|
@ -1,95 +0,0 @@
|
|||
This example shows how to use `Arena` to submit a pytorch stand-alone job. This example will download the source code from git url.
|
||||
|
||||
1. The first step is to check the available resources.
|
||||
```
|
||||
➜ arena top node
|
||||
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
|
||||
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
|
||||
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
|
||||
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
|
||||
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
|
||||
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
|
||||
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
|
||||
-----------------------------------------------------------------------------------------
|
||||
Allocated/Total GPUs In Cluster:
|
||||
0/12 (0%)
|
||||
```
|
||||
There are 3 available nodes with GPU for running training jobs.
|
||||
|
||||
2. Submit a pytorch training job, this example download the source code from [Alibaba Cloud code](https://code.aliyun.com/370272561/mnist-pytorch.git).
|
||||
```
|
||||
# Single gpu card
|
||||
➜ arena --loglevel info submit pytorch \
|
||||
--name=pytorch-local-git \
|
||||
--gpus=1 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
|
||||
configmap/pytorch-local-git-pytorchjob created
|
||||
configmap/pytorch-local-git-pytorchjob labeled
|
||||
pytorchjob.kubeflow.org/pytorch-local-git created
|
||||
INFO[0000] The Job pytorch-local-git has been submitted successfully
|
||||
INFO[0000] You can run `arena get pytorch-local-git --type pytorchjob` to check the job status
|
||||
```
|
||||
|
||||
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`.
|
||||
|
||||
> If you are using the private git repo, you can use the following command:
|
||||
|
||||
```
|
||||
➜ arena --loglevel info submit pytorch \
|
||||
--name=pytorch-local-git \
|
||||
--gpus=1 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
--env=GIT_SYNC_USERNAME=yourname \
|
||||
--env=GIT_SYNC_PASSWORD=yourpwd \
|
||||
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
|
||||
```
|
||||
|
||||
3. List all the jobs.
|
||||
```
|
||||
➜ arena list
|
||||
NAME STATUS TRAINER AGE NODE
|
||||
pytorch-local-git SUCCEEDED PYTORCHJOB 21h N/A
|
||||
```
|
||||
|
||||
4. Get the details of the this job.
|
||||
```
|
||||
➜ arena get pytorch-local-git
|
||||
STATUS: SUCCEEDED
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 35s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-local-git SUCCEEDED PYTORCHJOB 23h pytorch-local-git-master-0 172.16.0.210
|
||||
```
|
||||
|
||||
5. Check logs.
|
||||
```
|
||||
➜ arena logs pytorch-local-git
|
||||
WORLD_SIZE: 1, CURRENT_RANK: 0
|
||||
args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000)
|
||||
Using CUDA
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
np_resource = np.dtype([("resource", np.ubyte, 1)])
|
||||
Train Epoch: 1 [0/60000 (0%)] loss=2.3000
|
||||
Train Epoch: 1 [640/60000 (1%)] loss=2.2135
|
||||
Train Epoch: 1 [1280/60000 (2%)] loss=2.1705
|
||||
Train Epoch: 1 [1920/60000 (3%)] loss=2.0767
|
||||
Train Epoch: 1 [2560/60000 (4%)] loss=1.8681
|
||||
...
|
||||
```
|
|
@ -1,131 +0,0 @@
|
|||
This example shows how to use `Arena` to submit a pytorch distributed job. This example will download the source code from git url.
|
||||
|
||||
1. The first step is to check the available resources.
|
||||
```
|
||||
➜ arena top node
|
||||
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
|
||||
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
|
||||
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
|
||||
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
|
||||
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
|
||||
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
|
||||
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
|
||||
-----------------------------------------------------------------------------------------
|
||||
Allocated/Total GPUs In Cluster:
|
||||
0/12 (0%)
|
||||
```
|
||||
There are 3 available nodes with GPU for running training jobs.
|
||||
|
||||
2. Submit a pytorch distributed training job with 2 nodes and one gpu card, this example download the source code from [Alibaba Cloud code](https://code.aliyun.com/370272561/mnist-pytorch.git).
|
||||
```
|
||||
➜ arena --loglevel info submit pytorch \
|
||||
--name=pytorch-dist-git \
|
||||
--gpus=1 \
|
||||
--workers=2 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
|
||||
configmap/pytorch-dist-git-pytorchjob created
|
||||
configmap/pytorch-dist-git-pytorchjob labeled
|
||||
pytorchjob.kubeflow.org/pytorch-dist-git created
|
||||
INFO[0000] The Job pytorch-dist-git has been submitted successfully
|
||||
INFO[0000] You can run `arena get pytorch-dist-git --type pytorchjob` to check the job status
|
||||
```
|
||||
|
||||
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`.
|
||||
|
||||
>`workers` is the total number of nodes participating in the training (must be a positive integer and greater than or equal to 1), including rank0 node used to establish communication (corresponding to the `master` node in the pytorch-operator). The default value of the parameter is 1, which can not be set, as a stand-alone job.
|
||||
|
||||
|
||||
3. List all the jobs.
|
||||
```
|
||||
➜ arena list
|
||||
NAME STATUS TRAINER AGE NODE
|
||||
pytorch-dist-git SUCCEEDED PYTORCHJOB 23h N/A
|
||||
```
|
||||
|
||||
4. Get the details of the this job. There are 2 instances of this job, and instance `pytorch-dist-git-master-0` is the rank0. Arena simplifies the process of submitting distributed jobs with `PyTorch-Operator`.
|
||||
A `Service` will be created for this `master` instance for other nodes to access through the name of `Service` in `PyTorch-Operator`, and inject environment variables into each instance: `MASTER_PORT`、`MASTER_ADDR`、`WORLD_SIZE`、`RANK`. Initialization of distributed process group for pytorch( dist.init_ process_ group). `MASTER_PORT` auto assign, `MASTER_ADDR` is "localhost" in the `master` instance, and other instances are `Service` name of the `master`,`WORLD_SIZE` is the total number of instances, and `RANK` is the serial number of the current calculation node, and `master` is 0, `Worker` instance is the index of instance name suffix plus one. For example, in the following example, `RANK` of instance `pytorch-dist-git-worker-0` is `0 + 1 = 1`
|
||||
In Arena, the value filled in by the parameter `--workers` contains one `master` instance, because `master` is also involved in training.
|
||||
```
|
||||
➜ arena get pytorch-local-git
|
||||
STATUS: SUCCEEDED
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 1m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-dist-git SUCCEEDED PYTORCHJOB 23h pytorch-dist-git-master-0 172.16.0.210
|
||||
pytorch-dist-git SUCCEEDED PYTORCHJOB 23h pytorch-dist-git-worker-0 172.16.0.210
|
||||
```
|
||||
|
||||
5. Check logs.
|
||||
```
|
||||
➜ arena logs pytorch-dist-git
|
||||
WORLD_SIZE: 2, CURRENT_RANK: 0
|
||||
args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000)
|
||||
Using CUDA
|
||||
Using distributed PyTorch with gloo backend
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
np_resource = np.dtype([("resource", np.ubyte, 1)])
|
||||
Train Epoch: 1 [0/60000 (0%)] loss=2.3000
|
||||
Train Epoch: 1 [640/60000 (1%)] loss=2.2135
|
||||
Train Epoch: 1 [1280/60000 (2%)] loss=2.1705
|
||||
Train Epoch: 1 [1920/60000 (3%)] loss=2.0767
|
||||
Train Epoch: 1 [2560/60000 (4%)] loss=1.8681
|
||||
Train Epoch: 1 [3200/60000 (5%)] loss=1.4142
|
||||
Train Epoch: 1 [3840/60000 (6%)] loss=1.0009
|
||||
...
|
||||
```
|
||||
|
||||
> For multi instances of distributed job, the default output is the log of rank0 (the instance is the `master` node). If you want to view the log of the specific instance, you can view it by `-i` instance name, for example:
|
||||
|
||||
```
|
||||
➜ arena logs pytorch-dist-git -i pytorch-dist-git-worker-0
|
||||
WORLD_SIZE: 2, CURRENT_RANK: 1
|
||||
args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000)
|
||||
Using CUDA
|
||||
Using distributed PyTorch with gloo backend
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
|
||||
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
|
||||
np_resource = np.dtype([("resource", np.ubyte, 1)])
|
||||
Train Epoch: 1 [0/60000 (0%)] loss=2.3000
|
||||
Train Epoch: 1 [640/60000 (1%)] loss=2.2135
|
||||
Train Epoch: 1 [1280/60000 (2%)] loss=2.1705
|
||||
Train Epoch: 1 [1920/60000 (3%)] loss=2.0767
|
||||
Train Epoch: 1 [2560/60000 (4%)] loss=1.8681
|
||||
Train Epoch: 1 [3200/60000 (5%)] loss=1.4142
|
||||
```
|
||||
|
||||
> In addition, user can view the logs of the last few lines through the parameter `-t` lines num, such as:
|
||||
|
||||
```
|
||||
➜ arena logs pytorch-dist-git -i pytorch-dist-git-worker-0 -t 5
|
||||
Train Epoch: 1 [58880/60000 (98%)] loss=0.2048
|
||||
Train Epoch: 1 [59520/60000 (99%)] loss=0.0646
|
||||
|
||||
accuracy=0.9661
|
||||
|
||||
```
|
||||
> For more parameters, see ` arena logs -- help`
|
||||
|
|
@ -1,75 +0,0 @@
|
|||
This example shows how to use `Arena` to submit a python distributed job and visualize by `Tensorboard`. The sample downloads the source code from git URL.
|
||||
|
||||
1. The first step is to check the available resources.
|
||||
```
|
||||
➜ arena top node
|
||||
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
|
||||
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
|
||||
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
|
||||
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
|
||||
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
|
||||
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
|
||||
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
|
||||
-----------------------------------------------------------------------------------------
|
||||
Allocated/Total GPUs In Cluster:
|
||||
0/12 (0%)
|
||||
```
|
||||
There are 3 available nodes with GPU for running training jobs.
|
||||
|
||||
2. Submit a pytorch distributed training job with 2 nodes and one gpu card, this example download the source code from [Alibaba Cloud code](https://code.aliyun.com/370272561/mnist-pytorch.git).
|
||||
```
|
||||
➜ arena --loglevel info submit pytorch \
|
||||
--name=pytorch-dist-tensorboard \
|
||||
--gpus=1 \
|
||||
--workers=2 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
--tensorboard \
|
||||
--logdir=/root/logs \
|
||||
"python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo --dir /root/logs"
|
||||
configmap/pytorch-dist-tensorboard-pytorchjob created
|
||||
configmap/pytorch-dist-tensorboard-pytorchjob labeled
|
||||
service/pytorch-dist-tensorboard-tensorboard created
|
||||
deployment.apps/pytorch-dist-tensorboard-tensorboard created
|
||||
pytorchjob.kubeflow.org/pytorch-dist-tensorboard created
|
||||
INFO[0000] The Job pytorch-dist-tensorboard has been submitted successfully
|
||||
INFO[0000] You can run `arena get pytorch-dist-tensorboard --type pytorchjob` to check the job status
|
||||
```
|
||||
|
||||
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`.
|
||||
|
||||
> `workers` is the total number of nodes participating in the training (must be a positive integer and greater than or equal to 1), including rank0 node used to establish communication (corresponding to the `master` node in the pytorch-operator). The default value of the parameter is 1, which can not be set, as a stand-alone job.
|
||||
|
||||
> `logdir` indicates where the tensorboard reads the event logs of Pytorch.
|
||||
|
||||
3. List all the jobs.
|
||||
```
|
||||
➜ arena list
|
||||
NAME STATUS TRAINER AGE NODE
|
||||
pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h N/A
|
||||
```
|
||||
|
||||
4. Get the details of the this job.
|
||||
```
|
||||
➜ arena get pytorch-dist-tensorboard
|
||||
STATUS: SUCCEEDED
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 15m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h pytorch-dist-tensorboard-master-0 172.16.0.210
|
||||
pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h pytorch-dist-tensorboard-worker-0 172.16.0.210
|
||||
|
||||
Your tensorboard will be available on:
|
||||
http://172.16.0.205:30583
|
||||
```
|
||||
> Notice: you can access the tensorboard by using `172.16.0.205:30583`. You can consider `sshuttle` if you can't access the tensorboard directly from your laptop. For example:
|
||||
```
|
||||
# you can install sshuttle==0.74 in your mac with python2.7
|
||||
➜ pip install sshuttle==0.74
|
||||
# 0/0 -> 0.0.0.0/0
|
||||
➜ sshuttle -r root@39.104.17.205 0/0
|
||||
```
|
||||

|
Before Width: | Height: | Size: 879 KiB |
Before Width: | Height: | Size: 413 KiB |
|
@ -1,109 +0,0 @@
|
|||
|
||||
Here is an example how you can use `Arena` for the machine learning training. It will download the source code from git url, and use Tensorboard to visualize the Tensorflow computation graph and plot quantitative metrics.
|
||||
|
||||
1. the first step is to check the available resources
|
||||
|
||||
```
|
||||
arena top node
|
||||
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
|
||||
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
|
||||
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0
|
||||
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0
|
||||
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
|
||||
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
|
||||
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
|
||||
-----------------------------------------------------------------------------------------
|
||||
Allocated/Total GPUs In Cluster:
|
||||
0/3 (0%)
|
||||
```
|
||||
|
||||
There are 3 available nodes with GPU for running training jobs.
|
||||
|
||||
|
||||
2\. Now we can submit a training job with `arena cli`, it will download the source code from github
|
||||
|
||||
```
|
||||
# arena submit tf \
|
||||
--name=tf-tensorboard \
|
||||
--gpus=1 \
|
||||
--image=tensorflow/tensorflow:1.5.0-devel-gpu \
|
||||
--env=TEST_TMPDIR=code/tensorflow-sample-code/ \
|
||||
--syncMode=git \
|
||||
--syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
|
||||
--tensorboard \
|
||||
--logdir=/training_logs \
|
||||
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 5000"
|
||||
configmap/tf-tensorboard-tfjob created
|
||||
configmap/tf-tensorboard-tfjob labeled
|
||||
service/tf-tensorboard-tensorboard created
|
||||
deployment.extensions/tf-tensorboard-tensorboard created
|
||||
tfjob.kubeflow.org/tf-tensorboard created
|
||||
INFO[0001] The Job tf-tensorboard has been submitted successfully
|
||||
INFO[0001] You can run `arena get tf-tensorboard --type tfjob` to check the job status
|
||||
```
|
||||
|
||||
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`.
|
||||
|
||||
> `logdir` indicates where the tensorboard reads the event logs of TensorFlow
|
||||
|
||||
3\. List all the jobs
|
||||
|
||||
```
|
||||
# arena list
|
||||
NAME STATUS TRAINER AGE NODE
|
||||
tf-tensorboard RUNNING TFJOB 0s 192.168.1.119
|
||||
```
|
||||
|
||||
4\. Check the resource usage of the job
|
||||
|
||||
```
|
||||
# arena top job
|
||||
NAME STATUS TRAINER AGE NODE GPU(Requests) GPU(Allocated)
|
||||
tf-tensorboard RUNNING TFJOB 26s 192.168.1.119 1 1
|
||||
|
||||
|
||||
Total Allocated GPUs of Training Job:
|
||||
0
|
||||
|
||||
Total Requested GPUs of Training Job:
|
||||
1
|
||||
```
|
||||
|
||||
|
||||
|
||||
5\. Check the resource usage of the cluster
|
||||
|
||||
|
||||
```
|
||||
# arena top node
|
||||
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
|
||||
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
|
||||
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 1
|
||||
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0
|
||||
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
|
||||
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
|
||||
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
|
||||
-----------------------------------------------------------------------------------------
|
||||
Allocated/Total GPUs In Cluster:
|
||||
1/3 (33%)
|
||||
```
|
||||
|
||||
|
||||
6\. Get the details of the specific job
|
||||
|
||||
```
|
||||
# arena get tf-tensorboard
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
tf-tensorboard RUNNING tfjob 15s tf-tensorboard-tfjob-586fcf4d6f-vtlxv 192.168.1.119
|
||||
tf-tensorboard RUNNING tfjob 15s tf-tensorboard-tfjob-worker-0 192.168.1.119
|
||||
|
||||
Your tensorboard will be available on:
|
||||
192.168.1.117:30670
|
||||
```
|
||||
|
||||
> Notice: you can access the tensorboard by using `192.168.1.117:30670`. You can consider `sshuttle` if you can't access the tensorboard directly from your laptop. For example: `sshuttle -r root@47.89.59.51 192.168.0.0/16`
|
||||
|
||||
|
||||

|
||||
|
||||
Congratulations! You've run the training job with `arena` successfully, and you can also check the tensorboard easily.
|
|
@ -1,123 +0,0 @@
|
|||
This example shows how to use `Arena` to submit a python distributed job and mount an NFS data volume. The sample downloads the source code from git URL.
|
||||
|
||||
1. Set up an NFS server.(refer to: https://www.cnblogs.com/weifeng1463/p/10037803.html )
|
||||
```shell
|
||||
# install nfs server
|
||||
➜ yum install nfs-utils -y
|
||||
# Create local directory of NFS server
|
||||
➜ mkdir -p /root/nfs/data
|
||||
# Configure nfs server
|
||||
➜ cat /etc/exports
|
||||
/root/nfs/data *(rw,no_root_squash)
|
||||
# Start nfs server
|
||||
➜ systemctl start nfs; systemctl start rpcbind
|
||||
➜ systemctl enable nfs
|
||||
Created symlink from /etc/systemd/system/multi-user.target.wants/nfs-server.service to /usr/lib/systemd/system/nfs-server.service.
|
||||
```
|
||||
2. Download training data to shared directory of NFS.
|
||||
```shell
|
||||
# Get information of NFS server by showmount, 172.16.0.200 is the host ip of NFS server
|
||||
➜ showmount -e 172.16.0.200
|
||||
Export list for 172.16.0.200:
|
||||
/root/nfs/data *
|
||||
# Enter shared directory
|
||||
➜ cd /root/nfs/data
|
||||
# Prepare training data to shared directory
|
||||
➜ pwd
|
||||
/root/nfs/data
|
||||
# MNIST -> That's the training data we need
|
||||
➜ ll
|
||||
total 8.0K
|
||||
drwxr-xr-x 4 502 games 4.0K 6月 17 16:05 data
|
||||
drwxr-xr-x 4 root root 4.0K 6月 23 15:17 MNIST
|
||||
```
|
||||
3. Create PV.
|
||||
```shell
|
||||
# Note: Typesetting may cause yaml indentation problems
|
||||
➜ cat nfs-pv.yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: pytorchdata
|
||||
labels:
|
||||
pytorchdata: nas-mnist
|
||||
spec:
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
capacity:
|
||||
storage: 10Gi
|
||||
accessModes:
|
||||
- ReadWriteMany
|
||||
nfs:
|
||||
server: 172.16.0.200
|
||||
path: "/root/nfs/data"
|
||||
|
||||
➜ kubectl create -f nfs-pv.yaml
|
||||
persistentvolume/pytorchdata created
|
||||
➜ kubectl get pv | grep pytorchdata
|
||||
pytorchdata 10Gi RWX Retain Bound default/pytorchdata 7m38s
|
||||
```
|
||||
5. Create PVC.
|
||||
```shell
|
||||
➜ cat nfs-pvc.yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: pytorchdata
|
||||
annotations:
|
||||
description: "this is the mnist demo"
|
||||
owner: Tom
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteMany
|
||||
resources:
|
||||
requests:
|
||||
storage: 5Gi
|
||||
selector:
|
||||
matchLabels:
|
||||
pytorchdata: nas-mnist
|
||||
|
||||
➜ kubectl create -f nfs-pvc.yaml
|
||||
persistentvolumeclaim/pytorchdata created
|
||||
➜ kubectl get pvc | grep pytorchdata
|
||||
pytorchdata Bound pytorchdata 10Gi RWX 2m3s
|
||||
```
|
||||
7. Check the data volume.
|
||||
```shell
|
||||
➜ arena data list
|
||||
NAME ACCESSMODE DESCRIPTION OWNER AGE
|
||||
pytorchdata ReadWriteMany this is the mnist demo Tom 2m
|
||||
```
|
||||
9. Submit the pytorch job through `--data pvc_name:container_path` mount distributed storage volume.
|
||||
```shell
|
||||
➜ arena --loglevel info submit pytorch \
|
||||
--name=pytorch-data \
|
||||
--gpus=1 \
|
||||
--workers=2 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
--data=pytorchdata:/mnist_data \
|
||||
"python /root/code/mnist-pytorch/mnist.py --backend gloo --data /mnist_data/data"
|
||||
configmap/pytorch-data-pytorchjob created
|
||||
configmap/pytorch-data-pytorchjob labeled
|
||||
pytorchjob.kubeflow.org/pytorch-data created
|
||||
INFO[0000] The Job pytorch-data has been submitted successfully
|
||||
INFO[0000] You can run `arena get pytorch-data --type pytorchjob` to check the job status
|
||||
```
|
||||
11. Get status of volume `pytorchdata` in one of the instances by `kubectl describe`.
|
||||
```shell
|
||||
# Get the details of the this job
|
||||
➜ arena get pytorch-data
|
||||
STATUS: SUCCEEDED
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 56s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-data SUCCEEDED PYTORCHJOB 1m pytorch-data-master-0 172.16.0.210
|
||||
pytorch-data SUCCEEDED PYTORCHJOB 1m pytorch-data-worker-0 172.16.0.210
|
||||
|
||||
# Get status of volume `pytorchdata` from `pytorch-data-master-0`
|
||||
➜ kubectl describe pod pytorch-data-master-0 | grep pytorchdata -C 3
|
||||
```
|
||||

|
Before Width: | Height: | Size: 235 KiB |
|
@ -1,54 +0,0 @@
|
|||
## Arena supports assigning pytorch jobs to some k8s particular nodes
|
||||
|
||||
1. Get k8s cluster information:
|
||||
```shell
|
||||
➜ kubectl get nodes
|
||||
NAME STATUS ROLES AGE VERSION
|
||||
cn-huhehaote.172.16.0.205 Ready master 4h19m v1.16.9-aliyun.1
|
||||
cn-huhehaote.172.16.0.206 Ready master 4h18m v1.16.9-aliyun.1
|
||||
cn-huhehaote.172.16.0.207 Ready master 4h17m v1.16.9-aliyun.1
|
||||
cn-huhehaote.172.16.0.208 Ready <none> 4h13m v1.16.9-aliyun.1
|
||||
cn-huhehaote.172.16.0.209 Ready <none> 4h13m v1.16.9-aliyun.1
|
||||
cn-huhehaote.172.16.0.210 Ready <none> 4h13m v1.16.9-aliyun.1
|
||||
```
|
||||
2. Give a label to nodes,for example:
|
||||
```shell
|
||||
# 172.16.0.208 label gpu_node=ok
|
||||
➜ kubectl label nodes cn-huhehaote.172.16.0.208 gpu_node=ok
|
||||
node/cn-huhehaote.172.16.0.208 labeled
|
||||
# 172.16.0.209 label gpu_node=ok
|
||||
➜ kubectl label nodes cn-huhehaote.172.16.0.209 gpu_node=ok
|
||||
node/cn-huhehaote.172.16.0.209 labeled
|
||||
# 172.16.0.210 label ssd_node=ok
|
||||
➜ kubectl label nodes cn-huhehaote.172.16.0.210 ssd_node=ok
|
||||
node/cn-huhehaote.172.16.0.210 labeled
|
||||
```
|
||||
3. When submitting a python job, you can use the `--selector` to decide which node the job runs on
|
||||
```shell
|
||||
➜ arena --loglevel info submit pytorch \
|
||||
--name=pytorch-selector \
|
||||
--gpus=1 \
|
||||
--workers=2 \
|
||||
--selector gpu_node=ok \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
|
||||
configmap/pytorch-selector-pytorchjob created
|
||||
configmap/pytorch-selector-pytorchjob labeled
|
||||
pytorchjob.kubeflow.org/pytorch-selector created
|
||||
INFO[0000] The Job pytorch-selector has been submitted successfully
|
||||
INFO[0000] You can run `arena get pytorch-selector --type pytorchjob` to check the job status
|
||||
```
|
||||
4. Get the job details, you can see that the job only runs on this node with IP 172.16.0.209 and label `gpu_node=ok`.
|
||||
```shell
|
||||
➜ arena get pytorch-selector
|
||||
STATUS: PENDING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 14s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-selector PENDING PYTORCHJOB 14s pytorch-selector-master-0 172.16.0.209
|
||||
pytorch-selector PENDING PYTORCHJOB 14s pytorch-selector-worker-0 172.16.0.209
|
||||
```
|
|
@ -1,96 +0,0 @@
|
|||
## Arena supports submiting a pytorch job with tolerating k8s nodes with taints
|
||||
|
||||
1. Get k8s cluster information:
|
||||
```shell
|
||||
➜ kubectl get node
|
||||
NAME STATUS ROLES AGE VERSION
|
||||
cn-huhehaote.172.16.0.205 Ready master 5h13m v1.16.9-aliyun.1
|
||||
cn-huhehaote.172.16.0.206 Ready master 5h12m v1.16.9-aliyun.1
|
||||
cn-huhehaote.172.16.0.207 Ready master 5h11m v1.16.9-aliyun.1
|
||||
cn-huhehaote.172.16.0.208 Ready <none> 5h7m v1.16.9-aliyun.1
|
||||
cn-huhehaote.172.16.0.209 Ready <none> 5h7m v1.16.9-aliyun.1
|
||||
cn-huhehasote.172.16.0.210 Ready <none> 5h7m v1.16.9-aliyun.1
|
||||
```
|
||||
2. Give some taints for k8s nodes,for example:
|
||||
```shell
|
||||
# taint --> gpu_node
|
||||
➜ kubectl taint nodes cn-huhehaote.172.16.0.208 gpu_node=invalid:NoSchedule
|
||||
node/cn-huhehaote.172.16.0.208 tainted
|
||||
➜ kubectl taint nodes cn-huhehaote.172.16.0.209 gpu_node=invalid:NoSchedule
|
||||
node/cn-huhehaote.172.16.0.209 tainted
|
||||
# taint --> ssd_node
|
||||
➜ kubectl taint nodes cn-huhehaote.172.16.0.210 ssd_node=invalid:NoSchedule
|
||||
node/cn-huhehaote.172.16.0.210 tainted
|
||||
```
|
||||
3. When we add the wrong nodes' taints or restore the node's schedulability, we can remove the nodes' taints in the following commands:
|
||||
```shell
|
||||
➜ kubectl taint nodes cn-huhehaote.172.16.0.208 gpu_node-
|
||||
node/cn-huhehaote.172.16.0.208 untainted
|
||||
➜ kubectl taint nodes cn-huhehaote.172.16.0.209 gpu_node-
|
||||
node/cn-huhehaote.172.16.0.209 untainted
|
||||
➜ kubectl taint nodes cn-huhehaote.172.16.0.210 ssd_node-
|
||||
node/cn-huhehaote.172.16.0.210 untainted
|
||||
```
|
||||
4. When submit a job, you can tolerate some nodes with taints to run job with operation `--toleration`, for example `--toleration=gpu_node`. This parameter can be used multiple times with different taint keys.
|
||||
```shell
|
||||
➜ arena --loglevel info submit pytorch \
|
||||
--name=pytorch-toleration \
|
||||
--gpus=1 \
|
||||
--workers=2 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
--tensorboard \
|
||||
--logdir=/root/logs \
|
||||
--toleration gpu_node \
|
||||
"python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo --dir /root/logs"
|
||||
configmap/pytorch-toleration-pytorchjob created
|
||||
configmap/pytorch-toleration-pytorchjob labeled
|
||||
service/pytorch-toleration-tensorboard created
|
||||
deployment.apps/pytorch-toleration-tensorboard created
|
||||
pytorchjob.kubeflow.org/pytorch-toleration created
|
||||
INFO[0000] The Job pytorch-toleration has been submitted successfully
|
||||
INFO[0000] You can run `arena get pytorch-toleration --type pytorchjob` to check the job status
|
||||
```
|
||||
5. Get the details of the this job.
|
||||
```shell
|
||||
arena get pytorch-toleration
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 2m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-toleration RUNNING PYTORCHJOB 2m pytorch-toleration-master-0 172.16.0.209
|
||||
pytorch-toleration RUNNING PYTORCHJOB 2m pytorch-toleration-worker-0 172.16.0.209
|
||||
|
||||
Your tensorboard will be available on:
|
||||
http://172.16.0.205:32091
|
||||
```
|
||||
6. You can use `--toleration all` to tolerate all node taints.
|
||||
```shell
|
||||
➜ arena --loglevel info submit pytorch \
|
||||
--name=pytorch-toleration-all \
|
||||
--gpus=1 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
--toleration all \
|
||||
"python /root/code/mnist-pytorch/mnist.py --epochs 10 --backend gloo"
|
||||
configmap/pytorch-toleration-all-pytorchjob created
|
||||
configmap/pytorch-toleration-all-pytorchjob labeled
|
||||
pytorchjob.kubeflow.org/pytorch-toleration-all created
|
||||
INFO[0000] The Job pytorch-toleration-all has been submitted successfully
|
||||
INFO[0000] You can run `arena get pytorch-toleration-all --type pytorchjob` to check the job status
|
||||
```
|
||||
7. Get the details of the this job.
|
||||
```shell
|
||||
➜ arena get pytorch-toleration-all
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 33s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-toleration-all RUNNING PYTORCHJOB 33s pytorch-toleration-all-master-0 172.16.0.210
|
||||
```
|
|
@ -1,49 +0,0 @@
|
|||
## Assign configuration files for pytorch jobs
|
||||
|
||||
You can pass the configuration files to containers when submiting jobs.
|
||||
|
||||
1. Prepare the configuration file to be mounted on the submitted machine.
|
||||
```shell
|
||||
# prepare your config-file
|
||||
➜ cat /tmp/test-config.json
|
||||
{
|
||||
"key": "job-config"
|
||||
}
|
||||
```
|
||||
2. Submit the job, and specify the configuration file to mount by `--config-file`.
|
||||
```shell
|
||||
# arena submit job by --config-file ${host-config-file}:${container-config-file}
|
||||
# This parameter supports multiple use and mounting multiple configuration files
|
||||
➜ arena --loglevel info submit pytorch \
|
||||
--name=pytorch-config-file \
|
||||
--gpus=1 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
--config-file /tmp/test-config.json:/etc/config/config.json \
|
||||
"python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo"
|
||||
configmap/pytorch-config-file-pytorchjob created
|
||||
configmap/pytorch-config-file-pytorchjob labeled
|
||||
configmap/pytorch-config-file-a9cbad1b8719778 created
|
||||
pytorchjob.kubeflow.org/pytorch-config-file created
|
||||
INFO[0000] The Job pytorch-config-file has been submitted successfully
|
||||
INFO[0000] You can run `arena get pytorch-config-file --type pytorchjob` to check the job status
|
||||
```
|
||||
3. Get the details of the this job.
|
||||
```shell
|
||||
➜ arena get pytorch-config-file --type pytorchjob
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 51s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-config-file RUNNING PYTORCHJOB 51s pytorch-config-file-master-0 172.16.0.210
|
||||
```
|
||||
4. Use kubectl to check file is in container or not:
|
||||
```
|
||||
➜ kubectl exec -ti pytorch-config-file-master-0 -- cat /etc/config/config.json
|
||||
{
|
||||
"key": "job-config"
|
||||
}
|
||||
```
|
|
@ -1,130 +0,0 @@
|
|||
## Arena supports Priority and Preemption for pytorch job
|
||||
|
||||
1. Create `PriorityClass` with the yaml below.There are two priorities defined here: `critical` and `medium`.
|
||||
```shell
|
||||
# critical 和 medium 声明
|
||||
➜ cat priorityClass.yaml
|
||||
apiVersion: scheduling.k8s.io/v1beta1
|
||||
description: Used for the critical app
|
||||
kind: PriorityClass
|
||||
metadata:
|
||||
name: critical
|
||||
value: 1100000
|
||||
|
||||
---
|
||||
|
||||
apiVersion: scheduling.k8s.io/v1beta1
|
||||
description: Used for the medium app
|
||||
kind: PriorityClass
|
||||
metadata:
|
||||
name: medium
|
||||
value: 1000000
|
||||
|
||||
# Create two priority objects: critical and medium
|
||||
➜ kubectl create -f priorityClass.yaml
|
||||
priorityclass.scheduling.k8s.io/critical created
|
||||
priorityclass.scheduling.k8s.io/medium created
|
||||
```
|
||||
2. Check the available resources.There are 3 nodes in total, and each node has 4 gpu cards.
|
||||
```shell
|
||||
➜ arena top node
|
||||
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
|
||||
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
|
||||
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
|
||||
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
|
||||
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
|
||||
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
|
||||
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
|
||||
-----------------------------------------------------------------------------------------
|
||||
Allocated/Total GPUs In Cluster:
|
||||
0/12 (0%)
|
||||
```
|
||||
3. Submit a GPU job with `medium` priority of 3 nodes and 4 cards, which occupies the full resources. In order to verify the effect, we can increase the epoch of training, extend the training time, and facilitate the experiment to view.
|
||||
```shell
|
||||
➜ arena --loglevel info submit pytorch \
|
||||
--name=pytorch-priority-medium \
|
||||
--gpus=4 \
|
||||
--workers=3 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
--priority=medium \
|
||||
"python /root/code/mnist-pytorch/mnist.py --backend gloo --epochs 200"
|
||||
configmap/pytorch-priority-medium-pytorchjob created
|
||||
configmap/pytorch-priority-medium-pytorchjob labeled
|
||||
pytorchjob.kubeflow.org/pytorch-priority-medium created
|
||||
INFO[0000] The Job pytorch-priority-medium has been submitted successfully
|
||||
INFO[0000] You can run `arena get pytorch-priority-medium --type pytorchjob` to check the job status
|
||||
```
|
||||
4. Get the details of the this job. You can see that the task is running.
|
||||
```shell
|
||||
➜ arena get pytorch-priority-medium
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: medium
|
||||
TRAINING DURATION: 3m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-master-0 172.16.0.208
|
||||
pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-worker-0 172.16.0.210
|
||||
pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-worker-1 172.16.0.209
|
||||
```
|
||||
5. Check the GPU card usage. It is all occupied.
|
||||
```shell
|
||||
➜ arena top node
|
||||
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
|
||||
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
|
||||
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
|
||||
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
|
||||
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 4
|
||||
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 4
|
||||
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 4
|
||||
-----------------------------------------------------------------------------------------
|
||||
Allocated/Total GPUs In Cluster:
|
||||
12/12 (100%)
|
||||
```
|
||||
6. Submit a job with priority of `critical` to initiate preemption.
|
||||
```shell
|
||||
➜ arena --loglevel info submit pytorch \
|
||||
--name=pytorch-priority-critical \
|
||||
--gpus=1 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
--priority=critical \
|
||||
"python /root/code/mnist-pytorch/mnist.py --backend gloo --epochs 50"
|
||||
configmap/pytorch-priority-critical-pytorchjob created
|
||||
configmap/pytorch-priority-critical-pytorchjob labeled
|
||||
pytorchjob.kubeflow.org/pytorch-priority-critical created
|
||||
INFO[0000] The Job pytorch-priority-critical has been submitted successfully
|
||||
INFO[0000] You can run `arena get pytorch-priority-critical --type pytorchjob` to check the job status
|
||||
```
|
||||
7. Get the details of the this job.
|
||||
```shell
|
||||
➜ arena get pytorch-priority-critical
|
||||
arena get pytorch-priority-critical
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: critical
|
||||
TRAINING DURATION: 22s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-priority-critical RUNNING PYTORCHJOB 22s pytorch-priority-critical-master-0 172.16.0.208
|
||||
```
|
||||
8. Check the job status of `medium` priority. It has become `FAILED`. One instance has been deleted due to preemption.
|
||||
```shell
|
||||
➜ arena get pytorch-priority-medium
|
||||
STATUS: FAILED
|
||||
NAMESPACE: default
|
||||
PRIORITY: medium
|
||||
TRAINING DURATION: 1m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-priority-medium FAILED PYTORCHJOB 2m pytorch-priority-medium-master-0 172.16.0.210
|
||||
pytorch-priority-medium FAILED PYTORCHJOB 2m pytorch-priority-medium-worker-0 172.16.0.209
|
||||
```
|
||||
9. Check the event of the `pytorch-priority-medium`, and you can see that its `python-priority-media-worker-1` has been expelled. The reason for the expulsion is that the `python-priority-critical-master-0` is also applying for the resource of this node, and the node has no additional GPU resource, so the low priority job is preempted by the high priority job.
|
||||
```shell
|
||||
➜ kubectl get events --field-selector involvedObject.name=pytorch-priority-medium-worker-1
|
||||
```
|
||||

|
Before Width: | Height: | Size: 1.5 MiB |
|
@ -1,40 +0,0 @@
|
|||
## Specify the clean-up policy of pod after finishing for pytorch job
|
||||
|
||||
1. Submit a job, and specify `--clean-task-policy` as `All`. After the job finished (`SUCCEEDED` or `FAILED`), all instances (pods) will be deleted; the default is `None`, and all pods will be retained.
|
||||
```shell
|
||||
➜ arena --loglevel info submit pytorch \
|
||||
--name=pytorch-clean-policy \
|
||||
--gpus=1 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
--clean-task-policy=All \
|
||||
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
|
||||
configmap/pytorch-clean-policy-pytorchjob created
|
||||
configmap/pytorch-clean-policy-pytorchjob labeled
|
||||
pytorchjob.kubeflow.org/pytorch-clean-policy created
|
||||
INFO[0000] The Job pytorch-clean-policy has been submitted successfully
|
||||
INFO[0000] You can run `arena get pytorch-clean-policy --type pytorchjob` to check the job status
|
||||
```
|
||||
|
||||
2. Get the job details. After the job is finished, the instance `python-clean-policy-master-0` has been deleted.
|
||||
```shell
|
||||
# RUNNING
|
||||
➜ arena get pytorch-clean-policy
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 18s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-clean-policy RUNNING PYTORCHJOB 18s pytorch-clean-policy-master-0 172.16.0.209
|
||||
|
||||
# FINISHED
|
||||
➜ arena get pytorch-clean-policy
|
||||
STATUS: SUCCEEDED
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 37s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
```
|
|
@ -1,168 +0,0 @@
|
|||
# Submit the training jobs with ImagePullSecrets
|
||||
|
||||
You can use a private registry when submiting jobs(include tensorboard images).
|
||||
Assume the following images are in your private registry.
|
||||
```shell
|
||||
# pytorch
|
||||
registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard-secret:1.5.1-cuda10.1-cudnn7-runtime
|
||||
# tf
|
||||
registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.5.0-devel-gpu
|
||||
# mpi
|
||||
registry.cn-huhehaote.aliyuncs.com/lumo/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5
|
||||
# tensorboard (--tensorboard-image)
|
||||
registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel
|
||||
```
|
||||
|
||||
## Contents
|
||||
* <a href="#create_secret">Create ImagePullSecrets</a>
|
||||
* <a href="#tfjob">TFJob With Secret</a>
|
||||
* <a href="#mpijob">MPIJob With Secret</a>
|
||||
* <a href="#pytorchjob">PyTorchJob With Secret</a>
|
||||
* <a href="#arenaConfig">Load imagePullSecrets from configuration of Arena<a>
|
||||
|
||||
|
||||
## <a name="create_secret">Create ImagePullSecrets</a>
|
||||
* Create a [Secret](https://kubernetes.io/docs/concepts/configuration/secret/) with kubectl. In this case, it's [imagePullSecrets](https://kubernetes.io/docs/concepts/containers/images/).
|
||||
```shell script
|
||||
kubectl create secret docker-registry [$Reg_Secret] --docker-server=[$Registry] --docker-username=[$Username] --docker-password=[$Password] --docker-email=[$Email]
|
||||
```
|
||||
> Note:
|
||||
> [$Reg_Secret] is the name of the secret key, which can be defined by yourself.
|
||||
> [$Registry] is your private registry address.
|
||||
> [$Username] is username of your private registry.
|
||||
> [$Password] is password of your private registry.
|
||||
> [$Email] is your email address, Optional.
|
||||
|
||||
For Example:
|
||||
```shell
|
||||
kubectl create secret docker-registry \
|
||||
lumo-secret \
|
||||
--docker-server=registry.cn-huhehaote.aliyuncs.com \
|
||||
--docker-username=******@test.aliyunid.com \
|
||||
--docker-password=******
|
||||
secret/lumo-secret created
|
||||
```
|
||||
You can check that the secret was created.
|
||||
```shell
|
||||
# kubectl get secrets | grep lumo-secret
|
||||
lumo-secret kubernetes.io/dockerconfigjson 1 52s
|
||||
```
|
||||
|
||||
## <a name="tfjob">TFJob With Secret</a>
|
||||
Submit the job by using `--image-pull-secrets` to specify the imagePullSecrets.
|
||||
1. Submit tf job.
|
||||
```shell
|
||||
arena submit tf \
|
||||
--name=tf-git-with-secret \
|
||||
--working-dir=/root \
|
||||
--gpus=1 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.5.0-devel-gpu \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
|
||||
--data=training-data:/mnist_data \
|
||||
--tensorboard \
|
||||
--tensorboard-image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel \
|
||||
--logdir=/mnist_data/tf_data/logs \
|
||||
--image-pull-secrets=lumo-secret \
|
||||
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --log_dir /mnist_data/tf_data/logs --data_dir /mnist_data/tf_data/"
|
||||
```
|
||||
> Note:
|
||||
> If you have many `imagePullSecrets` to use, you can use `--image-pull-secrets` multiple times.
|
||||
```shell
|
||||
arena submit tf \
|
||||
--name=tf-git-with-secret \
|
||||
... \
|
||||
--image-pull-secrets=lumo-secret \
|
||||
--image-pull-secrets=king-secret \
|
||||
--image-pull-secrets=test-secret
|
||||
...
|
||||
```
|
||||
2. Get the details of the job.
|
||||
```shell
|
||||
# arena get tf-git-with-secret
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 17s
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
tf-git-with-secret RUNNING TFJOB 17s tf-git-with-secret-chief-0 172.16.0.202
|
||||
|
||||
Your tensorboard will be available on:
|
||||
http://172.16.0.198:30080
|
||||
```
|
||||
|
||||
## <a name="mpijob">MPIJob With Secret</a>
|
||||
Submit the job by using `--image-pull-secrets` to specify the imagePullSecrets.
|
||||
1. Submit mpi job.
|
||||
```shell
|
||||
arena submit mpi \
|
||||
--name=mpi-dist-with-secret \
|
||||
--gpus=1 \
|
||||
--workers=2 \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
|
||||
--env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://github.com/tensorflow/benchmarks.git \
|
||||
--tensorboard \
|
||||
--tensorboard-image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel \
|
||||
--image-pull-secrets=lumo-secret \
|
||||
"mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
|
||||
```
|
||||
2. Get the details of the job.
|
||||
```shell
|
||||
# arena get mpi-dist-with-secret
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 9m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-launcher-v8sgt 172.16.0.201
|
||||
mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-worker-0 172.16.0.201
|
||||
mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-worker-1 172.16.0.202
|
||||
|
||||
Your tensorboard will be available on:
|
||||
http://172.16.0.198:30450
|
||||
```
|
||||
|
||||
## <a name="pytorchjob">PyTorchJob With Secret</a>
|
||||
Submit the job by using `--image-pull-secrets` to specify the imagePullSecrets.
|
||||
1. Submit pytorch job.
|
||||
```shell
|
||||
arena submit pytorch \
|
||||
--name=pytorch-git-with-secret \
|
||||
--gpus=1 \
|
||||
--working-dir=/root \
|
||||
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard-secret:1.5.1-cuda10.1-cudnn7-runtime \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
|
||||
--data=training-data:/mnist_data \
|
||||
--tensorboard \
|
||||
--tensorboard-image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel \
|
||||
--logdir=/mnist_data/pytorch_data/logs \
|
||||
--image-pull-secrets=lumo-secret \
|
||||
"python /root/code/mnist-pytorch/mnist.py --epochs 10 --backend nccl --dir /mnist_data/pytorch_data/logs --data /mnist_data/pytorch_data/"
|
||||
```
|
||||
2. Get the details of the job.
|
||||
```shell
|
||||
# arena get pytorch-git-with-secret
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 2m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
pytorch-git-with-secret RUNNING PYTORCHJOB 2m pytorch-git-with-secret-master-0 172.16.0.202
|
||||
|
||||
Your tensorboard will be available on:
|
||||
http://172.16.0.198:31155
|
||||
```
|
||||
## <a name="arenaConfig">Load imagePullSecrets from configuration of Arena</a>
|
||||
If you don't want to submit job by `--image-pull-secrets` every time. You can replace it with configuration of Arena.
|
||||
Open the file `~/.arena/config`, if not exist, create it. And fill in the following configurations.
|
||||
```shell
|
||||
imagePullSecrets=lumo-secret,king-secret
|
||||
```
|
||||
> Note:
|
||||
> `--image-pull-secrets` will overwrite `~/.arena/config`.
|
Before Width: | Height: | Size: 123 KiB |
|
@ -1,62 +0,0 @@
|
|||
This guide walks through the steps to deploy and serve a custom model with kfserving
|
||||
|
||||
1. Setup
|
||||
|
||||
Follow the kFserving [guide](https://github.com/kubeflow/kfserving#install-kfserving) to install kFserving.For the prerequisites,you should ensure 8g memery and 4 core cpu avaliable in your environment.
|
||||
|
||||
2. summit your serving job into kfserving
|
||||
```shell script
|
||||
arena serve kfserving --name=max-object-detector --port=5000 --image=codait/max-object-detector --model-type=custom
|
||||
configmap/max-object-detector-202008221942-kfserving created
|
||||
configmap/max-object-detector-202008221942-kfserving labeled
|
||||
inferenceservice.serving.kubeflow.org/max-object-detector-202008221942 created
|
||||
```
|
||||
3. list the job you just serving
|
||||
```shell script
|
||||
arena serve list
|
||||
NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS
|
||||
max-object-detector KFSERVING 202008221942 1 1 10.97.52.65 http:80
|
||||
```
|
||||
4. test the model service
|
||||
##### Determine the ingress IP and ports
|
||||
The first step is to [determine the ingress IP](https://github.com/kubeflow/kfserving/blob/master/README.md#determine-the-ingress-ip-and-ports) and ports and set INGRESS_HOST and INGRESS_PORT
|
||||
|
||||
This example uses the [codait/max-object-detector](https://github.com/IBM/MAX-Object-Detector) image. The Max Object Detector api server expects a POST request to the /model/predict endpoint that includes an image multipart/form-data and an optional threshold query string.
|
||||
|
||||
```shell script
|
||||
MODEL_NAME=max-object-detector-202008221942
|
||||
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
|
||||
INGRESS_HOST=localhost
|
||||
INGRESS_PORT=80
|
||||
curl -v -F "image=@27-kfserving-custom.jpg" http://${INGRESS_HOST}:${INGRESS_PORT}/model/predict -H "Host: ${SERVICE_HOSTNAME}"
|
||||
* Trying ::1...
|
||||
* TCP_NODELAY set
|
||||
* Connected to localhost (::1) port 80 (#0)
|
||||
> POST /model/predict HTTP/1.1
|
||||
> Host: max-object-detector-202008221942.default.example.com
|
||||
> User-Agent: curl/7.64.1
|
||||
> Accept: */*
|
||||
> Content-Length: 125769
|
||||
> Content-Type: multipart/form-data; boundary=------------------------56b67bc60fc7bdc7
|
||||
> Expect: 100-continue
|
||||
>
|
||||
< HTTP/1.1 100 Continue
|
||||
* We are completely uploaded and fine
|
||||
< HTTP/1.1 200 OK
|
||||
< content-length: 380
|
||||
< content-type: application/json
|
||||
< date: Sun, 23 Aug 2020 03:27:14 GMT
|
||||
< server: istio-envoy
|
||||
< x-envoy-upstream-service-time: 3566
|
||||
<
|
||||
{"status": "ok", "predictions": [{"label_id": "1", "label": "person", "probability": 0.9440352320671082, "detection_box": [0.12420991063117981, 0.12507185339927673, 0.8423266410827637, 0.5974075794219971]}, {"label_id": "18", "label": "dog", "probability": 0.8645510673522949, "detection_box": [0.10447663068771362, 0.17799144983291626, 0.8422801494598389, 0.7320016026496887]}]}
|
||||
* Connection #0 to host localhost left intact
|
||||
* Closing connection 0
|
||||
```
|
||||
5. delete them
|
||||
```shell script
|
||||
arena serve delete max-object-detector --version=202008221942 2 err
|
||||
inferenceservice.serving.kubeflow.org "max-object-detector-202008221942" deleted
|
||||
configmap "max-object-detector-202008221942-kfserving" deleted
|
||||
INFO[0001] The Serving job max-object-detector with version 202008221942 has been deleted successfully
|
||||
```
|
|
@ -1,175 +0,0 @@
|
|||
This guide walks through the steps to submit a elastic training job with horovod.
|
||||
|
||||
1. Build image for training environment
|
||||
You can use the [registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1]() image directly.
|
||||
In addition, you can also build your own image with the help of this document [elastic-training-sample-image](https://code.aliyun.com/370272561/elastic-training-sample-image).
|
||||
|
||||
2. Submit a elastic training job. Example code from [tensorflow2_mnist_elastic.py](https://github.com/horovod/horovod/blob/master/examples/elastic/tensorflow2_mnist_elastic.py)
|
||||
```shell script
|
||||
arena submit etjob \
|
||||
--name=elastic-training \
|
||||
--gpus=1 \
|
||||
--workers=3 \
|
||||
--max-workers=9 \
|
||||
--min-workers=1 \
|
||||
--image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \
|
||||
--working-dir=/examples \
|
||||
"horovodrun
|
||||
-np \$((\${workers}*\${gpus}))
|
||||
--min-np \$((\${minWorkers}*\${gpus}))
|
||||
--max-np \$((\${maxWorkers}*\${gpus}))
|
||||
--host-discovery-script /usr/local/bin/discover_hosts.sh
|
||||
python /examples/elastic/tensorflow2_mnist_elastic.py
|
||||
"
|
||||
```
|
||||
Output:
|
||||
```
|
||||
configmap/elastic-training-etjob created
|
||||
configmap/elastic-training-etjob labeled
|
||||
trainingjob.kai.alibabacloud.com/elastic-training created
|
||||
INFO[0000] The Job elastic-training has been submitted successfully
|
||||
INFO[0000] You can run `arena get elastic-training --type etjob` to check the job status
|
||||
```
|
||||
|
||||
3. List your job.
|
||||
```shell script
|
||||
arena list
|
||||
```
|
||||
Output:
|
||||
```
|
||||
NAME STATUS TRAINER AGE NODE
|
||||
elastic-training RUNNING ETJOB 52s 192.168.0.116
|
||||
```
|
||||
|
||||
4. Get your job details.
|
||||
```shell script
|
||||
arena get elastic-training
|
||||
```
|
||||
Output:
|
||||
```
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 1m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
elastic-training RUNNING ETJOB 1m elastic-training-launcher 192.168.0.116
|
||||
elastic-training RUNNING ETJOB 1m elastic-training-worker-0 192.168.0.114
|
||||
elastic-training RUNNING ETJOB 1m elastic-training-worker-1 192.168.0.116
|
||||
elastic-training RUNNING ETJOB 1m elastic-training-worker-2 192.168.0.116
|
||||
```
|
||||
5. Check logs
|
||||
```shell script
|
||||
arena logs elastic-training --tail 10
|
||||
```
|
||||
Output:
|
||||
```
|
||||
Tue Sep 8 08:32:50 2020[1]<stdout>:Step #2170 Loss: 0.021992
|
||||
Tue Sep 8 08:32:50 2020[0]<stdout>:Step #2180 Loss: 0.000902
|
||||
Tue Sep 8 08:32:50 2020[1]<stdout>:Step #2180 Loss: 0.023190
|
||||
Tue Sep 8 08:32:50 2020[2]<stdout>:Step #2180 Loss: 0.013149
|
||||
Tue Sep 8 08:32:51 2020[0]<stdout>:Step #2190 Loss: 0.029536
|
||||
Tue Sep 8 08:32:51 2020[2]<stdout>:Step #2190 Loss: 0.017537
|
||||
Tue Sep 8 08:32:51 2020[1]<stdout>:Step #2190 Loss: 0.018273
|
||||
Tue Sep 8 08:32:51 2020[2]<stdout>:Step #2200 Loss: 0.038399
|
||||
Tue Sep 8 08:32:51 2020[0]<stdout>:Step #2200 Loss: 0.007017
|
||||
Tue Sep 8 08:32:51 2020[1]<stdout>:Step #2200 Loss: 0.017495
|
||||
```
|
||||
|
||||
|
||||
6. Scaleout your job. Will add one worker into jobs.
|
||||
```shell script
|
||||
arena scaleout etjob --name="elastic-training" --count=1 --timeout=1m
|
||||
```
|
||||
Output:
|
||||
```
|
||||
configmap/elastic-training-1599548177-scaleout created
|
||||
configmap/elastic-training-1599548177-scaleout labeled
|
||||
scaleout.kai.alibabacloud.com/elastic-training-1599548177 created
|
||||
INFO[0000] The scaleout job elastic-training-1599548177 has been submitted successfully
|
||||
```
|
||||
|
||||
7. Get your job details. We can see new worker(elastic-training-worker-3) has been "RUNNING".
|
||||
```shell script
|
||||
arena get elastic-training
|
||||
```
|
||||
Output:
|
||||
```
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 2m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
elastic-training RUNNING ETJOB 2m elastic-training-launcher 192.168.0.116
|
||||
elastic-training RUNNING ETJOB 2m elastic-training-worker-0 192.168.0.114
|
||||
elastic-training RUNNING ETJOB 2m elastic-training-worker-1 192.168.0.116
|
||||
elastic-training RUNNING ETJOB 2m elastic-training-worker-2 192.168.0.116
|
||||
elastic-training RUNNING ETJOB 2m elastic-training-worker-3 192.168.0.117
|
||||
```
|
||||
|
||||
8. Check logs.
|
||||
```shell script
|
||||
arena logs elastic-training --tail 10
|
||||
```
|
||||
Output:
|
||||
```
|
||||
Tue Sep 8 08:33:33 2020[1]<stdout>:Step #3140 Loss: 0.014412
|
||||
Tue Sep 8 08:33:33 2020[0]<stdout>:Step #3140 Loss: 0.004425
|
||||
Tue Sep 8 08:33:33 2020[3]<stdout>:Step #3150 Loss: 0.000513
|
||||
Tue Sep 8 08:33:33 2020[2]<stdout>:Step #3150 Loss: 0.062282
|
||||
Tue Sep 8 08:33:33 2020[1]<stdout>:Step #3150 Loss: 0.020650
|
||||
Tue Sep 8 08:33:33 2020[0]<stdout>:Step #3150 Loss: 0.008056
|
||||
Tue Sep 8 08:33:34 2020[3]<stdout>:Step #3160 Loss: 0.002170
|
||||
Tue Sep 8 08:33:34 2020[2]<stdout>:Step #3160 Loss: 0.009676
|
||||
Tue Sep 8 08:33:34 2020[1]<stdout>:Step #3160 Loss: 0.051425
|
||||
Tue Sep 8 08:33:34 2020[0]<stdout>:Step #3160 Loss: 0.023769
|
||||
```
|
||||
|
||||
9. Scalein your job. Will remove one worker from current jobs.
|
||||
```shell script
|
||||
arena scalein etjob --name="elastic-training" --count=1 --timeout=1m
|
||||
```
|
||||
Output:
|
||||
```
|
||||
configmap/elastic-training-1599554041-scalein created
|
||||
configmap/elastic-training-1599554041-scalein labeled
|
||||
scalein.kai.alibabacloud.com/elastic-training-1599554041 created
|
||||
INFO[0000] The scalein job elastic-training-1599554041 has been submitted successfully
|
||||
```
|
||||
|
||||
10. Get your job details. We can see that `elastic-training-worker-3` has been removed.
|
||||
```shell script
|
||||
arena get elastic-training
|
||||
```
|
||||
Output:
|
||||
```
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 3m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
elastic-training RUNNING ETJOB 3m elastic-training-launcher 192.168.0.116
|
||||
elastic-training RUNNING ETJOB 3m elastic-training-worker-0 192.168.0.114
|
||||
elastic-training RUNNING ETJOB 3m elastic-training-worker-1 192.168.0.116
|
||||
elastic-training RUNNING ETJOB 3m elastic-training-worker-2 192.168.0.116
|
||||
```
|
||||
|
||||
11. Check logs.
|
||||
```shell script
|
||||
arena logs elastic-training --tail 10
|
||||
```
|
||||
Output:
|
||||
```
|
||||
Tue Sep 8 08:34:43 2020[0]<stdout>:Step #5210 Loss: 0.005627
|
||||
Tue Sep 8 08:34:43 2020[2]<stdout>:Step #5220 Loss: 0.002142
|
||||
Tue Sep 8 08:34:43 2020[1]<stdout>:Step #5220 Loss: 0.002978
|
||||
Tue Sep 8 08:34:43 2020[0]<stdout>:Step #5220 Loss: 0.011404
|
||||
Tue Sep 8 08:34:44 2020[2]<stdout>:Step #5230 Loss: 0.000689
|
||||
Tue Sep 8 08:34:44 2020[1]<stdout>:Step #5230 Loss: 0.024597
|
||||
Tue Sep 8 08:34:44 2020[0]<stdout>:Step #5230 Loss: 0.040936
|
||||
Tue Sep 8 08:34:44 2020[0]<stdout>:Step #5240 Loss: 0.000125
|
||||
Tue Sep 8 08:34:44 2020[2]<stdout>:Step #5240 Loss: 0.026498
|
||||
Tue Sep 8 08:34:44 2020[1]<stdout>:Step #5240 Loss: 0.000308
|
||||
```
|
|
@ -1,182 +0,0 @@
|
|||
This guide walks through the steps to submit a elastic training job with horovod.
|
||||
|
||||
1. Build image for training environment
|
||||
You can use the [registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1]() image directly.
|
||||
In addition, you can also build your own image with the help of this document [elastic-training-sample-image](https://code.aliyun.com/370272561/elastic-training-sample-image).
|
||||
|
||||
2. Submit a elastic training job. Example code from [pytorch_synthetic_benchmark_elastic.py](https://github.com/horovod/horovod/blob/master/examples/elastic/pytorch_synthetic_benchmark_elastic.py)
|
||||
```shell script
|
||||
arena submit etjob \
|
||||
--name=elastic-training-synthetic \
|
||||
--gpus=1 \
|
||||
--workers=3 \
|
||||
--max-workers=9 \
|
||||
--min-workers=1 \
|
||||
--image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \
|
||||
--working-dir=/examples \
|
||||
"horovodrun
|
||||
--verbose
|
||||
--log-level=DEBUG
|
||||
-np \$((\${workers}*\${gpus}))
|
||||
--min-np \$((\${minWorkers}*\${gpus}))
|
||||
--max-np \$((\${maxWorkers}*\${gpus}))
|
||||
--start-timeout 100
|
||||
--elastic-timeout 1000
|
||||
--host-discovery-script /usr/local/bin/discover_hosts.sh
|
||||
python /examples/elastic/pytorch_synthetic_benchmark_elastic.py
|
||||
--num-iters=10000
|
||||
--num-warmup-batches=0"
|
||||
```
|
||||
Output:
|
||||
```
|
||||
configmap/elastic-training-synthetic-etjob created
|
||||
configmap/elastic-training-synthetic-etjob labeled
|
||||
trainingjob.kai.alibabacloud.com/elastic-training-synthetic created
|
||||
INFO[0000] The Job elastic-training-synthetic has been submitted successfully
|
||||
INFO[0000] You can run `arena get elastic-training-synthetic --type etjob` to check the job status
|
||||
```
|
||||
|
||||
3. List your job.
|
||||
```shell script
|
||||
arena list
|
||||
```
|
||||
Output:
|
||||
```
|
||||
NAME STATUS TRAINER AGE NODE
|
||||
elastic-training-synthetic RUNNING ETJOB 2m 192.168.0.112
|
||||
```
|
||||
|
||||
4. Get your job details.
|
||||
```shell script
|
||||
arena get elastic-training-synthetic
|
||||
```
|
||||
Output:
|
||||
```
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 3m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-launcher 192.168.0.112
|
||||
elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-worker-0 192.168.0.116
|
||||
elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-worker-1 192.168.0.117
|
||||
elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-worker-2 192.168.0.116
|
||||
```
|
||||
|
||||
5. Check logs
|
||||
```shell script
|
||||
arena logs elastic-training-synthetic --tail 10
|
||||
```
|
||||
Output:
|
||||
```
|
||||
Tue Sep 8 09:24:20 2020[0]<stdout>:Iter #54: 95.3 img/sec per GPU
|
||||
Tue Sep 8 09:24:23 2020[0]<stdout>:Iter #55: 95.3 img/sec per GPU
|
||||
Tue Sep 8 09:24:27 2020[0]<stdout>:Iter #56: 94.6 img/sec per GPU
|
||||
Tue Sep 8 09:24:30 2020[0]<stdout>:Iter #57: 97.1 img/sec per GPU
|
||||
Tue Sep 8 09:24:33 2020[0]<stdout>:Iter #58: 99.7 img/sec per GPU
|
||||
Tue Sep 8 09:24:36 2020[0]<stdout>:Iter #59: 99.8 img/sec per GPU
|
||||
Tue Sep 8 09:24:40 2020[0]<stdout>:Iter #60: 98.0 img/sec per GPU
|
||||
Tue Sep 8 09:24:43 2020[0]<stdout>:Iter #61: 97.1 img/sec per GPU
|
||||
Tue Sep 8 09:24:46 2020[0]<stdout>:Iter #62: 96.1 img/sec per GPU
|
||||
Tue Sep 8 09:24:50 2020[0]<stdout>:Iter #63: 100.4 img/sec per GPU
|
||||
```
|
||||
|
||||
|
||||
6. Scaleout your job. Will add one worker into jobs.
|
||||
```shell script
|
||||
arena scaleout etjob --name="elastic-training-synthetic" --count=1 --timeout=1m
|
||||
```
|
||||
Output:
|
||||
```
|
||||
configmap/elastic-training-synthetic-1599557124-scaleout created
|
||||
configmap/elastic-training-synthetic-1599557124-scaleout labeled
|
||||
scaleout.kai.alibabacloud.com/elastic-training-synthetic-1599557124 created
|
||||
INFO[0000] The scaleout job elastic-training-synthetic-1599557124 has been submitted successfully
|
||||
```
|
||||
|
||||
7. Get your job details. We can see new worker(elastic-training-synthetic-worker-3) has been "RUNNING".
|
||||
```shell script
|
||||
arena get elastic-training-synthetic
|
||||
```
|
||||
Output:
|
||||
```
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 5m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-launcher 192.168.0.112
|
||||
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-0 192.168.0.116
|
||||
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-1 192.168.0.117
|
||||
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-2 192.168.0.116
|
||||
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-3 192.168.0.112
|
||||
```
|
||||
|
||||
8. Check logs.
|
||||
```shell script
|
||||
arena logs elastic-training-synthetic --tail 10
|
||||
```
|
||||
Output:
|
||||
```
|
||||
Tue Sep 8 09:26:03 2020[0]<stdout>:Iter #76: 65.0 img/sec per GPU
|
||||
Tue Sep 8 09:26:08 2020[0]<stdout>:Iter #77: 64.0 img/sec per GPU
|
||||
Tue Sep 8 09:26:13 2020[0]<stdout>:Iter #78: 65.4 img/sec per GPU
|
||||
Tue Sep 8 09:26:18 2020[0]<stdout>:Iter #79: 64.4 img/sec per GPU
|
||||
Tue Sep 8 09:26:23 2020[0]<stdout>:Iter #80: 62.9 img/sec per GPU
|
||||
Tue Sep 8 09:26:28 2020[0]<stdout>:Iter #81: 64.0 img/sec per GPU
|
||||
Tue Sep 8 09:26:33 2020[0]<stdout>:Iter #82: 64.4 img/sec per GPU
|
||||
Tue Sep 8 09:26:38 2020[0]<stdout>:Iter #83: 64.9 img/sec per GPU
|
||||
Tue Sep 8 09:26:43 2020[0]<stdout>:Iter #84: 62.7 img/sec per GPU
|
||||
Tue Sep 8 09:26:48 2020[0]<stdout>:Iter #85: 64.2 img/sec per GPU
|
||||
```
|
||||
|
||||
9. Scalein your job. Will remove one worker from current jobs.
|
||||
```shell script
|
||||
arena scalein etjob --name="elastic-training-synthetic" --count=1 --timeout=1m
|
||||
```
|
||||
Output:
|
||||
```
|
||||
configmap/elastic-training-synthetic-1599557271-scalein created
|
||||
configmap/elastic-training-synthetic-1599557271-scalein labeled
|
||||
scalein.kai.alibabacloud.com/elastic-training-synthetic-1599557271 created
|
||||
INFO[0000] The scalein job elastic-training-synthetic-1599557271 has been submitted successfully
|
||||
```
|
||||
|
||||
10. Get your job details. We can see that `elastic-training-synthetic-worker-3` has been removed.
|
||||
```shell script
|
||||
arena get elastic-training-synthetic
|
||||
```
|
||||
Output:
|
||||
```
|
||||
STATUS: RUNNING
|
||||
NAMESPACE: default
|
||||
PRIORITY: N/A
|
||||
TRAINING DURATION: 7m
|
||||
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-launcher 192.168.0.112
|
||||
elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-worker-0 192.168.0.116
|
||||
elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-worker-1 192.168.0.117
|
||||
elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-worker-2 192.168.0.116
|
||||
```
|
||||
|
||||
11. Check logs.
|
||||
```shell script
|
||||
arena logs elastic-training-synthetic --tail 10
|
||||
```
|
||||
Output:
|
||||
```
|
||||
DEBUG:root:host elastic-training-synthetic-worker-3 has been blacklisted, ignoring exit from local_rank=0
|
||||
Process 3 exit with status code 134.
|
||||
Tue Sep 8 09:27:56 2020[0]<stdout>:Iter #97: 96.0 img/sec per GPU
|
||||
Tue Sep 8 09:28:00 2020[0]<stdout>:Iter #98: 95.4 img/sec per GPU
|
||||
Tue Sep 8 09:28:03 2020[0]<stdout>:Iter #99: 96.9 img/sec per GPU
|
||||
Tue Sep 8 09:28:06 2020[0]<stdout>:Iter #100: 97.2 img/sec per GPU
|
||||
Tue Sep 8 09:28:10 2020[0]<stdout>:Iter #101: 98.5 img/sec per GPU
|
||||
Tue Sep 8 09:28:13 2020[0]<stdout>:Iter #102: 95.8 img/sec per GPU
|
||||
Tue Sep 8 09:28:16 2020[0]<stdout>:Iter #103: 97.3 img/sec per GPU
|
||||
Tue Sep 8 09:28:20 2020[0]<stdout>:Iter #104: 97.3 img/sec per GPU
|
||||
Tue Sep 8 09:28:23 2020[0]<stdout>:Iter #105: 98.9 img/sec per GPU
|
||||
```
|
Before Width: | Height: | Size: 485 KiB |
|
@ -1,72 +0,0 @@
|
|||
|
||||
|
||||
Arena supports and simplifies distributed TensorFlow Training (PS/worker mode).
|
||||
|
||||
|
||||
1. To run a distributed Tensorflow Training, you need to specify:
|
||||
|
||||
- GPUs of each worker (only for GPU workload)
|
||||
- The number of workers (required)
|
||||
- The number of PS (required)
|
||||
- The docker image of worker (required)
|
||||
- The docker image of PS (required)
|
||||
- The Port of Worker (default is 22222)
|
||||
- The Port of PS (default is 22223)
|
||||
|
||||
The following command is an example. In this example, it defines 2 workers and 1 PS, and each worker has 1 GPU. The source code of worker and PS are located in git, and the tensorboard are enabled.
|
||||
|
||||
```
|
||||
# arena submit tf \
|
||||
--name=tf-dist-git \
|
||||
--gpus=1 \
|
||||
--workers=2 \
|
||||
--worker-image=tensorflow/tensorflow:1.5.0-devel-gpu \
|
||||
--sync-mode=git \
|
||||
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
|
||||
--ps=1 \
|
||||
--ps-image=tensorflow/tensorflow:1.5.0-devel \
|
||||
--tensorboard \
|
||||
"python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir=/training_logs --data_dir=code/tensorflow-sample-code/data"
|
||||
|
||||
configmap/tf-dist-git-tfjob created
|
||||
configmap/tf-dist-git-tfjob labeled
|
||||
service/tf-dist-git-tensorboard created
|
||||
deployment.extensions/tf-dist-git-tensorboard created
|
||||
tfjob.kubeflow.org/tf-dist-git created
|
||||
INFO[0001] The Job tf-dist-git has been submitted successfully
|
||||
INFO[0001] You can run `arena get tf-dist-git --type tfjob` to check the job status
|
||||
```
|
||||
|
||||
**Note**: If you saw the job or pod is failed, and then look at the logs, you may find out it is due to the reason that git code is not be able to cloned, especially if you are runing container insider some countries like China. This is not caused by arena, but cross-border network connectivity.
|
||||
|
||||
2\. Get the details of the specific job
|
||||
|
||||
```
|
||||
# arena get tf-dist-git
|
||||
NAME STATUS TRAINER AGE INSTANCE NODE
|
||||
tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-594d59789c-lrfsk 192.168.1.119
|
||||
tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-ps-0 192.168.1.118
|
||||
tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-worker-0 192.168.1.119
|
||||
tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-worker-1 192.168.1.120
|
||||
|
||||
Your tensorboard will be available on:
|
||||
192.168.1.117:32298
|
||||
```
|
||||
|
||||
3\. Check the tensorboard
|
||||
|
||||

|
||||
|
||||
|
||||
4\. Get the TFJob dashboard
|
||||
|
||||
```
|
||||
# arena logviewer tf-dist-git
|
||||
Your LogViewer will be available on:
|
||||
192.168.1.120:8080/tfjobs/ui/#/default/tf-dist-git-tfjob
|
||||
```
|
||||
|
||||
|
||||

|
||||
|
||||
Congratulations! You've run the distributed training job with `arena` successfully.
|
|
@ -1,78 +0,0 @@
|
|||
The Distributed Tensorflow job has some roles, includes: Worker,PS,Chief,Evaluator. Sometimes, you may need to decide the sequence when creating them, for example, you may need to create "Worker" role first and then create "PS" role second, This guide will help you.
|
||||
|
||||
1. Now, assume that you want to submit a Distributed Tensorflow job,the tensorflow job has four roles: Worker,PS,Chief,Evaluator and you need the role starting sequence is "Worker,Chief,PS,Evaluator", it is simple for you only add option "--role-sequence" when submitting the job,the following command is an example:
|
||||
|
||||
```
|
||||
$ arena submit tfjob \
|
||||
--name=tf-distributed-test \
|
||||
--role-sequence "Worker,Chief,PS,Evaluator" \
|
||||
--chief \
|
||||
--evaluator \
|
||||
--gpus=1 \
|
||||
--workers=1 \
|
||||
--worker-image=cheyang/tf-mnist-distributed:gpu \
|
||||
--ps-image=cheyang/tf-mnist-distributed:cpu \
|
||||
--ps=1 \
|
||||
--tensorboard \
|
||||
--tensorboard-image="registry.cn-hongkong.aliyuncs.com/ai-samples/tensorflow:1.12.0-devel" \
|
||||
"python /app/main.py"
|
||||
```
|
||||
|
||||
the "--role-sequence Worker,Chief,PS,Evaluator" is the same as "--role-sequence w,c,p,e" and "w" represents "Worker", "c" represents "Chief", "p" represents "PS" and "e" represents "Evaluator".
|
||||
|
||||
2. Make sure at least one pod belonging to the tfjob "tf-distributed-test" has annotation "job-role-sequence=Worker,Chief,PS,Evaluator":
|
||||
|
||||
```
|
||||
$ kubectl get po -l tf-job-name=tf-distributed-test
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
tf-distributed-test-chief-0 0/1 ContainerCreating 0 5m47s
|
||||
tf-distributed-test-evaluator-0 0/1 ContainerCreating 0 5m47s
|
||||
tf-distributed-test-ps-0 1/1 Running 0 5m47s
|
||||
tf-distributed-test-worker-0 0/1 ContainerCreating 0 5m47s
|
||||
|
||||
$ kubectl get po tf-distributed-test-worker-0 -o yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
annotations:
|
||||
job-role-sequence: Worker,Chief,PS,Evaluator
|
||||
kubernetes.io/psp: ack.privileged
|
||||
requestGPUsOfJobOwner: "3"
|
||||
creationTimestamp: 2021-02-22T03:07:49Z
|
||||
....
|
||||
|
||||
```
|
||||
|
||||
3. You can validate it by querying the tf-operator logs.
|
||||
|
||||
```
|
||||
$ kubectl get po -n arena-system
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
et-operator-576887864c-lvmrs 1/1 Running 1 19d
|
||||
mpi-operator-66b4cf9b76-kl2fm 1/1 Running 0 26d
|
||||
pytorch-operator-8545c46f98-cffgw 1/1 Running 4 26d
|
||||
tf-job-dashboard-78478bfc45-msbzn 1/1 Running 0 19d
|
||||
tf-job-operator-554d594cff-5vxfg 1/1 Running 0 101m
|
||||
```
|
||||
|
||||
Query the logs of tf-job-operator-554d594cff-5vxfg.
|
||||
|
||||
```
|
||||
$ kubectl logs tf-job-operator-554d594cff-5vxfg -n arena-system | grep "the Role Sequence" | tail -n 1
|
||||
{"filename":"tensorflow/controller.go:453","job":"default.tf-distributed-test","level":"info","msg":"the Role Sequence of job tf-distributed-test is: [Worker Chief PS Evaluator]","time":"2021-02-01T13:22:23Z","uid":"7db02629-4591-4e0c-a938-c6e4a1cfc074"}
|
||||
```
|
||||
|
||||
|
||||
As you see the sequence of tf-operator handles the tfjob roles is match the sequence you specified.
|
||||
|
||||
If you don't want to specify the role sequence every time when submitting the tfjob, you can save the role sequence to the arena configuration file "~/.arena/config", like:
|
||||
|
||||
```
|
||||
tfjob_role_sequence = Worker,PS,Chief,Evaluator
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```
|
||||
tfjob_role_sequence = w,p,c,e
|
||||
```
|
|
@ -1,128 +0,0 @@
|
|||
|
||||
## Support Multiple Users
|
||||
|
||||
In some usage scenarios, you may want multiple users to use arena and these users have different permissions to operate the kubernetes cluster. This guide will tell you how to implement the goal.
|
||||
|
||||
Now, assume that there is 3 users to use arena and their privileges are described as follow table:
|
||||
|
||||
|
||||
| User Name | User Namespace | Quota | Additional Privileges |
|
||||
| --------- | -------------- | ----- |---------------------- |
|
||||
| alex | workplace1 | - |-|
|
||||
| bob | workplace2 |limits.cpu: "10",limits.memory: "20Gi",requests.cpu: "5",requests.memory: "10Gi" |list the jobs in the cluster scope|
|
||||
| tom | workplace3 |requests.nvidia.com/gpu: 20|list the jobs in the namespace scope|
|
||||
|
||||
the following steps describe how to generate the kubeconfig files of the users.
|
||||
|
||||
1.Prepare the user configuration file, you can refer the ~/charts/user/values.yaml or /charts/user/values.yaml to write your own user configuration file.
|
||||
|
||||
The user alex doesn't need to prepare a user configuration file,because it use the default configuration.
|
||||
|
||||
The user bob's user configuration file is defined as:
|
||||
|
||||
```
|
||||
quota:
|
||||
limits.cpu: "10"
|
||||
requests.cpu: "5"
|
||||
requests.memory: "10Gi"
|
||||
limits.memory: "20Gi"
|
||||
|
||||
clusterRoles:
|
||||
- apiGroups:
|
||||
- batch
|
||||
resources:
|
||||
- jobs
|
||||
verbs:
|
||||
- list
|
||||
```
|
||||
|
||||
and store it to /tmp/bob-config.yaml
|
||||
|
||||
The user tom's user configuration file is defined as:
|
||||
|
||||
```
|
||||
quota:
|
||||
requests.nvidia.com/gpu: 5
|
||||
|
||||
roles:
|
||||
- apiGroups:
|
||||
- batch
|
||||
resources:
|
||||
- jobs
|
||||
verbs:
|
||||
- list
|
||||
```
|
||||
and store it to /tmp/tom-config.yaml
|
||||
|
||||
|
||||
2.Generate user kubeconfig, the script 'arena-gen-kubeconfig.sh' can help you:
|
||||
|
||||
```
|
||||
$ arena-gen-kubeconfig.sh -h
|
||||
|
||||
Usage:
|
||||
|
||||
arena-gen-kubeconfig.sh [OPTION1] [OPTION2] ...
|
||||
|
||||
Options:
|
||||
--user-name <USER_NAME> Specify the user name
|
||||
--user-namespace <USER_NAMESPACE> Specify the user namespace
|
||||
--user-config <USER_CONFIG> Specify the user config,refer the ~/charts/user/values.yaml or /charts/user/values.yaml
|
||||
--force If the user has been existed,force to update the user
|
||||
--delete Delete the user
|
||||
--output <KUBECONFIG|USER_MANIFEST_YAML> Specify the output kubeconfig file or the user manifest yaml
|
||||
--admin-kubeconfig <ADMIN_KUBECONFIG> Specify the Admin kubeconfig file
|
||||
--cluster-url <CLUSTER_URL> Specify the Cluster URL,if not specified,the script will detect the cluster url
|
||||
--create-user-yaml Only generate the user manifest yaml,don't apply it and create kubeconfig file
|
||||
```
|
||||
|
||||
Firstly, create the kubeconfig file of alex:
|
||||
|
||||
```
|
||||
$ arena-gen-kubeconfig.sh --user-name alex --user-namespace workplace1 --output /tmp/alex.kubeconfig --force
|
||||
|
||||
2021-02-08/11:38:44 DEBUG found arena charts in /Users/yangjunfeng/charts
|
||||
2021-02-08/11:38:44 DEBUG the user configuration not set,use the default configuration file
|
||||
resourcequota/arena-quota-alex created
|
||||
serviceaccount/alex created
|
||||
clusterrole.rbac.authorization.k8s.io/arena:workplace1:alex configured
|
||||
clusterrolebinding.rbac.authorization.k8s.io/arena:workplace1:alex configured
|
||||
role.rbac.authorization.k8s.io/arena:alex created
|
||||
rolebinding.rbac.authorization.k8s.io/arena:alex created
|
||||
configmap/arena-user-alex created
|
||||
Cluster "https://192.168.1.42:6443" set.
|
||||
User "alex" set.
|
||||
Context "registry" created.
|
||||
Switched to context "registry".
|
||||
2021-02-08/11:38:48 DEBUG kubeconfig written to file /tmp/alex.kubeconfig
|
||||
```
|
||||
As you see the kubeconfig file has been created(/tmp/alex.kubeconfig).
|
||||
|
||||
Secondly, create the kubeconfig file of user bob:
|
||||
|
||||
```
|
||||
$ arena-gen-kubeconfig.sh --user-name bob --user-namespace workplace2 --user-config /tmp/bob.yaml --output /tmp/bob.kubeconfig --force
|
||||
```
|
||||
the kubeconfig file will store at /tmp/bob.kubeconfig
|
||||
|
||||
Thirdly, create the kubeconfig file of user tom:
|
||||
|
||||
```
|
||||
$ arena-gen-kubeconfig.sh --user-name tom --user-namespace workplace3 --user-config /tmp/tom.yaml --output /tmp/tom.kubeconfig --force
|
||||
```
|
||||
the kubeconfig file will store at /tmp/tom.kubeconfig
|
||||
|
||||
3.Make the kubeconfig file is valid, you can set the env KUBECONFIG like:
|
||||
|
||||
```
|
||||
$ export KUBECONFIG=/tmp/alex.kubeconfig
|
||||
|
||||
```
|
||||
|
||||
4.Now you can use arena to submit your training jobs.
|
||||
|
||||
5.If you want to delete the user,execute the command like:
|
||||
|
||||
```
|
||||
$ arena-gen-kubeconfig.sh --user-name tom --user-namespace workplace3 --delete
|
||||
```
|
|
@ -1,110 +0,0 @@
|
|||
|
||||
`arena` allows to mount multiple data volumes into the training jobs. There is an example that mounts `data volume` into the training job.
|
||||
|
||||
|
||||
1. You need to create `/data` in the NFS Server, and prepare `mnist data`
|
||||
|
||||
```
|
||||
# mkdir -p /nfs
|
||||
# mount -t nfs -o vers=4.0 NFS_SERVER_IP:/ /nfs
|
||||
# mkdir -p /data
|
||||
# cd /data
|
||||
# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/t10k-images-idx3-ubyte.gz
|
||||
# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/t10k-labels-idx1-ubyte.gz
|
||||
# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/train-images-idx3-ubyte.gz
|
||||
# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/train-labels-idx1-ubyte.gz
|
||||
# cd /
|
||||
# umount /nfs
|
||||
```
|
||||
|
||||
2\. Create Persistent Volume. Moidfy `NFS_SERVER_IP` to yours.
|
||||
|
||||
```
|
||||
# cat nfs-pv.yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: tfdata
|
||||
labels:
|
||||
tfdata: nas-mnist
|
||||
spec:
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
capacity:
|
||||
storage: 10Gi
|
||||
accessModes:
|
||||
- ReadWriteMany
|
||||
nfs:
|
||||
server: NFS_SERVER_IP
|
||||
path: "/data"
|
||||
|
||||
# kubectl create -f nfs-pv.yaml
|
||||
```
|
||||
|
||||
3\. Create Persistent Volume Claim.
|
||||
|
||||
```
|
||||
# cat nfs-pvc.yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: tfdata
|
||||
annotations:
|
||||
description: "this is the mnist demo"
|
||||
owner: Tom
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteMany
|
||||
resources:
|
||||
requests:
|
||||
storage: 5Gi
|
||||
selector:
|
||||
matchLabels:
|
||||
tfdata: nas-mnist
|
||||
# kubectl create -f nfs-pvc.yaml
|
||||
```
|
||||
|
||||
> Notice: suggest to add `description` and `owner`
|
||||
|
||||
4\. Check the data volume
|
||||
|
||||
```
|
||||
# arena data list
|
||||
NAME ACCESSMODE DESCRIPTION OWNER AGE
|
||||
tfdata ReadWriteMany this is for mnist demo myteam 43d
|
||||
```
|
||||
|
||||
5\. Now we can submit a distributed training job with `arena`, it will download the source code from github and mount data volume `tfdata` to `/mnist_data`.
|
||||
|
||||
```
|
||||
# arena submit tf --name=tf-dist-data \
|
||||
--gpus=1 \
|
||||
--workers=2 \
|
||||
--workerImage=tensorflow/tensorflow:1.5.0-devel-gpu \
|
||||
--syncMode=git \
|
||||
--syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
|
||||
--ps=1 \
|
||||
--psImage=tensorflow/tensorflow:1.5.0-devel \
|
||||
--tensorboard \
|
||||
--data=tfdata:/mnist_data \
|
||||
"python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir /training_logs --data_dir /mnist_data"
|
||||
```
|
||||
|
||||
> `--data` specifies the data volume to mount to all the tasks of the job, like <name_of_datasource>:<mount_point_on_job>. In this example, the data volume is `tfdata`, and the target directory is `/mnist_data`.
|
||||
|
||||
|
||||
6\. From the logs, we find that the training data is extracted from `/mnist_data` instead of downloading from internet directly.
|
||||
|
||||
```
|
||||
# arena logs tf-dist-data
|
||||
...
|
||||
Extracting /mnist_data/train-images-idx3-ubyte.gz
|
||||
Extracting /mnist_data/train-labels-idx1-ubyte.gz
|
||||
Extracting /mnist_data/t10k-images-idx3-ubyte.gz
|
||||
Extracting /mnist_data/t10k-labels-idx1-ubyte.gz
|
||||
...
|
||||
Accuracy at step 960: 0.9753
|
||||
Accuracy at step 970: 0.9739
|
||||
Accuracy at step 980: 0.9756
|
||||
Accuracy at step 990: 0.9777
|
||||
Adding run metadata for 999
|
||||
```
|
Before Width: | Height: | Size: 239 KiB |
Before Width: | Height: | Size: 454 KiB |
Before Width: | Height: | Size: 360 KiB |