Compare commits

...

189 Commits

Author SHA1 Message Date
dependabot[bot] f8ee31410c
chore(deps): bump actions/setup-java from 4 to 5 (#1366)
Bumps [actions/setup-java](https://github.com/actions/setup-java) from 4 to 5.
- [Release notes](https://github.com/actions/setup-java/releases)
- [Commits](https://github.com/actions/setup-java/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/setup-java
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-08-26 02:37:19 +00:00
dependabot[bot] ec5255280c
chore(deps): bump actions/checkout from 4 to 5 (#1359)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-08-14 03:36:12 +00:00
dependabot[bot] d1f7be63ab
chore(deps): bump actions/download-artifact from 4 to 5 (#1356)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 5.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-08-14 03:35:12 +00:00
dependabot[bot] a190ca253b
chore(deps): bump github.com/spf13/pflag from 1.0.6 to 1.0.7 (#1352)
---
updated-dependencies:
- dependency-name: github.com/spf13/pflag
  dependency-version: 1.0.7
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-23 05:34:59 +00:00
dependabot[bot] 695c2c67f0
chore(deps): bump golang.org/x/crypto from 0.39.0 to 0.40.0 (#1351)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.39.0 to 0.40.0.
- [Commits](https://github.com/golang/crypto/compare/v0.39.0...v0.40.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.40.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-22 02:37:58 +00:00
Yi Chen 75ec421d62
Bump helm.sh/helm/v3 from 3.16.3 to 3.18.4 (#1350)
* Bump golang version from 1.23.10 to 1.24.0

Signed-off-by: Yi Chen <github@chenyicn.net>

* Fix go vet check

Signed-off-by: Yi Chen <github@chenyicn.net>

* Bump helm.sh/helm/v3 from 3.16.3 to 3.18.4

Signed-off-by: Yi Chen <github@chenyicn.net>

* Run go mod vendor

Signed-off-by: Yi Chen <github@chenyicn.net>

* Retrieve Helm version from go.mod file

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-07-11 14:56:52 +00:00
Yi Chen 25d7b1109e
Release v0.15.1 (#1344)
* Release v0.15.1

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add changelog for v0.15.1

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-26 06:52:17 +00:00
dependabot[bot] d2d5f77a97
chore(deps): bump golang.org/x/crypto from 0.38.0 to 0.39.0 (#1334)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.31.0 to 0.39.0.
- [Commits](https://github.com/golang/crypto/compare/v0.31.0...v0.39.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.39.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-25 06:30:16 +00:00
dependabot[bot] c4ccb4ca7e
chore(deps): bump github.com/prometheus/common from 0.60.1 to 0.65.0 (#1343)
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.60.1 to 0.65.0.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/RELEASE.md)
- [Commits](https://github.com/prometheus/common/compare/v0.60.1...v0.65.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-version: 0.65.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-25 06:20:15 +00:00
Yi Chen aa33dc51b7
Bump golang version from 1.22.7 to 1.23.10 (#1345)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-25 06:06:16 +00:00
Yi Chen 9e84dad37a
Fix golangci-lint issues (#1341)
* Bump golangci-lint version from v1.57.2 to v2.1.6

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add golangci-lint.yaml

Signed-off-by: Yi Chen <github@chenyicn.net>

* Fix golangci-lint issues

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 17:04:14 +00:00
Yi Chen c9d5653de3
Add support for configuring tolerations (#1337)
* Add support for configuring tolerations

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add basic Helm chart unittests

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add Helm chart unit tests to GitHub CI workflow

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 13:01:13 +00:00
Yi Chen 4618e321ab
Update uninstall bash script (#1335)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 12:58:14 +00:00
Yi Chen ca7bf97da4
[CI] Add CI workflow for releasing Arena images (#1340)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 12:57:14 +00:00
Yi Chen 1c633d76ff
Remove kubernetes artifacts (#1329)
* Remove Kubernetes artifacts

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update Makefile

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 12:53:14 +00:00
Yi Chen 3693f59663
Release v0.15.0 (#1332)
* Release v0.15.0

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add changelog for v0.15.0

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-04 15:12:14 +00:00
Syspretor fa2fad7d6e
Feat: support separate affinity policy configuration for PS and worke… (#1331)
Signed-off-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
Co-authored-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
2025-06-04 12:03:14 +00:00
Syspretor 8f4a602ce6
Feat: support affinity policy for kserve and tfjob (#1319)
Signed-off-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
Co-authored-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
2025-06-04 11:33:15 +00:00
Leoyzen ad85546c23
Add custom device support for kserve and kserving. (#1315)
* add custom device support for kserving.

Signed-off-by: Leoyzen <leoyzen@gmail.com>

* add custom device support for kserve.

Signed-off-by: Leoyzen <leoyzen@gmail.com>

---------

Signed-off-by: Leoyzen <leoyzen@gmail.com>
2025-06-04 02:45:14 +00:00
Yi Chen babcb76f91
Make number of replicas of tf-operator deployment configurable (#1323)
* Make tf-operator replicas configurable

Signed-off-by: Yi Chen <github@chenyicn.net>

* Make replicas of tf-operator spread out across different nodes

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-04 02:39:14 +00:00
Yi Chen ba7a09ace6
Make number of replicas of cron-operator deployment configurable (#1325)
* Make cron-operator replicas configurable

Signed-off-by: Yi Chen <github@chenyicn.net>

* Make replicas of cron-operator spread out across different nodes

Signed-off-by: Yi Chen <github@chenyicn.net>

* Remove '--enable-leader-election=true' from args

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-03 13:16:14 +00:00
Yi Chen 545f86bfe9
Delete all services when the TFJob is terminated (#1316)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-05-29 12:57:19 +00:00
co63oc 568e3845f5
Fix typos in multiple files (#1310)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-05-13 08:56:21 +00:00
co63oc 8b84559944
Fix typos in multiple files (#1304)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-05-12 12:45:38 +00:00
Yi Chen ee2384b911
fix: service account should use release namespace (#1308)
* Use release namespace

Signed-off-by: Yi Chen <github@chenyicn.net>

* Remove namespace from cluster scoped resource

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-05-12 12:23:38 +00:00
Yi Chen 2fbb3d7ed4
feat: add new value for using localtime in cron-operator (#1296)
* feat: add new value for using localtime in cron-operator

Signed-off-by: Yi Chen <github@chenyicn.net>

* Rename localTime to useHostTimezone

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-04-03 07:31:33 +00:00
Yi Chen 19b5133e6e
refactor: use helm lib instead of helm binary (#1207)
* Delete func ListAllReleasesWithDetail

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func ListReleaseMap

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func ListReleases

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func DeleteRelease

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add some helm util functions

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func InstallRelease

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func CheckRelease

Signed-off-by: Yi Chen <github@chenyicn.net>

* Refactor func GetChartVersion

Signed-off-by: Yi Chen <github@chenyicn.net>

* Refactor func GenerateHelmTemplate

Signed-off-by: Yi Chen <github@chenyicn.net>

* Move all helm releated functions into util.go

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add missed import statements and run go mod tidy

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update copyright header

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add flag --helm-binary for forward compatibility

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-21 09:19:27 +00:00
Yi Chen 8d413b5861
Add stale bot to mark stale issues and PRs (#1141)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-21 05:14:26 +00:00
dependabot[bot] 2f6e202bbf
Bump github.com/containerd/containerd from 1.7.23 to 1.7.27 (#1290)
Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.7.23 to 1.7.27.
- [Release notes](https://github.com/containerd/containerd/releases)
- [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md)
- [Commits](https://github.com/containerd/containerd/compare/v1.7.23...v1.7.27)

---
updated-dependencies:
- dependency-name: github.com/containerd/containerd
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-21 04:58:26 +00:00
Yi Chen f3d52fa73a
Add basic e2e tests (#1225)
* Add basic e2e tests

Signed-off-by: Yi Chen <github@chenyicn.net>

* Run go mod vendor

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-21 04:02:27 +00:00
Yi Chen ece85b8ce3
fix: job status displays incorrectly (#1289)
* fix: job status displays incorrectly

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add go unit tests

Signed-off-by: Yi Chen <github@chenyicn.net>

* logging job status

Signed-off-by: Yi Chen <github@chenyicn.net>

* Adjust the order of running and queuing conditions

Signed-off-by: Yi Chen <github@chenyicn.net>

* Use constants instead of hard encoded status

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-20 09:51:27 +00:00
Yi Chen d497232013
Release v0.14.2 (#1282)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-10 02:26:01 +00:00
Yi Chen 9407f9b1a0
Update pytorch operator image (#1281)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-10 01:56:01 +00:00
co63oc d9bf195879
Fix typos (#1276)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-03-06 03:11:39 +00:00
Yi Chen 19abf194bb
Release v0.14.1 (#1275)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-24 03:06:45 +00:00
Yi Chen 1f9350d78c
unset env NVIDIA_VISIBLE_DEVICES when gpushare is enabled (#1273)
* unset env NVIDIA_VISIBLE_DEVICES when gpushare is enabled

Signed-off-by: Yi Chen <github@chenyicn.net>

* Group constants into one const block

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-24 02:34:45 +00:00
Yi Chen 23e9731b52
fix: pytorchjob does not support backoff limit (#1272)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-19 06:57:41 +00:00
Yi Chen d6b177b93d
fix: format of tensorflow standalone training docs is messed up (#1265)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-12 12:18:29 +00:00
Yi Chen 0ca2670770
fix: device value does not support k8s resource quantity (#1267)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-12 12:17:29 +00:00
dependabot[bot] 7d7f75ad2d
Bump github.com/golang/glog from 1.2.3 to 1.2.4 (#1263)
Bumps [github.com/golang/glog](https://github.com/golang/glog) from 1.2.3 to 1.2.4.
- [Release notes](https://github.com/golang/glog/releases)
- [Commits](https://github.com/golang/glog/compare/v1.2.3...v1.2.4)

---
updated-dependencies:
- dependency-name: github.com/golang/glog
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-12 10:25:29 +00:00
DBMxrco 4b21f7299b
docs: fixed typo (#1257)
Signed-off-by: DBMxrco <marcoflet@yahoo.com>
2025-02-12 08:34:29 +00:00
Yi Chen 36a59bba67
Release v0.14.0 (#1264)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-12 06:43:28 +00:00
Yi Chen ccdbf44815
Add changelog for v0.13.1 (#1248)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-12 06:34:28 +00:00
dependabot[bot] 36b17b4175
Bump github.com/go-resty/resty/v2 from 2.16.0 to 2.16.5 (#1254)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.16.0 to 2.16.5.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.16.0...v2.16.5)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-12 06:26:29 +00:00
gujing 1058d48063
rename parameter (#1262)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2025-02-12 06:02:30 +00:00
AlanFokCo ce9c5f3bff
Update the version of elastic-job-supervisor in arena-artifacts (#1247)
Signed-off-by: AlanFokCo <892249240@qq.com>
2025-01-13 09:32:08 +00:00
Yi Chen 970afbd209
Add PyTorch mnist example (#1237)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 11:31:16 +00:00
Yi Chen f1bb3bcdbb
feat: add linux/arm64 support for et-operator image (#1241)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 11:00:16 +00:00
Yi Chen b814410627
feat: add linux/arm64 support for cron-operator image (#1240)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 10:59:16 +00:00
Yi Chen 38218aa3a0
feat: add linux/arm64 support for mpi-operator image (#1239)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 10:26:16 +00:00
Yi Chen 13fa5c8dc8
feat: add linux/arm64 support for tf-operator image (#1238)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 09:03:16 +00:00
Yi Chen f098f1af85
Release v0.13.0 (#1232)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-23 08:33:15 +00:00
Yi Chen b0e411cab5
Update pytorch-operator image (#1234)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-23 07:55:15 +00:00
dependabot[bot] 5e18210479
Bump github.com/stretchr/testify from 1.9.0 to 1.10.0 (#1233)
Bumps [github.com/stretchr/testify](https://github.com/stretchr/testify) from 1.9.0 to 1.10.0.
- [Release notes](https://github.com/stretchr/testify/releases)
- [Commits](https://github.com/stretchr/testify/compare/v1.9.0...v1.10.0)

---
updated-dependencies:
- dependency-name: github.com/stretchr/testify
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-20 13:55:12 +00:00
Yi Chen 13df29407c
Update tfjob standalone training job doc (#1222)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-20 05:29:11 +00:00
Yi Chen 0a701eb03d
Remove archived docs (#1208)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-20 05:26:12 +00:00
Yi Chen 0482946a0c
Add changelog for v0.12.1 (#1224)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-20 05:25:12 +00:00
dependabot[bot] 0d4b513d65
Bump golang.org/x/crypto from 0.29.0 to 0.31.0 (#1231)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.29.0 to 0.31.0.
- [Commits](https://github.com/golang/crypto/compare/v0.29.0...v0.31.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-20 05:09:13 +00:00
dependabot[bot] e8b9fcd10d
Bump google.golang.org/protobuf from 1.35.1 to 1.36.0 (#1227)
Bumps google.golang.org/protobuf from 1.35.1 to 1.36.0.

---
updated-dependencies:
- dependency-name: google.golang.org/protobuf
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-20 05:02:12 +00:00
Yi Chen 190c18e840
feat: add support for torchrun (#1228)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-19 11:32:11 +00:00
Yi Chen dc0929f32f
Avoid listing jobs and statefulsets when get pytorchjob (#1229)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-19 11:29:11 +00:00
Yi Chen 74ade74d3e
Release v0.12.1 (#1215)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-25 11:37:29 +00:00
Yi Chen 316e33c999
Update cron operator image (#1214)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-25 11:35:29 +00:00
dependabot[bot] fc47e460e1
Bump golang.org/x/crypto from 0.28.0 to 0.29.0 (#1206)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.28.0 to 0.29.0.
- [Commits](https://github.com/golang/crypto/compare/v0.28.0...v0.29.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-18 15:06:23 +00:00
Yi Chen 1cba9b99dc
Add docs for releasing arena (#1201)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-18 12:29:23 +00:00
Yi Chen 866ec44648
Publish releases only on master branch (#1210)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-18 12:28:23 +00:00
cheyang ac164b85bf
Support MPI Job with generic devices (#1209)
Signed-off-by: cheyang <cheyang@163.com>
2024-11-18 03:03:22 +00:00
Qianlong d61a784a13
Fix the functionality of generating kubeconfig (#1204) (#1205)
Signed-off-by: 向先 <wangqianlong.wql@alibaba-inc.com>
Co-authored-by: 向先 <wangqianlong.wql@alibaba-inc.com>
2024-11-16 15:45:21 +00:00
dependabot[bot] 74fd3f2ad3
bump github.com/go-resty/resty/v2 from 2.15.3 to 2.16.0 (#1202)
---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-15 09:38:20 +00:00
TzZtzt a765b1c5a0
Fix etjob rendering error when using local logging dir (#1203)
Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
2024-11-13 06:17:17 +00:00
Yi Chen 0838d54757
Add go mod vendor check to integration test (#1198)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-12 02:23:16 +00:00
Yi Chen ca735b6152
Add changelog for v0.12.0 (#1199)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-12 02:11:17 +00:00
Yi Chen 969ad681a3
Update tf-operator image to fix clean pod policy issues (#1200)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-12 01:55:16 +00:00
dependabot[bot] 29b2d6d2c5
Bump mkdocs-material from 9.5.42 to 9.5.44 (#1190)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.42 to 9.5.44.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.42...9.5.44)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 06:07:15 +00:00
cheyang 22a3df5023
Support distributed serving with vendor update (#1194)
Signed-off-by: cheyang <cheyang@163.com>
2024-11-11 06:06:15 +00:00
lianhui lin 68b71f9006
Feat: add support for distributed serving type (#1187)
* Feat: support distributed serving type

Signed-off-by: 林联辉 <linlianhui.llh@alibaba-inc.com>

* Fix command check

Signed-off-by: 林联辉 <linlianhui.llh@alibaba-inc.com>

* Fix lint problem

Signed-off-by: 林联辉 <linlianhui.llh@alibaba-inc.com>

---------

Signed-off-by: 林联辉 <linlianhui.llh@alibaba-inc.com>
Co-authored-by: 林联辉 <linlianhui.llh@alibaba-inc.com>
2024-11-07 10:20:12 +00:00
dependabot[bot] 70278ce8f7
Bump github.com/prometheus/common from 0.60.0 to 0.60.1 (#1182)
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.60.0 to 0.60.1.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/RELEASE.md)
- [Commits](https://github.com/prometheus/common/compare/v0.60.0...v0.60.1)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 06:43:12 +00:00
dependabot[bot] 8e008a4916
Bump github.com/golang/glog from 1.2.2 to 1.2.3 (#1189)
Bumps [github.com/golang/glog](https://github.com/golang/glog) from 1.2.2 to 1.2.3.
- [Release notes](https://github.com/golang/glog/releases)
- [Commits](https://github.com/golang/glog/compare/v1.2.2...v1.2.3)

---
updated-dependencies:
- dependency-name: github.com/golang/glog
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 03:12:12 +00:00
Yi Chen 46a795e3db
Fix: unable to set cleanPodPolicy to All when submitting TFJob (#1191)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-07 02:53:12 +00:00
Yi Chen 76ca05975e
Add changelog for v0.11.0 (#1181)
* Add changelog for v0.11.0

Signed-off-by: Yi Chen <github@chenyicn.net>

* Bump version to v0.11.0

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-07 02:05:12 +00:00
dependabot[bot] dce03cc700
Bump mkdocs-material from 9.5.40 to 9.5.42 (#1179)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.40 to 9.5.42.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.40...9.5.42)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-24 11:52:31 +00:00
qile123 7885f46081
Support ray job (#1123)
Signed-off-by: taiku <ljh404177@alibaba-inc.com>
Co-authored-by: 泰酷 <ljh404177@alibaba-inc.com>
2024-10-24 10:34:31 +00:00
dependabot[bot] 8d6c23d14c
Bump github.com/prometheus/client_golang from 1.20.4 to 1.20.5 (#1176)
Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.20.4 to 1.20.5.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/v1.20.5/CHANGELOG.md)
- [Commits](https://github.com/prometheus/client_golang/compare/v1.20.4...v1.20.5)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-24 05:19:30 +00:00
Yi Chen bd1b0da049
Add changelog for v0.10.1 (#1175)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-23 15:36:30 +00:00
Yi Chen e15cb18aeb
Remove redundant run_arena.sh file (#1172)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-14 12:51:17 +00:00
Yi Chen 82fd0ba7e5
fix: failed to sync cache due to status subresouce missed in tfjob CRD (#1173)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-14 12:48:17 +00:00
dependabot[bot] a1b7285e1d
Bump google.golang.org/protobuf from 1.34.2 to 1.35.1 (#1163)
Bumps google.golang.org/protobuf from 1.34.2 to 1.35.1.

---
updated-dependencies:
- dependency-name: google.golang.org/protobuf
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-14 12:33:17 +00:00
dependabot[bot] 522a0c610f
Bump mkdocs-material from 9.5.38 to 9.5.40 (#1166)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.38 to 9.5.40.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.38...9.5.40)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-14 12:32:17 +00:00
Yi Chen b8af066a2f
Migrate docker image to ACREE (#1171)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-14 09:59:16 +00:00
Yi Chen 42b8fcae2e
Add changelog for v0.10.0 (#1158)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-14 09:58:17 +00:00
Yi Chen 45c8e1b150
fix: unsupported success policy when success policy is not specified (#1170)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-14 09:57:16 +00:00
Yi Chen fdcfd18a98
fix: keep arena installer after installing the binary (#1164)
* Release v0.10.0

Signed-off-by: Yi Chen <github@chenyicn.net>

* fix: keep arena installer after installing the binary

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update tf-operator image

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-14 09:56:17 +00:00
dependabot[bot] 41fb18b640
Bump golang.org/x/crypto from 0.27.0 to 0.28.0 (#1162)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.27.0 to 0.28.0.
- [Commits](https://github.com/golang/crypto/compare/v0.27.0...v0.28.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-08 02:46:07 +00:00
dependabot[bot] bf49baae30
Bump github.com/prometheus/common from 0.59.1 to 0.60.0 (#1160)
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.59.1 to 0.60.0.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/RELEASE.md)
- [Commits](https://github.com/prometheus/common/compare/v0.59.1...v0.60.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-08 02:43:07 +00:00
dependabot[bot] bd159b2d0f
Bump github.com/go-resty/resty/v2 from 2.15.2 to 2.15.3 (#1156)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.15.2 to 2.15.3.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.15.2...v2.15.3)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-29 10:07:37 +00:00
dependabot[bot] 7c10b6756c
Bump mkdocs-material from 9.5.36 to 9.5.38 (#1153)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.36 to 9.5.38.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.36...9.5.38)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-29 09:54:37 +00:00
Yi Chen 0d95df6f1e
Bump golang from 1.21 to 1.22.7 (#1142)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-29 09:36:37 +00:00
Yi Chen 11b771b417
Add success policy to TF training job (#1148)
* Add successPolicy field to tfjob CRD

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add successPolicy to TFJob charts

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add success-policy flags

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-29 09:30:37 +00:00
AlanFokCo 223e534b91
[Bugfix] Make PytorchJob devices format to key=value (#1155)
Signed-off-by: huozhixin.hzx <huozhixin.hzx@alibaba-inc.com>
Co-authored-by: huozhixin.hzx <huozhixin.hzx@alibaba-inc.com>
2024-09-27 08:45:36 +00:00
dependabot[bot] 7197b5cb40
Bump mkdocs-material from 9.5.35 to 9.5.36 (#1151)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.35 to 9.5.36.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.35...9.5.36)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-23 13:37:32 +00:00
dependabot[bot] b2c5686543
Bump github.com/go-resty/resty/v2 from 2.15.1 to 2.15.2 (#1150)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.15.1 to 2.15.2.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.15.1...v2.15.2)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-23 12:11:32 +00:00
dependabot[bot] dfd3268cc6
Bump github.com/go-resty/resty/v2 from 2.15.0 to 2.15.1 (#1147)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.15.0 to 2.15.1.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.15.0...v2.15.1)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-19 12:36:28 +00:00
dependabot[bot] 513894a1f0
Bump mkdocs-material from 9.5.34 to 9.5.35 (#1145)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.34 to 9.5.35.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.34...9.5.35)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-18 16:41:28 +00:00
dependabot[bot] 064927ef5c
Bump github.com/go-resty/resty/v2 from 2.14.0 to 2.15.0 (#1143)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.14.0 to 2.15.0.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.14.0...v2.15.0)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-18 01:52:27 +00:00
dependabot[bot] a9ed5f6eaf
Bump github.com/prometheus/client_golang from 1.20.0 to 1.20.4 (#1144)
Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.20.0 to 1.20.4.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prometheus/client_golang/compare/v1.20.0...v1.20.4)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-18 01:42:28 +00:00
Yi Chen b2380e60dc
Bump client-java from 10.0.1 to 11.0.1 (#1132)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-13 11:41:22 +00:00
Yi Chen bf53ba33ea
docs: fix broken links and add CI for checking document build status (#1131)
* Fix broken links in docs

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add CI for building docs

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-13 11:40:22 +00:00
dependabot[bot] 305005ebdf
Bump github.com/prometheus/common from 0.45.0 to 0.59.1 (#1138)
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.45.0 to 0.59.1.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/RELEASE.md)
- [Commits](https://github.com/prometheus/common/compare/v0.45.0...v0.59.1)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 03:30:22 +00:00
dependabot[bot] b70297a03a
Bump github.com/kserve/kserve from 0.13.0 to 0.13.1 (#1135)
Bumps [github.com/kserve/kserve](https://github.com/kserve/kserve) from 0.13.0 to 0.13.1.
- [Release notes](https://github.com/kserve/kserve/releases)
- [Commits](https://github.com/kserve/kserve/compare/v0.13.0...v0.13.1)

---
updated-dependencies:
- dependency-name: github.com/kserve/kserve
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 01:46:22 +00:00
dependabot[bot] ded5780b29
Bump github.com/go-resty/resty/v2 from 2.12.0 to 2.14.0 (#1134)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.12.0 to 2.14.0.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.12.0...v2.14.0)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 01:36:22 +00:00
dependabot[bot] c1f39aba1f
Bump github.com/spf13/cobra from 1.8.0 to 1.8.1 (#1137)
Bumps [github.com/spf13/cobra](https://github.com/spf13/cobra) from 1.8.0 to 1.8.1.
- [Release notes](https://github.com/spf13/cobra/releases)
- [Commits](https://github.com/spf13/cobra/compare/v1.8.0...v1.8.1)

---
updated-dependencies:
- dependency-name: github.com/spf13/cobra
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 01:25:22 +00:00
dependabot[bot] 94fc66024f
Bump golang.org/x/crypto from 0.21.0 to 0.27.0 (#1126)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.21.0 to 0.27.0.
- [Commits](https://github.com/golang/crypto/compare/v0.21.0...v0.27.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 01:24:22 +00:00
dependabot[bot] c3e73610b0
Bump github.com/golang/glog from 1.1.2 to 1.2.2 (#1139)
Bumps [github.com/golang/glog](https://github.com/golang/glog) from 1.1.2 to 1.2.2.
- [Release notes](https://github.com/golang/glog/releases)
- [Commits](https://github.com/golang/glog/compare/v1.1.2...v1.2.2)

---
updated-dependencies:
- dependency-name: github.com/golang/glog
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 01:17:22 +00:00
Yi Chen e279bad1cf
chore: add issue templates and update depenabot bot (#1140)
* Update issue and pull request templates

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update dependabot config

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add issue label bot

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-12 13:40:22 +00:00
Yi Chen 3409e5b1e4
Increase RSA key bit size from 1024 to 2048 (#1130)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-12 06:33:21 +00:00
Yi Chen 3afe470d8d
chore: remove travis and circle CI (#1129)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-12 06:08:21 +00:00
Yi Chen f11dae2a6f
Update Makefile and release workflow (#1128)
* Update .gitignore

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add .dockerignore

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update Dockerfile for packaging arena installer

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update Makefile

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update integration test workflow

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add check release workflow

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add release workflow

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update Makefile

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update Dockerfile

Signed-off-by: Yi Chen <github@chenyicn.net>

* Make run_arena.sh executable

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-12 04:07:21 +00:00
Yi Chen a80b33508f
Bump arena Java SDK version to 1.0.8 (#1124)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-04 10:07:15 +00:00
lizhiboo 6c2373d32e
#1121 Support multiple type devices (#1122)
Signed-off-by: lizhiboo <lizhiboo@yeah.net>
2024-09-03 05:50:14 +00:00
yu lin b500f9eda2
Remove docker dependency (#1113)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-31 08:46:04 +00:00
Yi Chen 98a43dc6d9
Fix submitting spark training jobs and update docs (#1112)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-07-30 03:34:56 +00:00
yu lin 881780fb08
Release arena v0.9.16 (#1110)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-25 13:07:55 +00:00
yu lin 9064896a91
Fix incorrect TensorBoard images. (#1109)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-22 07:12:00 +00:00
yu lin c9dbc8f968
Support config security context for KServe (#1108)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-17 09:08:56 +00:00
yu lin 5748fe4136
Add env-from-secret to read environment variables from secret (#1107)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-17 02:09:56 +00:00
yu lin 33181529ab
Add a demo for using arena CLI in container. (#1105)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-16 03:17:55 +00:00
yu lin 5e8b6ddbff
Support setting shared memory for training job. (#1104)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-11 08:51:21 +00:00
yu lin a3a348c00a
Upgrade the kubernetes dependencies to v1.28 and go version to 1.21 (#1102)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-11 03:05:21 +00:00
Yi Chen 7acbb8c408
Add @ChenYi015 as Arena approvers (#1103)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-07-11 02:29:20 +00:00
yu lin 19c9090bd7
Support setting the init-container-image for pytorch-operator (#1097)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-06-18 01:36:58 +00:00
gujing 48eed0fe82
change kserve prom svc to ClusterIP (#1096)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-06-17 12:32:58 +00:00
yu lin 3926187d64
fix arena makefile and dockerfile (#1091)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-05-30 18:06:13 +08:00
yu lin 95d4bbeb94
Add license (#1090)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-05-30 08:59:34 +00:00
yu lin dbf740f8cb
Remove vendor (#1089)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-05-30 07:17:33 +00:00
yu lin 64808b67e6
Fix gpu-exporter and prometheus demo (#1087)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-05-29 06:58:15 +00:00
yu lin 37d8ab4d50
Update Arena Java SDK fastjson version (#1088)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-05-15 08:53:58 +00:00
yu lin a031bae968
Fix get kserve job panic (#1086)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-05-13 06:09:18 +00:00
yu lin f31e1b0be0
Release arena v0.9.15 (#1078)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-04-28 05:11:48 +00:00
yu lin 5034f390d2
Fix command includes quotes cause Helm template failure. (#1075)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-04-19 03:12:47 +00:00
gujing 43b60eddb7
Feature/kserve custom metrics prometheus (#1073)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-04-17 08:35:27 +00:00
yu lin 1398c8f307
Upgrade helm version to v3.13.3 (#1072)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-04-17 08:32:27 +00:00
gujing acac0fbb25
fix --command parameter (#1074)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-04-17 07:51:27 +00:00
yu lin 451030cfcb
Fix port cannot be allocated when submitting a tfjob using the go sdk. (#1071)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-04-16 07:48:51 +00:00
yu lin adb43b8d74
Release arena v0.9.14 (#1070)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-04-10 20:47:00 +08:00
Yi Chen fed8afc602
Update model manage documenation (#1066)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-04-10 09:05:29 +00:00
Yi Chen dd69d9c1af
Fix: model information does not display correctly when getting a training job (#1068)
training job

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-04-10 07:31:29 +00:00
yu lin 768218e8f5
Fix readthedocs build failed. (#1069)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-04-10 07:24:28 +00:00
Yi Chen d1e62ffa3a
Update model manage (#1062)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-04-09 11:36:28 +00:00
Yi Chen c114755222
Add support for MLflow model manage (#1058)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-04-02 02:26:22 +00:00
Yi Chen 12f205ef89
⚠️ Breaking Changes: Migrate model subcommand to model analyze (#1060)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-03-27 06:14:20 +00:00
yu lin 5ac396c7ab
Release arena v0.9.13 (#1057)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-03-18 09:06:34 +00:00
gujing 8b05634bea
support update --data in kserve serving job (#1049)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-03-18 08:20:34 +00:00
gujing b7f0ecf50e
support config request resources in kserve runtime (#1050)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-03-07 09:14:14 +00:00
gujing 57093a20fb
delete cm if job failed (#1051)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-03-07 09:10:15 +00:00
yu lin 70f4a13547
Support for updating the nodeSelector and toleration in GO SDK. (#1043)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-02-28 03:26:00 +00:00
yu lin d648a2a8cf
Upgrade Kubernetes version 1.26.4 and go version 1.20.12 (#1042)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-02-22 02:08:22 +00:00
yu lin 0a7501c542
Support Kubernetes 1.26 and KServe 0.11.2 (#1041)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-02-21 02:47:53 +00:00
gujing e4631c492d
Add @gujingit as Arena approvers (#1040)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-02-21 02:24:53 +00:00
yu lin 6fd3d0e022
Upgrade Go version to v1.20 (#1032)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-02-20 10:25:40 +00:00
gujing f27a6780ce
feat: add backend param for triton serving (#1039) 2024-02-18 03:58:48 +00:00
Alex Wang ed2aea2f86
add denkensk as approver (#1038) 2024-02-05 07:18:17 +00:00
yu lin a707f81ef6
Release arena v0.9.12 (#1037)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-02-04 15:27:32 +08:00
yu lin 23b4fe9090
Add CI to run Go unit test. (#1035)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-02-04 14:57:26 +08:00
gujing 3e7e915c16
update tritonserver base image to nvcr.io/nvidia/tritonserver:24.01-py3 (#1036) 2024-02-04 06:55:16 +00:00
gujing 8739eb536c
Feature/inferenceservice (#1034)
* chore: update inferenceservice yaml

* chore: update copyright
2024-02-01 03:27:14 +00:00
yu lin 875d0022b5
Add CI to run the tests for Go. (#1031)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-22 02:23:02 +00:00
yu lin 1449e75f92
chore: fix go lint. (#1030)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-19 01:55:59 +00:00
yu lin 10e1e629af
chore: go fmt (#1028)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-17 14:38:54 +00:00
yu lin ff24a10944
chore: Update OWNERS (#1027)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-16 12:34:18 +00:00
xieydd 8db2d49353
Add xieydd as approver (#1026)
Signed-off-by: xieydd <xieydd@gmail.com>
2024-01-16 09:55:18 +00:00
yu lin cdf1bb3102
Compatible with training-operator CRD. (#1024)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-16 09:16:18 +00:00
yu lin 67a9150c56
Update Arena 2024 Roadmap. (#1025)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-16 17:15:25 +08:00
yu lin 0df51d7492
Add @Syulin7 to Approvers. (#1022)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-10 12:39:06 +00:00
yu lin 7f31c6b209
[Discussion] Arena 2024 Roadmap. (#1020)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-10 12:06:07 +00:00
yu lin ce87d1095d
Fix release doc and job status. (#1011)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-12-04 14:54:33 +08:00
yu lin c4d37efa2b
Fix patch ownerReference (#1004)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-10-13 16:17:18 +08:00
yu lin a577b6d6ce
Fix incorrect job status display when kube-queue is enabled. (#1003)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-10-13 14:41:26 +08:00
yu lin 261cf3a362
Update kserve document. (#994)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-09-18 20:34:37 +08:00
yu lin 4dc39d6b52
Update kserve document. (#993)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-09-18 16:05:32 +08:00
yu lin a7e6a0fc19
Fix update triton server replicas. (#991)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-09-07 05:57:32 +00:00
yu lin 4afe00e05a
Fix install.sh to support control-plane label. (#989)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-09-05 17:39:41 +08:00
yu lin 46093aec39
Fix circleci. (#986)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-30 11:53:09 +08:00
yu lin bf33adad6d
Fix circleci. (#985)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-30 11:36:07 +08:00
yu lin 650d2ef0f8
Support maxSurge, livenessProbe, readinessProbe. (#983)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-29 08:04:34 +00:00
yu lin 14fa45c995
Add KServe document. (#984)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-29 06:02:34 +00:00
yu lin 2029700bd8
Update install.sh to support new label. (#982)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-28 09:41:23 +00:00
yu lin de8cb950de
Set KServe inference service version by default. (#981)
* Support KServe inference service

Signed-off-by: Syulin7 <735122171@qq.com>

* Set KServe inference service version by default.

Signed-off-by: Syulin7 <735122171@qq.com>

---------

Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-28 02:27:22 +00:00
yu lin 3fe9ae4026
Support KServe inference service (#976)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-16 09:38:04 +00:00
yu lin 81a8bf85c9
Update dependent component version. (#971)
* Update dependent component version.

Signed-off-by: Syulin7 <735122171@qq.com>

* Update dependent component version.

Signed-off-by: Syulin7 <735122171@qq.com>

* Update vendor.

Signed-off-by: Syulin7 <735122171@qq.com>

---------

Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-09 09:45:33 +00:00
7546 changed files with 1536768 additions and 248495 deletions

View File

@ -1,25 +0,0 @@
# Golang CircleCI 2.0 configuration file
#
# Check https://circleci.com/docs/2.0/language-go/ for more details
version: 2
jobs:
build:
docker:
- image: circleci/golang:1.14.10
working_directory: /go/src/github.com/kubeflow/arena
steps:
- checkout
- setup_remote_docker:
docker_layer_caching: false
- run:
name: run tests
command: |
test -z "$(go fmt ./... 2>/dev/null | tee /dev/stderr)" || (echo "please format Go code with 'gofmt'")
go vet ./...
go test -race -v ./...
- run: docker build -t acs/arena:$CIRCLE_BUILD_NUM -f Dockerfile.install .
- run:
name: codecov
command: |
go test -race -coverprofile=coverage.txt -covermode=atomic ./...
bash <(curl -s https://codecov.io/bash)

18
.dockerignore Normal file
View File

@ -0,0 +1,18 @@
bin/
docs/
jupyter/
samples/
sdk/
.gitignore
.readthedocs.yaml
Dockerfile*
LICENSE
OWNERS
README.md
README_cn.md
ROADMAP.md
ROADMAP_cn.md
cover.out
demo.jpg
mkdocs.yml
prow_config.yaml

48
.github/ISSUE_TEMPLATE/bug_report.yaml vendored Normal file
View File

@ -0,0 +1,48 @@
name: Bug Report
description: Tell us about a problem you are experiencing with Arena
labels: ["kind/bug", "lifecycle/needs-triage"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this Arena bug report!
- type: textarea
id: problem
attributes:
label: What happened?
description: |
Please provide as much info as possible.
Not doing so may result in your bug not being addressed in a timely manner.
validations:
required: true
- type: textarea
id: expected
attributes:
label: What did you expect to happen?
validations:
required: true
- type: textarea
id: environment
attributes:
label: Environment
value: |
Kubernetes version:
```bash
$ kubectl version
```
Arena version:
```bash
$ arena version
```
validations:
required: true
- type: input
id: votes
attributes:
label: Impacted by this bug?
value: Give it a 👍 We prioritize the issues with most 👍

6
.github/ISSUE_TEMPLATE/config.yaml vendored Normal file
View File

@ -0,0 +1,6 @@
blank_issues_enabled: true
contact_links:
- name: Arena Documentation
url: https://arena-docs.readthedocs.io/en/stable
about: Much help can be found in the docs

View File

@ -0,0 +1,28 @@
name: Feature Request
description: Suggest an idea for Arena
labels: ["kind/feature", "lifecycle/needs-triage"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this Arena feature request!
- type: textarea
id: feature
attributes:
label: What you would like to be added?
description: |
A clear and concise description of what you want to add to Arena.
Please consider to write Arena enhancement proposal if it is a large feature request.
validations:
required: true
- type: textarea
id: rationale
attributes:
label: Why is this needed?
validations:
required: true
- type: input
id: votes
attributes:
label: Love this feature?
value: Give it a 👍 We prioritize the features with most 👍

27
.github/ISSUE_TEMPLATE/question.yaml vendored Normal file
View File

@ -0,0 +1,27 @@
name: Question
description: Ask question about Arena
labels: ["kind/question", "lifecycle/needs-triage"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this question!
- type: textarea
id: feature
attributes:
label: What question do you want to ask?
description: |
A clear and concise description of what you want to ask about Arena.
validations:
required: true
- type: textarea
id: rationale
attributes:
label: Any additional context?
validations:
required: false
- type: input
id: votes
attributes:
label: Have the same question?
value: Give it a 👍 We prioritize the question with most 👍

29
.github/PULL_REQUEST_TEMPLATE.md vendored Normal file
View File

@ -0,0 +1,29 @@
<!-- Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, check our contributor guidelines: https://www.kubeflow.org/docs/about/contributing
2. To know more about Arena, check the developer guide:
https://arena-docs.readthedocs.io/en/latest/
3. If you want *faster* PR reviews, check how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
-->
## Purpose of this PR
<!-- Provide a clear and concise description of the changes. Explain the motivation behind these changes and link to relevant issues or discussions. -->
**Proposed changes:**
- <Change 1>
- <Change 2>
- <Change 3>
## Change Category
<!-- Indicate the type of change by marking the applicable boxes. -->
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] Feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that could affect existing functionality)
- [ ] Documentation update
### Rationale
<!-- Provide reasoning for the changes if not already covered in the description above. -->

361
.github/dependabot.yml vendored
View File

@ -1,337 +1,26 @@
updates:
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: .
open-pull-requests-limit: 10
package-ecosystem: docker
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: samples/docker/serve-custom-sample
open-pull-requests-limit: 10
package-ecosystem: docker
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/golang.org/x/net/http2
open-pull-requests-limit: 10
package-ecosystem: docker
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: kubernetes-artifacts/tf-operator
open-pull-requests-limit: 10
package-ecosystem: docker
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: kubernetes-artifacts/jobmon
open-pull-requests-limit: 10
package-ecosystem: docker
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: .
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- dims
- thockin
- justinsb
- tallclair
- piosz
- brancz
- DirectXMan12
- lavalamp
directory: vendor/k8s.io/klog
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- jayunit100
- hoegaarden
- andyxning
- neolit123
- pohly
- yagonobre
- vincepri
- detiber
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/google.golang.org/appengine
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/json-iterator/go
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/hashicorp/golang-lru
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/hashicorp/hcl
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/mitchellh/go-homedir
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/mitchellh/mapstructure
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/fsnotify/fsnotify
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/spf13/pflag
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/spf13/viper
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/spf13/afero
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/spf13/cast
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/spf13/jwalterweatherman
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/magiconair/properties
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/google/gofuzz
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/konsorten/go-windows-terminal-sequences
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/github.com/sirupsen/logrus
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/golang.org/x/oauth2
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
- assignees:
- cheyang
- wsxiaozhang
- denverdino
directory: vendor/gopkg.in/yaml.v2
open-pull-requests-limit: 10
package-ecosystem: gomod
reviewers:
- GarnettWang
- xiaozhouX
- osswangxining
schedule:
interval: daily
version: 2
updates:
- package-ecosystem: gomod
directory: /
schedule:
interval: daily
- package-ecosystem: maven
directory: /
schedule:
interval: daily
- package-ecosystem: pip
directory: /
schedule:
interval: daily
- package-ecosystem: docker
directory: /
schedule:
interval: daily
- package-ecosystem: github-actions
directory: /
schedule:
interval: daily

5
.github/issue_label_bot.yaml vendored Normal file
View File

@ -0,0 +1,5 @@
# For https://mlbot.net a Github bot that labels issues using KubeFlow
label-alias:
bug: kind/bug
feature_request: kind/feature
question: kind/question

69
.github/workflows/check-release.yaml vendored Normal file
View File

@ -0,0 +1,69 @@
name: Check Release
on:
pull_request:
branches:
- master
paths:
- VERSION
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
SEMVER_PATTERN: '^([0-9]+)\.([0-9]+)\.([0-9]+)(-rc\.([0-9]+))?$'
ARENA_ARTIFACTS_CHART: arena-artifacts
jobs:
check:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
with:
fetch-depth: 0
- name: Configure Git
run: |
git config user.name "$GITHUB_ACTOR"
git config user.email "$GITHUB_ACTOR@users.noreply.github.com"
- name: Check whether version matches semver pattern
run: |
VERSION=$(cat VERSION)
if [[ ${VERSION} =~ ${{ env.SEMVER_PATTERN }} ]]; then
echo "Version '${VERSION}' matches semver pattern."
else
echo "Version '${VERSION}' does not match semver pattern."
exit 1
fi
echo "VERSION=${VERSION}" >> $GITHUB_ENV
- name: Check arena artifacts chart version and appVersion
run: |
CHART_VERSION=$(cat ${{ env.ARENA_ARTIFACTS_CHART }}/Chart.yaml | grep -e '^version:' | awk '{print $2}')
CHART_APP_VERSION=$(cat ${{ env.ARENA_ARTIFACTS_CHART }}/Chart.yaml | grep -e '^appVersion:' | awk '{print $2}')
if [[ ${CHART_VERSION} == ${VERSION} ]]; then
echo "Chart version '${CHART_VERSION}' matches version '${VERSION}'."
else
echo "Chart version '${CHART_VERSION}' does not match version '${VERSION}'."
exit 1
fi
if [[ ${CHART_APP_VERSION} == ${VERSION} ]]; then
echo "Chart appVersion '${CHART_APP_VERSION}' matches version '${VERSION}'."
else
echo "Chart appVersion '${CHART_APP_VERSION}' does not match version '${VERSION}'."
exit 1
fi
- name: Check if tag exists
run: |
git fetch --tags
if git tag -l | grep -q "^v${VERSION}$"; then
echo "Tag 'v${VERSION}' already exists."
exit 1
else
echo "Tag 'v${VERSION}' does not exist."
fi

137
.github/workflows/integration.yaml vendored Normal file
View File

@ -0,0 +1,137 @@
name: Integration Test
on:
pull_request:
branches:
- master
- release-*
push:
branches:
- master
- release-*
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.actor }}
cancel-in-progress: true
jobs:
build-arena:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod
- name: Run go mod tidy
run: |
go mod tidy
if ! git diff --quiet; then
echo "Please run 'go mod tidy' to add missing and remove unused dependencies"
exit 1
fi
- name: Run go mod vendor
run: |
go mod vendor
if ! git diff --quiet; then
echo "Please run 'go mod vendor' to make vendored copy of dependencies"
exit 1
fi
- name: Run go fmt check
run: |
make go-fmt
if ! git diff --quiet; then
echo "Please run 'make go-fmt' to run go fmt against code"
exit 1
fi
- name: Run go vet check
run: |
make go-vet
if ! git diff --quiet; then
echo "Please run 'make go-vet' to run go vet against code"
exit 1
fi
- name: Run golangci-lint
run: |
make go-lint
- name: Run Go unit tests
run: |
make unit-test
- name: Run Helm unit tests
run: |
make helm-unittest
- name: Build arena binary
run: |
make arena
build-java-sdk:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
- uses: actions/setup-java@v5
with:
distribution: zulu
java-version: 8
- name: Build Java SDK
run: |
make java-sdk
build-docs:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
- uses: actions/setup-python@v5
with:
python-version: 3.11
- name: Build docs
run: |
pip install -r docs/requirements.txt
mkdocs build --strict
e2e-test:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
with:
fetch-depth: 0
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod
- name: Set up Kind cluster
uses: helm/kind-action@v1
with:
node_image: kindest/node:v1.29.10
config: arena-artifacts/ci/kind-config.yaml
- name: Install arena client
run: |
make arena-installer
tar -zxf arena-installer-*.tar.gz
arena-installer-*/install.sh --only-binary
- name: Run e2e tests
run: |
make e2e-test

242
.github/workflows/release.yaml vendored Normal file
View File

@ -0,0 +1,242 @@
name: Release
on:
push:
branches:
- master
paths:
- VERSION
env:
IMAGE_REGISTRY: ghcr.io
IMAGE_REPOSITORY: ${{ github.repository }}
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
package-arena-installer:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
os:
- linux
- darwin
arch:
- amd64
- arm64
steps:
- name: Checkout
uses: actions/checkout@v5
- name: Read version from VERSION file
run: |
VERSION=$(cat VERSION)
echo "VERSION=${VERSION}" >> ${GITHUB_ENV}
- name: Get git commit id
run: |
COMMIT=$(git rev-parse --short HEAD)
echo "COMMIT=${COMMIT}" >>${GITHUB_ENV}
- name: Build arena installer tarball
run: |
make arena-installer OS=${{ matrix.os }} ARCH=${{ matrix.arch }}
- uses: actions/upload-artifact@v4
with:
name: arena-installer-${{ env.VERSION }}-${{ matrix.os }}-${{ matrix.arch }}
path: arena-installer-${{ env.VERSION }}-${{ matrix.os }}-${{ matrix.arch }}.tar.gz
if-no-files-found: error
overwrite: true
build-arena-image:
name: Build Arena container image
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
platform:
- linux/amd64
- linux/arm64
steps:
- name: Prepare
run: |
platform=${{ matrix.platform }}
echo "PLATFORM_PAIR=${platform//\//-}" >> $GITHUB_ENV
- name: Checkout source code
uses: actions/checkout@v5
- name: Read version from VERSION file
run: |
VERSION=$(cat VERSION)
echo "VERSION=${VERSION}" >> $GITHUB_ENV
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}
tags: |
type=semver,pattern={{version}},value=${{ env.VERSION }}
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker buildx
uses: docker/setup-buildx-action@v3
- name: Login to container registry
uses: docker/login-action@v3
with:
registry: ${{ env.IMAGE_REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push by digest
id: build
uses: docker/build-push-action@v6
with:
platforms: ${{ matrix.platform }}
labels: ${{ steps.meta.outputs.labels }}
outputs: type=image,name=${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }},push-by-digest=true,name-canonical=true,push=true
- name: Export digest
run: |
mkdir -p /tmp/digests
digest="${{ steps.build.outputs.digest }}"
touch "/tmp/digests/${digest#sha256:}"
- name: Upload digest
uses: actions/upload-artifact@v4
with:
name: digests-${{ env.PLATFORM_PAIR }}
path: /tmp/digests/*
if-no-files-found: error
retention-days: 1
release-image:
needs:
- build-arena-image
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
- name: Read version from VERSION file
run: |
VERSION=$(cat VERSION)
echo "VERSION=${VERSION}" >> $GITHUB_ENV
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}
tags: |
type=semver,pattern={{version}},value=${{ env.VERSION }}
- name: Download digests
uses: actions/download-artifact@v5
with:
path: /tmp/digests
pattern: digests-*
merge-multiple: true
- name: Set up Docker buildx
uses: docker/setup-buildx-action@v3
- name: Login to container registry
uses: docker/login-action@v3
with:
registry: ${{ env.IMAGE_REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Create manifest list and push
working-directory: /tmp/digests
run: |
docker buildx imagetools create $(jq -cr '.tags | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON") \
$(printf '${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}@sha256:%s ' *)
- name: Inspect image
run: |
docker buildx imagetools inspect ${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}:${{ steps.meta.outputs.version }}
push_tag:
needs:
- package-arena-installer
- release-image
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
with:
fetch-depth: 0
- name: Configure Git
run: |
git config user.name "$GITHUB_ACTOR"
git config user.email "$GITHUB_ACTOR@users.noreply.github.com"
- name: Read version from VERSION file
run: |
VERSION=$(cat VERSION)
echo "VERSION=${VERSION}" >> ${GITHUB_ENV}
- name: Create and push tag
run: |
TAG="v${VERSION}"
git tag -a ${TAG} -m "Release v${VERSION}"
git push origin ${TAG}
draft_release:
needs:
- push_tag
permissions:
contents: write
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v5
- name: Configure Git
run: |
git config user.name "$GITHUB_ACTOR"
git config user.email "$GITHUB_ACTOR@users.noreply.github.com"
- name: Read version from VERSION file
run: |
VERSION=$(cat VERSION)
echo "VERSION=${VERSION}" >> ${GITHUB_ENV}
- name: Download arena installer tarballs
uses: actions/download-artifact@v5
with:
pattern: arena-installer-${{ env.VERSION }}-{linux,darwin}-{amd64,arm64}
- name: Release
uses: softprops/action-gh-release@v2
with:
token: ${{ secrets.GITHUB_TOKEN }}
tag_name: v${{ env.VERSION }}
prerelease: ${{ contains(env.VERSION, 'rc') }}
target_commitish: ${{ github.sha }}
draft: true
files: |
arena-installer-*/arena-installer-*.tar.gz
fail_on_unmatched_files: true

43
.github/workflows/stale.yaml vendored Normal file
View File

@ -0,0 +1,43 @@
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
#
# You can adjust the behavior by modifying this file.
# For more information, see:
# https://github.com/actions/stale
name: Mark stale issues and pull requests
on:
schedule:
- cron: "0 0 * * 0"
jobs:
stale:
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/stale@v9
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
days-before-stale: 360
days-before-close: 180
stale-issue-message: >
This issue has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
close-issue-message: >
This issue has been automatically closed because it has not had recent
activity. Please comment "/reopen" to reopen it.
stale-issue-label: lifecycle/stale
exempt-issue-labels: lifecycle/frozen
stale-pr-message: >
This pull request has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
close-pr-message: >
This pull request has been automatically closed because it has not had recent
activity. Please comment "/reopen" to reopen it.
stale-pr-label: lifecycle/stale
exempt-pr-labels: lifecycle/frozen

27
.gitignore vendored
View File

@ -1,12 +1,25 @@
bin/
**/*.tgz
**/.DS_Store
.idea
.kube
.vscode
Library
public/
site/
tmp/
sdk/arena-python-sdk/dist/
sdk/arena-python-sdk/build/
sdk/arena-python-sdk/arenasdk.egg-info/
__pycache__
.hugo_build.lock
.kube
*.tgz
*.tar.gz
# Python
__pycache__/
# Go
cover.out
# IDE files
.idea/
.vscode/
# MacOS
.DS_Store

76
.golangci.yaml Normal file
View File

@ -0,0 +1,76 @@
version: "2"
run:
# Timeout for total work, e.g. 30s, 5m, 5m30s.
# If the value is lower or equal to 0, the timeout is disabled.
# Default: 0 (disabled)
timeout: 2m
linters:
# Enable specific linters.
# https://golangci-lint.run/usage/linters/#enabled-by-default
enable:
# Detects places where loop variables are copied.
- copyloopvar
# Checks for duplicate words in the source code.
- dupword
# Tool for detection of FIXME, TODO and other comment keywords.
# - godox
# Enforces consistent import aliases.
- importas
# Find code that shadows one of Go's predeclared identifiers.
- predeclared
# Check that struct tags are well aligned.
- tagalign
# Remove unnecessary type conversions.
- unconvert
# Checks Go code for unused constants, variables, functions and types.
- unused
# Disable specific linters.
disable:
# Errcheck is a program for checking for unchecked errors in Go code.
- errcheck
settings:
importas:
# List of aliases
alias:
- pkg: k8s.io/api/admissionregistration/v1
alias: admissionregistrationv1
- pkg: k8s.io/api/apps/v1
alias: appsv1
- pkg: k8s.io/api/batch/v1
alias: batchv1
- pkg: k8s.io/api/core/v1
alias: corev1
- pkg: k8s.io/api/extensions/v1beta1
alias: extensionsv1beta1
- pkg: k8s.io/api/networking/v1
alias: networkingv1
- pkg: k8s.io/apimachinery/pkg/apis/meta/v1
alias: metav1
- pkg: sigs.k8s.io/controller-runtime
alias: ctrl
exclusions:
# Which file paths to exclude: they will be analyzed, but issues from them won't be reported.
# "/" will be replaced by the current OS file path separator to properly work on Windows.
# Default: []
paths:
- pkg/operators
issues:
# Maximum issues count per one linter.
# Set to 0 to disable.
# Default: 50
max-issues-per-linter: 50
# Maximum count of issues with the same text.
# Set to 0 to disable.
# Default: 3
max-same-issues: 10
formatters:
enable:
# Check import statements are formatted according to the 'goimport' command.
- goimports

View File

@ -4,6 +4,12 @@
# Required
version: 2
# Set the version of Python and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"
mkdocs:
configuration: mkdocs.yml
@ -13,6 +19,5 @@ formats:
# Optionally set the version of Python and requirements required to build your docs
python:
version: 3.7
install:
- requirements: docs/requirements.txt

View File

@ -1,16 +0,0 @@
language: go
go:
- "1.14.10"
go_import_path: github.com/kubeflow/arena
# let us have speedy Docker-based Travis workers
sudo: false
script:
- go build -o bin/arena cmd/arena/*.go
- go vet ./...
- go test -v ./...
- test -z "$(go fmt ./... 2>/dev/null | tee /dev/stderr)" || (echo "please format Go code with 'gofmt'")
- go test -race -v ./...

236
CHANGELOG.md Normal file
View File

@ -0,0 +1,236 @@
# Changelog
## [v0.15.1](https://github.com/kubeflow/arena/tree/v0.15.1) (2025-06-25)
### Features
- Add support for configuring tolerations ([#1337](https://github.com/kubeflow/arena/pull/1337) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Remove kubernetes artifacts ([#1329](https://github.com/kubeflow/arena/pull/1329) by [@ChenYi015](https://github.com/ChenYi015))
- [CI] Add CI workflow for releasing Arena images ([#1340](https://github.com/kubeflow/arena/pull/1340) by [@ChenYi015](https://github.com/ChenYi015))
- Update uninstall bash script ([#1335](https://github.com/kubeflow/arena/pull/1335) by [@ChenYi015](https://github.com/ChenYi015))
- Fix golangci-lint issues ([#1341](https://github.com/kubeflow/arena/pull/1341) by [@ChenYi015](https://github.com/ChenYi015))
- Bump golang version from 1.22.7 to 1.23.10 ([#1345](https://github.com/kubeflow/arena/pull/1345) by [@ChenYi015](https://github.com/ChenYi015))
- chore(deps): bump github.com/prometheus/common from 0.60.1 to 0.65.0 ([#1343](https://github.com/kubeflow/arena/pull/1343) by [@dependabot[bot]](https://github.com/apps/dependabot))
- chore(deps): bump golang.org/x/crypto from 0.38.0 to 0.39.0 ([#1334](https://github.com/kubeflow/arena/pull/1334) by [@dependabot[bot]](https://github.com/apps/dependabot))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.15.0...v0.15.1)
## [v0.15.0](https://github.com/kubeflow/arena/tree/v0.15.0) (2025-06-04)
### Features
- refactor: use helm lib instead of helm binary ([#1207](https://github.com/kubeflow/arena/pull/1207) by [@ChenYi015](https://github.com/ChenYi015))
- feat: add new value for using localtime in cron-operator ([#1296](https://github.com/kubeflow/arena/pull/1296) by [@ChenYi015](https://github.com/ChenYi015))
- Delete all services when the TFJob is terminated ([#1316](https://github.com/kubeflow/arena/pull/1316) by [@ChenYi015](https://github.com/ChenYi015))
- Make number of replicas of cron-operator deployment configurable ([#1325](https://github.com/kubeflow/arena/pull/1325) by [@ChenYi015](https://github.com/ChenYi015))
- Make number of replicas of tf-operator deployment configurable ([#1323](https://github.com/kubeflow/arena/pull/1323) by [@ChenYi015](https://github.com/ChenYi015))
- Add custom device support for kserve and kserving. ([#1315](https://github.com/kubeflow/arena/pull/1315) by [@Leoyzen](https://github.com/Leoyzen))
- Feat: support affinity policy for kserve and tfjob ([#1319](https://github.com/kubeflow/arena/pull/1319) by [@Syspretor](https://github.com/Syspretor))
- Feat: support separate affinity policy configuration for PS and worke… ([#1331](https://github.com/kubeflow/arena/pull/1331) by [@Syspretor](https://github.com/Syspretor))
### Bug Fixes
- fix: job status displays incorrectly ([#1289](https://github.com/kubeflow/arena/pull/1289) by [@ChenYi015](https://github.com/ChenYi015))
- fix: service account should use release namespace ([#1308](https://github.com/kubeflow/arena/pull/1308) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Add basic e2e tests ([#1225](https://github.com/kubeflow/arena/pull/1225) by [@ChenYi015](https://github.com/ChenYi015))
- Bump github.com/containerd/containerd from 1.7.23 to 1.7.27 ([#1290](https://github.com/kubeflow/arena/pull/1290) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Add stale bot to mark stale issues and PRs ([#1141](https://github.com/kubeflow/arena/pull/1141) by [@ChenYi015](https://github.com/ChenYi015))
- Fix typos in multiple files ([#1304](https://github.com/kubeflow/arena/pull/1304) by [@co63oc](https://github.com/co63oc))
- Fix typos in multiple files ([#1310](https://github.com/kubeflow/arena/pull/1310) by [@co63oc](https://github.com/co63oc))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.14.2...v0.15.0)
## [v0.14.2](https://github.com/kubeflow/arena/tree/v0.14.2) (2025-03-10)
### Misc
- Fix typos ([#1276](https://github.com/kubeflow/arena/pull/1276) by [@co63oc](https://github.com/co63oc))
- Update pytorch operator image ([#1281](https://github.com/kubeflow/arena/pull/1281) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.14.1...v0.14.2)
## [v0.14.1](https://github.com/kubeflow/arena/tree/v0.14.1) (2025-02-24)
### Bug Fixes
- fix: device value does not support k8s resource quantity ([#1267](https://github.com/kubeflow/arena/pull/1267) by [@ChenYi015](https://github.com/ChenYi015))
- fix: pytorchjob does not support backoff limit ([#1272](https://github.com/kubeflow/arena/pull/1272) by [@ChenYi015](https://github.com/ChenYi015))
- unset env NVIDIA_VISIBLE_DEVICES when gpushare is enabled ([#1273](https://github.com/kubeflow/arena/pull/1273) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- docs: fixed typo ([#1257](https://github.com/kubeflow/arena/pull/1257) by [@DBMxrco](https://github.com/DBMxrco))
- Bump github.com/golang/glog from 1.2.3 to 1.2.4 ([#1263](https://github.com/kubeflow/arena/pull/1263) by [@dependabot[bot]](https://github.com/apps/dependabot))
- fix: format of tensorflow standalone training docs is messed up ([#1265](https://github.com/kubeflow/arena/pull/1265) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.14.0...v0.14.1)
## [v0.14.0](https://github.com/kubeflow/arena/tree/v0.14.0) (2025-02-12)
### Features
- rename parameter ([#1262](https://github.com/kubeflow/arena/pull/1262) by [@gujingit](https://github.com/gujingit))
### Misc
- Add changelog for v0.13.1 ([#1248](https://github.com/kubeflow/arena/pull/1248) by [@ChenYi015](https://github.com/ChenYi015))
- Bump github.com/go-resty/resty/v2 from 2.16.0 to 2.16.5 ([#1254](https://github.com/kubeflow/arena/pull/1254) by [@dependabot[bot]](https://github.com/apps/dependabot))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.13.1...v0.14.0)
## [v0.13.1](https://github.com/kubeflow/arena/tree/v0.13.1) (2025-01-13)
### Misc
- feat: add linux/arm64 support for tf-operator image ([#1238](https://github.com/kubeflow/arena/pull/1238) by [@ChenYi015](https://github.com/ChenYi015))
- feat: add linux/arm64 support for mpi-operator image ([#1239](https://github.com/kubeflow/arena/pull/1239) by [@ChenYi015](https://github.com/ChenYi015))
- feat: add linux/arm64 support for cron-operator image ([#1240](https://github.com/kubeflow/arena/pull/1240) by [@ChenYi015](https://github.com/ChenYi015))
- feat: add linux/arm64 support for et-operator image ([#1241](https://github.com/kubeflow/arena/pull/1241) by [@ChenYi015](https://github.com/ChenYi015))
- Add PyTorch mnist example ([#1237](https://github.com/kubeflow/arena/pull/1237) by [@ChenYi015](https://github.com/ChenYi015))
- Update the version of elastic-job-supervisor in arena-artifacts ([#1247](https://github.com/kubeflow/arena/pull/1247) by [@AlanFokCo](https://github.com/AlanFokCo))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.13.0...v0.13.1)
## [v0.13.0](https://github.com/kubeflow/arena/tree/v0.13.0) (2024-12-23)
### New Features
- feat: add support for torchrun ([#1228](https://github.com/kubeflow/arena/pull/1228) by [@ChenYi015](https://github.com/ChenYi015))
- Update pytorch-operator image ([#1234](https://github.com/kubeflow/arena/pull/1234) by [@ChenYi015](https://github.com/ChenYi015))
### Bug Fix
- Avoid listing jobs and statefulsets when get pytorchjob ([#1229](https://github.com/kubeflow/arena/pull/1229) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Update tfjob standalone training job doc ([#1222](https://github.com/kubeflow/arena/pull/1222) by [@ChenYi015](https://github.com/ChenYi015))
- Remove archived docs ([#1208](https://github.com/kubeflow/arena/pull/1208) by [@ChenYi015](https://github.com/ChenYi015))
- Add changelog for v0.12.1 ([#1224](https://github.com/kubeflow/arena/pull/1224) by [@ChenYi015](https://github.com/ChenYi015))
- Bump golang.org/x/crypto from 0.29.0 to 0.31.0 ([#1231](https://github.com/kubeflow/arena/pull/1231) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump google.golang.org/protobuf from 1.35.1 to 1.36.0 ([#1227](https://github.com/kubeflow/arena/pull/1227) by [@dependabot[bot]](https://github.com/apps/dependabot))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.12.1...v0.13.0)
## [v0.12.1](https://github.com/kubeflow/arena/tree/v0.12.1) (2024-11-25)
### New Features
- Support MPI Job with generic devices ([#1209](https://github.com/kubeflow/arena/pull/1209) by [@cheyang](https://github.com/cheyang))
### Bug Fix
- Update tf-operator image to fix clean pod policy issues ([#1200](https://github.com/kubeflow/arena/pull/1200) by [@ChenYi015](https://github.com/ChenYi015))
- Fix etjob rendering error when using local logging dir ([#1203](https://github.com/kubeflow/arena/pull/1203) by [@TrafalgarZZZ](https://github.com/TrafalgarZZZ))
- Fix the functionality of generating kubeconfig (#1204) ([#1205](https://github.com/kubeflow/arena/pull/1205) by [@wqlparallel](https://github.com/wqlparallel))
- Update cron operator image ([#1214](https://github.com/kubeflow/arena/pull/1214) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Add changelog for v0.12.0 ([#1199](https://github.com/kubeflow/arena/pull/1199) by [@ChenYi015](https://github.com/ChenYi015))
- Add go mod vendor check to integration test ([#1198](https://github.com/kubeflow/arena/pull/1198) by [@ChenYi015](https://github.com/ChenYi015))
- bump github.com/go-resty/resty/v2 from 2.15.3 to 2.16.0 ([#1202](https://github.com/kubeflow/arena/pull/1202) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Publish releases only on master branch ([#1210](https://github.com/kubeflow/arena/pull/1210) by [@ChenYi015](https://github.com/ChenYi015))
- Add docs for releasing arena ([#1201](https://github.com/kubeflow/arena/pull/1201) by [@ChenYi015](https://github.com/ChenYi015))
- Bump golang.org/x/crypto from 0.28.0 to 0.29.0 ([#1206](https://github.com/kubeflow/arena/pull/1206) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Release v0.12.1 ([#1215](https://github.com/kubeflow/arena/pull/1215) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/29b2d6d2...v0.12.1)
## [v0.12.0](https://github.com/kubeflow/arena/tree/v0.12.0) (2024-11-11)
### New Features
- Feat: add support for distributed serving type ([#1187](https://github.com/kubeflow/arena/pull/1187) by [@linnlh](https://github.com/linnlh))
- Support distributed serving with vendor update ([#1194](https://github.com/kubeflow/arena/pull/1194) by [@cheyang](https://github.com/cheyang))
### Misc
- Bump github.com/golang/glog from 1.2.2 to 1.2.3 ([#1189](https://github.com/kubeflow/arena/pull/1189) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/prometheus/common from 0.60.0 to 0.60.1 ([#1182](https://github.com/kubeflow/arena/pull/1182) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump mkdocs-material from 9.5.42 to 9.5.44 ([#1190](https://github.com/kubeflow/arena/pull/1190) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Release v0.12.0 ([#1197](https://github.com/kubeflow/arena/pull/1197) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/46a795e3...v0.12.0)
## [v0.11.0](https://github.com/kubeflow/arena/tree/v0.11.0) (2024-10-24)
### New Features
- Support ray job ([#1123](https://github.com/kubeflow/arena/pull/1123) by [@qile123](https://github.com/qile123))
### Misc
- Bump github.com/prometheus/client_golang from 1.20.4 to 1.20.5 ([#1176](https://github.com/kubeflow/arena/pull/1176) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump mkdocs-material from 9.5.40 to 9.5.42 ([#1179](https://github.com/kubeflow/arena/pull/1179) by [@dependabot[bot]](https://github.com/apps/dependabot))
[Full Changelog](https://github.com/kubeflow/arena/compare/e15cb18...v0.11.0)
## [v0.10.1](https://github.com/kubeflow/arena/tree/v0.10.1) (2024-10-14)
### Bug Fixes
- fix: keep arena installer after installing the binary ([#1164](https://github.com/kubeflow/arena/pull/1164) by [@ChenYi015](https://github.com/ChenYi015))
- fix: unsupported success policy when success policy is not specified ([#1170](https://github.com/kubeflow/arena/pull/1170) by [@ChenYi015](https://github.com/ChenYi015))
- fix: failed to sync cache due to status subresouce missed in tfjob CRD ([#1173](https://github.com/kubeflow/arena/pull/1173) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Bump github.com/prometheus/common from 0.59.1 to 0.60.0 ([#1160](https://github.com/kubeflow/arena/pull/1160) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump golang.org/x/crypto from 0.27.0 to 0.28.0 ([#1162](https://github.com/kubeflow/arena/pull/1162) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Migrate docker image to ACREE ([#1171](https://github.com/kubeflow/arena/pull/1171) by [@ChenYi015](https://github.com/ChenYi015))
- Bump mkdocs-material from 9.5.38 to 9.5.40 ([#1166](https://github.com/kubeflow/arena/pull/1166) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump google.golang.org/protobuf from 1.34.2 to 1.35.1 ([#1163](https://github.com/kubeflow/arena/pull/1163) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Remove redundant run_arena.sh file ([#1172](https://github.com/kubeflow/arena/pull/1172) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.10.0...v0.10.1)
## [v0.10.0](https://github.com/kubeflow/arena/tree/v0.10.0) (2024-09-29)
### New Features
- Support multiple type devices ([#1122](https://github.com/kubeflow/arena/pull/1122) by [@lizhiboo](https://github.com/lizhiboo))
- Increase RSA key bit size from 1024 to 2048 ([#1130](https://github.com/kubeflow/arena/pull/1130) by [@ChenYi015](https://github.com/ChenYi015))
- Add success policy to TF training job ([#1148](https://github.com/kubeflow/arena/pull/1148) by [@ChenYi015](https://github.com/ChenYi015))
### Bug Fixes
- Fix submitting spark training jobs and update docs ([#1112](https://github.com/kubeflow/arena/pull/1112) by [@ChenYi015](https://github.com/ChenYi015))
- docs: fix broken links and add CI for checking document build status ([#1131](https://github.com/kubeflow/arena/pull/1131) by [@ChenYi015](https://github.com/ChenYi015))
- [Bugfix] Make PytorchJob devices format to key=value ([#1155](https://github.com/kubeflow/arena/pull/1155) by [@AlanFokCo](https://github.com/AlanFokCo))
### SDK
- Bump arena Java SDK version to 1.0.8 ([#1124](https://github.com/kubeflow/arena/pull/1124) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Remove docker dependency ([#1113](https://github.com/kubeflow/arena/pull/1113) by [@Syulin7](https://github.com/Syulin7))
- Update Makefile and release workflow ([#1128](https://github.com/kubeflow/arena/pull/1128) by [@ChenYi015](https://github.com/ChenYi015))
- chore: remove travis and circle CI ([#1129](https://github.com/kubeflow/arena/pull/1129) by [@ChenYi015](https://github.com/ChenYi015))
- chore: add issue templates and update depenabot bot ([#1140](https://github.com/kubeflow/arena/pull/1140) by [@ChenYi015](https://github.com/ChenYi015))
- Bump github.com/golang/glog from 1.1.2 to 1.2.2 ([#1139](https://github.com/kubeflow/arena/pull/1139) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump golang.org/x/crypto from 0.21.0 to 0.27.0 ([#1126](https://github.com/kubeflow/arena/pull/1126) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/spf13/cobra from 1.8.0 to 1.8.1 ([#1137](https://github.com/kubeflow/arena/pull/1137) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/go-resty/resty/v2 from 2.12.0 to 2.14.0 ([#1134](https://github.com/kubeflow/arena/pull/1134) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/kserve/kserve from 0.13.0 to 0.13.1 ([#1135](https://github.com/kubeflow/arena/pull/1135) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/prometheus/common from 0.45.0 to 0.59.1 ([#1138](https://github.com/kubeflow/arena/pull/1138) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump client-java from 10.0.1 to 11.0.1 ([#1132](https://github.com/kubeflow/arena/pull/1132) by [@ChenYi015](https://github.com/ChenYi015))
- Bump github.com/prometheus/client_golang from 1.20.0 to 1.20.4 ([#1144](https://github.com/kubeflow/arena/pull/1144) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/go-resty/resty/v2 from 2.14.0 to 2.15.0 ([#1143](https://github.com/kubeflow/arena/pull/1143) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump mkdocs-material from 9.5.34 to 9.5.35 ([#1145](https://github.com/kubeflow/arena/pull/1145) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/go-resty/resty/v2 from 2.15.0 to 2.15.1 ([#1147](https://github.com/kubeflow/arena/pull/1147) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/go-resty/resty/v2 from 2.15.1 to 2.15.2 ([#1150](https://github.com/kubeflow/arena/pull/1150) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump mkdocs-material from 9.5.35 to 9.5.36 ([#1151](https://github.com/kubeflow/arena/pull/1151) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump golang from 1.21 to 1.22.7 ([#1142](https://github.com/kubeflow/arena/pull/1142) by [@ChenYi015](https://github.com/ChenYi015))
- Bump mkdocs-material from 9.5.36 to 9.5.38 ([#1153](https://github.com/kubeflow/arena/pull/1153) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/go-resty/resty/v2 from 2.15.2 to 2.15.3 ([#1156](https://github.com/kubeflow/arena/pull/1156) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Release v0.10.0 ([#1157](https://github.com/kubeflow/arena/pull/1157) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.9.16...v0.10.0)

41
Dockerfile Normal file
View File

@ -0,0 +1,41 @@
ARG BASE_IMAGE=debian:12-slim
FROM golang:1.24.0 AS builder
ARG TARGETOS
ARG TARGETARCH
WORKDIR /workspace
COPY . .
RUN set -eux && \
VERSION=$(cat VERSION) && \
make arena-installer OS=${TARGETOS} ARCH=${TARGETARCH} && \
mv arena-installer-${VERSION}-${TARGETOS}-${TARGETARCH}.tar.gz arena-installer.tar.gz
FROM ${BASE_IMAGE}
ARG TARGETOS
ARG TARGETARCH
WORKDIR /root
RUN apt-get update \
&& apt-get install -y tini \
&& rm -rf /var/lib/apt/lists/*
COPY --from=builder /workspace/arena-installer.tar.gz .
RUN set -eux && \
tar -zxvf arena-installer.tar.gz && \
mv arena-installer-*-${TARGETOS}-${TARGETARCH} arena-installer && \
arena-installer/install.sh --only-binary && \
rm -rf arena-installer.tar.gz
COPY entrypoint.sh /usr/local/bin/
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]

View File

@ -1,94 +0,0 @@
#**********************************************************************
# Builder
#
# Create a go runtime for building arena
ARG GOLANG_VERSION=1.16
ARG KUBE_VERSION=v1.23.0
ARG HELM_VERSION=v3.7.2
ARG VERSION=v0.3.0-rc
ARG OS_ARCH=linux-amd64
ARG COMMIT=stable
ARG TARGET=cli-$OS_ARCH
FROM golang:$GOLANG_VERSION-stretch as build
ARG KUBE_VERSION
ARG HELM_VERSION
ARG OS_ARCH
ARG TARGET
ENV KUBE_VERSION $KUBE_VERSION
ENV HELM_VERSION $HELM_VERSION
ENV VERSION $VERSION
ENV OS_ARCH $OS_ARCH
ENV COMMIT $COMMIT
ENV TARGET $TARGET
ENV GO111MODULE off
RUN mkdir -p /go/src/github.com/kubeflow/arena
WORKDIR /go/src/github.com/kubeflow/arena
COPY . .
RUN make $TARGET
RUN wget https://get.helm.sh/helm-$HELM_VERSION-$OS_ARCH.tar.gz && \
tar -xvf helm-$HELM_VERSION-$OS_ARCH.tar.gz && \
mv $OS_ARCH/helm /usr/local/bin/helm && \
chmod u+x /usr/local/bin/helm && \
chmod u+x /go/src/github.com/kubeflow/arena/install.sh
RUN OS=$(echo $OS_ARCH | cut -f1 -d-) && \
ARCH=$(echo $OS_ARCH | cut -f2 -d-) && \
cd /usr/local/bin && \
curl -LO https://dl.k8s.io/release/${KUBE_VERSION}/bin/${OS}/${ARCH}/kubectl && \
chmod +x /usr/local/bin/kubectl
#**********************************************************************
#
# Create arena pacakge
#
FROM centos:7
ARG KUBE_VERSION
ARG HELM_VERSION
ARG OS_ARCH
ARG TARGET
ARG COMMIT
ARG VERSION
ENV OS_ARCH $OS_ARCH
ENV COMMIT $COMMIT
ENV TARGET $TARGET
ENV VERSION $VERSION
ENV ARENA_HOME /arena-installer
ENV ARENA_TARFILE /arena-installer-$VERSION-$COMMIT-$OS_ARCH.tar.gz
RUN mkdir -p $ARENA_HOME/bin
COPY --from=build /go/src/github.com/kubeflow/arena/bin/arena $ARENA_HOME/bin/arena
COPY --from=build /go/src/github.com/kubeflow/arena/uninstall.sh $ARENA_HOME/bin/arena-uninstall
COPY --from=build /go/src/github.com/kubeflow/arena/install.sh $ARENA_HOME/install.sh
COPY --from=build /go/src/github.com/kubeflow/arena/arena-gen-kubeconfig.sh $ARENA_HOME/bin/arena-gen-kubeconfig.sh
COPY --from=build /usr/local/bin/helm $ARENA_HOME/bin/helm
COPY --from=build /go/src/github.com/kubeflow/arena/kubernetes-artifacts $ARENA_HOME/kubernetes-artifacts
COPY --from=build /go/src/github.com/kubeflow/arena/arena-artifacts $ARENA_HOME/arena-artifacts
COPY --from=build /usr/local/bin/kubectl $ARENA_HOME/bin/kubectl
COPY --from=build /go/src/github.com/kubeflow/arena/charts $ARENA_HOME/charts
RUN sed -i "s@^version: \(.*\)@version: $VERSION-$COMMIT@g" $ARENA_HOME/arena-artifacts/Chart.yaml && \
sed -i "s@^appVersion: \(.*\)@appVersion: $VERSION-$COMMIT@g" $ARENA_HOME/arena-artifacts/Chart.yaml && \
tar -zcvf $ARENA_TARFILE $ARENA_HOME

View File

@ -1,40 +0,0 @@
#FROM golang:1.10-stretch as build
FROM golang:1.14-stretch as build
RUN mkdir -p /go/src/github.com/kubeflow/arena
WORKDIR /go/src/github.com/kubeflow/arena
COPY . .
RUN make
RUN wget https://get.helm.sh/helm-v2.14.1-linux-amd64.tar.gz && \
tar -xvf helm-v2.14.1-linux-amd64.tar.gz && \
mv linux-amd64/helm /usr/local/bin/helm && \
chmod u+x /usr/local/bin/helm
ENV K8S_VERSION v1.13.6
RUN curl -o /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${K8S_VERSION}/bin/linux/amd64/kubectl && chmod +x /usr/local/bin/kubectl
FROM centos:7
COPY --from=build /go/src/github.com/kubeflow/arena/bin/arena /usr/local/bin/arena
COPY --from=build /usr/local/bin/helm /usr/local/bin/helm
COPY --from=build /go/src/github.com/kubeflow/arena/kubernetes-artifacts /root/kubernetes-artifacts
COPY --from=build /usr/local/bin/kubectl /usr/local/bin/kubectl
COPY --from=build /go/src/github.com/kubeflow/arena/charts /charts
ADD run_arena.sh /usr/local/bin
RUN chmod u+x /usr/local/bin/run_arena.sh
RUN yum install bash-completion -y && \
echo "source <(arena completion bash)" >> ~/.bashrc
ENTRYPOINT ["/usr/local/bin/run_arena.sh"]

View File

@ -3,7 +3,7 @@ ARG BASE_IMAGE=tensorflow/tensorflow:1.12.0-devel-py3
ARG USER=root
FROM golang:1.14-stretch as build
FROM golang:1.23.10 AS build
RUN mkdir -p /go/src/github.com/kubeflow/arena
@ -12,12 +12,12 @@ COPY . .
RUN make
RUN wget https://get.helm.sh/helm-v2.14.1-linux-amd64.tar.gz && \
tar -xvf helm-v2.14.1-linux-amd64.tar.gz && \
RUN wget https://get.helm.sh/helm-v3.13.3-linux-amd64.tar.gz && \
tar -xvf helm-v3.13.3-linux-amd64.tar.gz && \
mv linux-amd64/helm /usr/local/bin/helm && \
chmod u+x /usr/local/bin/helm
ENV K8S_VERSION v1.13.6
ENV K8S_VERSION v1.28.4
RUN curl -o /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${K8S_VERSION}/bin/linux/amd64/kubectl && chmod +x /usr/local/bin/kubectl
FROM $BASE_IMAGE

View File

@ -2,7 +2,7 @@ ARG BASE_IMAGE=registry.aliyuncs.com/kubeflow-images-public/tensorflow-1.12.0-no
ARG USER=jovyan
FROM golang:1.14-stretch as build
FROM golang:1.23.10 AS build
RUN mkdir -p /go/src/github.com/kubeflow/arena
@ -11,12 +11,12 @@ COPY . .
RUN make
RUN wget https://get.helm.sh/helm-v2.14.1-linux-amd64.tar.gz && \
tar -xvf helm-v2.14.1-linux-amd64.tar.gz && \
RUN wget https://get.helm.sh/helm-v3.13.3-linux-amd64.tar.gz && \
tar -xvf helm-v3.13.3-linux-amd64.tar.gz && \
mv linux-amd64/helm /usr/local/bin/helm && \
chmod u+x /usr/local/bin/helm
ENV K8S_VERSION v1.13.6
ENV K8S_VERSION v1.28.4
RUN curl -o /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${K8S_VERSION}/bin/linux/amd64/kubectl && chmod +x /usr/local/bin/kubectl
FROM $BASE_IMAGE

317
Makefile
View File

@ -1,19 +1,64 @@
PACKAGE=github.com/kubeflow/arena
CURRENT_DIR=$(shell pwd)
DIST_DIR=${CURRENT_DIR}/bin
ARENA_CLI_NAME=arena
JOB_MONITOR=jobmon
ARENA_UNINSTALL=arena-uninstall
OS_ARCH?=linux-amd64
.SILENT:
VERSION=$(shell cat ${CURRENT_DIR}/VERSION)
BUILD_DATE=$(shell date -u +'%Y-%m-%dT%H:%M:%SZ')
GIT_COMMIT=$(shell git rev-parse HEAD)
GIT_SHORT_COMMIT=$(shell git rev-parse --short HEAD)
DOCKER_BUILD_DATE=$(shell date -u +'%Y%m%d%H%M%S')
GIT_TAG=$(shell if [ -z "`git status --porcelain`" ]; then git describe --exact-match --tags HEAD 2>/dev/null; fi)
GIT_TREE_STATE=$(shell if [ -z "`git status --porcelain`" ]; then echo "clean" ; else echo "dirty"; fi)
PACKR_CMD=$(shell if [ "`which packr`" ]; then echo "packr"; else echo "go run vendor/github.com/gobuffalo/packr/packr/main.go"; fi)
# Get the currently used golang install path (in GOPATH/bin, unless GOBIN is set)
ifeq (,$(shell go env GOBIN))
GOBIN=$(shell go env GOPATH)/bin
else
GOBIN=$(shell go env GOBIN)
endif
# Setting SHELL to bash allows bash commands to be executed by recipes.
# Options are set to exit when a recipe line exits non-zero or a piped command fails.
SHELL = /usr/bin/env bash -o pipefail
.SHELLFLAGS = -ec
PACKAGE ?= github.com/kubeflow/arena
CURRENT_DIR ?= $(shell pwd)
DIST_DIR ?= $(CURRENT_DIR)/bin
ARENA_CLI_NAME ?= arena
JOB_MONITOR ?= jobmon
ARENA_UNINSTALL ?= arena-uninstall
OS ?= $(shell go env GOOS)
ARCH ?= $(shell go env GOARCH)
VERSION ?= $(shell cat VERSION)
BUILD_DATE := $(shell date -u +'%Y-%m-%dT%H:%M:%SZ')
GIT_COMMIT := $(shell git rev-parse HEAD)
GIT_SHORT_COMMIT := $(shell git rev-parse --short HEAD)
DOCKER_BUILD_DATE := $(shell date -u +'%Y%m%d%H%M%S')
GIT_TAG := $(shell if [ -z "`git status --porcelain`" ]; then git describe --exact-match --tags HEAD 2>/dev/null; fi)
GIT_TREE_STATE := $(shell if [ -z "`git status --porcelain`" ]; then echo "clean" ; else echo "dirty"; fi)
PACKR_CMD := $(shell if [ "`which packr`" ]; then echo "packr"; else echo "go run vendor/github.com/gobuffalo/packr/packr/main.go"; fi)
# Location to install binaries
LOCALBIN ?= $(CURRENT_DIR)/bin
# Location to put temp files
TEMPDIR ?= $(CURRENT_DIR)/tmp
# ARENA_ARTIFACTS
ARENA_ARTIFACTS_CHART_PATH ?= $(CURRENT_DIR)/arena-artifacts
# Versions
GOLANG_VERSION=$(shell grep -e '^go ' go.mod | cut -d ' ' -f 2)
KUBECTL_VERSION ?= v1.28.4
HELM_VERSION ?= $(shell grep -e 'helm.sh/helm/v3 ' go.mod | cut -d ' ' -f 2)
HELM_UNITTEST_VERSION ?= 0.5.1
KIND_VERSION ?= v0.23.0
KIND_K8S_VERSION ?= v1.29.3
ENVTEST_VERSION ?= release-0.18
ENVTEST_K8S_VERSION ?= 1.29.3
GOLANGCI_LINT_VERSION ?= v2.1.6
# Binaries
ARENA ?= arena-v$(VERSION)-$(OS)-$(ARCH)
KUBECTL ?= kubectl-$(KUBECTL_VERSION)-$(OS)-$(ARCH)
HELM ?= helm-$(HELM_VERSION)-$(OS)-$(ARCH)
KIND ?= $(LOCALBIN)/kind-$(KIND_VERSION)
ENVTEST ?= $(LOCALBIN)/setup-envtest-$(ENVTEST_VERSION)
GOLANGCI_LINT ?= golangci-lint-$(GOLANGCI_LINT_VERSION)
# Tarballs
ARENA_INSTALLER ?= arena-installer-$(VERSION)-$(OS)-$(ARCH)
ARENA_INSTALLER_TARBALL ?= $(ARENA_INSTALLER).tar.gz
BUILDER_IMAGE=arena-builder
BASE_IMAGE=registry.aliyuncs.com/kubeflow-images-public/tensorflow-1.12.0-notebook-gpu:v0.4.0
@ -32,8 +77,12 @@ override LDFLAGS += \
-extldflags "-static"
# docker image publishing options
IMAGE_REGISTRY ?= docker.io
IMAGE_REPOSITORY ?= kubeflow/arena
IMAGE_TAG ?= $(VERSION)
IMAGE ?= $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY):$(IMAGE_TAG)
DOCKER_PUSH=false
IMAGE_TAG=latest
BASE_IMAGE ?= debian:12-slim
ifneq (${GIT_TAG},)
IMAGE_TAG=${GIT_TAG}
@ -56,49 +105,117 @@ ifdef IMAGE_NAMESPACE
IMAGE_PREFIX=${IMAGE_NAMESPACE}/
endif
##@ General
# The help target prints out all targets with their descriptions organized
# beneath their categories. The categories are represented by '##@' and the
# target descriptions by '##'. The awk command is responsible for reading the
# entire set of makefiles included in this invocation, looking for lines of the
# file as xyz: ## something, and then pretty-format the target and help. Then,
# if there's a line with ##@ something, that gets pretty-printed as a category.
# More info on the usage of ANSI control characters for terminal formatting:
# https://en.wikipedia.org/wiki/ANSI_escape_code#SGR_parameters
# More info on the awk command:
# http://linuxcommand.org/lc3_adv_awk.php
.PHONY: help
help: ## Display this help.
@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n make \033[36m<target>\033[0m\n"} /^[a-zA-Z_0-9-]+:.*?##/ { printf " \033[36m%-30s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)
.PHONY: all
all: go-fmt go-vet go-lint unit-test e2e-test
##@ Development
go-fmt: ## Run go fmt against code.
@echo "Running go fmt..."
go fmt ./...
go-vet: ## Run go vet against code.
@echo "Running go vet..."
go vet ./...
.PHONY: go-lint
go-lint: golangci-lint ## Run golangci-lint linter.
@echo "Running golangci-lint run..."
$(LOCALBIN)/$(GOLANGCI_LINT) run --timeout 5m ./...
.PHONY: go-lint-fix
go-lint-fix: golangci-lint ## Run golangci-lint linter and perform fixes.
@echo "Running golangci-lint run --fix..."
$(LOCALBIN)/$(GOLANGCI_LINT) run --fix --timeout 5m ./...
.PHONY: unit-test
unit-test: ## Run go unit tests.
@echo "Running go test..."
go test $(shell go list ./... | grep -v /e2e) -coverprofile cover.out
.PHONY: e2e-test
e2e-test: envtest ## Run the e2e tests against a Kind k8s instance that is spun up.
@echo "Running e2e tests..."
go test ./test/e2e/ -v -ginkgo.v -timeout 30m
# Build the project
.PHONY: default
default:
ifeq ($(OS),Windows_NT)
default: cli-windows
default: arena-windows
else
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S),Linux)
$(info "Building on Linux")
default: cli-linux-amd64
default: arena-linux-amd64
else ifeq ($(UNAME_S),Darwin)
$(info "Building on Darwin")
default: cli-darwin-amd64
default: arena-darwin-amd64
else
$(error "The OS is not supported")
endif
endif
.PHONY: cli-linux-amd64
cli-linux-amd64:
mkdir -p bin
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 GO111MODULE=off go build -tags 'netgo' -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${ARENA_CLI_NAME} cmd/arena/*.go
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 GO111MODULE=off go build -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${JOB_MONITOR} cmd/job-monitor/*.go
##@ Build
.PHONY: cli-darwin-amd64
cli-darwin-amd64:
mkdir -p bin
CGO_ENABLED=0 GOOS=darwin GO111MODULE=off go build -tags 'netgo' -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${ARENA_CLI_NAME} ./cmd/arena/*.go
$(LOCALBIN):
mkdir -p $(LOCALBIN)
.PHONY: cli-darwin-arm64
cli-darwin-arm64:
mkdir -p bin
CGO_ENABLED=0 GOOS=darwin GO111MODULE=off go build -tags 'netgo' -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${ARENA_CLI_NAME} ./cmd/arena/*.go
$(TEMPDIR):
mkdir -p $(TEMPDIR)
.PHONY: cli-windows
cli-windows:
mkdir -p bin
CGO_ENABLED=0 GOARCH=amd64 GOOS=windows GO111MODULE=off go build -tags 'netgo' -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${ARENA_CLI_NAME} ./cmd/arena/*.go
clean: ## Clean up all downloaded and generated files.
rm -rf $(LOCALBIN) $(TEMPDIR)
.PHONY: arena
arena: $(LOCALBIN) ## Build arena CLI for current platform.
@echo "Building arena CLI..."
CGO_ENABLED=0 GOOS=$(OS) GOARCH=$(ARCH) go build -tags netgo -ldflags '${LDFLAGS}' -o $(LOCALBIN)/$(ARENA) cmd/arena/main.go
.PHONY: install-image
install-image:
docker build -t cheyang/arena:${VERSION}-${DOCKER_BUILD_DATE}-${GIT_SHORT_COMMIT} -f Dockerfile.install .
.PHONY: java-sdk
java-sdk: ## Build Java SDK.
echo "Building arena Java SDK..."
mvn package -Dmaven.test.skip=true -Dgpg.skip -f sdk/arena-java-sdk
.PHONY: docker-build
docker-build: ## Build docker image.
docker build \
--build-arg BASE_IMAGE=$(BASE_IMAGE) \
--tag $(IMAGE) \
-f Dockerfile \
.
.PHONY: docker-push
docker-push: ## Push docker image.
docker push $(IMAGE)
.PHONY: docker-buildx
PLATFORMS ?= linux/amd64,linux/arm64
docker-buildx: ## Build and push docker images for multiple platforms.
- $(CONTAINER_TOOL) buildx create --name arena-builder
$(CONTAINER_TOOL) buildx use arena-builder
- $(CONTAINER_TOOL) buildx build --push \
--platform=$(PLATFORMS) \
--build-arg BASE_IMAGE=$(BASE_IMAGE) \
--tag $(IMAGE) \
-f Dockerfile \
.
- $(CONTAINER_TOOL) buildx rm arena-builder
.PHONY: notebook-image-kubeflow
notebook-image-kubeflow:
@ -110,22 +227,106 @@ notebook-image:
docker build --build-arg "BASE_IMAGE=tensorflow/tensorflow:1.12.0-devel-py3" -t cheyang/arena:${VERSION}-notebook-${DOCKER_BUILD_DATE}-${GIT_SHORT_COMMIT}-cpu -f Dockerfile.notebook.cpu .
docker tag cheyang/arena:${VERSION}-notebook-${DOCKER_BUILD_DATE}-${GIT_SHORT_COMMIT}-cpu cheyang/arena-notebook:cpu
# make OS_ARCH=darwin-amd64 build-pkg for mac
.PHONY: build-pkg
build-pkg:
docker rm -f arena-pkg || true
docker build --build-arg "KUBE_VERSION=v1.23.0" \
--build-arg "HELM_VERSION=v3.7.2" \
--build-arg "COMMIT=${GIT_SHORT_COMMIT}" \
--build-arg "VERSION=${VERSION}" \
--build-arg "OS_ARCH=${OS_ARCH}" \
--build-arg "GOLANG_VERSION=1.16" \
--build-arg "TARGET=cli-${OS_ARCH}" \
-t arena-build:${VERSION}-${GIT_SHORT_COMMIT}-${OS_ARCH} -f Dockerfile.build .
docker run -itd --name=arena-pkg arena-build:${VERSION}-${GIT_SHORT_COMMIT}-${OS_ARCH} /bin/bash
docker cp arena-pkg:/arena-installer-${VERSION}-${GIT_SHORT_COMMIT}-${OS_ARCH}.tar.gz .
docker rm -f arena-pkg
.PHONY: build-dependabot
build-dependabot:
python3 hack/create_dependabot.py
.PHONY: arena-installer
arena-installer: $(ARENA_INSTALLER_TARBALL) ## Build arena installer tarball
$(ARENA_INSTALLER_TARBALL): arena kubectl helm
echo "Building arena installer tarball..." && \
rm -rf $(TEMPDIR)/$(ARENA_INSTALLER) && \
mkdir -p $(TEMPDIR)/$(ARENA_INSTALLER)/bin && \
cp $(LOCALBIN)/$(ARENA) $(TEMPDIR)/$(ARENA_INSTALLER)/bin/arena && \
cp $(LOCALBIN)/$(KUBECTL) $(TEMPDIR)/$(ARENA_INSTALLER)/bin/kubectl && \
cp $(LOCALBIN)/$(HELM) $(TEMPDIR)/$(ARENA_INSTALLER)/bin/helm && \
cp -R charts $(TEMPDIR)/$(ARENA_INSTALLER) && \
cp -R arena-artifacts $(TEMPDIR)/$(ARENA_INSTALLER) && \
cp arena-gen-kubeconfig.sh $(TEMPDIR)/$(ARENA_INSTALLER)/bin && \
cp install.sh $(TEMPDIR)/$(ARENA_INSTALLER) && \
cp uninstall.sh $(TEMPDIR)/$(ARENA_INSTALLER)/bin/arena-uninstall && \
tar -zcf $(ARENA_INSTALLER).tar.gz -C $(TEMPDIR) $(ARENA_INSTALLER) && \
echo "Successfully saved arena installer to $(ARENA_INSTALLER).tar.gz."
##@ Helm
.PHONY: helm-unittest
helm-unittest: helm-unittest-plugin ## Run Helm chart unittests.
set -x && $(LOCALBIN)/$(HELM) unittest $(ARENA_ARTIFACTS_CHART_PATH) --strict --file "tests/**/*_test.yaml" --chart-tests-path $(CURRENT_DIR)
##@ Dependencies
.PHONY: golangci-lint
golangci-lint: $(LOCALBIN)/$(GOLANGCI_LINT) ## Download golangci-lint locally if necessary.
$(LOCALBIN)/$(GOLANGCI_LINT): $(LOCALBIN)
$(call go-install-tool,$(LOCALBIN)/$(GOLANGCI_LINT),github.com/golangci/golangci-lint/v2/cmd/golangci-lint,${GOLANGCI_LINT_VERSION})
.PHONY: envtest
envtest: $(ENVTEST) ## Download setup-envtest locally if necessary.
$(ENVTEST): $(LOCALBIN)
$(call go-install-tool,$(ENVTEST),sigs.k8s.io/controller-runtime/tools/setup-envtest,$(ENVTEST_VERSION))
.PHONY: kubectl
kubectl: $(LOCALBIN)/$(KUBECTL)
$(LOCALBIN)/$(KUBECTL): $(LOCALBIN) $(TEMPDIR)
$(eval KUBECTL_URL=https://dl.k8s.io/release/$(KUBECTL_VERSION)/bin/$(OS)/$(ARCH)/kubectl)
$(eval KUBECTL_SHA_URL=$(KUBECTL_URL).sha256)
cd $(TEMPDIR) && \
echo "Download $(KUBECTL) if not present..." && \
if [ ! -f $(KUBECTL) ]; then \
curl -sSLo $(KUBECTL) $(KUBECTL_URL); \
fi && \
echo "Download $(KUBECTL).sha256 if not present..." && \
if [ ! -f kubectl.sha256 ]; then \
curl -sSLo $(KUBECTL).sha256 $(KUBECTL_SHA_URL); \
fi && \
echo "Verifying checksum..." && \
echo -n "$$(cat $(KUBECTL).sha256) $(KUBECTL)" | shasum -a 256 --check --quiet || (echo "Checksum verification failed, exiting." && false) && \
echo "Make kubectl executable and move it to bin directory..." && \
chmod +x $(KUBECTL) && \
cp $(KUBECTL) $(LOCALBIN) && \
echo "Successfully installed kubectl to $(LOCALBIN)/$(KUBECTL)."
.PHONY: helm
helm: $(LOCALBIN)/$(HELM)
$(LOCALBIN)/$(HELM): $(LOCALBIN) $(TEMPDIR)
$(eval HELM_URL=https://get.helm.sh/$(HELM).tar.gz)
$(eval HELM_SHA_URL=https://get.helm.sh/$(HELM).tar.gz.sha256sum)
cd $(TEMPDIR) && \
echo "Download $(HELM).tar.gz if not present..." && \
if [ ! -f $(HELM).tar.gz ]; then \
wget -qO $(HELM).tar.gz $(HELM_URL); \
fi && \
echo "Download $(HELM).tar.gz.sha256sum if not present..." && \
if [ ! -f $(HELM).tar.gz.sha256sum ]; then \
wget -qO $(HELM).tar.gz.sha256sum $(HELM_SHA_URL); \
fi && \
echo "Verifying checksum..." && \
cat $(HELM).tar.gz.sha256sum | shasum -a 256 --check --quiet || (echo "Checksum verification failed, exiting." && false) && \
echo "Extract helm tarball and move it to bin directory..." && \
tar -zxf $(HELM).tar.gz && \
cp ${OS}-${ARCH}/helm $(LOCALBIN)/$(HELM) && \
echo "Successfully installed helm to $(LOCALBIN)/$(HELM)."
.PHONY: helm-unittest-plugin
helm-unittest-plugin: helm ## Download helm unittest plugin locally if necessary.
if [ -z "$(shell $(LOCALBIN)/$(HELM) plugin list | grep unittest)" ]; then \
echo "Installing helm unittest plugin"; \
$(LOCALBIN)/$(HELM) plugin install https://github.com/helm-unittest/helm-unittest.git --version $(HELM_UNITTEST_VERSION); \
fi
# go-install-tool will 'go install' any package with custom target and name of binary, if it doesn't exist
# $1 - target path with name of binary (ideally with version)
# $2 - package url which can be installed
# $3 - specific version of package
define go-install-tool
@[ -f $(1) ] || { \
set -e; \
package=$(2)@$(3) ;\
echo "Downloading $${package}" ;\
GOBIN=$(LOCALBIN) go install $${package} ;\
mv "$$(echo "$(1)" | sed "s/-$(3)$$//")" $(1) ;\
}
endef

10
OWNERS
View File

@ -1,9 +1,11 @@
approvers:
- cheyang
- wsxiaozhang
- denverdino
- happy2048
- Syulin7
- xieydd
- denkensk
- gujingit
- ChenYi015
reviewers:
- GarnettWang
- wsxiaozhang
- xiaozhouX
- osswangxining

View File

@ -1,8 +1,6 @@
# Arena
[![CircleCI](https://circleci.com/gh/kubeflow/arena.svg?style=svg)](https://circleci.com/gh/kubeflow/arena)
[![Build Status](https://travis-ci.org/kubeflow/arena.svg?branch=master)](https://travis-ci.org/kubeflow/arena)
[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/arena)](https://goreportcard.com/report/github.com/kubeflow/arena)
[![GitHub release](https://img.shields.io/github/v/release/kubeflow/arena)](https://github.com/kubeflow/arena/releases) [![Integration Test](https://github.com/kubeflow/arena/actions/workflows/integration.yaml/badge.svg)](https://github.com/kubeflow/arena/actions/workflows/integration.yaml) [![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/arena)](https://goreportcard.com/report/github.com/kubeflow/arena)
View the [Arena documentation](https://arena-docs.readthedocs.io/en/latest).
@ -24,11 +22,9 @@ You can follow up the [Installation guide](https://arena-docs.readthedocs.io/en/
Arena is a command-line interface to run and monitor the machine learning training jobs and check their results in an easy way. Please refer the [User Guide](https://arena-docs.readthedocs.io/en/latest/training) to manage your training jobs.
## Demo
[![](demo.jpg)](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/50210690772.mp4)
[![arena demo](demo.jpg)](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/50210690772.mp4)
## Developing
@ -36,7 +32,7 @@ Prerequisites:
- Go >= 1.8
```
```shell
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
cd $(go env GOPATH)/src/github.com/kubeflow
git clone https://github.com/kubeflow/arena.git
@ -50,7 +46,7 @@ Then you can follow [Installation guide for developer](https://arena-docs.readth
## CPU Profiling
```
```shell
# set profile rate (HZ)
export PROFILE_RATE=1000
@ -61,20 +57,18 @@ INFO[0000] Dump cpu profile file into /tmp/cpu_profile
Then you can analyze the profile by following [Go CPU profiling: pprof and speedscope](https://coder.today/go-profiling-pprof-and-speedscope-b05b812cc429)
## Adopters
If you are intrested in Arena and would like to share your experiences with others, you are warmly welcome to add your information on [ADOPTERS.md](docs/about/ADOPTERS.md) page. We will continuousely discuss new requirements and feature design with you in advance.
If you are interested in Arena and would like to share your experiences with others, you are warmly welcome to add your information on [ADOPTERS.md](docs/about/ADOPTERS.md) page. We will continuously discuss new requirements and feature design with you in advance.
## FAQ
Please refer to [FAQ](https://arena-docs.readthedocs.io/en/latest/faq)
Please refer to [FAQ](https://arena-docs.readthedocs.io/en/latest/faq).
## CLI Document
Please refer to [arena.md](docs/cli/arena.md)
Please refer to [arena.md](docs/cli/arena.md).
## RoadMap
See [RoadMap](ROADMAP.md)
See [RoadMap](ROADMAP.md).

View File

@ -1,9 +1,6 @@
# Arena
[![CircleCI](https://circleci.com/gh/kubeflow/arena.svg?style=svg)](https://circleci.com/gh/kubeflow/arena)
[![Build Status](https://travis-ci.org/kubeflow/arena.svg?branch=master)](https://travis-ci.org/kubeflow/arena)
[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/arena)](https://goreportcard.com/report/github.com/kubeflow/arena)
[![Integration Test](https://github.com/kubeflow/arena/actions/workflows/integration.yaml/badge.svg)](https://github.com/kubeflow/arena/actions/workflows/integration.yaml)[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/arena)](https://goreportcard.com/report/github.com/kubeflow/arena)
## 概述
@ -13,7 +10,6 @@ Arena 是一个命令行工具,可供数据科学家轻而易举地运行和
简而言之Arena 的目标是让数据科学家感觉自己就像是在一台机器上工作,而实际上还可以享受到 GPU 集群的强大力量。
## 设置
您可以按照 [安装指南](https://arena-docs.readthedocs.io/en/latest/installation) 执行操作
@ -32,8 +28,7 @@ Arena 是一种命令行界面,支持轻而易举地运行和监控机器学
## 演示
[![](demo.jpg)](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/50210690772.mp4)
[![arena demo](demo.jpg)](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/50210690772.mp4)
## 开发
@ -41,7 +36,7 @@ Arena 是一种命令行界面,支持轻而易举地运行和监控机器学
- Go >= 1.8
```
```shell
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
cd $(go env GOPATH)/src/github.com/kubeflow
git clone https://github.com/kubeflow/arena.git
@ -58,4 +53,3 @@ make
## 路线图
请参阅[路线图](ROADMAP.md)

View File

@ -1,10 +1,40 @@
# Arena 2019 Roadmap
# Kubeflow Arena Roadmap
## Kubeflow Arena 2024 Roadmap
This document defines a high level roadmap for Arena development.
### 2019
* ObjectiveSimplify the user experience by deeply integrating with the Kubeflow Ecosystem.
* Kubeflow Integration
* Prepare Arena for release v1.0.0 alongside kubeflow v1.10.
* Develop a seamless integration with the Training Operator to help simplify model training using command line.
* Integrate with Kubeflow Pipelines to facilitate model training from a Pipeline.
* Enable mode serving with KServe.
* Add documentation to Kubeflow website:
* Installation, uninstallation, and upgrade processes.
* Guide for tfjob, mpijob, pytorchJob.
#### Core CUJs
* ObjectiveAmplify the Extensibility of the Arena for Different ML frameworks, AIGC models and platforms.
* Support DeepSpeed Training Job.
* Support for submitting and managing llm fine-tuning jobs, like DeepSpeed.
* Enable model serving for an expanded set of models like Baichuan, LLaMA, ChatGLM, Falcon, and more.
* Extend platform support to include ARM.
* Integrate [Fluid project](https://github.com/fluid-cloudnative/fluid) for efficient data management.
* Add support for Ray Job management with [Kuberay](https://github.com/ray-project/kuberay).
* Objective: Boost Performance and Stability.
* Regularly publish recommended practices documentation.
* Enhancements on Arena SDK.
* Enhance code quality by:
* Reduce repetitive code.
* Enhance unit test.
* Implement automated End-to-End Test:
* Add integration tests using GitHub Actions.
* Strive for more than 60% Test Coverage of Supported Features.
## Kubeflow Arena 2019 Roadmap
### Core CUJs
Objectives: "Make Arena easily to be integrated with External System."
@ -19,13 +49,13 @@ Objectives: "Simplify the user experience of the data scientists and provide a l
* Submit and manage Model Serving with [KF Serving](https://github.com/kubeflow/kfserving)
Objectives: "Make Arena support the same Operator compatiable with different API version, so the upgrade of operator doesn't impact the existing users' experiences."
Objectives: "Make Arena support the same Operator compatible with different API version, so the upgrade of operator doesn't impact the existing users' experiences."
* Compatibility:
* v1aphla2 and v1 TFJob
* v1alpha1 and v1aphla2 MPIJob
Objectives: "Enchance the software quality of Arena so it can be in the quick iteration"
Objectives: "Enhance the software quality of Arena so it can be in the quick iteration"
* Refactor the source code
* Move Training implementation from `cmd` into `pkg`

View File

@ -1 +1 @@
0.9.10
0.15.1

View File

@ -1,16 +0,0 @@
# Adopters Of Arena
Below are the adopters of project Arena. If you are using Arena to improve efficiency and productivity in Machine Learning with Kubernetes, please feel free to add yourself into the following list by a pull request. There're several phases as follow:
* **Evaluation:** Known Arena, that's interesting; evaluating the features/scopes of Arena
* **Testing:** Take Arena as one of candidates, testing Kubernetes cluster with Arena
* **Staging:** Decide to use Arena, testing it in pre-product environment
* **Production:** Already put Arena into product environment
| Organization | Contact | Phases | Description of Use |
| ------------ | ------- | ----------- | ------------------ |
| [Weibo](https://www.weibo.com) | [@phoenixwu0229](https://github.com/phoenixwu0229) | **Production** | Weibo ML Platform |
| [HUYA](https://www.huya.com) | [@BobLiu20](https://github.com/bobliu20) | **Production** | HUYA AI Platform |
| [Microsoft](https://www.microsoft.com) | [@chaowangnk1](https://github.com/chaowangnk1) | **Testing** | AzureML DataCache internal benchmark system |
| [Unisound](https://www.unisound.com) | [@xieydd](https://github.com/xieydd) | **Production** | Unisound ATLAS AI Platform |
| [DOUYU](https://www.douyu.com) | [@gongcan1219](https://github.com/gongcan1219) | **Production** | DOUYU AI Platform |

View File

@ -1,40 +0,0 @@
## arena
arena is the command line interface to Arena
### Synopsis
arena is the command line interface to Arena
```
arena [flags]
```
### Options
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
-h, --help help for arena
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena completion](arena_completion.md) - output shell completion code for the specified shell (bash or zsh)
* [arena data](arena_data.md) - manage data.
* [arena delete](arena_delete.md) - delete a training job and its associated pods
* [arena get](arena_get.md) - display details of a training job
* [arena list](arena_list.md) - list all the training jobs
* [arena logs](arena_logs.md) - print the logs for a task of the training job
* [arena logviewer](arena_logviewer.md) - display Log Viewer URL of a training job
* [arena prune](arena_prune.md) - prune history job
* [arena serve](arena_serve.md) - Serve a job.
* [arena submit](arena_submit.md) - Submit a job.
* [arena top](arena_top.md) - Display Resource (GPU) usage.
* [arena version](arena_version.md) - Print version information
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,43 +0,0 @@
## arena completion
output shell completion code for the specified shell (bash or zsh)
### Synopsis
Write bash or zsh shell completion code to standard output.
For bash, ensure you have bash completions installed and enabled.
To access completions in your current shell, run
$ source <(arena completion bash)
Alternatively, write it to a file and source in .bash_profile
For zsh, output to a file in a directory referenced by the $fpath shell
variable.
```
arena completion SHELL [flags]
```
### Options
```
-h, --help help for completion
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,39 +0,0 @@
## arena data
manage data.
### Synopsis
manage data volumes.
Available Commands:
list,ls List the data volumes.
```
arena data [flags]
```
### Options
```
-h, --help help for data
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
* [arena data list](arena_data_list.md) - list all the data volume.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,35 +0,0 @@
## arena data list
list all the data volume.
### Synopsis
list all the data volume.
```
arena data list [flags]
```
### Options
```
--allNamespaces show all the namespaces
-h, --help help for list
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena data](arena_data.md) - manage data.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,35 +0,0 @@
## arena delete
delete a training job and its associated pods
### Synopsis
delete a training job and its associated pods
```
arena delete a training job [flags]
```
### Options
```
-h, --help help for delete
--type string The training type to delete, the possible option is tfjob, mpijob, horovodjob or standalonejob. (optional)
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,37 +0,0 @@
## arena get
display details of a training job
### Synopsis
display details of a training job
```
arena get training job [flags]
```
### Options
```
-e, --events Specify if show pending pod's events.
-h, --help help for get
-o, --output string Output format. One of: json|yaml|wide
--type string The training type to delete, the possible option is tfjob, mpijob, horovodjob or standalonejob. (optional)
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,35 +0,0 @@
## arena list
list all the training jobs
### Synopsis
list all the training jobs
```
arena list [flags]
```
### Options
```
--allNamespaces show all the namespaces
-h, --help help for list
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,41 +0,0 @@
## arena logs
print the logs for a task of the training job
### Synopsis
print the logs for a task of the training job
```
arena logs training job [flags]
```
### Options
```
-f, --follow Specify if the logs should be streamed.
-h, --help help for logs
-i, --instance string Specify the task instance to get log
--since string Only return logs newer than a relative duration like 5s, 2m, or 3h. Defaults to all logs. Only one of since-time / since may be used.
--since-time string Only return logs after a specific date (RFC3339). Defaults to all logs. Only one of since-time / since may be used.
--tail int Lines of recent log file to display. Defaults to -1 with no selector, showing all log lines otherwise 10, if a selector is provided. (default -1)
--timestamps Include timestamps on each line in the log output
--type string The training type to show logging, the possible option is tfjob, mpijob, horovodjob or standalonejob. (optional)
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,34 +0,0 @@
## arena logviewer
display Log Viewer URL of a training job
### Synopsis
display Log Viewer URL of a training job
```
arena logviewer job [flags]
```
### Options
```
-h, --help help for logviewer
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,35 +0,0 @@
## arena prune
prune history job
### Synopsis
prune history job
```
arena prune history job [flags]
```
### Options
```
-h, --help help for prune
-s, --since duration Clean job that live longer than relative duration like 5s, 2m, or 3h. (default -1ns)
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,43 +0,0 @@
## arena serve
Serve a job.
### Synopsis
serve a job.
Available Commands:
tensorflow,tf Submit a TensorFlow Serving Job.
tensorrt,trt Submit a TensorRT Job
```
arena serve [flags]
```
### Options
```
-h, --help help for serve
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
* [arena serve delete](arena_serve_delete.md) - delete a serving job and its associated pods
* [arena serve list](arena_serve_list.md) - list all the serving jobs
* [arena serve tensorflow](arena_serve_tensorflow.md) - Submit tensorflow serving job to deploy and serve machine learning models.
* [arena serve tensorrt](arena_serve_tensorrt.md) - Submit tensorRT inference serving job to deploy and serve machine learning models.
* [arena serve traffic-split](arena_serve_traffic-split.md) - Adjust traffic routing dynamically for tfserving jobs
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,34 +0,0 @@
## arena serve delete
delete a serving job and its associated pods
### Synopsis
delete a serving job and its associated pods
```
arena serve delete a serving job [flags]
```
### Options
```
-h, --help help for delete
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena serve](arena_serve.md) - Serve a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,34 +0,0 @@
## arena serve list
list all the serving jobs
### Synopsis
list all the serving jobs
```
arena serve list [flags]
```
### Options
```
-h, --help help for list
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena serve](arena_serve.md) - Serve a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,54 +0,0 @@
## arena serve tensorflow
Submit tensorflow serving job to deploy and serve machine learning models.
### Synopsis
Submit tensorflow serving job to deploy and serve machine learning models.
```
arena serve tensorflow [flags]
```
### Options
```
--command string the command will inject to container's command.
--cpu string the request cpu of each replica to run the serve.
-d, --data stringArray specify the trained models datasource to mount for serving, like <name_of_datasource>:<mount_point_on_job>
--enableIstio enable Istio for serving or not (disable Istio by default)
-e, --envs stringArray the environment variables
--exposeService expose service using Istio gateway for external access or not (not expose by default)
--gpumemory int the limit GPU memory of each replica to run the serve.
--gpus int the limit GPU count of each replica to run the serve.
-h, --help help for tensorflow
--image string the docker image name of serve job, and the default image is tensorflow/serving:latest (default "tensorflow/serving:latest")
--imagePullPolicy string the policy to pull the image, and the default policy is IfNotPresent (default "IfNotPresent")
--memory string the request memory of each replica to run the serve.
--modelConfigFile string Corresponding with --model_config_file in tensorflow serving
--modelName string the model name for serving
--modelPath string the model path for serving in the container
--port int the port of tensorflow gRPC listening port (default 8500)
--replicas int the replicas number of the serve job. (default 1)
--restfulPort int the port of tensorflow RESTful listening port (default 8501)
--servingName string the serving name
--servingVersion string the serving version
--versionPolicy string support latest, latest:N, specific:N, all
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena serve](arena_serve.md) - Serve a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,55 +0,0 @@
## arena serve tensorrt
Submit tensorRT inference serving job to deploy and serve machine learning models.
### Synopsis
Submit tensorRT inference serving job to deploy and serve machine learning models.
```
arena serve tensorrt [flags]
```
### Options
```
--allowMetrics Open Metric
--command string the command will inject to container's command.
--cpu string the request cpu of each replica to run the serve.
-d, --data stringArray specify the trained models datasource to mount for serving, like <name_of_datasource>:<mount_point_on_job>
--enableIstio enable Istio for serving or not (disable Istio by default)
-e, --envs stringArray the environment variables
--exposeService expose service using Istio gateway for external access or not (not expose by default)
--gpumemory int the limit GPU memory of each replica to run the serve.
--gpus int the limit GPU count of each replica to run the serve.
--grpcPort int the port of grpc serving server (default 8001)
-h, --help help for tensorrt
--httpPort int the port of http serving server (default 8000)
--image string the docker image name of serve job, and the default image is registry.cn-beijing.aliyuncs.com/xiaozhou/tensorrt-serving:18.12-py3 (default "registry.cn-beijing.aliyuncs.com/xiaozhou/tensorrt-serving:18.12-py3")
--imagePullPolicy string the policy to pull the image, and the default policy is IfNotPresent (default "IfNotPresent")
--memory string the request memory of each replica to run the serve.
--metricPort int the port of metrics server (default 8002)
--modelName string the model name for serving
--modelPath string the model path for serving in the container
--modelStore string the path of tensorRT model path
--replicas int the replicas number of the serve job. (default 1)
--servingName string the serving name
--servingVersion string the serving version
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena serve](arena_serve.md) - Serve a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,36 +0,0 @@
## arena serve traffic-router-split
Adjust traffic routing dynamically for tfserving jobs
### Synopsis
Adjust traffic routing dynamically for tfserving jobs
```
arena serve traffic-router-split [flags]
```
### Options
```
-h, --help help for traffic-router-split
--servingName string the serving name
--versions string Model versions which the traffic will be routed to, e.g. [1,2,3] (default "[]")
--weights string Weight percentage values for each model version which the traffic will be routed to,e.g. [70,20,10] (default "[]")
```
### Options inherited from parent commands
```
--arenaNamespace string The namespace of arena system service, like TFJob (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
--namespace string the namespace of the job (default "default")
--pprof enable cpu profile
```
### SEE ALSO
* [arena serve](arena_serve.md) - Serve a job.
###### Auto generated by spf13/cobra on 7-Sep-2018

View File

@ -1,37 +0,0 @@
## arena serve traffic-split
Adjust traffic routing dynamically for tfserving jobs
### Synopsis
Adjust traffic routing dynamically for tfserving jobs
```
arena serve traffic-split [flags]
```
### Options
```
-h, --help help for traffic-split
--servingName string the serving name
--servingVersions string Model versions which the traffic will be routed to, e.g. 1,2,3
--weights string Weight percentage values for each model version which the traffic will be routed to,e.g. 70,20,10
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena serve](arena_serve.md) - Serve a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,47 +0,0 @@
## arena submit
Submit a job.
### Synopsis
Submit a job.
Available Commands:
tfjob,tf Submit a TFJob.
horovod,hj Submit a Horovod Job.
mpijob,mpi Submit a MPIJob.
standalonejob,sj Submit a standalone Job.
tfserving,tfserving Submit a Serving Job.
sparkjob,spark Submit a Spark Job.
```
arena submit [flags]
```
### Options
```
-h, --help help for submit
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
* [arena submit horovodjob](arena_submit_horovodjob.md) - Submit horovodjob as training job.
* [arena submit mpijob](arena_submit_mpijob.md) - Submit MPIjob as training job.
* [arena submit standalonejob](arena_submit_standalonejob.md) - Submit StandaloneJob as training job. And it will be deprecated soon, please use tfjob instead.
* [arena submit tfjob](arena_submit_tfjob.md) - Submit TFJob as training job.
* [arena submit sparkjob](arena_submit_sparkjob.md) - Submit SparkJob as training job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,51 +0,0 @@
## arena submit horovodjob
Submit horovodjob as training job.
### Synopsis
Submit horovodjob as training job.
```
arena submit horovodjob [flags]
```
### Options
```
-a, --annotation stringArray the annotations
--cpu string the cpu resource to use for the training, like 1 for 1 core.
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
-e, --env stringArray the environment variables
--gpus int the GPU count of each worker to run the training.
-h, --help help for horovodjob
--image string the docker image name of training job
--memory string the memory resource to use for the training, like 1Gi.
--name string override name
--rdma enable RDMA
--retry int retry times.
--sshPort int ssh port.
--sync-image string the docker image of syncImage
--sync-mode string syncMode: support rsync, hdfs, git
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
--workers int the worker number to run the distributed training. (default 1)
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena submit](arena_submit.md) - Submit a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,53 +0,0 @@
## arena submit mpijob
Submit MPIjob as training job.
### Synopsis
Submit MPIjob as training job.
```
arena submit mpijob [flags]
```
### Options
```
-a, --annotation stringArray the annotations
--cpu string the cpu resource to use for the training, like 1 for 1 core.
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
-e, --env stringArray the environment variables
--gpus int the GPU count of each worker to run the training.
-h, --help help for mpijob
--image string the docker image name of training job
--logdir string the training logs dir, default is /training_logs (default "/training_logs")
--memory string the memory resource to use for the training, like 1Gi.
--name string override name
--rdma enable RDMA
--retry int retry times.
--sync-image string the docker image of syncImage
--sync-mode string syncMode: support rsync, hdfs, git
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
--tensorboard enable tensorboard
--tensorboard-image string the docker image for tensorboard (default "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/tensorflow:1.12.0-devel")
--workers int the worker number to run the distributed training. (default 1)
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena submit](arena_submit.md) - Submit a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,37 +0,0 @@
## arena submit sparkjob
Submit SparkJob as training job.
### Synopsis
Submit SparkJob as training job.
```
arena submit tfjob [flags]
```
### Options
```
--image string the docker image name of training job
--jar string jar path in image
--main-class string main class of your jar
--name string override name
--workers int the worker number to run the distributed training. (default 1)
```
### Options inherited from parent commands
```
--arenaNamespace string The namespace of arena system service, like TFJob (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
--namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena submit](arena_submit.md) - Submit a job.

View File

@ -1,52 +0,0 @@
## arena submit standalonejob(deprecated)
**Warning: standalonejob has been deprecated,please use [tfjob](../userguide/1-tfjob-standalone.md) instead.**
Submit StandaloneJob as training job. And it will be deprecated soon, please use tfjob instead.
### Synopsis
Submit StandaloneJob as training job. And it will be deprecated soon, please use tfjob instead.
```
arena submit standalonejob [flags]
```
### Options
```
-a, --annotation stringArray the annotations
--cpu string the cpu resource to use for the training, like 1 for 1 core.
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
-e, --env stringArray the environment variables
--gpus int the GPU count of each worker to run the training.
-h, --help help for standalonejob
--image string the docker image name of training job
--memory string the memory resource to use for the training, like 1Gi.
--name string override name
--rdma enable RDMA
--retry int retry times.
--sync-image string the docker image of syncImage
--sync-mode string syncMode: support rsync, hdfs, git
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
--workers int the worker number to run the distributed training. (default 1)
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena submit](arena_submit.md) - Submit a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,68 +0,0 @@
## arena submit tfjob
Submit TFJob as training job.
### Synopsis
Submit TFJob as training job.
```
arena submit tfjob [flags]
```
### Options
```
-a, --annotation stringArray the annotations
--chief enable chief, which is required for estimator.
--chief-cpu string the cpu resource to use for the Chief, like 1 for 1 core.
--chief-memory string the memory resource to use for the Chief, like 1Gi.
--chief-port int the port of the chief.
--clean-task-policy string How to clean tasks after Training is done, only support Running, None. (default "Running")
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
-e, --env stringArray the environment variables
--evaluator enable evaluator, which is optional for estimator.
--evaluator-cpu string the cpu resource to use for the evaluator, like 1 for 1 core.
--evaluator-memory string the memory resource to use for the evaluator, like 1Gi.
--gpus int the GPU count of each worker to run the training.
-h, --help help for tfjob
--image string the docker image name of training job
--logdir string the training logs dir, default is /training_logs (default "/training_logs")
--name string override name
--ps int the number of the parameter servers.
--ps-cpu string the cpu resource to use for the parameter servers, like 1 for 1 core.
--ps-image string the docker image for tensorflow workers
--ps-memory string the memory resource to use for the parameter servers, like 1Gi.
--ps-port int the port of the parameter server.
--rdma enable RDMA
--retry int retry times.
--sync-image string the docker image of syncImage
--sync-mode string syncMode: support rsync, hdfs, git
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
--tensorboard enable tensorboard
--tensorboard-image string the docker image for tensorboard (default "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/tensorflow:1.12.0-devel")
--worker-cpu string the cpu resource to use for the worker, like 1 for 1 core.
--worker-image string the docker image for tensorflow workers
--worker-memory string the memory resource to use for the worker, like 1Gi.
--worker-port int the port of the worker.
--workers int the worker number to run the distributed training. (default 1)
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena submit](arena_submit.md) - Submit a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,41 +0,0 @@
## arena top
Display Resource (GPU) usage.
### Synopsis
Display Resource (GPU) usage.
Available Commands:
node Display Resource (GPU) usage of nodes
job Display Resource (GPU) usage of pods
```
arena top [flags]
```
### Options
```
-h, --help help for top
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
* [arena top job](arena_top_job.md) - Display Resource (GPU) usage of jobs.
* [arena top node](arena_top_node.md) - Display Resource (GPU) usage of nodes.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,37 +0,0 @@
## arena top job
Display Resource (GPU) usage of jobs.
### Synopsis
Display Resource (GPU) usage of jobs.
```
arena top job [flags]
```
### Options
```
--allNamespaces show all the namespaces
-h, --help help for job
-i, --instance string Display instance top info
-r, --refresh Display continuously
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena top](arena_top.md) - Display Resource (GPU) usage.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,35 +0,0 @@
## arena top node
Display Resource (GPU) usage of nodes.
### Synopsis
Display Resource (GPU) usage of nodes.
```
arena top node [flags]
```
### Options
```
-d, --details Display details
-h, --help help for node
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena top](arena_top.md) - Display Resource (GPU) usage.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,35 +0,0 @@
## arena version
Print version information
### Synopsis
Print version information
```
arena version [flags]
```
### Options
```
-h, --help help for version
--short print just the version number
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,50 +0,0 @@
## The TFJob plugin framework
If you'd like to customize or enhance the TFJob with your own chart or code.
## Developer Workflow
### Step 1: Implement the following function (optional)
```
// Customized runtime for tf training training
type tfRuntime interface {
// check the tfjob args
check(tf *submitTFJobArgs) (err error)
// transform the tfjob
transform(tf *submitTFJobArgs) (err error)
getChartName() string
}
```
You can refer the implmentation of default tf runtime [../../cmd/arena/commands/training_plugin_interface.go](training_plugin_interface.go)
### Step 2. Create your own chart
If you don't need to create your code for `check` or `transform`, you can create the chart in the same directory of tfjob, mpijob. For example, the chart name is `mock`.
```
cd /charts
cp -r tfjob mock
```
## User Workflow
Just run with the command by specifying annotation `runtime={your runtime}`
```
arena submit tf \
--name=test \
--annotation="runtime=mock" \
--workers=1 \
--chief \
--chief-cpu=4 \
--evaluator \
--evaluator-cpu=4 \
--worker-cpu=2 \
"python test.py"
```

View File

@ -1,118 +0,0 @@
## Setup
This documentation assumes you have a Kubernetes cluster already available.
If you need help setting up a Kubernetes cluster please refer to [Kubernetes Setup](https://kubernetes.io/docs/setup/).
If you want to use GPUs, be sure to follow the Kubernetes [instructions for enabling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/).
Arena doesn't have to run can be run within Kubernetes cluster. It can also be run in your laptop. If you can run `kubectl` to manage the Kubernetes cluster there, you can also use `arena` to manage Training Jobs.
### Requirements
* Linux OS
* Kubernetes >= 1.11, kubectl >= 1.11
* helm version [v2.14.1](https://docs.helm.sh/using_helm/#installing-helm) or later
* tiller with ths same version of helm should be also installed (https://docs.helm.sh/using_helm/#installing-tiller)
### Steps
1\. Prepare kubeconfig file by using `export KUBECONFIG=/etc/kubernetes/admin.conf` or creating a `~/.kube/config`
2\. Download the latest installer from [Release Page](https://github.com/kubeflow/arena/releases), and rename it to `arena-installer.tar.gz`
3\. Untar the installer package
```
# tar -xvf arena-installer.tar.gz
```
4\. Setup Environment Varaibles for customization
4.1\. If you'd like to train and serving in hostNetwork
```
export USE_HOSTNETWORK=true
```
4.2\. If you'd like to customize Kubernetes namespace of arena infrastructure
```
export NAMESPACE={your namespace}
```
4.3\. If you'd like to use your private docker registry instead of `ACR(Alibaba Cloud Container Registry)`:
```
export DOCKER_REGISTRY={your docker registry}
```
4.4\. If you'd like to deploy prometheus in `ACK(Alibaba Container Service for Kubernetes)`
```
export USE_PROMETHEUS=true
export PLATFORM=ack
```
4.5\. If you'd like to use Cloud loadbalancer
```
export USE_LOADBALANCER=true
```
5\. Install arena
```
# cd arena-installer
# sudo ./install.sh
```
6\. Enable shell autocompletion
On Linux, please use bash
On CentOS Linux, you may need to install the bash-completion package which is not installed by default.
```
yum install bash-completion -y
```
On Debian or Ubuntu Linux you may need to install with
```
apt-get install bash-completion
```
To add arena autocompletion to your current shell, run `source <(arena completion bash)`.
On MacOS, please use bash
You can install it with Homebrew:
```
brew install bash-completion@2
```
To add arena autocompletion to your profile, so it is automatically loaded in future shells run:
```
echo "source <(arena completion bash)" >> ~/.bashrc
chmod u+x ~/.bashrc
```
For MacOS, add the following to your `~/.bashrc` file:
```
echo "source $(brew --prefix)/etc/profile.d/bash_completion.sh" >> ~/.bashrc
```
Then you can use [tab] to auto complete the command
```
# arena list
NAME STATUS TRAINER AGE NODE
tf1 PENDING TFJOB 0s N/A
caffe-1080ti-1 RUNNING HOROVOD 45s 192.168.1.120
# arena get [tab]
caffe-1080ti-1 tf1
```

View File

@ -1,157 +0,0 @@
## Setup
This documentation assumes you have a Kubernetes cluster already available.
If you need help setting up a Kubernetes cluster please refer to [Kubernetes Setup](https://kubernetes.io/docs/setup/).
If you want to use GPUs, be sure to follow the Kubernetes [instructions for enabling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/).
Arena doesn't have to run can be run within Kubernetes cluster. It can also be run in your laptop. If you can run `kubectl` to manage the Kubernetes cluster there, you can also use `arena` to manage Training Jobs.
### Requirements
* Kubernetes >= 1.11, kubectl >= 1.11
* helm version [v2.14.1](https://docs.helm.sh/using_helm/#installing-helm) or later
* tiller with ths same version of helm should be also installed (https://docs.helm.sh/using_helm/#installing-tiller)
### Steps
1\. Prepare kubeconfig file by using `export KUBECONFIG=/etc/kubernetes/admin.conf` or creating a `~/.kube/config`
2\. Install kubectl client
Please follow [kubectl installation guide](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
3\. Install Helm client
- Download Helm client from [github.com](https://github.com/helm/helm/releases)
- Unpack it (tar -zxvf helm-v2.14.1-linux-amd64.tgz)
- Find the `helm` binary in the unpacked directory, and move it to its desired destination (mv linux-amd64/helm /usr/local/bin/arena-helm)
Then run `helm list` to check if the the kubernetes can be managed successfully by helm.
```
# arena-helm list
# echo $?
0
```
4\. Download the charts
```
mkdir /charts
git clone https://github.com/kubeflow/arena.git
cp -r arena/charts/* /charts
```
5\. Install TFJob Controller
```
kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-crd.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml
```
6\. Install Dashboard
```
kubectl create -f arena/kubernetes-artifacts/dashboard/dashboard.yaml
```
7\. Install MPIJob Controller
```
kubectl create -f arena/kubernetes-artifacts/mpi-operator/mpi-operator.yaml
```
8\. Build arena
Prerequisites:
- Go >= 1.8
```
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
cd $(go env GOPATH)/src/github.com/kubeflow
git clone https://github.com/kubeflow/arena.git
cd arena
make
```
`arena` binary is located in directory `arena/bin`. You may want add the directory to `$PATH`.
9\. Install and configure kube-arbitrator for gang scheduling(optional)
```
kubectl create -f arena/kubernetes-artifacts/kube-batchd/kube-batched.yaml
```
10\. Enable shell autocompletion
On Linux, please use bash
On CentOS Linux, you may need to install the bash-completion package which is not installed by default.
```
yum install bash-completion -y
```
To add arena autocompletion to your current shell, run source <(arena completion bash).
To add arena autocompletion to your profile, so it is automatically loaded in future shells run:
```
echo "source <(arena completion bash)" >> ~/.bashrc
```
Then you can use [tab] to auto complete the command
```
# arena list
NAME STATUS TRAINER AGE NODE
tf1 PENDING TFJOB 0s N/A
caffe-1080ti-1 RUNNING HOROVOD 45s 192.168.1.120
# arena get [tab]
caffe-1080ti-1 tf1
```
11\. Enable Host network for training (optional)
The training is not `useHostNetwork` by default. If you'd like to run the training in HostNetwork. You can run the command below:
```
find /charts/ -name values.yaml | xargs sed -i "/useHostNetwork/s/false/true/g"
```
12\. Enable Loadbalancer in the public cloud (optional)
Kubernetes can be run on AWS, GCE, Azure and Alibaba Cloud, and `LoadBalancer` is supported in their cloud provider. If you want to access tensorboard on the internet directly, you can run the command below:
```
find /charts/ -name "*.yaml" | xargs sed -i "s/NodePort/LoadBalancer/g"
```
> Warning: it's not encouraged to expose the service to the internet, because the service can be attacked by hacker easily.
13\. Enable Ingress in the public cloud (optional)
If you have ingress controller configured, you are able to access tensorboard through ingress. You can run the command below:
```
find /charts/ -name values.yaml | xargs sed -i "/ingress/s/false/true/g"
```
> Warning: it's not encouraged to expose the service to the internet, because the service can be attacked by hacker easily.
14\. Change imagePullPolicy from `Always` to `IfNotPresent` (optional)
```
find /charts/ -name values.yaml| xargs sed -i "s/Always/IfNotPresent/g"
```
> Warning: this may cause the docker images are not up to date if it's already downloaded in node.

View File

@ -1,154 +0,0 @@
## 部署
本文档假设您已经有可用的 Kubernetes 集群。
如果您需要有关 Kubernetes 集群设置的帮助,请参阅 [Kubernetes 设置](https://kubernetes.io/docs/setup/)。
如果您希望使用 GPU请务必按照 Kubernetes [GPU 启用说明](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) 操作。
Arena 并非必需在 Kubernetes 集群内运行。它也可以在您的笔记本电脑中运行。如果您可以运行 `kubectl` 以管理 Kubernetes 集群,那么也可以使用 `arena` 管理训练作业。
### 要求
* Kubernetes >= 1.11, kubectl >= 1.11
* helm 版本 [v2.14.1](https://docs.helm.sh/using_helm/#installing-helm) 或更新版本
* 此外还要部署与 helm 版本相同的 tiller(https://docs.helm.sh/using_helm/#installing-tiller)
### 步骤
1\.通过使用 `export KUBECONFIG=/etc/kubernetes/admin.conf` 或创建一个 `~/.kube/config` 来准备 kubeconfig 文件
2\.安装 kubectl 客户端
请按照 [kubectl 安装指南] 操作(https://kubernetes.io/docs/tasks/tools/install-kubectl/)
3\.安装 Helm 客户端
- 从 [github.com] 下载 Helm 客户端(https://github.com/helm/helm/releases)
- 将下载到的文件解压缩 (tar -zxvf helm-v2.8.2-linux-amd64.tgz)
- 在解压缩目录中找到 `helm` 二进制文件,将其移到所需目标位置 (mv linux-amd64/helm /usr/local/bin/arena-helm)
然后运行 `helm list` 以检查 helm 能否成功管理 kubernetes。
```
#helm list
#echo $?
0
```
4\.下载 Chart
```
mkdir /charts
git clone https://github.com/kubeflow/arena.git
cp -r arena/charts/* /charts
```
5\.安装 TFJob 控制器
```
kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-crd.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml
```
6\.安装控制台 (可选)
```
kubectl create -f arena/kubernetes-artifacts/dashboard/dashboard.yaml
```
7\.安装 MPIJob 控制器
```
kubectl create -f arena/kubernetes-artifacts/mpi-operator/mpi-operator.yaml
```
8\.安装 arena
先决条件:
- Go >= 1.8
```
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
cd $(go env GOPATH)/src/github.com/kubeflow
git clone https://github.com/kubeflow/arena.git
cd arena
make
```
`arena` 二进制文件位于 `arena/bin` 目录下。您可能希望将目录添加到 `$PATH`
9\.安装并为群调度配置 kube-arbitrator可选
```
kubectl create -f arena/kubernetes-artifacts/kube-batchd/kube-batched.yaml
```
10\.启用 shell 自动完成
在 Linux 上,请使用 bash
在 CentOS Linux 上,您可能需要安装默认并未安装的 bash-completion 包。
```
yum install bash-completion -y
```
要为当前 shell 添加 arena 自动完成,请运行 source <(arena completion bash)。
通过如下方法向您的配置文件添加 arena 自动完成功能,以便将来 shell 运行时可以自动加载此功能:
```
echo "source <(arena completion bash)" >> ~/.bashrc
```
然后,你可以使用 [TAB] 来自动完成命令
```
#arena list
NAME STATUS TRAINER AGE NODE
tf1 PENDING TFJOB 0s N/A
caffe-1080ti-1 RUNNING HOROVOD 45s 192.168.1.120
#arena get [tab]
caffe-1080ti-1 tf1
```
11\.为训练启用主机网络(可选)
默认情况下,训练并非 `useHostNetwork`。如果您希望在 HostNetwork 中运行训练。可以运行如下命令:
```
find /charts/ -name values.yaml | xargs sed -i "/useHostNetwork/s/false/true/g"
```
12\.在公共云中启用 Loadbalancer
Kubernetes 可在 AWS、GCE、Azure 和阿里云中运行,其云提供商支持 `LoadBalancer`。如果您希望在互联网上直接访问 tensorboard可以运行如下代码
```
find /charts/ -name "*.yaml" | xargs sed -i "s/NodePort/LoadBalancer/g"
```
> 警告:我们不鼓励将服务公开给互联网,因为这种做法会导致服务受黑客攻击。
13\. 在公共云中启用 Ingress
Kubernetes 可在 AWS、GCE、Azure 和阿里云中运行,其云提供商支持 `Ingress`。如果您希望在互联网上直接通过统一入口访问 tensorboard可以运行如下代码
```
find /charts/ -name values.yaml | xargs sed -i "/ingress/s/false/true/g"
```
> 警告:我们不鼓励将服务公开给互联网,因为这种做法会导致服务受黑客攻击。
14\. 将 imagePullPolicy 策略由 `Always` 修改为 `IfNotPresent` (可选)
```
find /charts/ -name values.yaml| xargs sed -i "s/Always/IfNotPresent/g"
```
> 警告: 这会导致容器镜像可能不是最新更新版本。

Binary file not shown.

Before

Width:  |  Height:  |  Size: 223 KiB

View File

@ -1,138 +0,0 @@
Here is an example how you can use `Arena` for the machine learning training. It will download the source code from git url.
1. the first step is to check the available resources
```
arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0%)
```
There are 3 available nodes with GPU for running training jobs.
2\. Now we can submit a training job with `arena`, it will download the source code from github
```
# arena submit tf \
--name=tf-git \
--gpus=1 \
--image=tensorflow/tensorflow:1.5.0-devel-gpu \
--sync-mode=git \
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 10000 --data_dir=code/tensorflow-sample-code/data"
configmap/tf-git-tfjob created
configmap/tf-git-tfjob labeled
tfjob.kubeflow.org/tf-git created
INFO[0000] The Job tf-git has been submitted successfully
INFO[0000] You can run `arena get tf-git --type tfjob` to check the job status
```
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`. Also, you may specify the branch you are pulling code from by addding `--env GIT_SYNC_BRANCH=main` to the paramasters while submitting the job.
> If you are using the private git repo, you can use the following command:
```
# arena submit tf \
--name=tf-git \
--gpus=1 \
--image=tensorflow/tensorflow:1.5.0-devel-gpu \
--syncMode=git \
--syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--env=GIT_SYNC_USERNAME=yourname \
--env=GIT_SYNC_PASSWORD=yourpwd \
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py"
```
Notice: `arena` is using [git-sync](https://github.com/kubernetes/git-sync/blob/master/cmd/git-sync/main.go) to sync up source code. You can set the environment variables defined in git-sync project.
3\. List all the jobs
```
# arena list
NAME STATUS TRAINER AGE NODE
tf-git RUNNING tfjob 0s 192.168.1.120
```
4\. Check the resource usage of the job
```
# arena top job
NAME STATUS TRAINER AGE NODE GPU(Requests) GPU(Allocated)
tf-git RUNNING TFJOB 17s 192.168.1.120 1 1
Total Allocated GPUs of Training Job:
1
Total Requested GPUs of Training Job:
1
```
5\. Check the resource usage of the cluster
```
# arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 1
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
1/3 (33%)
```
6\. Get the details of the specific job
```
# arena get tf-git
NAME STATUS TRAINER AGE INSTANCE NODE
tf-git RUNNING TFJOB 5s tf-git-tfjob-worker-0 192.168.1.120
```
7\. Check logs
```
# arena logs tf-git
2018-07-22T23:56:20.841129509Z WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:119: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
2018-07-22T23:56:20.841211064Z Instructions for updating:
2018-07-22T23:56:20.841217002Z
2018-07-22T23:56:20.841221287Z Future major versions of TensorFlow will allow gradients to flow
2018-07-22T23:56:20.841225581Z into the labels input on backprop by default.
2018-07-22T23:56:20.841229492Z
...
2018-07-22T23:57:11.842929868Z Accuracy at step 920: 0.967
2018-07-22T23:57:11.842933859Z Accuracy at step 930: 0.9646
2018-07-22T23:57:11.842937832Z Accuracy at step 940: 0.967
2018-07-22T23:57:11.842941362Z Accuracy at step 950: 0.9674
2018-07-22T23:57:11.842945487Z Accuracy at step 960: 0.9693
2018-07-22T23:57:11.842949067Z Accuracy at step 970: 0.9687
2018-07-22T23:57:11.842952818Z Accuracy at step 980: 0.9688
2018-07-22T23:57:11.842956775Z Accuracy at step 990: 0.9649
2018-07-22T23:57:11.842961076Z Adding run metadata for 999
```
8\. More information about the training job in the logviewer
```
# arena logviewer tf-git
Your LogViewer will be available on:
192.168.1.120:8080/tfjobs/ui/#/default/tf-git-tfjob
```
![](1-tfjob-logviewer.jpg)
Congratulations! You've run the first training job with `arena` successfully.

View File

@ -1,45 +0,0 @@
Arena supports RDMA For distributed Training. We can allocate RDMA device for worker jobs by adding parameter `--rdma`
1. Deploy rdma device plugin
```
# Deploy RDMA device plugin
kubectl create -f kubernetes-artifacts/rdma/rdma-config.yaml
kubectl create -f kubernetes-artifacts/rdma/device-plugin.yaml
```
2\. Label your node with infiniband device
```
# Label RDMA NODE
kubectl label node <your node> accelerator/rdma=true
```
```
# Check Device plugin status
kubectl -n arena-system get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
rdma-sriov-dp-ds 1 1 1 1 1 accelerator/rdma=true 46d
```
3\. Enable arena RDMA config
```
find /charts/ -name values.yaml | xargs sed -i "/enableRDMA/s/false/true/g"
```
4\. Submit a Tensorflow training job using RDMA
```
# arena submit mpi --name=mpi-dist \
--rdma \
--gpus=1 \
--workers=2 \
--image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \
--syncMode=git \
--syncSource=https://github.com/tensorflow/benchmarks.git \
--tensorboard \
"mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3
--save_summaries_steps=10"
```

View File

@ -1,201 +0,0 @@
Arena supports and simplifies distributed spark job.
### 1. To run a distributed spark job, you need to specify:
- The spark job image which contains the main class jar. (required)
- Main class of your jar. (required)
- Jar path in the container.(required)
- The number of executors.(default: 1)
- The resource cpu request of driver pod (default: 1)
- The resource memory request of driver pod (default: 500m)
- The resource cpu request of executor pod (default: 1)
- The resource memory request of executor pod (default: 500m)
### 2. How to create spark job image.
Arena spark job is based on spark-on-k8s-operator(https://github.com/GoogleCloudPlatform/spark-on-k8s-operator).You can create spark job image with tool `docker-image-tool` (https://spark.apache.org/docs/latest/running-on-kubernetes.html#docker-images)
### 3. How to use Arena spark job
##### install spark operator
```$xslt
# arena-system is the default namespace,if not exist please create it.
kubectl create -f arena/kubernetes-artifacts/spark-operator/spark-operator.yaml
```
##### create rbac of spark job
The spark job need service account `spark` to create executors.
```$xslt
kubectl create -f arena/kubernetes-artifacts/spark-operator/spark-rbac.yaml
```
The default namespace is `default`. If you want to run spark job in other namespaces. You can change namespace in spark-rbac.yaml and create a new service account.
##### submit a spark job
```$xslt
arena submit sparkjob --name=demo --image=registry.aliyuncs.com/acs/spark:v2.4.0 --main-class=org.apache.spark.examples.SparkPi --jar=local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar
```
The result is like below.
```$xslt
configmap/demo-sparkjob created
configmap/demo-sparkjob labeled
sparkapplication.sparkoperator.k8s.io/demo created
INFO[0005] The Job demo has been submitted successfully
INFO[0005] You can run `arena get demo --type sparkjob` to check the job status
```
##### get spark job status
```$xslt
arena get --type=sparkjob demo
```
When the job succeed,you will see the result below.
```$xslt
STATUS: SUCCEEDED
NAMESPACE: default
TRAINING DURATION: 15s
NAME STATUS TRAINER AGE INSTANCE NODE
demo1 SUCCEEDED SPARKJOB 1h demo1-driver N/A
```
##### watch log of spark job
```$xslt
arena logs -f demo
```
You will get the log of spark driver pod.
```$xslt
2019-05-08T08:25:21.904409561Z ++ id -u
2019-05-08T08:25:21.904639867Z + myuid=0
2019-05-08T08:25:21.904649704Z ++ id -g
2019-05-08T08:25:21.904901542Z + mygid=0
2019-05-08T08:25:21.904909072Z + set +e
2019-05-08T08:25:21.905241846Z ++ getent passwd 0
2019-05-08T08:25:21.905608733Z + uidentry=root:x:0:0:root:/root:/bin/ash
2019-05-08T08:25:21.905623028Z + set -e
2019-05-08T08:25:21.905629226Z + '[' -z root:x:0:0:root:/root:/bin/ash ']'
2019-05-08T08:25:21.905633894Z + SPARK_K8S_CMD=driver
2019-05-08T08:25:21.905757494Z + case "$SPARK_K8S_CMD" in
2019-05-08T08:25:21.90622059Z + shift 1
2019-05-08T08:25:21.906232126Z + SPARK_CLASSPATH=':/opt/spark/jars/*'
2019-05-08T08:25:21.906236316Z + env
2019-05-08T08:25:21.906239651Z + grep SPARK_JAVA_OPT_
2019-05-08T08:25:21.90624307Z + sort -t_ -k4 -n
2019-05-08T08:25:21.906585896Z + sed 's/[^=]*=\(.*\)/\1/g'
2019-05-08T08:25:21.906908601Z + readarray -t SPARK_EXECUTOR_JAVA_OPTS
2019-05-08T08:25:21.906917535Z + '[' -n '' ']'
2019-05-08T08:25:21.906999069Z + '[' -n '' ']'
2019-05-08T08:25:21.907003871Z + PYSPARK_ARGS=
2019-05-08T08:25:21.907006605Z + '[' -n '' ']'
2019-05-08T08:25:21.907008951Z + R_ARGS=
2019-05-08T08:25:21.907012105Z + '[' -n '' ']'
2019-05-08T08:25:21.907148385Z + '[' '' == 2 ']'
2019-05-08T08:25:21.907994286Z + '[' '' == 3 ']'
2019-05-08T08:25:21.908014459Z + case "$SPARK_K8S_CMD" in
2019-05-08T08:25:21.908018653Z + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
2019-05-08T08:25:21.908023924Z + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.20.90.160 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal
2019-05-08T08:25:23.326681135Z 2019-05-08 08:25:23 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-05-08T08:25:23.829843117Z 2019-05-08 08:25:23 INFO SparkContext:54 - Running Spark version 2.4.0
2019-05-08T08:25:23.8529898Z 2019-05-08 08:25:23 INFO SparkContext:54 - Submitted application: Spark Pi
2019-05-08T08:25:23.94670344Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing view acls to: root
2019-05-08T08:25:23.946735076Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing modify acls to: root
2019-05-08T08:25:23.946740267Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing view acls groups to:
2019-05-08T08:25:23.946744543Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing modify acls groups to:
2019-05-08T08:25:23.946748767Z 2019-05-08 08:25:23 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2019-05-08T08:25:24.273960575Z 2019-05-08 08:25:24 INFO Utils:54 - Successfully started service 'sparkDriver' on port 7078.
2019-05-08T08:25:24.307632934Z 2019-05-08 08:25:24 INFO SparkEnv:54 - Registering MapOutputTracker
2019-05-08T08:25:24.339548141Z 2019-05-08 08:25:24 INFO SparkEnv:54 - Registering BlockManagerMaster
2019-05-08T08:25:24.339577986Z 2019-05-08 08:25:24 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2019-05-08T08:25:24.340887925Z 2019-05-08 08:25:24 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2019-05-08T08:25:24.359682519Z 2019-05-08 08:25:24 INFO DiskBlockManager:54 - Created local directory at /var/data/spark-118b216d-2d39-4287-ad71-5b5d7c7195c9/blockmgr-5532fd8b-64b9-492c-b94d-308b55d60a71
2019-05-08T08:25:24.388529744Z 2019-05-08 08:25:24 INFO MemoryStore:54 - MemoryStore started with capacity 110.0 MB
2019-05-08T08:25:24.413347888Z 2019-05-08 08:25:24 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2019-05-08T08:25:24.560654618Z 2019-05-08 08:25:24 INFO log:192 - Logging initialized @2462ms
2019-05-08T08:25:24.654721075Z 2019-05-08 08:25:24 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2019-05-08T08:25:24.680943254Z 2019-05-08 08:25:24 INFO Server:419 - Started @2586ms
2019-05-08T08:25:24.715867156Z 2019-05-08 08:25:24 INFO AbstractConnector:278 - Started ServerConnector@7e97551f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-05-08T08:25:24.715897312Z 2019-05-08 08:25:24 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040.
2019-05-08T08:25:24.76123501Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1450078a{/jobs,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.762173789Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@534ca02b{/jobs/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.763361524Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@29a23c3d{/jobs/job,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.764374535Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6fe46b62{/jobs/job/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.764919809Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@591fd34d{/stages,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.765687152Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@61e45f87{/stages/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.766434602Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7c9b78e3{/stages/stage,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.769934319Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5491f68b{/stages/stage/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.769949155Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@736ac09a{/stages/pool,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.769966711Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6ecd665{/stages/pool/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.77037559Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@45394b31{/storage,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.772696599Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1ec7d8b3{/storage/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.772709487Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3b0ca5e1{/storage/rdd,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.773014833Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5bb3131b{/storage/rdd/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.77546416Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@54dcbb9f{/environment,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.775478151Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@74fef3f7{/environment/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.775882882Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2a037324{/executors,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.780702953Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@69eb86b4{/executors/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.780717178Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@585ac855{/executors/threadDump,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.78072195Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5bb8f9e2{/executors/threadDump/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.793805533Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6a933be2{/static,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.808511998Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@378bd86d{/,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.808532751Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2189e7a7{/api,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.808537695Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@644abb8f{/jobs/job/kill,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.80854206Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a411233{/stages/stage/kill,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.808546336Z 2019-05-08 08:25:24 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://demo1-1557303918993-driver-svc.default.svc:4040
2019-05-08T08:25:24.834767942Z 2019-05-08 08:25:24 INFO SparkContext:54 - Added JAR file:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar at spark://demo1-1557303918993-driver-svc.default.svc:7078/jars/spark-examples_2.11-2.4.0.jar with timestamp 1557303924832
2019-05-08T08:25:26.274526541Z 2019-05-08 08:25:26 INFO ExecutorPodsAllocator:54 - Going to request 1 executors from Kubernetes.
2019-05-08T08:25:26.455658752Z 2019-05-08 08:25:26 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
2019-05-08T08:25:26.47651031Z 2019-05-08 08:25:26 INFO NettyBlockTransferService:54 - Server created on demo1-1557303918993-driver-svc.default.svc:7079
2019-05-08T08:25:26.476533172Z 2019-05-08 08:25:26 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2019-05-08T08:25:26.503099521Z 2019-05-08 08:25:26 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None)
2019-05-08T08:25:26.506168762Z 2019-05-08 08:25:26 INFO BlockManagerMasterEndpoint:54 - Registering block manager demo1-1557303918993-driver-svc.default.svc:7079 with 110.0 MB RAM, BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None)
2019-05-08T08:25:26.529524775Z 2019-05-08 08:25:26 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None)
2019-05-08T08:25:26.529543725Z 2019-05-08 08:25:26 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None)
2019-05-08T08:25:26.661414752Z 2019-05-08 08:25:26 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4c777e7b{/metrics/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:30.459756195Z 2019-05-08 08:25:30 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.20.90.161:52168) with ID 1
2019-05-08T08:25:30.534179215Z 2019-05-08 08:25:30 INFO KubernetesClusterSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
2019-05-08T08:25:30.679510273Z 2019-05-08 08:25:30 INFO BlockManagerMasterEndpoint:54 - Registering block manager 172.20.90.161:36718 with 110.0 MB RAM, BlockManagerId(1, 172.20.90.161, 36718, None)
2019-05-08T08:25:30.906713226Z 2019-05-08 08:25:30 INFO SparkContext:54 - Starting job: reduce at SparkPi.scala:38
2019-05-08T08:25:30.93537711Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
2019-05-08T08:25:30.936000643Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
2019-05-08T08:25:30.936506781Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Parents of final stage: List()
2019-05-08T08:25:30.938152322Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Missing parents: List()
2019-05-08T08:25:30.958509715Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
2019-05-08T08:25:31.128459296Z 2019-05-08 08:25:31 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 110.0 MB)
2019-05-08T08:25:31.172704042Z 2019-05-08 08:25:31 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 1256.0 B, free 110.0 MB)
2019-05-08T08:25:31.178025215Z 2019-05-08 08:25:31 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on demo1-1557303918993-driver-svc.default.svc:7079 (size: 1256.0 B, free: 110.0 MB)
2019-05-08T08:25:31.182000364Z 2019-05-08 08:25:31 INFO SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1161
2019-05-08T08:25:31.202640906Z 2019-05-08 08:25:31 INFO DAGScheduler:54 - Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
2019-05-08T08:25:31.203502967Z 2019-05-08 08:25:31 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
2019-05-08T08:25:31.245126257Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, 172.20.90.161, executor 1, partition 0, PROCESS_LOCAL, 7878 bytes)
2019-05-08T08:25:31.805815672Z 2019-05-08 08:25:31 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 172.20.90.161:36718 (size: 1256.0 B, free: 110.0 MB)
2019-05-08T08:25:31.946492966Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, 172.20.90.161, executor 1, partition 1, PROCESS_LOCAL, 7878 bytes)
2019-05-08T08:25:31.957903365Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 727 ms on 172.20.90.161 (executor 1) (1/2)
2019-05-08T08:25:31.99308236Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Finished task 1.0 in stage 0.0 (TID 1) in 47 ms on 172.20.90.161 (executor 1) (2/2)
2019-05-08T08:25:31.994764897Z 2019-05-08 08:25:31 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2019-05-08T08:25:31.995390219Z 2019-05-08 08:25:31 INFO DAGScheduler:54 - ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.998 s
2019-05-08T08:25:32.003622135Z 2019-05-08 08:25:32 INFO DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 1.094511 s
2019-05-08T08:25:32.005407995Z Pi is roughly 3.1436157180785904
2019-05-08T08:25:32.011499948Z 2019-05-08 08:25:32 INFO AbstractConnector:318 - Stopped Spark@7e97551f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-05-08T08:25:32.014105609Z 2019-05-08 08:25:32 INFO SparkUI:54 - Stopped Spark web UI at http://demo1-1557303918993-driver-svc.default.svc:4040
2019-05-08T08:25:32.01861939Z 2019-05-08 08:25:32 INFO KubernetesClusterSchedulerBackend:54 - Shutting down all executors
2019-05-08T08:25:32.019973046Z 2019-05-08 08:25:32 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asking each executor to shut down
2019-05-08T08:25:32.025136562Z 2019-05-08 08:25:32 WARN ExecutorPodsWatchSnapshotSource:87 - Kubernetes client has been closed (this is expected if the application is shutting down.)
2019-05-08T08:25:32.087137746Z 2019-05-08 08:25:32 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-05-08T08:25:32.097659039Z 2019-05-08 08:25:32 INFO MemoryStore:54 - MemoryStore cleared
2019-05-08T08:25:32.098360561Z 2019-05-08 08:25:32 INFO BlockManager:54 - BlockManager stopped
2019-05-08T08:25:32.104432515Z 2019-05-08 08:25:32 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2019-05-08T08:25:32.10761075Z 2019-05-08 08:25:32 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-05-08T08:25:32.114734944Z 2019-05-08 08:25:32 INFO SparkContext:54 - Successfully stopped SparkContext
2019-05-08T08:25:32.117170277Z 2019-05-08 08:25:32 INFO ShutdownHookManager:54 - Shutdown hook called
2019-05-08T08:25:32.118273045Z 2019-05-08 08:25:32 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bdb4e416-5ab7-420c-905e-ef43c30fb187
2019-05-08T08:25:32.120019227Z 2019-05-08 08:25:32 INFO ShutdownHookManager:54 - Deleting directory /var/data/spark-118b216d-2d39-4287-ad71-5b5d7c7195c9/spark-06dbab1f-13aa-474c-a1db-8845e14627bf
```
##### delete spark job
```$xslt
arena delete --type=sparkjob demo
```
You will found the spark job is deleted.
```$xslt
sparkapplication.sparkoperator.k8s.io "demo1" deleted
time="2019-05-08T17:27:06+08:00" level=info msg="The Job demo1 has been deleted successfully"
configmap "demo1-sparkjob" deleted
```
Congratulations! You've run the distributed spark job with `arena` successfully.

View File

@ -1,156 +0,0 @@
# Arena supports and simplifies volcano job.
Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms currently missing from
Kubernetes that are commonly required by many classes of batch & elastic workload including:
1. machine learning/deep learning,
2. bioinformatics/genomics, and
3. other "big data" applications.
## pre requisites
- k8s deployment
- deploy the volcano following the steps from kubernetes-artifacts/volcano-operator/README.md
### 1. To run a batch/distributed volcano job, you may need to specify:
```
--minAvailable int The minimal available pods to run for this Job. default value is 1 (default 1)
--name string override name
--queue string Specifies the queue that will be used in the scheduler, default queue is used this leaves empty (default "default")
--schedulerName string Specifies the scheduler Name, default is volcano when not specified (default "volcano")
--taskCPU string cpu request for each task replica / pod. default value is 250m (default "250m")
--taskImages strings the docker images of different tasks of volcano job. default used 3 tasks with ubuntu,nginx and busybox images (default [ubuntu,nginx,busybox])
--taskMemory string memory request for each task replica/pod.default value is 128Mi) (default "128Mi")
--taskName string the task name of volcano job, default value is task (default "task")
--taskPort int the task port number. default value is 2222 (default 2222)
--taskReplicas int the task replica's number to run the distributed tasks. default value is 1 (default 1)
```
### 2. More information related to volcano job.
Arena volcano job is based on (https://github.com/volcano-sh/volcano).
You can get more information related to volcano from https://volcano.sh/
### 3. How to use Arena volcano job
##### install volcano
deploy the volcano following the steps from kubernetes-artifacts/volcano-operator/README.md
To install the chart with the release name `volcano-release`
```bash
$ helm install --name volcano-release kubernetes-artifacts/volcano-operator
```
TO verify all deployments are running use the below command
```bash
kubectl get deployment --all-namespaces | grep {release_name}
```
We should get similar output like given below, where three deployments for controller, admission, scheduler should be running.
```bash
NAME READY UP-TO-DATE AVAILABLE AGE
{release_name}-admission 1/1 1 1 4s
{release_name}-controllers 1/1 1 1 4s
{release_name}-scheduler 1/1 1 1 4s
```
TO verify all pods are running use the below command
```bash
kubectl get pods --all-namespaces | grep {release_name}
```
We should get similar output like given below, where three pods for controller, admission,admissioninit, scheduler should be running.
```bash
NAMESPACE NAME READY STATUS RESTARTS AGE
default volcano-release-admission-cbfdb8549-dz5hg 1/1 Running 0 33s
default volcano-release-admission-init-7xmzd 0/1 Completed 0 33s
default volcano-release-controllers-7967fffb8d-7vnn9 1/1 Running 0 33s
default volcano-release-scheduler-746f6557d8-9pfg6 1/1 Running 0 33s
```
##### submit a volcano job
```$xslt
arena submit volcanojob --name=demo
```
The result is like below.
```$xslt
configmap/demo-volcanojob created
configmap/demo-volcanojob labeled
job.batch.volcano.sh/demo created
INFO[0003] The Job demo has been submitted successfully
INFO[0003] You can run `arena get demo --type volcanojob` to check the job status
```
if we want to provide more command line parameters then
```$xslt
./bin/arena submit volcanojob --name demo12 --taskImages busybox,busybox --taskReplicas 2
```
in above case it creates two tasks each with 2 replicas as shown below
```$xslt
arena get --type volcanojob demo12
```
the result is as below
```$xslt
STATUS: SUCCEEDED
NAMESPACE: default
TRAINING DURATION: 2m
NAME STATUS TRAINER AGE INSTANCE NODE
demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-0-0 11.245.101.184
demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-0-1 11.245.101.184
demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-1-0 11.245.101.184
demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-1-1 11.245.101.184
```
##### get volcano job status
```$xslt
arena get --type=volcanojob demo
```
When the job running/succeed,you will see the result below.
```$xslt
STATUS: RUNNING/SUCCEEDED
NAMESPACE: default
TRAINING DURATION: 45s
NAME STATUS TRAINER AGE INSTANCE NODE
demo SUCCEEDED VOLCANOJOB 59s demo-task-0-0 11.245.101.184
demo RUNNING VOLCANOJOB 59s demo-task-1-0 11.245.101.184
demo SUCCEEDED VOLCANOJOB 59s demo-task-2-0 11.245.101.184
```
##### list arena jobs
```$xslt
arena list
```
we can observe the below data
```$xslt
NAME STATUS TRAINER AGE NODE
demo RUNNING VOLCANOJOB 2m 11.245.101.184
```
##### delete volcano job
```$xslt
arena delete --type=volcanojob demo
```
You will found the volcano job is deleted.
```$xslt
job.batch.volcano.sh "demo" deleted
configmap "demo-volcanojob" deleted
INFO[0000] The Job demo has been deleted successfully
```
Congratulations! You've run the batch/distributed volcano job with `arena` successfully.

View File

@ -1,169 +0,0 @@
# Arena supports Priority and Preemption for MPIJob
## prerequisites
- k8s > 1.11
1.Create `PriorityClass` with the yaml below:
```yaml
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the critical app
kind: PriorityClass
metadata:
name: critical
value: 1100000
---
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the medium app
kind: PriorityClass
metadata:
name: medium
value: 1000000
```
Save the template that applies in a file named `pc.yaml`, and create the `PriorityClass`:
```
kubectl create -f pc.yaml
```
2.There is only 1 GPU available in the Kubernetes cluster
```
# arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
192.168.0.20 192.168.0.20 master 0 0
192.168.0.21 192.168.0.21 master 0 0
192.168.0.22 192.168.0.22 master 0 0
192.168.0.23 192.168.0.23 <none> 1 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/1 (0%)
```
3.Run the MPI training Job with `medium` priority:
The following command is an example.
```
# arena submit mpi \
--name=medium \
--priority=medium \
--gpus=1 \
--workers=1 \
--image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
"mpirun tail -f /dev/null"
configmap/medium-mpijob created
configmap/medium-mpijob labeled
mpijob.kubeflow.org/medium created
INFO[0000] The Job medium has been submitted successfully
INFO[0000] You can run `arena get medium --type mpijob` to check the job status
```
4.Get the details of the specific job
```
# arena get medium
STATUS: RUNNING
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 58s
NAME STATUS TRAINER AGE INSTANCE NODE
medium RUNNING MPIJOB 58s medium-launcher-sz5xj 192.168.0.23
medium RUNNING MPIJOB 58s medium-worker-0 192.168.0.23
```
5.The only one GPU is used by MPI training Job `medium`
```
# arena top node -d
NAME: cn-hangzhou.192.168.0.23
IPADDRESS: 192.168.0.23
ROLE: <none>
NAMESPACE NAME GPU REQUESTS GPU LIMITS
default medium-worker-0 1 1
Total GPUs In Node cn-hangzhou.192.168.0.23: 1
Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%)
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster: 1/1 (100%)
```
6.Run the MPI training Job with `critical` priority:
```
# arena submit mpi \
--name=critical \
--priority=critical \
--gpus=1 \
--workers=1 \
--image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
"mpirun tail -f /dev/null"
```
7.Check MPI Training Job `medium`, and find it's preempted by critical-worker-0
```
# kubectl get events --field-selector involvedObject.name=medium-worker-0
LAST SEEN TYPE REASON OBJECT MESSAGE
15m Normal Scheduled pod/medium-worker-0 Successfully assigned default/medium-worker-0 to 192.168.0.23
14m Normal Pulled pod/medium-worker-0 Container image "registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5" already present on machine
14m Normal Created pod/medium-worker-0 Created container mpi
14m Normal Started pod/medium-worker-0 Started container mpi
2m32s Normal Preempted pod/medium-worker-0 by default/critical-worker-0 on node 192.168.0.23
2m32s Normal Killing pod/medium-worker-0 Stopping container mpi
```
8.Check the details of the MPI Training Job `medium`, and it's turned to fail
```
# arena get medium
STATUS: FAILED
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 12m
NAME STATUS TRAINER AGE INSTANCE NODE
medium FAILED MPIJOB 20m medium-launcher-sz5xj 192.168.0.23
```
9.And check the details of the MPI Training Job `critical`, it's running.
```
# arena get critical
STATUS: RUNNING
NAMESPACE: default
PRIORITY: critical
TRAINING DURATION: 10m
NAME STATUS TRAINER AGE INSTANCE NODE
critical RUNNING MPIJOB 10m critical-launcher-mfffs 192.168.0.23
critical RUNNING MPIJOB 10m critical-worker-0 192.168.0.23
```
10.And we can find the only GPU is used by the MPI Training Job `critical`
```
# arena top node -d
NAME: cn-hangzhou.192.168.0.23
IPADDRESS: 192.168.0.23
ROLE: <none>
NAMESPACE NAME GPU REQUESTS GPU LIMITS
default critical-worker-0 1 1
Total GPUs In Node cn-hangzhou.192.168.0.23: 1
Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%)
-----------------------------------------------------------------------------------------
```
Congratulations! You've run the the job in priorities and preemptions with `arena` successfully.

View File

@ -1,160 +0,0 @@
Arena supports assigning jobs to some k8s particular nodes(Currently only support mpi job and tf job).
some usage examples in here.
1.query k8s cluster information:
```
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-beijing.192.168.3.225 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.226 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.227 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.228 Ready <none> 2d22h v1.12.6-aliyun.1
cn-beijing.192.168.3.229 Ready <none> 2d22h v1.12.6-aliyun.1
cn-beijing.192.168.3.230 Ready <none> 2d22h v1.12.6-aliyun.1
```
2.give a label to nodes,for example: give label "gpu_node=ok" to node "cn-beijing.192.168.3.228" and node "cn-beijing.192.168.3.229",give label "ssd_node=ok" to node "cn-beijing.192.168.3.230"
```
# kubectl label nodes cn-beijing.192.168.3.228 gpu_node=ok
node/cn-beijing.192.168.3.228 labeled
# kubectl label nodes cn-beijing.192.168.3.229 gpu_node=ok
node/cn-beijing.192.168.3.229 labeled
# kubectl label nodes cn-beijing.192.168.3.230 ssd_node=ok
node/cn-beijing.192.168.3.230 labeled
```
## for MPI job
1.when submit a job,you can assign nodes to run job with operation "--selector"
```
# arena submit mpi --name=mpi-dist \
--gpus=1 \
--workers=1 \
--selector gpu_node=ok \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--tensorboard \
--loglevel debug \
"mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
```
2.query the job information
```
# arena get mpi-dist
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 21s
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist RUNNING MPIJOB 21s mpi-dist-launcher-7jn4q 192.168.3.229
mpi-dist RUNNING MPIJOB 21s mpi-dist-worker-0 192.168.3.229
Your tensorboard will be available on:
http://192.168.3.225:31611
```
the jobs are running on node cn-beijing.192.168.3.229(ip is 192.168.3.229).
3.you can use "--selector" multiple times,for example you can use "--selector gpu_node=ok --selector ssd_node=ok" in arena submit command,it represents that the job should be running on nodes which own label "gpu_node=ok" and label "ssd_node=ok".
## for tf job
1.because there is four roles("PS","Worker","Evaluator","Chief") in tf job,you can use "--selector" to assgin nodes,this is effective for all roles.for example:
```
arena submit tfjob \
--name=tf \
--gpus=1 \
--workers=1 \
--selector ssd_node=ok \
--workerImage=cheyang/tf-mnist-distributed:gpu \
--psImage=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--loglevel debug \
"python /app/main.py"
```
use follow command to check the job status:
```
# arena get tf
STATUS: PENDING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 24s
NAME STATUS TRAINER AGE INSTANCE NODE
tf RUNNING TFJOB 24s tf-ps-0 192.168.3.230
tf PENDING TFJOB 24s tf-worker-0 192.168.3.230
Your tensorboard will be available on:
http://192.168.3.225:31867
```
the jobs(include "PS" and "Worker") have been running on cn-beijing.192.168.3.230(ip is 192.168.3.230,label is "ssd_node=ok").
2.you also can assign node to run single role job,for example: if you want to run a job whose role is "PS" on nodes which own label ssd_node="ok" and run "Worker" job on nodes which own label gpu_node=ok,you can use option "--ps-selector" and "--worker-selector"
```
arena submit tfjob \
--name=tf \
--gpus=1 \
--workers=1 \
--ps-selector ssd_node=ok \
--worker-selector gpu_node=ok \
--workerImage=cheyang/tf-mnist-distributed:gpu \
--psImage=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--loglevel debug \
"python /app/main.py"
```
then check the jobs's status:
```
# arena get tf
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 23s
NAME STATUS TRAINER AGE INSTANCE NODE
tf RUNNING TFJOB 23s tf-ps-0 192.168.3.230
tf RUNNING TFJOB 23s tf-worker-0 192.168.3.228
Your tensorboard will be available on:
http://192.168.3.225:30162
```
the "PS" job is running on cn-beijing.192.168.3.230(ip is 192.168.3.230,label is "ssd_node=ok") and the "Worker" job is running on cn-beijing.192.168.3.228(ip is 192.168.3.228,label is "gpu_node=ok")
3.if you use "--selector" in "arena submit tf" command and also use "--ps-selector"(or "--worker-selector","--evaluator-selector","chief-selector"),the value of "--ps-selector" would cover value of "--selector",for example:
```
arena submit tfjob \
--name=tf \
--gpus=1 \
--workers=1 \
--ps-selector ssd_node=ok \
--selector gpu_node=ok \
--workerImage=cheyang/tf-mnist-distributed:gpu \
--psImage=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--loglevel debug \
"python /app/main.py"
```
"PS" job will be running on nodes whose label is "ssd_node=ok",other jobs will be running on nodes whose label is "gpu_node=ok",now verify our conclusions,use follow command to check job status.
```
# arena get tf
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 39s
NAME STATUS TRAINER AGE INSTANCE NODE
tf RUNNING TFJOB 39s tf-ps-0 192.168.3.230
tf RUNNING TFJOB 39s tf-worker-0 192.168.3.228
Your tensorboard will be available on:
http://192.168.3.225:32105
```
as you can see, "PS" job is running on nodes which own label "ssd_node=ok",other jobs are running on nodes which own label "gpu_node=ok"

View File

@ -1,85 +0,0 @@
Arena supports submiting a job with tolerating k8s nodes with taints(Currently only support mpi job and tf job).
some usage examples in here.
1.query k8s cluster information:
```
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-beijing.192.168.3.225 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.226 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.227 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.228 Ready <none> 2d22h v1.12.6-aliyun.1
cn-beijing.192.168.3.229 Ready <none> 2d22h v1.12.6-aliyun.1
cn-beijing.192.168.3.230 Ready <none> 2d22h v1.12.6-aliyun.1
```
2.give some taints for k8s nodes,for example: give taint "gpu_node=invalid:NoSchedule" to node "cn-beijing.192.168.3.228" and node "cn-beijing.192.168.3.229",give taint "ssd_node=invalid:NoSchedule" to node "cn-beijing.192.168.3.230",now all k8s pod can't schedule to these nodes.
```
# kubectl taint nodes cn-beijing.192.168.3.228 gpu_node=invalid:NoSchedule
node/cn-beijing.192.168.3.228 tainted
# kubectl taint nodes cn-beijing.192.168.3.229 gpu_node=invalid:NoSchedule
node/cn-beijing.192.168.3.229 tainted
# kubectl taint nodes cn-beijing.192.168.3.230 ssd_node=invalid:NoSchedule
node/cn-beijing.192.168.3.230 tainted
```
3.when submit a job,you can tolerate some nodes with taints to run job with operation "--toleration"
```
# arena submit mpi --name=mpi-dist \
--gpus=1 \
--workers=1 \
--toleration ssd_node \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--tensorboard \
--loglevel debug \
"mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
```
query the job information
```
# arena get mpi-dist
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 29s
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist RUNNING MPIJOB 29s mpi-dist-launcher-jgms7 192.168.3.230
mpi-dist RUNNING MPIJOB 29s mpi-dist-worker-0 192.168.3.230
Your tensorboard will be available on:
http://192.168.3.225:30052
```
the jobs are running on node cn-beijing.192.168.3.230(ip is 192.168.3.230,taint is ssd_node=invalid).
4.you can use "--toleration" multiple times,for example you can use "--toleration gpu_node --toleration ssd_node" in arena submit command,it represents that the job tolerates nodes which own taint "gpu_node=invalid" and taint "ssd_node=invalid".
```
# arena submit mpi --name=mpi-dist \
--gpus=1 \
--workers=1 \
--toleration ssd_node \
--toleration gpu_node \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--tensorboard \
--loglevel debug \
"mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
```
query the job status:
```
# arena get mpi-dist
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 29s
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist RUNNING MPIJOB 29s mpi-dist-launcher-jgms7 192.168.3.229
mpi-dist RUNNING MPIJOB 29s mpi-dist-worker-0 192.168.3.230
Your tensorboard will be available on:
http://192.168.3.225:30052
```
5.you can use "--toleration all" to tolerate all node taints.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 183 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 290 KiB

View File

@ -1,80 +0,0 @@
# Serving Trained Model with arena
You can use arena to deploy your trained model as RESTful APIs.to illustrate usage,we use a sample project [fast-style-transfer](https://github.com/floydhub/fast-style-transfer).in order to save time,we use its' trainted model and add the model to docker images.
### 1.Serve Mode
we use the app.py script in project to start restful server,you can use arena to deploy trainted model:
```
# arena serve custom \
--name=fast-style-transfer \
--gpus=1 \
--version=alpha \
--replicas=1 \
--restful-port=5000 \
--image=happy365/fast-style-transfer:latest \
"python app.py"
```
check the status of TensorFlow Serving Job:
```
# arena serve list
NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS
fast-style-transfer CUSTOM alpha 1 0 172.21.8.94 grpc:8001,restful:5000
```
because the docker image is very large,pulling it requests some time,we can use kubectl to check the pod status:
```
# kubectl get po
NAME READY STATUS RESTARTS AGE
fast-style-transfer-alpha-custom-serving-845ffbf7dd-btbhj 0/1 ContainerCreating 0 6m44s
```
### 2.Access the service
we can use a client to access the service,run the follow command to create a client:
```
# kubectl run sample-client \
--generator=run-pod/v1 \
--image=happy365/arena-serve-custem-sample-client:latest \
--command -- \
/bin/sleep infinity
```
then,we can query the status of sample-client:
```
# kubectl get po sample-client
NAME READY STATUS RESTARTS AGE
sample-client 1/1 Running 0 87s
```
we should query the sevice name,it is a combination of job name and version(the sample job name is fast-style-transfer and version is alpha):
```
# kubectl get svc fast-style-transfer-alpha
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fast-style-transfer-alpha ClusterIP 172.21.1.114 <none> 5000/TCP 31m
```
now,we can use the "kubectl exec" command to login the container:
```
# kubectl exec -ti sample-client /bin/sh
#
```
then we use "curl" command to access the custom serving job:
```
# curl -o /root/output/beijing_out.jpg -F "file=@/root/input/beijing.jpg" http://fast-style-transfer-alpha:5000
```
the input is an image which name is "beijing.jpg" ![beijing.jpg](15-custom-serving-sample-beijing.jpg),the image is stored in "/root/input",the output is stored in "/root/output". you can use "kubectl cp" command to copy output image from container to host:
```
# kubectl cp sample-client:/root/output/beijing_out.jpg ~/beijing_out.jpg
```
now you can view the image in ~/beijing_out.jpg,there is "beijing_out.jpg" ![beijing_out.jpg](15-custom-serving-sample-beijing_out.jpg)

View File

@ -1,73 +0,0 @@
# Assign configuration files for jobs
you can pass the configuration files to containers when submiting jobs.
this feature only support follow jobs:
* tfjob
* mpijob
## 1.usage
you can use `--config-file <host_path_file>:<container_path_file>` to assign a configuration file to container.and there is some rules:
* if assignd <host_path_file> and not assign <container_path_file>,we see <container_path_file> is the same as <host_path_file>
* <container_path_file> must be a file with absolute path
* you can use `--config-file` more than one in a command,eg: "--config-file /tmp/test1.conf:/etc/config/test1.conf --config-file /tmp/test2.conf:/etc/config/test2.conf"
## 2.sample
firstly,we create a test file which name is "test-config.json",its' path is "/tmp/test-config.json". we want push this file to containers of a tfjob (or mpijob) and the path in container is "/etc/config/config.json".
```
# cat /tmp/test-config.json
{
"key": "job-config"
}
```
secondly,use follow command to create tfjob:
```
# arena submit tfjob \
--name=tf \
--gpus=1 \
--workers=1 \
--workerImage=cheyang/tf-mnist-distributed:gpu \
--psImage=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--config-file /tmp/test-config.json:/etc/config/config.json \
"python /app/main.py"
```
wait a minute,get the job status:
```
# arena get tf
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 16s
NAME STATUS TRAINER AGE INSTANCE NODE
tf RUNNING TFJOB 16s tf-ps-0 192.168.7.18
tf RUNNING TFJOB 16s tf-worker-0 192.168.7.16
Your tensorboard will be available on:
http://192.168.7.10:31825
```
use kubectl to check file is in container or not:
```
# kubectl exec -ti tf-ps-0 -- cat /etc/config/config.json
{
"key": "job-config"
}
# kubectl exec -ti tf-worker-0 -- cat /etc/config/config.json
{
"key": "job-config"
}
```
as you see,the file is in the containers.

View File

@ -1,95 +0,0 @@
This example shows how to use `Arena` to submit a pytorch stand-alone job. This example will download the source code from git url.
1. The first step is to check the available resources.
```
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/12 (0%)
```
There are 3 available nodes with GPU for running training jobs.
2. Submit a pytorch training job, this example download the source code from [Alibaba Cloud code](https://code.aliyun.com/370272561/mnist-pytorch.git).
```
# Single gpu card
➜ arena --loglevel info submit pytorch \
--name=pytorch-local-git \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
configmap/pytorch-local-git-pytorchjob created
configmap/pytorch-local-git-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-local-git created
INFO[0000] The Job pytorch-local-git has been submitted successfully
INFO[0000] You can run `arena get pytorch-local-git --type pytorchjob` to check the job status
```
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`.
> If you are using the private git repo, you can use the following command
```
➜ arena --loglevel info submit pytorch \
--name=pytorch-local-git \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--env=GIT_SYNC_USERNAME=yourname \
--env=GIT_SYNC_PASSWORD=yourpwd \
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
```
3. List all the jobs.
```
➜ arena list
NAME STATUS TRAINER AGE NODE
pytorch-local-git SUCCEEDED PYTORCHJOB 21h N/A
```
4. Get the details of the this job.
```
➜ arena get pytorch-local-git
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 35s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-local-git SUCCEEDED PYTORCHJOB 23h pytorch-local-git-master-0 172.16.0.210
```
5. Check logs.
```
➜ arena logs pytorch-local-git
WORLD_SIZE: 1, CURRENT_RANK: 0
args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000)
Using CUDA
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Train Epoch: 1 [0/60000 (0%)] loss=2.3000
Train Epoch: 1 [640/60000 (1%)] loss=2.2135
Train Epoch: 1 [1280/60000 (2%)] loss=2.1705
Train Epoch: 1 [1920/60000 (3%)] loss=2.0767
Train Epoch: 1 [2560/60000 (4%)] loss=1.8681
...
```

View File

@ -1,131 +0,0 @@
This example shows how to use `Arena` to submit a pytorch distributed job. This example will download the source code from git url.
1. The first step is to check the available resources.
```
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/12 (0%)
```
There are 3 available nodes with GPU for running training jobs.
2. Submit a pytorch distributed training job with 2 nodes and one gpu card, this example download the source code from [Alibaba Cloud code](https://code.aliyun.com/370272561/mnist-pytorch.git).
```
➜ arena --loglevel info submit pytorch \
--name=pytorch-dist-git \
--gpus=1 \
--workers=2 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
configmap/pytorch-dist-git-pytorchjob created
configmap/pytorch-dist-git-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-dist-git created
INFO[0000] The Job pytorch-dist-git has been submitted successfully
INFO[0000] You can run `arena get pytorch-dist-git --type pytorchjob` to check the job status
```
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`.
>`workers` is the total number of nodes participating in the training (must be a positive integer and greater than or equal to 1), including rank0 node used to establish communication (corresponding to the `master` node in the pytorch-operator). The default value of the parameter is 1, which can not be set, as a stand-alone job.
3. List all the jobs.
```
➜ arena list
NAME STATUS TRAINER AGE NODE
pytorch-dist-git SUCCEEDED PYTORCHJOB 23h N/A
```
4. Get the details of the this job. There are 2 instances of this job, and instance `pytorch-dist-git-master-0` is the rank0. Arena simplifies the process of submitting distributed jobs with `PyTorch-Operator`.
A `Service` will be created for this `master` instance for other nodes to access through the name of `Service` in `PyTorch-Operator`, and inject environment variables into each instance: `MASTER_PORT`、`MASTER_ADDR`、`WORLD_SIZE`、`RANK`. Initialization of distributed process group for pytorch dist.init_ process_ group). `MASTER_PORT` auto assign, `MASTER_ADDR` is "localhost" in the `master` instance, and other instances are `Service` name of the `master`,`WORLD_SIZE` is the total number of instances, and `RANK` is the serial number of the current calculation node, and `master` is 0, `Worker` instance is the index of instance name suffix plus one. For example, in the following example, `RANK` of instance `pytorch-dist-git-worker-0` is `0 + 1 = 1`
In Arena, the value filled in by the parameter `--workers` contains one `master` instance, because `master` is also involved in training.
```
➜ arena get pytorch-local-git
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-dist-git SUCCEEDED PYTORCHJOB 23h pytorch-dist-git-master-0 172.16.0.210
pytorch-dist-git SUCCEEDED PYTORCHJOB 23h pytorch-dist-git-worker-0 172.16.0.210
```
5. Check logs.
```
➜ arena logs pytorch-dist-git
WORLD_SIZE: 2, CURRENT_RANK: 0
args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000)
Using CUDA
Using distributed PyTorch with gloo backend
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Train Epoch: 1 [0/60000 (0%)] loss=2.3000
Train Epoch: 1 [640/60000 (1%)] loss=2.2135
Train Epoch: 1 [1280/60000 (2%)] loss=2.1705
Train Epoch: 1 [1920/60000 (3%)] loss=2.0767
Train Epoch: 1 [2560/60000 (4%)] loss=1.8681
Train Epoch: 1 [3200/60000 (5%)] loss=1.4142
Train Epoch: 1 [3840/60000 (6%)] loss=1.0009
...
```
> For multi instances of distributed job, the default output is the log of rank0 (the instance is the `master` node). If you want to view the log of the specific instance, you can view it by `-i` instance name, for example:
```
➜ arena logs pytorch-dist-git -i pytorch-dist-git-worker-0
WORLD_SIZE: 2, CURRENT_RANK: 1
args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000)
Using CUDA
Using distributed PyTorch with gloo backend
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Train Epoch: 1 [0/60000 (0%)] loss=2.3000
Train Epoch: 1 [640/60000 (1%)] loss=2.2135
Train Epoch: 1 [1280/60000 (2%)] loss=2.1705
Train Epoch: 1 [1920/60000 (3%)] loss=2.0767
Train Epoch: 1 [2560/60000 (4%)] loss=1.8681
Train Epoch: 1 [3200/60000 (5%)] loss=1.4142
```
> In addition, user can view the logs of the last few lines through the parameter `-t` lines num, such as:
```
➜ arena logs pytorch-dist-git -i pytorch-dist-git-worker-0 -t 5
Train Epoch: 1 [58880/60000 (98%)] loss=0.2048
Train Epoch: 1 [59520/60000 (99%)] loss=0.0646
accuracy=0.9661
```
> For more parameters, see ` arena logs -- help`

View File

@ -1,75 +0,0 @@
This example shows how to use `Arena` to submit a python distributed job and visualize by `Tensorboard`. The sample downloads the source code from git URL.
1. The first step is to check the available resources.
```
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/12 (0%)
```
There are 3 available nodes with GPU for running training jobs.
2. Submit a pytorch distributed training job with 2 nodes and one gpu card, this example download the source code from [Alibaba Cloud code](https://code.aliyun.com/370272561/mnist-pytorch.git).
```
➜ arena --loglevel info submit pytorch \
--name=pytorch-dist-tensorboard \
--gpus=1 \
--workers=2 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--tensorboard \
--logdir=/root/logs \
"python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo --dir /root/logs"
configmap/pytorch-dist-tensorboard-pytorchjob created
configmap/pytorch-dist-tensorboard-pytorchjob labeled
service/pytorch-dist-tensorboard-tensorboard created
deployment.apps/pytorch-dist-tensorboard-tensorboard created
pytorchjob.kubeflow.org/pytorch-dist-tensorboard created
INFO[0000] The Job pytorch-dist-tensorboard has been submitted successfully
INFO[0000] You can run `arena get pytorch-dist-tensorboard --type pytorchjob` to check the job status
```
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`.
> `workers` is the total number of nodes participating in the training (must be a positive integer and greater than or equal to 1), including rank0 node used to establish communication (corresponding to the `master` node in the pytorch-operator). The default value of the parameter is 1, which can not be set, as a stand-alone job.
> `logdir` indicates where the tensorboard reads the event logs of Pytorch.
3. List all the jobs.
```
➜ arena list
NAME STATUS TRAINER AGE NODE
pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h N/A
```
4. Get the details of the this job.
```
➜ arena get pytorch-dist-tensorboard
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 15m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h pytorch-dist-tensorboard-master-0 172.16.0.210
pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h pytorch-dist-tensorboard-worker-0 172.16.0.210
Your tensorboard will be available on:
http://172.16.0.205:30583
```
> Notice: you can access the tensorboard by using `172.16.0.205:30583`. You can consider `sshuttle` if you can't access the tensorboard directly from your laptop. For example:
```
# you can install sshuttle==0.74 in your mac with python2.7
➜ pip install sshuttle==0.74
# 0/0 -> 0.0.0.0/0
➜ sshuttle -r root@39.104.17.205 0/0
```
![](19-pytorchjob-tensorboard.png)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 879 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 413 KiB

View File

@ -1,109 +0,0 @@
Here is an example how you can use `Arena` for the machine learning training. It will download the source code from git url, and use Tensorboard to visualize the Tensorflow computation graph and plot quantitative metrics.
1. the first step is to check the available resources
```
arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0%)
```
There are 3 available nodes with GPU for running training jobs.
2\. Now we can submit a training job with `arena cli`, it will download the source code from github
```
# arena submit tf \
--name=tf-tensorboard \
--gpus=1 \
--image=tensorflow/tensorflow:1.5.0-devel-gpu \
--env=TEST_TMPDIR=code/tensorflow-sample-code/ \
--syncMode=git \
--syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--tensorboard \
--logdir=/training_logs \
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 5000"
configmap/tf-tensorboard-tfjob created
configmap/tf-tensorboard-tfjob labeled
service/tf-tensorboard-tensorboard created
deployment.extensions/tf-tensorboard-tensorboard created
tfjob.kubeflow.org/tf-tensorboard created
INFO[0001] The Job tf-tensorboard has been submitted successfully
INFO[0001] You can run `arena get tf-tensorboard --type tfjob` to check the job status
```
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`.
> `logdir` indicates where the tensorboard reads the event logs of TensorFlow
3\. List all the jobs
```
# arena list
NAME STATUS TRAINER AGE NODE
tf-tensorboard RUNNING TFJOB 0s 192.168.1.119
```
4\. Check the resource usage of the job
```
# arena top job
NAME STATUS TRAINER AGE NODE GPU(Requests) GPU(Allocated)
tf-tensorboard RUNNING TFJOB 26s 192.168.1.119 1 1
Total Allocated GPUs of Training Job:
0
Total Requested GPUs of Training Job:
1
```
5\. Check the resource usage of the cluster
```
# arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 1
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
1/3 (33%)
```
6\. Get the details of the specific job
```
# arena get tf-tensorboard
NAME STATUS TRAINER AGE INSTANCE NODE
tf-tensorboard RUNNING tfjob 15s tf-tensorboard-tfjob-586fcf4d6f-vtlxv 192.168.1.119
tf-tensorboard RUNNING tfjob 15s tf-tensorboard-tfjob-worker-0 192.168.1.119
Your tensorboard will be available on:
192.168.1.117:30670
```
> Notice: you can access the tensorboard by using `192.168.1.117:30670`. You can consider `sshuttle` if you can't access the tensorboard directly from your laptop. For example: `sshuttle -r root@47.89.59.51 192.168.0.0/16`
![](2-tensorboard.jpg)
Congratulations! You've run the training job with `arena` successfully, and you can also check the tensorboard easily.

View File

@ -1,123 +0,0 @@
This example shows how to use `Arena` to submit a python distributed job and mount an NFS data volume. The sample downloads the source code from git URL.
1. Set up an NFS server.(refer to: https://www.cnblogs.com/weifeng1463/p/10037803.html )
```shell
# install nfs server
➜ yum install nfs-utils -y
# Create local directory of NFS server
➜ mkdir -p /root/nfs/data
# Configure nfs server
➜ cat /etc/exports
/root/nfs/data *(rw,no_root_squash)
# Start nfs server
➜ systemctl start nfs; systemctl start rpcbind
➜ systemctl enable nfs
Created symlink from /etc/systemd/system/multi-user.target.wants/nfs-server.service to /usr/lib/systemd/system/nfs-server.service.
```
2. Download training data to shared directory of NFS.
```shell
# Get information of NFS server by showmount, 172.16.0.200 is the host ip of NFS server
➜ showmount -e 172.16.0.200
Export list for 172.16.0.200:
/root/nfs/data *
# Enter shared directory
➜ cd /root/nfs/data
# Prepare training data to shared directory
➜ pwd
/root/nfs/data
# MNIST -> That's the training data we need
➜ ll
total 8.0K
drwxr-xr-x 4 502 games 4.0K 6月 17 16:05 data
drwxr-xr-x 4 root root 4.0K 6月 23 15:17 MNIST
```
3. Create PV.
```shell
# Note: Typesetting may cause yaml indentation problems
➜ cat nfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pytorchdata
labels:
pytorchdata: nas-mnist
spec:
persistentVolumeReclaimPolicy: Retain
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
nfs:
server: 172.16.0.200
path: "/root/nfs/data"
➜ kubectl create -f nfs-pv.yaml
persistentvolume/pytorchdata created
➜ kubectl get pv | grep pytorchdata
pytorchdata 10Gi RWX Retain Bound default/pytorchdata 7m38s
```
5. Create PVC.
```shell
➜ cat nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pytorchdata
annotations:
description: "this is the mnist demo"
owner: Tom
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
selector:
matchLabels:
pytorchdata: nas-mnist
➜ kubectl create -f nfs-pvc.yaml
persistentvolumeclaim/pytorchdata created
➜ kubectl get pvc | grep pytorchdata
pytorchdata Bound pytorchdata 10Gi RWX 2m3s
```
7. Check the data volume.
```shell
➜ arena data list
NAME ACCESSMODE DESCRIPTION OWNER AGE
pytorchdata ReadWriteMany this is the mnist demo Tom 2m
```
9. Submit the pytorch job through `--data pvc_name:container_path` mount distributed storage volume.
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-data \
--gpus=1 \
--workers=2 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--data=pytorchdata:/mnist_data \
"python /root/code/mnist-pytorch/mnist.py --backend gloo --data /mnist_data/data"
configmap/pytorch-data-pytorchjob created
configmap/pytorch-data-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-data created
INFO[0000] The Job pytorch-data has been submitted successfully
INFO[0000] You can run `arena get pytorch-data --type pytorchjob` to check the job status
```
11. Get status of volume `pytorchdata` in one of the instances by `kubectl describe`.
```shell
# Get the details of the this job
➜ arena get pytorch-data
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 56s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-data SUCCEEDED PYTORCHJOB 1m pytorch-data-master-0 172.16.0.210
pytorch-data SUCCEEDED PYTORCHJOB 1m pytorch-data-worker-0 172.16.0.210
# Get status of volume `pytorchdata` from `pytorch-data-master-0`
➜ kubectl describe pod pytorch-data-master-0 | grep pytorchdata -C 3
```
![](20-pytorchjob-distributed-data.png)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 235 KiB

View File

@ -1,54 +0,0 @@
## Arena supports assigning pytorch jobs to some k8s particular nodes
1. Get k8s cluster information:
```shell
➜ kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-huhehaote.172.16.0.205 Ready master 4h19m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.206 Ready master 4h18m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.207 Ready master 4h17m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.208 Ready <none> 4h13m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.209 Ready <none> 4h13m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.210 Ready <none> 4h13m v1.16.9-aliyun.1
```
2. Give a label to nodes,for example:
```shell
# 172.16.0.208 label gpu_node=ok
➜ kubectl label nodes cn-huhehaote.172.16.0.208 gpu_node=ok
node/cn-huhehaote.172.16.0.208 labeled
# 172.16.0.209 label gpu_node=ok
➜ kubectl label nodes cn-huhehaote.172.16.0.209 gpu_node=ok
node/cn-huhehaote.172.16.0.209 labeled
# 172.16.0.210 label ssd_node=ok
➜ kubectl label nodes cn-huhehaote.172.16.0.210 ssd_node=ok
node/cn-huhehaote.172.16.0.210 labeled
```
3. When submitting a python job, you can use the `--selector` to decide which node the job runs on
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-selector \
--gpus=1 \
--workers=2 \
--selector gpu_node=ok \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
configmap/pytorch-selector-pytorchjob created
configmap/pytorch-selector-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-selector created
INFO[0000] The Job pytorch-selector has been submitted successfully
INFO[0000] You can run `arena get pytorch-selector --type pytorchjob` to check the job status
```
4. Get the job details, you can see that the job only runs on this node with IP 172.16.0.209 and label `gpu_node=ok`.
```shell
➜ arena get pytorch-selector
STATUS: PENDING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 14s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-selector PENDING PYTORCHJOB 14s pytorch-selector-master-0 172.16.0.209
pytorch-selector PENDING PYTORCHJOB 14s pytorch-selector-worker-0 172.16.0.209
```

View File

@ -1,96 +0,0 @@
## Arena supports submiting a pytorch job with tolerating k8s nodes with taints
1. Get k8s cluster information:
```shell
➜ kubectl get node
NAME STATUS ROLES AGE VERSION
cn-huhehaote.172.16.0.205 Ready master 5h13m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.206 Ready master 5h12m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.207 Ready master 5h11m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.208 Ready <none> 5h7m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.209 Ready <none> 5h7m v1.16.9-aliyun.1
cn-huhehasote.172.16.0.210 Ready <none> 5h7m v1.16.9-aliyun.1
```
2. Give some taints for k8s nodes,for example:
```shell
# taint --> gpu_node
➜ kubectl taint nodes cn-huhehaote.172.16.0.208 gpu_node=invalid:NoSchedule
node/cn-huhehaote.172.16.0.208 tainted
➜ kubectl taint nodes cn-huhehaote.172.16.0.209 gpu_node=invalid:NoSchedule
node/cn-huhehaote.172.16.0.209 tainted
# taint --> ssd_node
➜ kubectl taint nodes cn-huhehaote.172.16.0.210 ssd_node=invalid:NoSchedule
node/cn-huhehaote.172.16.0.210 tainted
```
3. When we add the wrong nodes' taints or restore the node's schedulability, we can remove the nodes' taints in the following commands:
```shell
➜ kubectl taint nodes cn-huhehaote.172.16.0.208 gpu_node-
node/cn-huhehaote.172.16.0.208 untainted
➜ kubectl taint nodes cn-huhehaote.172.16.0.209 gpu_node-
node/cn-huhehaote.172.16.0.209 untainted
➜ kubectl taint nodes cn-huhehaote.172.16.0.210 ssd_node-
node/cn-huhehaote.172.16.0.210 untainted
```
4. When submit a job, you can tolerate some nodes with taints to run job with operation `--toleration`, for example `--toleration=gpu_node`. This parameter can be used multiple times with different taint keys.
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-toleration \
--gpus=1 \
--workers=2 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--tensorboard \
--logdir=/root/logs \
--toleration gpu_node \
"python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo --dir /root/logs"
configmap/pytorch-toleration-pytorchjob created
configmap/pytorch-toleration-pytorchjob labeled
service/pytorch-toleration-tensorboard created
deployment.apps/pytorch-toleration-tensorboard created
pytorchjob.kubeflow.org/pytorch-toleration created
INFO[0000] The Job pytorch-toleration has been submitted successfully
INFO[0000] You can run `arena get pytorch-toleration --type pytorchjob` to check the job status
```
5. Get the details of the this job.
```shell
arena get pytorch-toleration
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 2m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-toleration RUNNING PYTORCHJOB 2m pytorch-toleration-master-0 172.16.0.209
pytorch-toleration RUNNING PYTORCHJOB 2m pytorch-toleration-worker-0 172.16.0.209
Your tensorboard will be available on:
http://172.16.0.205:32091
```
6. You can use `--toleration all` to tolerate all node taints.
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-toleration-all \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--toleration all \
"python /root/code/mnist-pytorch/mnist.py --epochs 10 --backend gloo"
configmap/pytorch-toleration-all-pytorchjob created
configmap/pytorch-toleration-all-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-toleration-all created
INFO[0000] The Job pytorch-toleration-all has been submitted successfully
INFO[0000] You can run `arena get pytorch-toleration-all --type pytorchjob` to check the job status
```
7. Get the details of the this job.
```shell
➜ arena get pytorch-toleration-all
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 33s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-toleration-all RUNNING PYTORCHJOB 33s pytorch-toleration-all-master-0 172.16.0.210
```

View File

@ -1,49 +0,0 @@
## Assign configuration files for pytorch jobs
You can pass the configuration files to containers when submiting jobs.
1. Prepare the configuration file to be mounted on the submitted machine.
```shell
# prepare your config-file
➜ cat /tmp/test-config.json
{
"key": "job-config"
}
```
2. Submit the job, and specify the configuration file to mount by `--config-file`.
```shell
# arena submit job by --config-file ${host-config-file}:${container-config-file}
# This parameter supports multiple use and mounting multiple configuration files
➜ arena --loglevel info submit pytorch \
--name=pytorch-config-file \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--config-file /tmp/test-config.json:/etc/config/config.json \
"python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo"
configmap/pytorch-config-file-pytorchjob created
configmap/pytorch-config-file-pytorchjob labeled
configmap/pytorch-config-file-a9cbad1b8719778 created
pytorchjob.kubeflow.org/pytorch-config-file created
INFO[0000] The Job pytorch-config-file has been submitted successfully
INFO[0000] You can run `arena get pytorch-config-file --type pytorchjob` to check the job status
```
3. Get the details of the this job.
```shell
➜ arena get pytorch-config-file --type pytorchjob
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 51s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-config-file RUNNING PYTORCHJOB 51s pytorch-config-file-master-0 172.16.0.210
```
4. Use kubectl to check file is in container or not:
```
➜ kubectl exec -ti pytorch-config-file-master-0 -- cat /etc/config/config.json
{
"key": "job-config"
}
```

View File

@ -1,130 +0,0 @@
## Arena supports Priority and Preemption for pytorch job
1. Create `PriorityClass` with the yaml below.There are two priorities defined here: `critical` and `medium`.
```shell
# critical 和 medium 声明
➜ cat priorityClass.yaml
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the critical app
kind: PriorityClass
metadata:
name: critical
value: 1100000
---
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the medium app
kind: PriorityClass
metadata:
name: medium
value: 1000000
# Create two priority objects: critical and medium
➜ kubectl create -f priorityClass.yaml
priorityclass.scheduling.k8s.io/critical created
priorityclass.scheduling.k8s.io/medium created
```
2. Check the available resources.There are 3 nodes in total, and each node has 4 gpu cards.
```shell
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/12 (0%)
```
3. Submit a GPU job with `medium` priority of 3 nodes and 4 cards, which occupies the full resources. In order to verify the effect, we can increase the epoch of training, extend the training time, and facilitate the experiment to view.
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-priority-medium \
--gpus=4 \
--workers=3 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--priority=medium \
"python /root/code/mnist-pytorch/mnist.py --backend gloo --epochs 200"
configmap/pytorch-priority-medium-pytorchjob created
configmap/pytorch-priority-medium-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-priority-medium created
INFO[0000] The Job pytorch-priority-medium has been submitted successfully
INFO[0000] You can run `arena get pytorch-priority-medium --type pytorchjob` to check the job status
```
4. Get the details of the this job. You can see that the task is running.
```shell
➜ arena get pytorch-priority-medium
STATUS: RUNNING
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 3m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-master-0 172.16.0.208
pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-worker-0 172.16.0.210
pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-worker-1 172.16.0.209
```
5. Check the GPU card usage. It is all occupied.
```shell
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 4
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 4
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 4
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
12/12 (100%)
```
6. Submit a job with priority of `critical` to initiate preemption.
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-priority-critical \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--priority=critical \
"python /root/code/mnist-pytorch/mnist.py --backend gloo --epochs 50"
configmap/pytorch-priority-critical-pytorchjob created
configmap/pytorch-priority-critical-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-priority-critical created
INFO[0000] The Job pytorch-priority-critical has been submitted successfully
INFO[0000] You can run `arena get pytorch-priority-critical --type pytorchjob` to check the job status
```
7. Get the details of the this job.
```shell
➜ arena get pytorch-priority-critical
arena get pytorch-priority-critical
STATUS: RUNNING
NAMESPACE: default
PRIORITY: critical
TRAINING DURATION: 22s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-priority-critical RUNNING PYTORCHJOB 22s pytorch-priority-critical-master-0 172.16.0.208
```
8. Check the job status of `medium` priority. It has become `FAILED`. One instance has been deleted due to preemption.
```shell
➜ arena get pytorch-priority-medium
STATUS: FAILED
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 1m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-priority-medium FAILED PYTORCHJOB 2m pytorch-priority-medium-master-0 172.16.0.210
pytorch-priority-medium FAILED PYTORCHJOB 2m pytorch-priority-medium-worker-0 172.16.0.209
```
9. Check the event of the `pytorch-priority-medium`, and you can see that its `python-priority-media-worker-1` has been expelled. The reason for the expulsion is that the `python-priority-critical-master-0` is also applying for the resource of this node, and the node has no additional GPU resource, so the low priority job is preempted by the high priority job.
```shell
➜ kubectl get events --field-selector involvedObject.name=pytorch-priority-medium-worker-1
```
![](24-pytorchjob-preempted.png)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.5 MiB

View File

@ -1,40 +0,0 @@
## Specify the clean-up policy of pod after finishing for pytorch job
1. Submit a job, and specify `--clean-task-policy` as `All`. After the job finished (`SUCCEEDED` or `FAILED`), all instances (pods) will be deleted; the default is `None`, and all pods will be retained.
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-clean-policy \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--clean-task-policy=All \
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
configmap/pytorch-clean-policy-pytorchjob created
configmap/pytorch-clean-policy-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-clean-policy created
INFO[0000] The Job pytorch-clean-policy has been submitted successfully
INFO[0000] You can run `arena get pytorch-clean-policy --type pytorchjob` to check the job status
```
2. Get the job details. After the job is finished, the instance `python-clean-policy-master-0` has been deleted.
```shell
# RUNNING
➜ arena get pytorch-clean-policy
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 18s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-clean-policy RUNNING PYTORCHJOB 18s pytorch-clean-policy-master-0 172.16.0.209
# FINISHED
➜ arena get pytorch-clean-policy
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 37s
NAME STATUS TRAINER AGE INSTANCE NODE
```

View File

@ -1,168 +0,0 @@
# Submit the training jobs with ImagePullSecrets
You can use a private registry when submiting jobs(include tensorboard images).
Assume the following images are in your private registry.
```shell
# pytorch
registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard-secret:1.5.1-cuda10.1-cudnn7-runtime
# tf
registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.5.0-devel-gpu
# mpi
registry.cn-huhehaote.aliyuncs.com/lumo/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5
# tensorboard (--tensorboard-image)
registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel
```
## Contents
* <a href="#create_secret">Create ImagePullSecrets</a>
* <a href="#tfjob">TFJob With Secret</a>
* <a href="#mpijob">MPIJob With Secret</a>
* <a href="#pytorchjob">PyTorchJob With Secret</a>
* <a href="#arenaConfig">Load imagePullSecrets from configuration of Arena<a>
## <a name="create_secret">Create ImagePullSecrets</a>
* Create a [Secret](https://kubernetes.io/docs/concepts/configuration/secret/) with kubectl. In this case, it's [imagePullSecrets](https://kubernetes.io/docs/concepts/containers/images/).
```shell script
kubectl create secret docker-registry [$Reg_Secret] --docker-server=[$Registry] --docker-username=[$Username] --docker-password=[$Password] --docker-email=[$Email]
```
> Note
> [$Reg_Secret] is the name of the secret key, which can be defined by yourself.
> [$Registry] is your private registry address.
> [$Username] is username of your private registry.
> [$Password] is password of your private registry.
> [$Email] is your email address, Optional.
For Example:
```shell
kubectl create secret docker-registry \
lumo-secret \
--docker-server=registry.cn-huhehaote.aliyuncs.com \
--docker-username=******@test.aliyunid.com \
--docker-password=******
secret/lumo-secret created
```
You can check that the secret was created.
```shell
# kubectl get secrets | grep lumo-secret
lumo-secret kubernetes.io/dockerconfigjson 1 52s
```
## <a name="tfjob">TFJob With Secret</a>
Submit the job by using `--image-pull-secrets` to specify the imagePullSecrets.
1. Submit tf job.
```shell
arena submit tf \
--name=tf-git-with-secret \
--working-dir=/root \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.5.0-devel-gpu \
--sync-mode=git \
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--data=training-data:/mnist_data \
--tensorboard \
--tensorboard-image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel \
--logdir=/mnist_data/tf_data/logs \
--image-pull-secrets=lumo-secret \
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --log_dir /mnist_data/tf_data/logs --data_dir /mnist_data/tf_data/"
```
> Note:
> If you have many `imagePullSecrets` to use, you can use `--image-pull-secrets` multiple times.
```shell
arena submit tf \
--name=tf-git-with-secret \
... \
--image-pull-secrets=lumo-secret \
--image-pull-secrets=king-secret \
--image-pull-secrets=test-secret
...
```
2. Get the details of the job.
```shell
# arena get tf-git-with-secret
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 17s
NAME STATUS TRAINER AGE INSTANCE NODE
tf-git-with-secret RUNNING TFJOB 17s tf-git-with-secret-chief-0 172.16.0.202
Your tensorboard will be available on:
http://172.16.0.198:30080
```
## <a name="mpijob">MPIJob With Secret</a>
Submit the job by using `--image-pull-secrets` to specify the imagePullSecrets.
1. Submit mpi job.
```shell
arena submit mpi \
--name=mpi-dist-with-secret \
--gpus=1 \
--workers=2 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \
--sync-mode=git \
--sync-source=https://github.com/tensorflow/benchmarks.git \
--tensorboard \
--tensorboard-image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel \
--image-pull-secrets=lumo-secret \
"mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
```
2. Get the details of the job.
```shell
# arena get mpi-dist-with-secret
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 9m
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-launcher-v8sgt 172.16.0.201
mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-worker-0 172.16.0.201
mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-worker-1 172.16.0.202
Your tensorboard will be available on:
http://172.16.0.198:30450
```
## <a name="pytorchjob">PyTorchJob With Secret</a>
Submit the job by using `--image-pull-secrets` to specify the imagePullSecrets.
1. Submit pytorch job.
```shell
arena submit pytorch \
--name=pytorch-git-with-secret \
--gpus=1 \
--working-dir=/root \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard-secret:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--data=training-data:/mnist_data \
--tensorboard \
--tensorboard-image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel \
--logdir=/mnist_data/pytorch_data/logs \
--image-pull-secrets=lumo-secret \
"python /root/code/mnist-pytorch/mnist.py --epochs 10 --backend nccl --dir /mnist_data/pytorch_data/logs --data /mnist_data/pytorch_data/"
```
2. Get the details of the job.
```shell
# arena get pytorch-git-with-secret
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 2m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-git-with-secret RUNNING PYTORCHJOB 2m pytorch-git-with-secret-master-0 172.16.0.202
Your tensorboard will be available on:
http://172.16.0.198:31155
```
## <a name="arenaConfig">Load imagePullSecrets from configuration of Arena</a>
If you don't want to submit job by `--image-pull-secrets` every time. You can replace it with configuration of Arena.
Open the file `~/.arena/config`, if not exist, create it. And fill in the following configurations.
```shell
imagePullSecrets=lumo-secret,king-secret
```
> Note:
> `--image-pull-secrets` will overwrite `~/.arena/config`.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 123 KiB

View File

@ -1,62 +0,0 @@
This guide walks through the steps to deploy and serve a custom model with kfserving
1. Setup
Follow the kFserving [guide](https://github.com/kubeflow/kfserving#install-kfserving) to install kFserving.For the prerequisites,you should ensure 8g memery and 4 core cpu avaliable in your environment.
2. summit your serving job into kfserving
```shell script
arena serve kfserving --name=max-object-detector --port=5000 --image=codait/max-object-detector --model-type=custom
configmap/max-object-detector-202008221942-kfserving created
configmap/max-object-detector-202008221942-kfserving labeled
inferenceservice.serving.kubeflow.org/max-object-detector-202008221942 created
```
3. list the job you just serving
```shell script
arena serve list
NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS
max-object-detector KFSERVING 202008221942 1 1 10.97.52.65 http:80
```
4. test the model service
##### Determine the ingress IP and ports
The first step is to [determine the ingress IP](https://github.com/kubeflow/kfserving/blob/master/README.md#determine-the-ingress-ip-and-ports) and ports and set INGRESS_HOST and INGRESS_PORT
This example uses the [codait/max-object-detector](https://github.com/IBM/MAX-Object-Detector) image. The Max Object Detector api server expects a POST request to the /model/predict endpoint that includes an image multipart/form-data and an optional threshold query string.
```shell script
MODEL_NAME=max-object-detector-202008221942
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
INGRESS_HOST=localhost
INGRESS_PORT=80
curl -v -F "image=@27-kfserving-custom.jpg" http://${INGRESS_HOST}:${INGRESS_PORT}/model/predict -H "Host: ${SERVICE_HOSTNAME}"
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 80 (#0)
> POST /model/predict HTTP/1.1
> Host: max-object-detector-202008221942.default.example.com
> User-Agent: curl/7.64.1
> Accept: */*
> Content-Length: 125769
> Content-Type: multipart/form-data; boundary=------------------------56b67bc60fc7bdc7
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< content-length: 380
< content-type: application/json
< date: Sun, 23 Aug 2020 03:27:14 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 3566
<
{"status": "ok", "predictions": [{"label_id": "1", "label": "person", "probability": 0.9440352320671082, "detection_box": [0.12420991063117981, 0.12507185339927673, 0.8423266410827637, 0.5974075794219971]}, {"label_id": "18", "label": "dog", "probability": 0.8645510673522949, "detection_box": [0.10447663068771362, 0.17799144983291626, 0.8422801494598389, 0.7320016026496887]}]}
* Connection #0 to host localhost left intact
* Closing connection 0
```
5. delete them
```shell script
arena serve delete max-object-detector --version=202008221942 2 err
inferenceservice.serving.kubeflow.org "max-object-detector-202008221942" deleted
configmap "max-object-detector-202008221942-kfserving" deleted
INFO[0001] The Serving job max-object-detector with version 202008221942 has been deleted successfully
```

View File

@ -1,175 +0,0 @@
This guide walks through the steps to submit a elastic training job with horovod.
1. Build image for training environment
You can use the [registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1]() image directly.
In addition, you can also build your own image with the help of this document [elastic-training-sample-image](https://code.aliyun.com/370272561/elastic-training-sample-image).
2. Submit a elastic training job. Example code from [tensorflow2_mnist_elastic.py](https://github.com/horovod/horovod/blob/master/examples/elastic/tensorflow2_mnist_elastic.py)
```shell script
arena submit etjob \
--name=elastic-training \
--gpus=1 \
--workers=3 \
--max-workers=9 \
--min-workers=1 \
--image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \
--working-dir=/examples \
"horovodrun
-np \$((\${workers}*\${gpus}))
--min-np \$((\${minWorkers}*\${gpus}))
--max-np \$((\${maxWorkers}*\${gpus}))
--host-discovery-script /usr/local/bin/discover_hosts.sh
python /examples/elastic/tensorflow2_mnist_elastic.py
"
```
Output:
```
configmap/elastic-training-etjob created
configmap/elastic-training-etjob labeled
trainingjob.kai.alibabacloud.com/elastic-training created
INFO[0000] The Job elastic-training has been submitted successfully
INFO[0000] You can run `arena get elastic-training --type etjob` to check the job status
```
3. List your job.
```shell script
arena list
```
Output:
```
NAME STATUS TRAINER AGE NODE
elastic-training RUNNING ETJOB 52s 192.168.0.116
```
4. Get your job details.
```shell script
arena get elastic-training
```
Output:
```
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m
NAME STATUS TRAINER AGE INSTANCE NODE
elastic-training RUNNING ETJOB 1m elastic-training-launcher 192.168.0.116
elastic-training RUNNING ETJOB 1m elastic-training-worker-0 192.168.0.114
elastic-training RUNNING ETJOB 1m elastic-training-worker-1 192.168.0.116
elastic-training RUNNING ETJOB 1m elastic-training-worker-2 192.168.0.116
```
5. Check logs
```shell script
arena logs elastic-training --tail 10
```
Output:
```
Tue Sep 8 08:32:50 2020[1]<stdout>:Step #2170 Loss: 0.021992
Tue Sep 8 08:32:50 2020[0]<stdout>:Step #2180 Loss: 0.000902
Tue Sep 8 08:32:50 2020[1]<stdout>:Step #2180 Loss: 0.023190
Tue Sep 8 08:32:50 2020[2]<stdout>:Step #2180 Loss: 0.013149
Tue Sep 8 08:32:51 2020[0]<stdout>:Step #2190 Loss: 0.029536
Tue Sep 8 08:32:51 2020[2]<stdout>:Step #2190 Loss: 0.017537
Tue Sep 8 08:32:51 2020[1]<stdout>:Step #2190 Loss: 0.018273
Tue Sep 8 08:32:51 2020[2]<stdout>:Step #2200 Loss: 0.038399
Tue Sep 8 08:32:51 2020[0]<stdout>:Step #2200 Loss: 0.007017
Tue Sep 8 08:32:51 2020[1]<stdout>:Step #2200 Loss: 0.017495
```
6. Scaleout your job. Will add one worker into jobs.
```shell script
arena scaleout etjob --name="elastic-training" --count=1 --timeout=1m
```
Output:
```
configmap/elastic-training-1599548177-scaleout created
configmap/elastic-training-1599548177-scaleout labeled
scaleout.kai.alibabacloud.com/elastic-training-1599548177 created
INFO[0000] The scaleout job elastic-training-1599548177 has been submitted successfully
```
7. Get your job details. We can see new worker(elastic-training-worker-3) has been "RUNNING".
```shell script
arena get elastic-training
```
Output:
```
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 2m
NAME STATUS TRAINER AGE INSTANCE NODE
elastic-training RUNNING ETJOB 2m elastic-training-launcher 192.168.0.116
elastic-training RUNNING ETJOB 2m elastic-training-worker-0 192.168.0.114
elastic-training RUNNING ETJOB 2m elastic-training-worker-1 192.168.0.116
elastic-training RUNNING ETJOB 2m elastic-training-worker-2 192.168.0.116
elastic-training RUNNING ETJOB 2m elastic-training-worker-3 192.168.0.117
```
8. Check logs.
```shell script
arena logs elastic-training --tail 10
```
Output:
```
Tue Sep 8 08:33:33 2020[1]<stdout>:Step #3140 Loss: 0.014412
Tue Sep 8 08:33:33 2020[0]<stdout>:Step #3140 Loss: 0.004425
Tue Sep 8 08:33:33 2020[3]<stdout>:Step #3150 Loss: 0.000513
Tue Sep 8 08:33:33 2020[2]<stdout>:Step #3150 Loss: 0.062282
Tue Sep 8 08:33:33 2020[1]<stdout>:Step #3150 Loss: 0.020650
Tue Sep 8 08:33:33 2020[0]<stdout>:Step #3150 Loss: 0.008056
Tue Sep 8 08:33:34 2020[3]<stdout>:Step #3160 Loss: 0.002170
Tue Sep 8 08:33:34 2020[2]<stdout>:Step #3160 Loss: 0.009676
Tue Sep 8 08:33:34 2020[1]<stdout>:Step #3160 Loss: 0.051425
Tue Sep 8 08:33:34 2020[0]<stdout>:Step #3160 Loss: 0.023769
```
9. Scalein your job. Will remove one worker from current jobs.
```shell script
arena scalein etjob --name="elastic-training" --count=1 --timeout=1m
```
Output:
```
configmap/elastic-training-1599554041-scalein created
configmap/elastic-training-1599554041-scalein labeled
scalein.kai.alibabacloud.com/elastic-training-1599554041 created
INFO[0000] The scalein job elastic-training-1599554041 has been submitted successfully
```
10. Get your job details. We can see that `elastic-training-worker-3` has been removed.
```shell script
arena get elastic-training
```
Output:
```
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 3m
NAME STATUS TRAINER AGE INSTANCE NODE
elastic-training RUNNING ETJOB 3m elastic-training-launcher 192.168.0.116
elastic-training RUNNING ETJOB 3m elastic-training-worker-0 192.168.0.114
elastic-training RUNNING ETJOB 3m elastic-training-worker-1 192.168.0.116
elastic-training RUNNING ETJOB 3m elastic-training-worker-2 192.168.0.116
```
11. Check logs.
```shell script
arena logs elastic-training --tail 10
```
Output:
```
Tue Sep 8 08:34:43 2020[0]<stdout>:Step #5210 Loss: 0.005627
Tue Sep 8 08:34:43 2020[2]<stdout>:Step #5220 Loss: 0.002142
Tue Sep 8 08:34:43 2020[1]<stdout>:Step #5220 Loss: 0.002978
Tue Sep 8 08:34:43 2020[0]<stdout>:Step #5220 Loss: 0.011404
Tue Sep 8 08:34:44 2020[2]<stdout>:Step #5230 Loss: 0.000689
Tue Sep 8 08:34:44 2020[1]<stdout>:Step #5230 Loss: 0.024597
Tue Sep 8 08:34:44 2020[0]<stdout>:Step #5230 Loss: 0.040936
Tue Sep 8 08:34:44 2020[0]<stdout>:Step #5240 Loss: 0.000125
Tue Sep 8 08:34:44 2020[2]<stdout>:Step #5240 Loss: 0.026498
Tue Sep 8 08:34:44 2020[1]<stdout>:Step #5240 Loss: 0.000308
```

View File

@ -1,182 +0,0 @@
This guide walks through the steps to submit a elastic training job with horovod.
1. Build image for training environment
You can use the [registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1]() image directly.
In addition, you can also build your own image with the help of this document [elastic-training-sample-image](https://code.aliyun.com/370272561/elastic-training-sample-image).
2. Submit a elastic training job. Example code from [pytorch_synthetic_benchmark_elastic.py](https://github.com/horovod/horovod/blob/master/examples/elastic/pytorch_synthetic_benchmark_elastic.py)
```shell script
arena submit etjob \
--name=elastic-training-synthetic \
--gpus=1 \
--workers=3 \
--max-workers=9 \
--min-workers=1 \
--image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \
--working-dir=/examples \
"horovodrun
--verbose
--log-level=DEBUG
-np \$((\${workers}*\${gpus}))
--min-np \$((\${minWorkers}*\${gpus}))
--max-np \$((\${maxWorkers}*\${gpus}))
--start-timeout 100
--elastic-timeout 1000
--host-discovery-script /usr/local/bin/discover_hosts.sh
python /examples/elastic/pytorch_synthetic_benchmark_elastic.py
--num-iters=10000
--num-warmup-batches=0"
```
Output:
```
configmap/elastic-training-synthetic-etjob created
configmap/elastic-training-synthetic-etjob labeled
trainingjob.kai.alibabacloud.com/elastic-training-synthetic created
INFO[0000] The Job elastic-training-synthetic has been submitted successfully
INFO[0000] You can run `arena get elastic-training-synthetic --type etjob` to check the job status
```
3. List your job.
```shell script
arena list
```
Output:
```
NAME STATUS TRAINER AGE NODE
elastic-training-synthetic RUNNING ETJOB 2m 192.168.0.112
```
4. Get your job details.
```shell script
arena get elastic-training-synthetic
```
Output:
```
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 3m
NAME STATUS TRAINER AGE INSTANCE NODE
elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-launcher 192.168.0.112
elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-worker-0 192.168.0.116
elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-worker-1 192.168.0.117
elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-worker-2 192.168.0.116
```
5. Check logs
```shell script
arena logs elastic-training-synthetic --tail 10
```
Output:
```
Tue Sep 8 09:24:20 2020[0]<stdout>:Iter #54: 95.3 img/sec per GPU
Tue Sep 8 09:24:23 2020[0]<stdout>:Iter #55: 95.3 img/sec per GPU
Tue Sep 8 09:24:27 2020[0]<stdout>:Iter #56: 94.6 img/sec per GPU
Tue Sep 8 09:24:30 2020[0]<stdout>:Iter #57: 97.1 img/sec per GPU
Tue Sep 8 09:24:33 2020[0]<stdout>:Iter #58: 99.7 img/sec per GPU
Tue Sep 8 09:24:36 2020[0]<stdout>:Iter #59: 99.8 img/sec per GPU
Tue Sep 8 09:24:40 2020[0]<stdout>:Iter #60: 98.0 img/sec per GPU
Tue Sep 8 09:24:43 2020[0]<stdout>:Iter #61: 97.1 img/sec per GPU
Tue Sep 8 09:24:46 2020[0]<stdout>:Iter #62: 96.1 img/sec per GPU
Tue Sep 8 09:24:50 2020[0]<stdout>:Iter #63: 100.4 img/sec per GPU
```
6. Scaleout your job. Will add one worker into jobs.
```shell script
arena scaleout etjob --name="elastic-training-synthetic" --count=1 --timeout=1m
```
Output:
```
configmap/elastic-training-synthetic-1599557124-scaleout created
configmap/elastic-training-synthetic-1599557124-scaleout labeled
scaleout.kai.alibabacloud.com/elastic-training-synthetic-1599557124 created
INFO[0000] The scaleout job elastic-training-synthetic-1599557124 has been submitted successfully
```
7. Get your job details. We can see new worker(elastic-training-synthetic-worker-3) has been "RUNNING".
```shell script
arena get elastic-training-synthetic
```
Output:
```
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 5m
NAME STATUS TRAINER AGE INSTANCE NODE
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-launcher 192.168.0.112
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-0 192.168.0.116
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-1 192.168.0.117
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-2 192.168.0.116
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-3 192.168.0.112
```
8. Check logs.
```shell script
arena logs elastic-training-synthetic --tail 10
```
Output:
```
Tue Sep 8 09:26:03 2020[0]<stdout>:Iter #76: 65.0 img/sec per GPU
Tue Sep 8 09:26:08 2020[0]<stdout>:Iter #77: 64.0 img/sec per GPU
Tue Sep 8 09:26:13 2020[0]<stdout>:Iter #78: 65.4 img/sec per GPU
Tue Sep 8 09:26:18 2020[0]<stdout>:Iter #79: 64.4 img/sec per GPU
Tue Sep 8 09:26:23 2020[0]<stdout>:Iter #80: 62.9 img/sec per GPU
Tue Sep 8 09:26:28 2020[0]<stdout>:Iter #81: 64.0 img/sec per GPU
Tue Sep 8 09:26:33 2020[0]<stdout>:Iter #82: 64.4 img/sec per GPU
Tue Sep 8 09:26:38 2020[0]<stdout>:Iter #83: 64.9 img/sec per GPU
Tue Sep 8 09:26:43 2020[0]<stdout>:Iter #84: 62.7 img/sec per GPU
Tue Sep 8 09:26:48 2020[0]<stdout>:Iter #85: 64.2 img/sec per GPU
```
9. Scalein your job. Will remove one worker from current jobs.
```shell script
arena scalein etjob --name="elastic-training-synthetic" --count=1 --timeout=1m
```
Output:
```
configmap/elastic-training-synthetic-1599557271-scalein created
configmap/elastic-training-synthetic-1599557271-scalein labeled
scalein.kai.alibabacloud.com/elastic-training-synthetic-1599557271 created
INFO[0000] The scalein job elastic-training-synthetic-1599557271 has been submitted successfully
```
10. Get your job details. We can see that `elastic-training-synthetic-worker-3` has been removed.
```shell script
arena get elastic-training-synthetic
```
Output:
```
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 7m
NAME STATUS TRAINER AGE INSTANCE NODE
elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-launcher 192.168.0.112
elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-worker-0 192.168.0.116
elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-worker-1 192.168.0.117
elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-worker-2 192.168.0.116
```
11. Check logs.
```shell script
arena logs elastic-training-synthetic --tail 10
```
Output:
```
DEBUG:root:host elastic-training-synthetic-worker-3 has been blacklisted, ignoring exit from local_rank=0
Process 3 exit with status code 134.
Tue Sep 8 09:27:56 2020[0]<stdout>:Iter #97: 96.0 img/sec per GPU
Tue Sep 8 09:28:00 2020[0]<stdout>:Iter #98: 95.4 img/sec per GPU
Tue Sep 8 09:28:03 2020[0]<stdout>:Iter #99: 96.9 img/sec per GPU
Tue Sep 8 09:28:06 2020[0]<stdout>:Iter #100: 97.2 img/sec per GPU
Tue Sep 8 09:28:10 2020[0]<stdout>:Iter #101: 98.5 img/sec per GPU
Tue Sep 8 09:28:13 2020[0]<stdout>:Iter #102: 95.8 img/sec per GPU
Tue Sep 8 09:28:16 2020[0]<stdout>:Iter #103: 97.3 img/sec per GPU
Tue Sep 8 09:28:20 2020[0]<stdout>:Iter #104: 97.3 img/sec per GPU
Tue Sep 8 09:28:23 2020[0]<stdout>:Iter #105: 98.9 img/sec per GPU
```

Binary file not shown.

Before

Width:  |  Height:  |  Size: 485 KiB

View File

@ -1,72 +0,0 @@
Arena supports and simplifies distributed TensorFlow Training (PS/worker mode).
1. To run a distributed Tensorflow Training, you need to specify:
- GPUs of each worker (only for GPU workload)
- The number of workers (required)
- The number of PS (required)
- The docker image of worker (required)
- The docker image of PS (required)
- The Port of Worker (default is 22222)
- The Port of PS (default is 22223)
The following command is an example. In this example, it defines 2 workers and 1 PS, and each worker has 1 GPU. The source code of worker and PS are located in git, and the tensorboard are enabled.
```
# arena submit tf \
--name=tf-dist-git \
--gpus=1 \
--workers=2 \
--worker-image=tensorflow/tensorflow:1.5.0-devel-gpu \
--sync-mode=git \
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--ps=1 \
--ps-image=tensorflow/tensorflow:1.5.0-devel \
--tensorboard \
"python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir=/training_logs --data_dir=code/tensorflow-sample-code/data"
configmap/tf-dist-git-tfjob created
configmap/tf-dist-git-tfjob labeled
service/tf-dist-git-tensorboard created
deployment.extensions/tf-dist-git-tensorboard created
tfjob.kubeflow.org/tf-dist-git created
INFO[0001] The Job tf-dist-git has been submitted successfully
INFO[0001] You can run `arena get tf-dist-git --type tfjob` to check the job status
```
**Note**: If you saw the job or pod is failed, and then look at the logs, you may find out it is due to the reason that git code is not be able to cloned, especially if you are runing container insider some countries like China. This is not caused by arena, but cross-border network connectivity.
2\. Get the details of the specific job
```
# arena get tf-dist-git
NAME STATUS TRAINER AGE INSTANCE NODE
tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-594d59789c-lrfsk 192.168.1.119
tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-ps-0 192.168.1.118
tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-worker-0 192.168.1.119
tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-worker-1 192.168.1.120
Your tensorboard will be available on:
192.168.1.117:32298
```
3\. Check the tensorboard
![](3-tensorboard.jpg)
4\. Get the TFJob dashboard
```
# arena logviewer tf-dist-git
Your LogViewer will be available on:
192.168.1.120:8080/tfjobs/ui/#/default/tf-dist-git-tfjob
```
![](4-tfjob-logviewer-distributed.jpg)
Congratulations! You've run the distributed training job with `arena` successfully.

View File

@ -1,78 +0,0 @@
The Distributed Tensorflow job has some roles, includes: Worker,PS,Chief,Evaluator. Sometimes, you may need to decide the sequence when creating them, for example, you may need to create "Worker" role first and then create "PS" role second, This guide will help you.
1. Now, assume that you want to submit a Distributed Tensorflow jobthe tensorflow job has four roles: Worker,PS,Chief,Evaluator and you need the role starting sequence is "Worker,Chief,PS,Evaluator", it is simple for you only add option "--role-sequence" when submitting the job,the following command is an example:
```
$ arena submit tfjob \
--name=tf-distributed-test \
--role-sequence "Worker,Chief,PS,Evaluator" \
--chief \
--evaluator \
--gpus=1 \
--workers=1 \
--worker-image=cheyang/tf-mnist-distributed:gpu \
--ps-image=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--tensorboard-image="registry.cn-hongkong.aliyuncs.com/ai-samples/tensorflow:1.12.0-devel" \
"python /app/main.py"
```
the "--role-sequence Worker,Chief,PS,Evaluator" is the same as "--role-sequence w,c,p,e" and "w" represents "Worker", "c" represents "Chief", "p" represents "PS" and "e" represents "Evaluator".
2. Make sure at least one pod belonging to the tfjob "tf-distributed-test" has annotation "job-role-sequence=Worker,Chief,PS,Evaluator":
```
$ kubectl get po -l tf-job-name=tf-distributed-test
NAME READY STATUS RESTARTS AGE
tf-distributed-test-chief-0 0/1 ContainerCreating 0 5m47s
tf-distributed-test-evaluator-0 0/1 ContainerCreating 0 5m47s
tf-distributed-test-ps-0 1/1 Running 0 5m47s
tf-distributed-test-worker-0 0/1 ContainerCreating 0 5m47s
$ kubectl get po tf-distributed-test-worker-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
job-role-sequence: Worker,Chief,PS,Evaluator
kubernetes.io/psp: ack.privileged
requestGPUsOfJobOwner: "3"
creationTimestamp: 2021-02-22T03:07:49Z
....
```
3. You can validate it by querying the tf-operator logs.
```
$ kubectl get po -n arena-system
NAME READY STATUS RESTARTS AGE
et-operator-576887864c-lvmrs 1/1 Running 1 19d
mpi-operator-66b4cf9b76-kl2fm 1/1 Running 0 26d
pytorch-operator-8545c46f98-cffgw 1/1 Running 4 26d
tf-job-dashboard-78478bfc45-msbzn 1/1 Running 0 19d
tf-job-operator-554d594cff-5vxfg 1/1 Running 0 101m
```
Query the logs of tf-job-operator-554d594cff-5vxfg.
```
$ kubectl logs tf-job-operator-554d594cff-5vxfg -n arena-system | grep "the Role Sequence" | tail -n 1
{"filename":"tensorflow/controller.go:453","job":"default.tf-distributed-test","level":"info","msg":"the Role Sequence of job tf-distributed-test is: [Worker Chief PS Evaluator]","time":"2021-02-01T13:22:23Z","uid":"7db02629-4591-4e0c-a938-c6e4a1cfc074"}
```
As you see the sequence of tf-operator handles the tfjob roles is match the sequence you specified.
If you don't want to specify the role sequence every time when submitting the tfjob, you can save the role sequence to the arena configuration file "~/.arena/config", like:
```
tfjob_role_sequence = Worker,PS,Chief,Evaluator
```
or
```
tfjob_role_sequence = w,p,c,e
```

View File

@ -1,128 +0,0 @@
## Support Multiple Users
In some usage scenarios, you may want multiple users to use arena and these users have different permissions to operate the kubernetes cluster. This guide will tell you how to implement the goal.
Now, assume that there is 3 users to use arena and their privileges are described as follow table:
| User Name | User Namespace | Quota | Additional Privileges |
| --------- | -------------- | ----- |---------------------- |
| alex | workplace1 | - |-|
| bob | workplace2 |limits.cpu: "10",limits.memory: "20Gi",requests.cpu: "5",requests.memory: "10Gi" |list the jobs in the cluster scope|
| tom | workplace3 |requests.nvidia.com/gpu: 20|list the jobs in the namespace scope|
the following steps describe how to generate the kubeconfig files of the users.
1.Prepare the user configuration file, you can refer the ~/charts/user/values.yaml or /charts/user/values.yaml to write your own user configuration file.
The user alex doesn't need to prepare a user configuration file,because it use the default configuration.
The user bob's user configuration file is defined as:
```
quota:
limits.cpu: "10"
requests.cpu: "5"
requests.memory: "10Gi"
limits.memory: "20Gi"
clusterRoles:
- apiGroups:
- batch
resources:
- jobs
verbs:
- list
```
and store it to /tmp/bob-config.yaml
The user tom's user configuration file is defined as:
```
quota:
requests.nvidia.com/gpu: 5
roles:
- apiGroups:
- batch
resources:
- jobs
verbs:
- list
```
and store it to /tmp/tom-config.yaml
2.Generate user kubeconfig, the script 'arena-gen-kubeconfig.sh' can help you:
```
$ arena-gen-kubeconfig.sh -h
Usage:
arena-gen-kubeconfig.sh [OPTION1] [OPTION2] ...
Options:
--user-name <USER_NAME> Specify the user name
--user-namespace <USER_NAMESPACE> Specify the user namespace
--user-config <USER_CONFIG> Specify the user config,refer the ~/charts/user/values.yaml or /charts/user/values.yaml
--force If the user has been existed,force to update the user
--delete Delete the user
--output <KUBECONFIG|USER_MANIFEST_YAML> Specify the output kubeconfig file or the user manifest yaml
--admin-kubeconfig <ADMIN_KUBECONFIG> Specify the Admin kubeconfig file
--cluster-url <CLUSTER_URL> Specify the Cluster URL,if not specified,the script will detect the cluster url
--create-user-yaml Only generate the user manifest yaml,don't apply it and create kubeconfig file
```
Firstly, create the kubeconfig file of alex:
```
$ arena-gen-kubeconfig.sh --user-name alex --user-namespace workplace1 --output /tmp/alex.kubeconfig --force
2021-02-08/11:38:44 DEBUG found arena charts in /Users/yangjunfeng/charts
2021-02-08/11:38:44 DEBUG the user configuration not set,use the default configuration file
resourcequota/arena-quota-alex created
serviceaccount/alex created
clusterrole.rbac.authorization.k8s.io/arena:workplace1:alex configured
clusterrolebinding.rbac.authorization.k8s.io/arena:workplace1:alex configured
role.rbac.authorization.k8s.io/arena:alex created
rolebinding.rbac.authorization.k8s.io/arena:alex created
configmap/arena-user-alex created
Cluster "https://192.168.1.42:6443" set.
User "alex" set.
Context "registry" created.
Switched to context "registry".
2021-02-08/11:38:48 DEBUG kubeconfig written to file /tmp/alex.kubeconfig
```
As you see the kubeconfig file has been created(/tmp/alex.kubeconfig).
Secondly, create the kubeconfig file of user bob:
```
$ arena-gen-kubeconfig.sh --user-name bob --user-namespace workplace2 --user-config /tmp/bob.yaml --output /tmp/bob.kubeconfig --force
```
the kubeconfig file will store at /tmp/bob.kubeconfig
Thirdly, create the kubeconfig file of user tom:
```
$ arena-gen-kubeconfig.sh --user-name tom --user-namespace workplace3 --user-config /tmp/tom.yaml --output /tmp/tom.kubeconfig --force
```
the kubeconfig file will store at /tmp/tom.kubeconfig
3.Make the kubeconfig file is valid, you can set the env KUBECONFIG like:
```
$ export KUBECONFIG=/tmp/alex.kubeconfig
```
4.Now you can use arena to submit your training jobs.
5.If you want to delete the user,execute the command like:
```
$ arena-gen-kubeconfig.sh --user-name tom --user-namespace workplace3 --delete
```

View File

@ -1,110 +0,0 @@
`arena` allows to mount multiple data volumes into the training jobs. There is an example that mounts `data volume` into the training job.
1. You need to create `/data` in the NFS Server, and prepare `mnist data`
```
# mkdir -p /nfs
# mount -t nfs -o vers=4.0 NFS_SERVER_IP:/ /nfs
# mkdir -p /data
# cd /data
# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/t10k-images-idx3-ubyte.gz
# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/t10k-labels-idx1-ubyte.gz
# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/train-images-idx3-ubyte.gz
# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/train-labels-idx1-ubyte.gz
# cd /
# umount /nfs
```
2\. Create Persistent Volume. Moidfy `NFS_SERVER_IP` to yours.
```
# cat nfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: tfdata
labels:
tfdata: nas-mnist
spec:
persistentVolumeReclaimPolicy: Retain
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
nfs:
server: NFS_SERVER_IP
path: "/data"
# kubectl create -f nfs-pv.yaml
```
3\. Create Persistent Volume Claim.
```
# cat nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tfdata
annotations:
description: "this is the mnist demo"
owner: Tom
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
selector:
matchLabels:
tfdata: nas-mnist
# kubectl create -f nfs-pvc.yaml
```
> Notice: suggest to add `description` and `owner`
4\. Check the data volume
```
# arena data list
NAME ACCESSMODE DESCRIPTION OWNER AGE
tfdata ReadWriteMany this is for mnist demo myteam 43d
```
5\. Now we can submit a distributed training job with `arena`, it will download the source code from github and mount data volume `tfdata` to `/mnist_data`.
```
# arena submit tf --name=tf-dist-data \
--gpus=1 \
--workers=2 \
--workerImage=tensorflow/tensorflow:1.5.0-devel-gpu \
--syncMode=git \
--syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--ps=1 \
--psImage=tensorflow/tensorflow:1.5.0-devel \
--tensorboard \
--data=tfdata:/mnist_data \
"python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir /training_logs --data_dir /mnist_data"
```
> `--data` specifies the data volume to mount to all the tasks of the job, like <name_of_datasource>:<mount_point_on_job>. In this example, the data volume is `tfdata`, and the target directory is `/mnist_data`.
6\. From the logs, we find that the training data is extracted from `/mnist_data` instead of downloading from internet directly.
```
# arena logs tf-dist-data
...
Extracting /mnist_data/train-images-idx3-ubyte.gz
Extracting /mnist_data/train-labels-idx1-ubyte.gz
Extracting /mnist_data/t10k-images-idx3-ubyte.gz
Extracting /mnist_data/t10k-labels-idx1-ubyte.gz
...
Accuracy at step 960: 0.9753
Accuracy at step 970: 0.9739
Accuracy at step 980: 0.9756
Accuracy at step 990: 0.9777
Adding run metadata for 999
```

Binary file not shown.

Before

Width:  |  Height:  |  Size: 239 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 454 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 360 KiB

Some files were not shown because too many files have changed in this diff Show More