Compare commits

...

80 Commits

Author SHA1 Message Date
dependabot[bot] f8ee31410c
chore(deps): bump actions/setup-java from 4 to 5 (#1366)
Bumps [actions/setup-java](https://github.com/actions/setup-java) from 4 to 5.
- [Release notes](https://github.com/actions/setup-java/releases)
- [Commits](https://github.com/actions/setup-java/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/setup-java
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-08-26 02:37:19 +00:00
dependabot[bot] ec5255280c
chore(deps): bump actions/checkout from 4 to 5 (#1359)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-08-14 03:36:12 +00:00
dependabot[bot] d1f7be63ab
chore(deps): bump actions/download-artifact from 4 to 5 (#1356)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 5.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-08-14 03:35:12 +00:00
dependabot[bot] a190ca253b
chore(deps): bump github.com/spf13/pflag from 1.0.6 to 1.0.7 (#1352)
---
updated-dependencies:
- dependency-name: github.com/spf13/pflag
  dependency-version: 1.0.7
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-23 05:34:59 +00:00
dependabot[bot] 695c2c67f0
chore(deps): bump golang.org/x/crypto from 0.39.0 to 0.40.0 (#1351)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.39.0 to 0.40.0.
- [Commits](https://github.com/golang/crypto/compare/v0.39.0...v0.40.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.40.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-22 02:37:58 +00:00
Yi Chen 75ec421d62
Bump helm.sh/helm/v3 from 3.16.3 to 3.18.4 (#1350)
* Bump golang version from 1.23.10 to 1.24.0

Signed-off-by: Yi Chen <github@chenyicn.net>

* Fix go vet check

Signed-off-by: Yi Chen <github@chenyicn.net>

* Bump helm.sh/helm/v3 from 3.16.3 to 3.18.4

Signed-off-by: Yi Chen <github@chenyicn.net>

* Run go mod vendor

Signed-off-by: Yi Chen <github@chenyicn.net>

* Retrieve Helm version from go.mod file

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-07-11 14:56:52 +00:00
Yi Chen 25d7b1109e
Release v0.15.1 (#1344)
* Release v0.15.1

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add changelog for v0.15.1

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-26 06:52:17 +00:00
dependabot[bot] d2d5f77a97
chore(deps): bump golang.org/x/crypto from 0.38.0 to 0.39.0 (#1334)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.31.0 to 0.39.0.
- [Commits](https://github.com/golang/crypto/compare/v0.31.0...v0.39.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.39.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-25 06:30:16 +00:00
dependabot[bot] c4ccb4ca7e
chore(deps): bump github.com/prometheus/common from 0.60.1 to 0.65.0 (#1343)
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.60.1 to 0.65.0.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/RELEASE.md)
- [Commits](https://github.com/prometheus/common/compare/v0.60.1...v0.65.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-version: 0.65.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-25 06:20:15 +00:00
Yi Chen aa33dc51b7
Bump golang version from 1.22.7 to 1.23.10 (#1345)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-25 06:06:16 +00:00
Yi Chen 9e84dad37a
Fix golangci-lint issues (#1341)
* Bump golangci-lint version from v1.57.2 to v2.1.6

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add golangci-lint.yaml

Signed-off-by: Yi Chen <github@chenyicn.net>

* Fix golangci-lint issues

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 17:04:14 +00:00
Yi Chen c9d5653de3
Add support for configuring tolerations (#1337)
* Add support for configuring tolerations

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add basic Helm chart unittests

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add Helm chart unit tests to GitHub CI workflow

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 13:01:13 +00:00
Yi Chen 4618e321ab
Update uninstall bash script (#1335)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 12:58:14 +00:00
Yi Chen ca7bf97da4
[CI] Add CI workflow for releasing Arena images (#1340)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 12:57:14 +00:00
Yi Chen 1c633d76ff
Remove kubernetes artifacts (#1329)
* Remove Kubernetes artifacts

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update Makefile

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 12:53:14 +00:00
Yi Chen 3693f59663
Release v0.15.0 (#1332)
* Release v0.15.0

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add changelog for v0.15.0

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-04 15:12:14 +00:00
Syspretor fa2fad7d6e
Feat: support separate affinity policy configuration for PS and worke… (#1331)
Signed-off-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
Co-authored-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
2025-06-04 12:03:14 +00:00
Syspretor 8f4a602ce6
Feat: support affinity policy for kserve and tfjob (#1319)
Signed-off-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
Co-authored-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
2025-06-04 11:33:15 +00:00
Leoyzen ad85546c23
Add custom device support for kserve and kserving. (#1315)
* add custom device support for kserving.

Signed-off-by: Leoyzen <leoyzen@gmail.com>

* add custom device support for kserve.

Signed-off-by: Leoyzen <leoyzen@gmail.com>

---------

Signed-off-by: Leoyzen <leoyzen@gmail.com>
2025-06-04 02:45:14 +00:00
Yi Chen babcb76f91
Make number of replicas of tf-operator deployment configurable (#1323)
* Make tf-operator replicas configurable

Signed-off-by: Yi Chen <github@chenyicn.net>

* Make replicas of tf-operator spread out across different nodes

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-04 02:39:14 +00:00
Yi Chen ba7a09ace6
Make number of replicas of cron-operator deployment configurable (#1325)
* Make cron-operator replicas configurable

Signed-off-by: Yi Chen <github@chenyicn.net>

* Make replicas of cron-operator spread out across different nodes

Signed-off-by: Yi Chen <github@chenyicn.net>

* Remove '--enable-leader-election=true' from args

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-03 13:16:14 +00:00
Yi Chen 545f86bfe9
Delete all services when the TFJob is terminated (#1316)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-05-29 12:57:19 +00:00
co63oc 568e3845f5
Fix typos in multiple files (#1310)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-05-13 08:56:21 +00:00
co63oc 8b84559944
Fix typos in multiple files (#1304)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-05-12 12:45:38 +00:00
Yi Chen ee2384b911
fix: service account should use release namespace (#1308)
* Use release namespace

Signed-off-by: Yi Chen <github@chenyicn.net>

* Remove namespace from cluster scoped resource

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-05-12 12:23:38 +00:00
Yi Chen 2fbb3d7ed4
feat: add new value for using localtime in cron-operator (#1296)
* feat: add new value for using localtime in cron-operator

Signed-off-by: Yi Chen <github@chenyicn.net>

* Rename localTime to useHostTimezone

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-04-03 07:31:33 +00:00
Yi Chen 19b5133e6e
refactor: use helm lib instead of helm binary (#1207)
* Delete func ListAllReleasesWithDetail

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func ListReleaseMap

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func ListReleases

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func DeleteRelease

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add some helm util functions

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func InstallRelease

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func CheckRelease

Signed-off-by: Yi Chen <github@chenyicn.net>

* Refactor func GetChartVersion

Signed-off-by: Yi Chen <github@chenyicn.net>

* Refactor func GenerateHelmTemplate

Signed-off-by: Yi Chen <github@chenyicn.net>

* Move all helm releated functions into util.go

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add missed import statements and run go mod tidy

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update copyright header

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add flag --helm-binary for forward compatibility

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-21 09:19:27 +00:00
Yi Chen 8d413b5861
Add stale bot to mark stale issues and PRs (#1141)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-21 05:14:26 +00:00
dependabot[bot] 2f6e202bbf
Bump github.com/containerd/containerd from 1.7.23 to 1.7.27 (#1290)
Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.7.23 to 1.7.27.
- [Release notes](https://github.com/containerd/containerd/releases)
- [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md)
- [Commits](https://github.com/containerd/containerd/compare/v1.7.23...v1.7.27)

---
updated-dependencies:
- dependency-name: github.com/containerd/containerd
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-21 04:58:26 +00:00
Yi Chen f3d52fa73a
Add basic e2e tests (#1225)
* Add basic e2e tests

Signed-off-by: Yi Chen <github@chenyicn.net>

* Run go mod vendor

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-21 04:02:27 +00:00
Yi Chen ece85b8ce3
fix: job status displays incorrectly (#1289)
* fix: job status displays incorrectly

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add go unit tests

Signed-off-by: Yi Chen <github@chenyicn.net>

* logging job status

Signed-off-by: Yi Chen <github@chenyicn.net>

* Adjust the order of running and queuing conditions

Signed-off-by: Yi Chen <github@chenyicn.net>

* Use constants instead of hard encoded status

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-20 09:51:27 +00:00
Yi Chen d497232013
Release v0.14.2 (#1282)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-10 02:26:01 +00:00
Yi Chen 9407f9b1a0
Update pytorch operator image (#1281)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-10 01:56:01 +00:00
co63oc d9bf195879
Fix typos (#1276)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-03-06 03:11:39 +00:00
Yi Chen 19abf194bb
Release v0.14.1 (#1275)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-24 03:06:45 +00:00
Yi Chen 1f9350d78c
unset env NVIDIA_VISIBLE_DEVICES when gpushare is enabled (#1273)
* unset env NVIDIA_VISIBLE_DEVICES when gpushare is enabled

Signed-off-by: Yi Chen <github@chenyicn.net>

* Group constants into one const block

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-24 02:34:45 +00:00
Yi Chen 23e9731b52
fix: pytorchjob does not support backoff limit (#1272)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-19 06:57:41 +00:00
Yi Chen d6b177b93d
fix: format of tensorflow standalone training docs is messed up (#1265)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-12 12:18:29 +00:00
Yi Chen 0ca2670770
fix: device value does not support k8s resource quantity (#1267)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-12 12:17:29 +00:00
dependabot[bot] 7d7f75ad2d
Bump github.com/golang/glog from 1.2.3 to 1.2.4 (#1263)
Bumps [github.com/golang/glog](https://github.com/golang/glog) from 1.2.3 to 1.2.4.
- [Release notes](https://github.com/golang/glog/releases)
- [Commits](https://github.com/golang/glog/compare/v1.2.3...v1.2.4)

---
updated-dependencies:
- dependency-name: github.com/golang/glog
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-12 10:25:29 +00:00
DBMxrco 4b21f7299b
docs: fixed typo (#1257)
Signed-off-by: DBMxrco <marcoflet@yahoo.com>
2025-02-12 08:34:29 +00:00
Yi Chen 36a59bba67
Release v0.14.0 (#1264)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-12 06:43:28 +00:00
Yi Chen ccdbf44815
Add changelog for v0.13.1 (#1248)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-12 06:34:28 +00:00
dependabot[bot] 36b17b4175
Bump github.com/go-resty/resty/v2 from 2.16.0 to 2.16.5 (#1254)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.16.0 to 2.16.5.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.16.0...v2.16.5)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-12 06:26:29 +00:00
gujing 1058d48063
rename parameter (#1262)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2025-02-12 06:02:30 +00:00
AlanFokCo ce9c5f3bff
Update the version of elastic-job-supervisor in arena-artifacts (#1247)
Signed-off-by: AlanFokCo <892249240@qq.com>
2025-01-13 09:32:08 +00:00
Yi Chen 970afbd209
Add PyTorch mnist example (#1237)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 11:31:16 +00:00
Yi Chen f1bb3bcdbb
feat: add linux/arm64 support for et-operator image (#1241)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 11:00:16 +00:00
Yi Chen b814410627
feat: add linux/arm64 support for cron-operator image (#1240)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 10:59:16 +00:00
Yi Chen 38218aa3a0
feat: add linux/arm64 support for mpi-operator image (#1239)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 10:26:16 +00:00
Yi Chen 13fa5c8dc8
feat: add linux/arm64 support for tf-operator image (#1238)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 09:03:16 +00:00
Yi Chen f098f1af85
Release v0.13.0 (#1232)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-23 08:33:15 +00:00
Yi Chen b0e411cab5
Update pytorch-operator image (#1234)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-23 07:55:15 +00:00
dependabot[bot] 5e18210479
Bump github.com/stretchr/testify from 1.9.0 to 1.10.0 (#1233)
Bumps [github.com/stretchr/testify](https://github.com/stretchr/testify) from 1.9.0 to 1.10.0.
- [Release notes](https://github.com/stretchr/testify/releases)
- [Commits](https://github.com/stretchr/testify/compare/v1.9.0...v1.10.0)

---
updated-dependencies:
- dependency-name: github.com/stretchr/testify
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-20 13:55:12 +00:00
Yi Chen 13df29407c
Update tfjob standalone training job doc (#1222)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-20 05:29:11 +00:00
Yi Chen 0a701eb03d
Remove archived docs (#1208)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-20 05:26:12 +00:00
Yi Chen 0482946a0c
Add changelog for v0.12.1 (#1224)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-20 05:25:12 +00:00
dependabot[bot] 0d4b513d65
Bump golang.org/x/crypto from 0.29.0 to 0.31.0 (#1231)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.29.0 to 0.31.0.
- [Commits](https://github.com/golang/crypto/compare/v0.29.0...v0.31.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-20 05:09:13 +00:00
dependabot[bot] e8b9fcd10d
Bump google.golang.org/protobuf from 1.35.1 to 1.36.0 (#1227)
Bumps google.golang.org/protobuf from 1.35.1 to 1.36.0.

---
updated-dependencies:
- dependency-name: google.golang.org/protobuf
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-20 05:02:12 +00:00
Yi Chen 190c18e840
feat: add support for torchrun (#1228)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-19 11:32:11 +00:00
Yi Chen dc0929f32f
Avoid listing jobs and statefulsets when get pytorchjob (#1229)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-19 11:29:11 +00:00
Yi Chen 74ade74d3e
Release v0.12.1 (#1215)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-25 11:37:29 +00:00
Yi Chen 316e33c999
Update cron operator image (#1214)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-25 11:35:29 +00:00
dependabot[bot] fc47e460e1
Bump golang.org/x/crypto from 0.28.0 to 0.29.0 (#1206)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.28.0 to 0.29.0.
- [Commits](https://github.com/golang/crypto/compare/v0.28.0...v0.29.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-18 15:06:23 +00:00
Yi Chen 1cba9b99dc
Add docs for releasing arena (#1201)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-18 12:29:23 +00:00
Yi Chen 866ec44648
Publish releases only on master branch (#1210)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-18 12:28:23 +00:00
cheyang ac164b85bf
Support MPI Job with generic devices (#1209)
Signed-off-by: cheyang <cheyang@163.com>
2024-11-18 03:03:22 +00:00
Qianlong d61a784a13
Fix the functionality of generating kubeconfig (#1204) (#1205)
Signed-off-by: 向先 <wangqianlong.wql@alibaba-inc.com>
Co-authored-by: 向先 <wangqianlong.wql@alibaba-inc.com>
2024-11-16 15:45:21 +00:00
dependabot[bot] 74fd3f2ad3
bump github.com/go-resty/resty/v2 from 2.15.3 to 2.16.0 (#1202)
---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-15 09:38:20 +00:00
TzZtzt a765b1c5a0
Fix etjob rendering error when using local logging dir (#1203)
Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
2024-11-13 06:17:17 +00:00
Yi Chen 0838d54757
Add go mod vendor check to integration test (#1198)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-12 02:23:16 +00:00
Yi Chen ca735b6152
Add changelog for v0.12.0 (#1199)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-12 02:11:17 +00:00
Yi Chen 969ad681a3
Update tf-operator image to fix clean pod policy issues (#1200)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-12 01:55:16 +00:00
dependabot[bot] 29b2d6d2c5
Bump mkdocs-material from 9.5.42 to 9.5.44 (#1190)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.42 to 9.5.44.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.42...9.5.44)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 06:07:15 +00:00
cheyang 22a3df5023
Support distributed serving with vendor update (#1194)
Signed-off-by: cheyang <cheyang@163.com>
2024-11-11 06:06:15 +00:00
lianhui lin 68b71f9006
Feat: add support for distributed serving type (#1187)
* Feat: support distributed serving type

Signed-off-by: 林联辉 <linlianhui.llh@alibaba-inc.com>

* Fix command check

Signed-off-by: 林联辉 <linlianhui.llh@alibaba-inc.com>

* Fix lint problem

Signed-off-by: 林联辉 <linlianhui.llh@alibaba-inc.com>

---------

Signed-off-by: 林联辉 <linlianhui.llh@alibaba-inc.com>
Co-authored-by: 林联辉 <linlianhui.llh@alibaba-inc.com>
2024-11-07 10:20:12 +00:00
dependabot[bot] 70278ce8f7
Bump github.com/prometheus/common from 0.60.0 to 0.60.1 (#1182)
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.60.0 to 0.60.1.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/RELEASE.md)
- [Commits](https://github.com/prometheus/common/compare/v0.60.0...v0.60.1)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 06:43:12 +00:00
dependabot[bot] 8e008a4916
Bump github.com/golang/glog from 1.2.2 to 1.2.3 (#1189)
Bumps [github.com/golang/glog](https://github.com/golang/glog) from 1.2.2 to 1.2.3.
- [Release notes](https://github.com/golang/glog/releases)
- [Commits](https://github.com/golang/glog/compare/v1.2.2...v1.2.3)

---
updated-dependencies:
- dependency-name: github.com/golang/glog
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 03:12:12 +00:00
Yi Chen 46a795e3db
Fix: unable to set cleanPodPolicy to All when submitting TFJob (#1191)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-07 02:53:12 +00:00
Yi Chen 76ca05975e
Add changelog for v0.11.0 (#1181)
* Add changelog for v0.11.0

Signed-off-by: Yi Chen <github@chenyicn.net>

* Bump version to v0.11.0

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-07 02:05:12 +00:00
7414 changed files with 2288343 additions and 13526 deletions

View File

@ -3,7 +3,7 @@ name: Check Release
on:
pull_request:
branches:
- release-*
- master
paths:
- VERSION
@ -21,7 +21,7 @@ jobs:
steps:
- name: Checkout source code
uses: actions/checkout@v4
uses: actions/checkout@v5
with:
fetch-depth: 0

View File

@ -20,7 +20,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v4
uses: actions/checkout@v5
- name: Set up Go
uses: actions/setup-go@v5
@ -35,11 +35,19 @@ jobs:
exit 1
fi
- name: Run go mod vendor
run: |
go mod vendor
if ! git diff --quiet; then
echo "Please run 'go mod vendor' to make vendored copy of dependencies"
exit 1
fi
- name: Run go fmt check
run: |
make go-fmt
if ! git diff --quiet; then
echo "Please run 'make go-fmt' to run go fmt aganist code"
echo "Please run 'make go-fmt' to run go fmt against code"
exit 1
fi
@ -47,7 +55,7 @@ jobs:
run: |
make go-vet
if ! git diff --quiet; then
echo "Please run 'make go-vet' to run go vet aganist code"
echo "Please run 'make go-vet' to run go vet against code"
exit 1
fi
@ -55,10 +63,14 @@ jobs:
run: |
make go-lint
- name: Run unit tests
- name: Run Go unit tests
run: |
make unit-test
- name: Run Helm unit tests
run: |
make helm-unittest
- name: Build arena binary
run: |
make arena
@ -67,9 +79,9 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v4
uses: actions/checkout@v5
- uses: actions/setup-java@v4
- uses: actions/setup-java@v5
with:
distribution: zulu
java-version: 8
@ -83,7 +95,7 @@ jobs:
steps:
- name: Checkout source code
uses: actions/checkout@v4
uses: actions/checkout@v5
- uses: actions/setup-python@v5
with:
@ -93,3 +105,33 @@ jobs:
run: |
pip install -r docs/requirements.txt
mkdocs build --strict
e2e-test:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
with:
fetch-depth: 0
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod
- name: Set up Kind cluster
uses: helm/kind-action@v1
with:
node_image: kindest/node:v1.29.10
config: arena-artifacts/ci/kind-config.yaml
- name: Install arena client
run: |
make arena-installer
tar -zxf arena-installer-*.tar.gz
arena-installer-*/install.sh --only-binary
- name: Run e2e tests
run: |
make e2e-test

View File

@ -3,10 +3,14 @@ name: Release
on:
push:
branches:
- release-*
- master
paths:
- VERSION
env:
IMAGE_REGISTRY: ghcr.io
IMAGE_REPOSITORY: ${{ github.repository }}
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
@ -26,7 +30,7 @@ jobs:
- arm64
steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@v5
- name: Read version from VERSION file
run: |
@ -49,15 +53,135 @@ jobs:
if-no-files-found: error
overwrite: true
push_tag:
build-arena-image:
name: Build Arena container image
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
platform:
- linux/amd64
- linux/arm64
steps:
- name: Prepare
run: |
platform=${{ matrix.platform }}
echo "PLATFORM_PAIR=${platform//\//-}" >> $GITHUB_ENV
- name: Checkout source code
uses: actions/checkout@v5
- name: Read version from VERSION file
run: |
VERSION=$(cat VERSION)
echo "VERSION=${VERSION}" >> $GITHUB_ENV
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}
tags: |
type=semver,pattern={{version}},value=${{ env.VERSION }}
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker buildx
uses: docker/setup-buildx-action@v3
- name: Login to container registry
uses: docker/login-action@v3
with:
registry: ${{ env.IMAGE_REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push by digest
id: build
uses: docker/build-push-action@v6
with:
platforms: ${{ matrix.platform }}
labels: ${{ steps.meta.outputs.labels }}
outputs: type=image,name=${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }},push-by-digest=true,name-canonical=true,push=true
- name: Export digest
run: |
mkdir -p /tmp/digests
digest="${{ steps.build.outputs.digest }}"
touch "/tmp/digests/${digest#sha256:}"
- name: Upload digest
uses: actions/upload-artifact@v4
with:
name: digests-${{ env.PLATFORM_PAIR }}
path: /tmp/digests/*
if-no-files-found: error
retention-days: 1
release-image:
needs:
- package-arena-installer
- build-arena-image
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v4
uses: actions/checkout@v5
- name: Read version from VERSION file
run: |
VERSION=$(cat VERSION)
echo "VERSION=${VERSION}" >> $GITHUB_ENV
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}
tags: |
type=semver,pattern={{version}},value=${{ env.VERSION }}
- name: Download digests
uses: actions/download-artifact@v5
with:
path: /tmp/digests
pattern: digests-*
merge-multiple: true
- name: Set up Docker buildx
uses: docker/setup-buildx-action@v3
- name: Login to container registry
uses: docker/login-action@v3
with:
registry: ${{ env.IMAGE_REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Create manifest list and push
working-directory: /tmp/digests
run: |
docker buildx imagetools create $(jq -cr '.tags | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON") \
$(printf '${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}@sha256:%s ' *)
- name: Inspect image
run: |
docker buildx imagetools inspect ${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}:${{ steps.meta.outputs.version }}
push_tag:
needs:
- package-arena-installer
- release-image
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
with:
fetch-depth: 0
@ -77,7 +201,7 @@ jobs:
git tag -a ${TAG} -m "Release v${VERSION}"
git push origin ${TAG}
draft_relase:
draft_release:
needs:
- push_tag
@ -88,7 +212,7 @@ jobs:
steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@v5
- name: Configure Git
run: |
@ -101,7 +225,7 @@ jobs:
echo "VERSION=${VERSION}" >> ${GITHUB_ENV}
- name: Download arena installer tarballs
uses: actions/download-artifact@v4
uses: actions/download-artifact@v5
with:
pattern: arena-installer-${{ env.VERSION }}-{linux,darwin}-{amd64,arm64}

43
.github/workflows/stale.yaml vendored Normal file
View File

@ -0,0 +1,43 @@
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
#
# You can adjust the behavior by modifying this file.
# For more information, see:
# https://github.com/actions/stale
name: Mark stale issues and pull requests
on:
schedule:
- cron: "0 0 * * 0"
jobs:
stale:
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/stale@v9
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
days-before-stale: 360
days-before-close: 180
stale-issue-message: >
This issue has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
close-issue-message: >
This issue has been automatically closed because it has not had recent
activity. Please comment "/reopen" to reopen it.
stale-issue-label: lifecycle/stale
exempt-issue-labels: lifecycle/frozen
stale-pr-message: >
This pull request has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
close-pr-message: >
This pull request has been automatically closed because it has not had recent
activity. Please comment "/reopen" to reopen it.
stale-pr-label: lifecycle/stale
exempt-pr-labels: lifecycle/frozen

76
.golangci.yaml Normal file
View File

@ -0,0 +1,76 @@
version: "2"
run:
# Timeout for total work, e.g. 30s, 5m, 5m30s.
# If the value is lower or equal to 0, the timeout is disabled.
# Default: 0 (disabled)
timeout: 2m
linters:
# Enable specific linters.
# https://golangci-lint.run/usage/linters/#enabled-by-default
enable:
# Detects places where loop variables are copied.
- copyloopvar
# Checks for duplicate words in the source code.
- dupword
# Tool for detection of FIXME, TODO and other comment keywords.
# - godox
# Enforces consistent import aliases.
- importas
# Find code that shadows one of Go's predeclared identifiers.
- predeclared
# Check that struct tags are well aligned.
- tagalign
# Remove unnecessary type conversions.
- unconvert
# Checks Go code for unused constants, variables, functions and types.
- unused
# Disable specific linters.
disable:
# Errcheck is a program for checking for unchecked errors in Go code.
- errcheck
settings:
importas:
# List of aliases
alias:
- pkg: k8s.io/api/admissionregistration/v1
alias: admissionregistrationv1
- pkg: k8s.io/api/apps/v1
alias: appsv1
- pkg: k8s.io/api/batch/v1
alias: batchv1
- pkg: k8s.io/api/core/v1
alias: corev1
- pkg: k8s.io/api/extensions/v1beta1
alias: extensionsv1beta1
- pkg: k8s.io/api/networking/v1
alias: networkingv1
- pkg: k8s.io/apimachinery/pkg/apis/meta/v1
alias: metav1
- pkg: sigs.k8s.io/controller-runtime
alias: ctrl
exclusions:
# Which file paths to exclude: they will be analyzed, but issues from them won't be reported.
# "/" will be replaced by the current OS file path separator to properly work on Windows.
# Default: []
paths:
- pkg/operators
issues:
# Maximum issues count per one linter.
# Set to 0 to disable.
# Default: 50
max-issues-per-linter: 50
# Maximum count of issues with the same text.
# Set to 0 to disable.
# Default: 3
max-same-issues: 10
formatters:
enable:
# Check import statements are formatted according to the 'goimport' command.
- goimports

View File

@ -1,5 +1,177 @@
# Changelog
## [v0.15.1](https://github.com/kubeflow/arena/tree/v0.15.1) (2025-06-25)
### Features
- Add support for configuring tolerations ([#1337](https://github.com/kubeflow/arena/pull/1337) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Remove kubernetes artifacts ([#1329](https://github.com/kubeflow/arena/pull/1329) by [@ChenYi015](https://github.com/ChenYi015))
- [CI] Add CI workflow for releasing Arena images ([#1340](https://github.com/kubeflow/arena/pull/1340) by [@ChenYi015](https://github.com/ChenYi015))
- Update uninstall bash script ([#1335](https://github.com/kubeflow/arena/pull/1335) by [@ChenYi015](https://github.com/ChenYi015))
- Fix golangci-lint issues ([#1341](https://github.com/kubeflow/arena/pull/1341) by [@ChenYi015](https://github.com/ChenYi015))
- Bump golang version from 1.22.7 to 1.23.10 ([#1345](https://github.com/kubeflow/arena/pull/1345) by [@ChenYi015](https://github.com/ChenYi015))
- chore(deps): bump github.com/prometheus/common from 0.60.1 to 0.65.0 ([#1343](https://github.com/kubeflow/arena/pull/1343) by [@dependabot[bot]](https://github.com/apps/dependabot))
- chore(deps): bump golang.org/x/crypto from 0.38.0 to 0.39.0 ([#1334](https://github.com/kubeflow/arena/pull/1334) by [@dependabot[bot]](https://github.com/apps/dependabot))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.15.0...v0.15.1)
## [v0.15.0](https://github.com/kubeflow/arena/tree/v0.15.0) (2025-06-04)
### Features
- refactor: use helm lib instead of helm binary ([#1207](https://github.com/kubeflow/arena/pull/1207) by [@ChenYi015](https://github.com/ChenYi015))
- feat: add new value for using localtime in cron-operator ([#1296](https://github.com/kubeflow/arena/pull/1296) by [@ChenYi015](https://github.com/ChenYi015))
- Delete all services when the TFJob is terminated ([#1316](https://github.com/kubeflow/arena/pull/1316) by [@ChenYi015](https://github.com/ChenYi015))
- Make number of replicas of cron-operator deployment configurable ([#1325](https://github.com/kubeflow/arena/pull/1325) by [@ChenYi015](https://github.com/ChenYi015))
- Make number of replicas of tf-operator deployment configurable ([#1323](https://github.com/kubeflow/arena/pull/1323) by [@ChenYi015](https://github.com/ChenYi015))
- Add custom device support for kserve and kserving. ([#1315](https://github.com/kubeflow/arena/pull/1315) by [@Leoyzen](https://github.com/Leoyzen))
- Feat: support affinity policy for kserve and tfjob ([#1319](https://github.com/kubeflow/arena/pull/1319) by [@Syspretor](https://github.com/Syspretor))
- Feat: support separate affinity policy configuration for PS and worke… ([#1331](https://github.com/kubeflow/arena/pull/1331) by [@Syspretor](https://github.com/Syspretor))
### Bug Fixes
- fix: job status displays incorrectly ([#1289](https://github.com/kubeflow/arena/pull/1289) by [@ChenYi015](https://github.com/ChenYi015))
- fix: service account should use release namespace ([#1308](https://github.com/kubeflow/arena/pull/1308) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Add basic e2e tests ([#1225](https://github.com/kubeflow/arena/pull/1225) by [@ChenYi015](https://github.com/ChenYi015))
- Bump github.com/containerd/containerd from 1.7.23 to 1.7.27 ([#1290](https://github.com/kubeflow/arena/pull/1290) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Add stale bot to mark stale issues and PRs ([#1141](https://github.com/kubeflow/arena/pull/1141) by [@ChenYi015](https://github.com/ChenYi015))
- Fix typos in multiple files ([#1304](https://github.com/kubeflow/arena/pull/1304) by [@co63oc](https://github.com/co63oc))
- Fix typos in multiple files ([#1310](https://github.com/kubeflow/arena/pull/1310) by [@co63oc](https://github.com/co63oc))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.14.2...v0.15.0)
## [v0.14.2](https://github.com/kubeflow/arena/tree/v0.14.2) (2025-03-10)
### Misc
- Fix typos ([#1276](https://github.com/kubeflow/arena/pull/1276) by [@co63oc](https://github.com/co63oc))
- Update pytorch operator image ([#1281](https://github.com/kubeflow/arena/pull/1281) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.14.1...v0.14.2)
## [v0.14.1](https://github.com/kubeflow/arena/tree/v0.14.1) (2025-02-24)
### Bug Fixes
- fix: device value does not support k8s resource quantity ([#1267](https://github.com/kubeflow/arena/pull/1267) by [@ChenYi015](https://github.com/ChenYi015))
- fix: pytorchjob does not support backoff limit ([#1272](https://github.com/kubeflow/arena/pull/1272) by [@ChenYi015](https://github.com/ChenYi015))
- unset env NVIDIA_VISIBLE_DEVICES when gpushare is enabled ([#1273](https://github.com/kubeflow/arena/pull/1273) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- docs: fixed typo ([#1257](https://github.com/kubeflow/arena/pull/1257) by [@DBMxrco](https://github.com/DBMxrco))
- Bump github.com/golang/glog from 1.2.3 to 1.2.4 ([#1263](https://github.com/kubeflow/arena/pull/1263) by [@dependabot[bot]](https://github.com/apps/dependabot))
- fix: format of tensorflow standalone training docs is messed up ([#1265](https://github.com/kubeflow/arena/pull/1265) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.14.0...v0.14.1)
## [v0.14.0](https://github.com/kubeflow/arena/tree/v0.14.0) (2025-02-12)
### Features
- rename parameter ([#1262](https://github.com/kubeflow/arena/pull/1262) by [@gujingit](https://github.com/gujingit))
### Misc
- Add changelog for v0.13.1 ([#1248](https://github.com/kubeflow/arena/pull/1248) by [@ChenYi015](https://github.com/ChenYi015))
- Bump github.com/go-resty/resty/v2 from 2.16.0 to 2.16.5 ([#1254](https://github.com/kubeflow/arena/pull/1254) by [@dependabot[bot]](https://github.com/apps/dependabot))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.13.1...v0.14.0)
## [v0.13.1](https://github.com/kubeflow/arena/tree/v0.13.1) (2025-01-13)
### Misc
- feat: add linux/arm64 support for tf-operator image ([#1238](https://github.com/kubeflow/arena/pull/1238) by [@ChenYi015](https://github.com/ChenYi015))
- feat: add linux/arm64 support for mpi-operator image ([#1239](https://github.com/kubeflow/arena/pull/1239) by [@ChenYi015](https://github.com/ChenYi015))
- feat: add linux/arm64 support for cron-operator image ([#1240](https://github.com/kubeflow/arena/pull/1240) by [@ChenYi015](https://github.com/ChenYi015))
- feat: add linux/arm64 support for et-operator image ([#1241](https://github.com/kubeflow/arena/pull/1241) by [@ChenYi015](https://github.com/ChenYi015))
- Add PyTorch mnist example ([#1237](https://github.com/kubeflow/arena/pull/1237) by [@ChenYi015](https://github.com/ChenYi015))
- Update the version of elastic-job-supervisor in arena-artifacts ([#1247](https://github.com/kubeflow/arena/pull/1247) by [@AlanFokCo](https://github.com/AlanFokCo))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.13.0...v0.13.1)
## [v0.13.0](https://github.com/kubeflow/arena/tree/v0.13.0) (2024-12-23)
### New Features
- feat: add support for torchrun ([#1228](https://github.com/kubeflow/arena/pull/1228) by [@ChenYi015](https://github.com/ChenYi015))
- Update pytorch-operator image ([#1234](https://github.com/kubeflow/arena/pull/1234) by [@ChenYi015](https://github.com/ChenYi015))
### Bug Fix
- Avoid listing jobs and statefulsets when get pytorchjob ([#1229](https://github.com/kubeflow/arena/pull/1229) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Update tfjob standalone training job doc ([#1222](https://github.com/kubeflow/arena/pull/1222) by [@ChenYi015](https://github.com/ChenYi015))
- Remove archived docs ([#1208](https://github.com/kubeflow/arena/pull/1208) by [@ChenYi015](https://github.com/ChenYi015))
- Add changelog for v0.12.1 ([#1224](https://github.com/kubeflow/arena/pull/1224) by [@ChenYi015](https://github.com/ChenYi015))
- Bump golang.org/x/crypto from 0.29.0 to 0.31.0 ([#1231](https://github.com/kubeflow/arena/pull/1231) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump google.golang.org/protobuf from 1.35.1 to 1.36.0 ([#1227](https://github.com/kubeflow/arena/pull/1227) by [@dependabot[bot]](https://github.com/apps/dependabot))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.12.1...v0.13.0)
## [v0.12.1](https://github.com/kubeflow/arena/tree/v0.12.1) (2024-11-25)
### New Features
- Support MPI Job with generic devices ([#1209](https://github.com/kubeflow/arena/pull/1209) by [@cheyang](https://github.com/cheyang))
### Bug Fix
- Update tf-operator image to fix clean pod policy issues ([#1200](https://github.com/kubeflow/arena/pull/1200) by [@ChenYi015](https://github.com/ChenYi015))
- Fix etjob rendering error when using local logging dir ([#1203](https://github.com/kubeflow/arena/pull/1203) by [@TrafalgarZZZ](https://github.com/TrafalgarZZZ))
- Fix the functionality of generating kubeconfig (#1204) ([#1205](https://github.com/kubeflow/arena/pull/1205) by [@wqlparallel](https://github.com/wqlparallel))
- Update cron operator image ([#1214](https://github.com/kubeflow/arena/pull/1214) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Add changelog for v0.12.0 ([#1199](https://github.com/kubeflow/arena/pull/1199) by [@ChenYi015](https://github.com/ChenYi015))
- Add go mod vendor check to integration test ([#1198](https://github.com/kubeflow/arena/pull/1198) by [@ChenYi015](https://github.com/ChenYi015))
- bump github.com/go-resty/resty/v2 from 2.15.3 to 2.16.0 ([#1202](https://github.com/kubeflow/arena/pull/1202) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Publish releases only on master branch ([#1210](https://github.com/kubeflow/arena/pull/1210) by [@ChenYi015](https://github.com/ChenYi015))
- Add docs for releasing arena ([#1201](https://github.com/kubeflow/arena/pull/1201) by [@ChenYi015](https://github.com/ChenYi015))
- Bump golang.org/x/crypto from 0.28.0 to 0.29.0 ([#1206](https://github.com/kubeflow/arena/pull/1206) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Release v0.12.1 ([#1215](https://github.com/kubeflow/arena/pull/1215) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/29b2d6d2...v0.12.1)
## [v0.12.0](https://github.com/kubeflow/arena/tree/v0.12.0) (2024-11-11)
### New Features
- Feat: add support for distributed serving type ([#1187](https://github.com/kubeflow/arena/pull/1187) by [@linnlh](https://github.com/linnlh))
- Support distributed serving with vendor update ([#1194](https://github.com/kubeflow/arena/pull/1194) by [@cheyang](https://github.com/cheyang))
### Misc
- Bump github.com/golang/glog from 1.2.2 to 1.2.3 ([#1189](https://github.com/kubeflow/arena/pull/1189) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/prometheus/common from 0.60.0 to 0.60.1 ([#1182](https://github.com/kubeflow/arena/pull/1182) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump mkdocs-material from 9.5.42 to 9.5.44 ([#1190](https://github.com/kubeflow/arena/pull/1190) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Release v0.12.0 ([#1197](https://github.com/kubeflow/arena/pull/1197) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/46a795e3...v0.12.0)
## [v0.11.0](https://github.com/kubeflow/arena/tree/v0.11.0) (2024-10-24)
### New Features
- Support ray job ([#1123](https://github.com/kubeflow/arena/pull/1123) by [@qile123](https://github.com/qile123))
### Misc
- Bump github.com/prometheus/client_golang from 1.20.4 to 1.20.5 ([#1176](https://github.com/kubeflow/arena/pull/1176) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump mkdocs-material from 9.5.40 to 9.5.42 ([#1179](https://github.com/kubeflow/arena/pull/1179) by [@dependabot[bot]](https://github.com/apps/dependabot))
[Full Changelog](https://github.com/kubeflow/arena/compare/e15cb18...v0.11.0)
## [v0.10.1](https://github.com/kubeflow/arena/tree/v0.10.1) (2024-10-14)
### Bug Fixes

View File

@ -1,6 +1,6 @@
ARG BASE_IMAGE=debian:12-slim
FROM golang:1.22.7 as builder
FROM golang:1.24.0 AS builder
ARG TARGETOS

View File

@ -3,7 +3,7 @@ ARG BASE_IMAGE=tensorflow/tensorflow:1.12.0-devel-py3
ARG USER=root
FROM golang:1.22.7 as build
FROM golang:1.23.10 AS build
RUN mkdir -p /go/src/github.com/kubeflow/arena

View File

@ -2,7 +2,7 @@ ARG BASE_IMAGE=registry.aliyuncs.com/kubeflow-images-public/tensorflow-1.12.0-no
ARG USER=jovyan
FROM golang:1.22.7 as build
FROM golang:1.23.10 AS build
RUN mkdir -p /go/src/github.com/kubeflow/arena

View File

@ -18,8 +18,8 @@ DIST_DIR ?= $(CURRENT_DIR)/bin
ARENA_CLI_NAME ?= arena
JOB_MONITOR ?= jobmon
ARENA_UNINSTALL ?= arena-uninstall
OS ?= linux
ARCH ?= amd64
OS ?= $(shell go env GOOS)
ARCH ?= $(shell go env GOARCH)
VERSION ?= $(shell cat VERSION)
BUILD_DATE := $(shell date -u +'%Y-%m-%dT%H:%M:%SZ')
@ -34,17 +34,26 @@ PACKR_CMD := $(shell if [ "`which packr`" ]; then echo "packr"; else echo "go ru
LOCALBIN ?= $(CURRENT_DIR)/bin
# Location to put temp files
TEMPDIR ?= $(CURRENT_DIR)/tmp
# ARENA_ARTIFACTS
ARENA_ARTIFACTS_CHART_PATH ?= $(CURRENT_DIR)/arena-artifacts
# Versions
GOLANG_VERSION=$(shell grep -e '^go ' go.mod | cut -d ' ' -f 2)
KUBECTL_VERSION ?= 1.28.4
HELM_VERSION ?= 3.13.3
GOLANGCI_LINT_VERSION ?= 1.57.2
KUBECTL_VERSION ?= v1.28.4
HELM_VERSION ?= $(shell grep -e 'helm.sh/helm/v3 ' go.mod | cut -d ' ' -f 2)
HELM_UNITTEST_VERSION ?= 0.5.1
KIND_VERSION ?= v0.23.0
KIND_K8S_VERSION ?= v1.29.3
ENVTEST_VERSION ?= release-0.18
ENVTEST_K8S_VERSION ?= 1.29.3
GOLANGCI_LINT_VERSION ?= v2.1.6
# Binaries
ARENA ?= arena-v$(VERSION)-$(OS)-$(ARCH)
KUBECTL ?= kubectl-v$(KUBECTL_VERSION)-$(OS)-$(ARCH)
HELM ?= helm-v$(HELM_VERSION)-$(OS)-$(ARCH)
KUBECTL ?= kubectl-$(KUBECTL_VERSION)-$(OS)-$(ARCH)
HELM ?= helm-$(HELM_VERSION)-$(OS)-$(ARCH)
KIND ?= $(LOCALBIN)/kind-$(KIND_VERSION)
ENVTEST ?= $(LOCALBIN)/setup-envtest-$(ENVTEST_VERSION)
GOLANGCI_LINT ?= golangci-lint-$(GOLANGCI_LINT_VERSION)
# Tarballs
@ -113,6 +122,9 @@ endif
help: ## Display this help.
@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n make \033[36m<target>\033[0m\n"} /^[a-zA-Z_0-9-]+:.*?##/ { printf " \033[36m%-30s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)
.PHONY: all
all: go-fmt go-vet go-lint unit-test e2e-test
##@ Development
go-fmt: ## Run go fmt against code.
@ -136,7 +148,12 @@ go-lint-fix: golangci-lint ## Run golangci-lint linter and perform fixes.
.PHONY: unit-test
unit-test: ## Run go unit tests.
@echo "Running go test..."
go test ./... -coverprofile cover.out
go test $(shell go list ./... | grep -v /e2e) -coverprofile cover.out
.PHONY: e2e-test
e2e-test: envtest ## Run the e2e tests against a Kind k8s instance that is spun up.
@echo "Running e2e tests..."
go test ./test/e2e/ -v -ginkgo.v -timeout 30m
# Build the project
.PHONY: default
@ -166,8 +183,7 @@ clean: ## Clean up all downloaded and generated files.
rm -rf $(LOCALBIN) $(TEMPDIR)
.PHONY: arena
arena: $(LOCALBIN)/$(ARENA) ## Build arena CLI for current platform.
$(LOCALBIN)/$(ARENA): $(LOCALBIN)
arena: $(LOCALBIN) ## Build arena CLI for current platform.
@echo "Building arena CLI..."
CGO_ENABLED=0 GOOS=$(OS) GOARCH=$(ARCH) go build -tags netgo -ldflags '${LDFLAGS}' -o $(LOCALBIN)/$(ARENA) cmd/arena/main.go
@ -219,30 +235,41 @@ build-dependabot:
arena-installer: $(ARENA_INSTALLER_TARBALL) ## Build arena installer tarball
$(ARENA_INSTALLER_TARBALL): arena kubectl helm
echo "Building arena installer tarball..." && \
rm -rf $(TEMPDIR)/$(ARENA_INSTALLER) && \
mkdir -p $(TEMPDIR)/$(ARENA_INSTALLER)/bin && \
cp $(LOCALBIN)/$(ARENA) $(TEMPDIR)/$(ARENA_INSTALLER)/bin/arena && \
cp $(LOCALBIN)/$(KUBECTL) $(TEMPDIR)/$(ARENA_INSTALLER)/bin/kubectl && \
cp $(LOCALBIN)/$(HELM) $(TEMPDIR)/$(ARENA_INSTALLER)/bin/helm && \
cp -R charts $(TEMPDIR)/$(ARENA_INSTALLER) && \
cp -R arena-artifacts $(TEMPDIR)/$(ARENA_INSTALLER) && \
cp -R kubernetes-artifacts $(TEMPDIR)/$(ARENA_INSTALLER) && \
cp arena-gen-kubeconfig.sh $(TEMPDIR)/$(ARENA_INSTALLER)/bin && \
cp install.sh $(TEMPDIR)/$(ARENA_INSTALLER) && \
cp uninstall.sh $(TEMPDIR)/$(ARENA_INSTALLER)/bin/arena-uninstall && \
tar -zcf $(ARENA_INSTALLER).tar.gz -C $(TEMPDIR) $(ARENA_INSTALLER) && \
echo "Successfully saved arena installer to $(ARENA_INSTALLER).tar.gz."
##@ Helm
.PHONY: helm-unittest
helm-unittest: helm-unittest-plugin ## Run Helm chart unittests.
set -x && $(LOCALBIN)/$(HELM) unittest $(ARENA_ARTIFACTS_CHART_PATH) --strict --file "tests/**/*_test.yaml" --chart-tests-path $(CURRENT_DIR)
##@ Dependencies
.PHONY: golangci-lint
golangci-lint: $(LOCALBIN)/$(GOLANGCI_LINT) ## Download golangci-lint locally if necessary.
$(LOCALBIN)/$(GOLANGCI_LINT): $(LOCALBIN)
$(call go-install-tool,$(LOCALBIN)/$(GOLANGCI_LINT),github.com/golangci/golangci-lint/cmd/golangci-lint,${GOLANGCI_LINT_VERSION})
$(call go-install-tool,$(LOCALBIN)/$(GOLANGCI_LINT),github.com/golangci/golangci-lint/v2/cmd/golangci-lint,${GOLANGCI_LINT_VERSION})
.PHONY: envtest
envtest: $(ENVTEST) ## Download setup-envtest locally if necessary.
$(ENVTEST): $(LOCALBIN)
$(call go-install-tool,$(ENVTEST),sigs.k8s.io/controller-runtime/tools/setup-envtest,$(ENVTEST_VERSION))
.PHONY: kubectl
kubectl: $(LOCALBIN)/$(KUBECTL)
$(LOCALBIN)/$(KUBECTL): $(LOCALBIN) $(TEMPDIR)
$(eval KUBECTL_URL=https://dl.k8s.io/release/v$(KUBECTL_VERSION)/bin/$(OS)/$(ARCH)/kubectl)
$(eval KUBECTL_URL=https://dl.k8s.io/release/$(KUBECTL_VERSION)/bin/$(OS)/$(ARCH)/kubectl)
$(eval KUBECTL_SHA_URL=$(KUBECTL_URL).sha256)
cd $(TEMPDIR) && \
@ -278,11 +305,18 @@ $(LOCALBIN)/$(HELM): $(LOCALBIN) $(TEMPDIR)
fi && \
echo "Verifying checksum..." && \
cat $(HELM).tar.gz.sha256sum | shasum -a 256 --check --quiet || (echo "Checksum verification failed, exiting." && false) && \
echo "Extrat helm tarball and move it to bin directory..." && \
echo "Extract helm tarball and move it to bin directory..." && \
tar -zxf $(HELM).tar.gz && \
cp ${OS}-${ARCH}/helm $(LOCALBIN)/$(HELM) && \
echo "Successfully installed helm to $(LOCALBIN)/$(HELM)."
.PHONY: helm-unittest-plugin
helm-unittest-plugin: helm ## Download helm unittest plugin locally if necessary.
if [ -z "$(shell $(LOCALBIN)/$(HELM) plugin list | grep unittest)" ]; then \
echo "Installing helm unittest plugin"; \
$(LOCALBIN)/$(HELM) plugin install https://github.com/helm-unittest/helm-unittest.git --version $(HELM_UNITTEST_VERSION); \
fi
# go-install-tool will 'go install' any package with custom target and name of binary, if it doesn't exist
# $1 - target path with name of binary (ideally with version)
# $2 - package url which can be installed
@ -290,7 +324,7 @@ $(LOCALBIN)/$(HELM): $(LOCALBIN) $(TEMPDIR)
define go-install-tool
@[ -f $(1) ] || { \
set -e; \
package=$(2)@v$(3) ;\
package=$(2)@$(3) ;\
echo "Downloading $${package}" ;\
GOBIN=$(LOCALBIN) go install $${package} ;\
mv "$$(echo "$(1)" | sed "s/-$(3)$$//")" $(1) ;\

View File

@ -1,6 +1,6 @@
# Arena
[![Integration Test](https://github.com/kubeflow/arena/actions/workflows/integration.yaml/badge.svg)](https://github.com/kubeflow/arena/actions/workflows/integration.yaml)[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/arena)](https://goreportcard.com/report/github.com/kubeflow/arena)
[![GitHub release](https://img.shields.io/github/v/release/kubeflow/arena)](https://github.com/kubeflow/arena/releases) [![Integration Test](https://github.com/kubeflow/arena/actions/workflows/integration.yaml/badge.svg)](https://github.com/kubeflow/arena/actions/workflows/integration.yaml) [![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/arena)](https://goreportcard.com/report/github.com/kubeflow/arena)
View the [Arena documentation](https://arena-docs.readthedocs.io/en/latest).
@ -59,7 +59,7 @@ Then you can analyze the profile by following [Go CPU profiling: pprof and speed
## Adopters
If you are intrested in Arena and would like to share your experiences with others, you are warmly welcome to add your information on [ADOPTERS.md](docs/about/ADOPTERS.md) page. We will continuousely discuss new requirements and feature design with you in advance.
If you are interested in Arena and would like to share your experiences with others, you are warmly welcome to add your information on [ADOPTERS.md](docs/about/ADOPTERS.md) page. We will continuously discuss new requirements and feature design with you in advance.
## FAQ

View File

@ -49,13 +49,13 @@ Objectives: "Simplify the user experience of the data scientists and provide a l
* Submit and manage Model Serving with [KF Serving](https://github.com/kubeflow/kfserving)
Objectives: "Make Arena support the same Operator compatiable with different API version, so the upgrade of operator doesn't impact the existing users' experiences."
Objectives: "Make Arena support the same Operator compatible with different API version, so the upgrade of operator doesn't impact the existing users' experiences."
* Compatibility:
* v1aphla2 and v1 TFJob
* v1alpha1 and v1aphla2 MPIJob
Objectives: "Enchance the software quality of Arena so it can be in the quick iteration"
Objectives: "Enhance the software quality of Arena so it can be in the quick iteration"
* Refactor the source code
* Move Training implementation from `cmd` into `pkg`

View File

@ -1 +1 @@
0.10.1
0.15.1

View File

@ -1,16 +0,0 @@
# Adopters Of Arena
Below are the adopters of project Arena. If you are using Arena to improve efficiency and productivity in Machine Learning with Kubernetes, please feel free to add yourself into the following list by a pull request. There're several phases as follow:
* **Evaluation:** Known Arena, that's interesting; evaluating the features/scopes of Arena
* **Testing:** Take Arena as one of candidates, testing Kubernetes cluster with Arena
* **Staging:** Decide to use Arena, testing it in pre-product environment
* **Production:** Already put Arena into product environment
| Organization | Contact | Phases | Description of Use |
| ------------ | ------- | ----------- | ------------------ |
| [Weibo](https://www.weibo.com) | [@phoenixwu0229](https://github.com/phoenixwu0229) | **Production** | Weibo ML Platform |
| [HUYA](https://www.huya.com) | [@BobLiu20](https://github.com/bobliu20) | **Production** | HUYA AI Platform |
| [Microsoft](https://www.microsoft.com) | [@chaowangnk1](https://github.com/chaowangnk1) | **Testing** | AzureML DataCache internal benchmark system |
| [Unisound](https://www.unisound.com) | [@xieydd](https://github.com/xieydd) | **Production** | Unisound ATLAS AI Platform |
| [DOUYU](https://www.douyu.com) | [@gongcan1219](https://github.com/gongcan1219) | **Production** | DOUYU AI Platform |

View File

@ -1,40 +0,0 @@
## arena
arena is the command line interface to Arena
### Synopsis
arena is the command line interface to Arena
```
arena [flags]
```
### Options
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
-h, --help help for arena
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena completion](arena_completion.md) - output shell completion code for the specified shell (bash or zsh)
* [arena data](arena_data.md) - manage data.
* [arena delete](arena_delete.md) - delete a training job and its associated pods
* [arena get](arena_get.md) - display details of a training job
* [arena list](arena_list.md) - list all the training jobs
* [arena logs](arena_logs.md) - print the logs for a task of the training job
* [arena logviewer](arena_logviewer.md) - display Log Viewer URL of a training job
* [arena prune](arena_prune.md) - prune history job
* [arena serve](arena_serve.md) - Serve a job.
* [arena submit](arena_submit.md) - Submit a job.
* [arena top](arena_top.md) - Display Resource (GPU) usage.
* [arena version](arena_version.md) - Print version information
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,43 +0,0 @@
## arena completion
output shell completion code for the specified shell (bash or zsh)
### Synopsis
Write bash or zsh shell completion code to standard output.
For bash, ensure you have bash completions installed and enabled.
To access completions in your current shell, run
$ source <(arena completion bash)
Alternatively, write it to a file and source in .bash_profile
For zsh, output to a file in a directory referenced by the $fpath shell
variable.
```
arena completion SHELL [flags]
```
### Options
```
-h, --help help for completion
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,39 +0,0 @@
## arena data
manage data.
### Synopsis
manage data volumes.
Available Commands:
list,ls List the data volumes.
```
arena data [flags]
```
### Options
```
-h, --help help for data
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
* [arena data list](arena_data_list.md) - list all the data volume.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,35 +0,0 @@
## arena data list
list all the data volume.
### Synopsis
list all the data volume.
```
arena data list [flags]
```
### Options
```
--allNamespaces show all the namespaces
-h, --help help for list
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena data](arena_data.md) - manage data.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,35 +0,0 @@
## arena delete
delete a training job and its associated pods
### Synopsis
delete a training job and its associated pods
```
arena delete a training job [flags]
```
### Options
```
-h, --help help for delete
--type string The training type to delete, the possible option is tfjob, mpijob, horovodjob or standalonejob. (optional)
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,37 +0,0 @@
## arena get
display details of a training job
### Synopsis
display details of a training job
```
arena get training job [flags]
```
### Options
```
-e, --events Specify if show pending pod's events.
-h, --help help for get
-o, --output string Output format. One of: json|yaml|wide
--type string The training type to delete, the possible option is tfjob, mpijob, horovodjob or standalonejob. (optional)
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,35 +0,0 @@
## arena list
list all the training jobs
### Synopsis
list all the training jobs
```
arena list [flags]
```
### Options
```
--allNamespaces show all the namespaces
-h, --help help for list
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,41 +0,0 @@
## arena logs
print the logs for a task of the training job
### Synopsis
print the logs for a task of the training job
```
arena logs training job [flags]
```
### Options
```
-f, --follow Specify if the logs should be streamed.
-h, --help help for logs
-i, --instance string Specify the task instance to get log
--since string Only return logs newer than a relative duration like 5s, 2m, or 3h. Defaults to all logs. Only one of since-time / since may be used.
--since-time string Only return logs after a specific date (RFC3339). Defaults to all logs. Only one of since-time / since may be used.
--tail int Lines of recent log file to display. Defaults to -1 with no selector, showing all log lines otherwise 10, if a selector is provided. (default -1)
--timestamps Include timestamps on each line in the log output
--type string The training type to show logging, the possible option is tfjob, mpijob, horovodjob or standalonejob. (optional)
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,34 +0,0 @@
## arena logviewer
display Log Viewer URL of a training job
### Synopsis
display Log Viewer URL of a training job
```
arena logviewer job [flags]
```
### Options
```
-h, --help help for logviewer
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,35 +0,0 @@
## arena prune
prune history job
### Synopsis
prune history job
```
arena prune history job [flags]
```
### Options
```
-h, --help help for prune
-s, --since duration Clean job that live longer than relative duration like 5s, 2m, or 3h. (default -1ns)
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,43 +0,0 @@
## arena serve
Serve a job.
### Synopsis
serve a job.
Available Commands:
tensorflow,tf Submit a TensorFlow Serving Job.
tensorrt,trt Submit a TensorRT Job
```
arena serve [flags]
```
### Options
```
-h, --help help for serve
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
* [arena serve delete](arena_serve_delete.md) - delete a serving job and its associated pods
* [arena serve list](arena_serve_list.md) - list all the serving jobs
* [arena serve tensorflow](arena_serve_tensorflow.md) - Submit tensorflow serving job to deploy and serve machine learning models.
* [arena serve tensorrt](arena_serve_tensorrt.md) - Submit tensorRT inference serving job to deploy and serve machine learning models.
* [arena serve traffic-split](arena_serve_traffic-split.md) - Adjust traffic routing dynamically for tfserving jobs
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,34 +0,0 @@
## arena serve delete
delete a serving job and its associated pods
### Synopsis
delete a serving job and its associated pods
```
arena serve delete a serving job [flags]
```
### Options
```
-h, --help help for delete
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena serve](arena_serve.md) - Serve a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,34 +0,0 @@
## arena serve list
list all the serving jobs
### Synopsis
list all the serving jobs
```
arena serve list [flags]
```
### Options
```
-h, --help help for list
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena serve](arena_serve.md) - Serve a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,54 +0,0 @@
## arena serve tensorflow
Submit tensorflow serving job to deploy and serve machine learning models.
### Synopsis
Submit tensorflow serving job to deploy and serve machine learning models.
```
arena serve tensorflow [flags]
```
### Options
```
--command string the command will inject to container's command.
--cpu string the request cpu of each replica to run the serve.
-d, --data stringArray specify the trained models datasource to mount for serving, like <name_of_datasource>:<mount_point_on_job>
--enableIstio enable Istio for serving or not (disable Istio by default)
-e, --envs stringArray the environment variables
--exposeService expose service using Istio gateway for external access or not (not expose by default)
--gpumemory int the limit GPU memory of each replica to run the serve.
--gpus int the limit GPU count of each replica to run the serve.
-h, --help help for tensorflow
--image string the docker image name of serve job, and the default image is tensorflow/serving:latest (default "tensorflow/serving:latest")
--imagePullPolicy string the policy to pull the image, and the default policy is IfNotPresent (default "IfNotPresent")
--memory string the request memory of each replica to run the serve.
--modelConfigFile string Corresponding with --model_config_file in tensorflow serving
--modelName string the model name for serving
--modelPath string the model path for serving in the container
--port int the port of tensorflow gRPC listening port (default 8500)
--replicas int the replicas number of the serve job. (default 1)
--restfulPort int the port of tensorflow RESTful listening port (default 8501)
--servingName string the serving name
--servingVersion string the serving version
--versionPolicy string support latest, latest:N, specific:N, all
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena serve](arena_serve.md) - Serve a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,55 +0,0 @@
## arena serve tensorrt
Submit tensorRT inference serving job to deploy and serve machine learning models.
### Synopsis
Submit tensorRT inference serving job to deploy and serve machine learning models.
```
arena serve tensorrt [flags]
```
### Options
```
--allowMetrics Open Metric
--command string the command will inject to container's command.
--cpu string the request cpu of each replica to run the serve.
-d, --data stringArray specify the trained models datasource to mount for serving, like <name_of_datasource>:<mount_point_on_job>
--enableIstio enable Istio for serving or not (disable Istio by default)
-e, --envs stringArray the environment variables
--exposeService expose service using Istio gateway for external access or not (not expose by default)
--gpumemory int the limit GPU memory of each replica to run the serve.
--gpus int the limit GPU count of each replica to run the serve.
--grpcPort int the port of grpc serving server (default 8001)
-h, --help help for tensorrt
--httpPort int the port of http serving server (default 8000)
--image string the docker image name of serve job, and the default image is registry.cn-beijing.aliyuncs.com/xiaozhou/tensorrt-serving:18.12-py3 (default "registry.cn-beijing.aliyuncs.com/xiaozhou/tensorrt-serving:18.12-py3")
--imagePullPolicy string the policy to pull the image, and the default policy is IfNotPresent (default "IfNotPresent")
--memory string the request memory of each replica to run the serve.
--metricPort int the port of metrics server (default 8002)
--modelName string the model name for serving
--modelPath string the model path for serving in the container
--modelStore string the path of tensorRT model path
--replicas int the replicas number of the serve job. (default 1)
--servingName string the serving name
--servingVersion string the serving version
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena serve](arena_serve.md) - Serve a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,36 +0,0 @@
## arena serve traffic-router-split
Adjust traffic routing dynamically for tfserving jobs
### Synopsis
Adjust traffic routing dynamically for tfserving jobs
```
arena serve traffic-router-split [flags]
```
### Options
```
-h, --help help for traffic-router-split
--servingName string the serving name
--versions string Model versions which the traffic will be routed to, e.g. [1,2,3] (default "[]")
--weights string Weight percentage values for each model version which the traffic will be routed to,e.g. [70,20,10] (default "[]")
```
### Options inherited from parent commands
```
--arenaNamespace string The namespace of arena system service, like TFJob (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
--namespace string the namespace of the job (default "default")
--pprof enable cpu profile
```
### SEE ALSO
* [arena serve](arena_serve.md) - Serve a job.
###### Auto generated by spf13/cobra on 7-Sep-2018

View File

@ -1,37 +0,0 @@
## arena serve traffic-split
Adjust traffic routing dynamically for tfserving jobs
### Synopsis
Adjust traffic routing dynamically for tfserving jobs
```
arena serve traffic-split [flags]
```
### Options
```
-h, --help help for traffic-split
--servingName string the serving name
--servingVersions string Model versions which the traffic will be routed to, e.g. 1,2,3
--weights string Weight percentage values for each model version which the traffic will be routed to,e.g. 70,20,10
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena serve](arena_serve.md) - Serve a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,47 +0,0 @@
## arena submit
Submit a job.
### Synopsis
Submit a job.
Available Commands:
tfjob,tf Submit a TFJob.
horovod,hj Submit a Horovod Job.
mpijob,mpi Submit a MPIJob.
standalonejob,sj Submit a standalone Job.
tfserving,tfserving Submit a Serving Job.
sparkjob,spark Submit a Spark Job.
```
arena submit [flags]
```
### Options
```
-h, --help help for submit
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
* [arena submit horovodjob](arena_submit_horovodjob.md) - Submit horovodjob as training job.
* [arena submit mpijob](arena_submit_mpijob.md) - Submit MPIjob as training job.
* [arena submit standalonejob](arena_submit_standalonejob.md) - Submit StandaloneJob as training job. And it will be deprecated soon, please use tfjob instead.
* [arena submit tfjob](arena_submit_tfjob.md) - Submit TFJob as training job.
* [arena submit sparkjob](arena_submit_sparkjob.md) - Submit SparkJob as training job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,51 +0,0 @@
## arena submit horovodjob
Submit horovodjob as training job.
### Synopsis
Submit horovodjob as training job.
```
arena submit horovodjob [flags]
```
### Options
```
-a, --annotation stringArray the annotations
--cpu string the cpu resource to use for the training, like 1 for 1 core.
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
-e, --env stringArray the environment variables
--gpus int the GPU count of each worker to run the training.
-h, --help help for horovodjob
--image string the docker image name of training job
--memory string the memory resource to use for the training, like 1Gi.
--name string override name
--rdma enable RDMA
--retry int retry times.
--sshPort int ssh port.
--sync-image string the docker image of syncImage
--sync-mode string syncMode: support rsync, hdfs, git
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
--workers int the worker number to run the distributed training. (default 1)
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena submit](arena_submit.md) - Submit a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,53 +0,0 @@
## arena submit mpijob
Submit MPIjob as training job.
### Synopsis
Submit MPIjob as training job.
```
arena submit mpijob [flags]
```
### Options
```
-a, --annotation stringArray the annotations
--cpu string the cpu resource to use for the training, like 1 for 1 core.
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
-e, --env stringArray the environment variables
--gpus int the GPU count of each worker to run the training.
-h, --help help for mpijob
--image string the docker image name of training job
--logdir string the training logs dir, default is /training_logs (default "/training_logs")
--memory string the memory resource to use for the training, like 1Gi.
--name string override name
--rdma enable RDMA
--retry int retry times.
--sync-image string the docker image of syncImage
--sync-mode string syncMode: support rsync, hdfs, git
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
--tensorboard enable tensorboard
--tensorboard-image string the docker image for tensorboard (default "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/tensorflow:1.12.0-devel")
--workers int the worker number to run the distributed training. (default 1)
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena submit](arena_submit.md) - Submit a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,37 +0,0 @@
## arena submit sparkjob
Submit SparkJob as training job.
### Synopsis
Submit SparkJob as training job.
```
arena submit tfjob [flags]
```
### Options
```
--image string the docker image name of training job
--jar string jar path in image
--main-class string main class of your jar
--name string override name
--workers int the worker number to run the distributed training. (default 1)
```
### Options inherited from parent commands
```
--arenaNamespace string The namespace of arena system service, like TFJob (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
--namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena submit](arena_submit.md) - Submit a job.

View File

@ -1,52 +0,0 @@
## arena submit standalonejob(deprecated)
**Warning: standalonejob has been deprecated,please use [tfjob](../userguide/1-tfjob-standalone.md) instead.**
Submit StandaloneJob as training job. And it will be deprecated soon, please use tfjob instead.
### Synopsis
Submit StandaloneJob as training job. And it will be deprecated soon, please use tfjob instead.
```
arena submit standalonejob [flags]
```
### Options
```
-a, --annotation stringArray the annotations
--cpu string the cpu resource to use for the training, like 1 for 1 core.
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
-e, --env stringArray the environment variables
--gpus int the GPU count of each worker to run the training.
-h, --help help for standalonejob
--image string the docker image name of training job
--memory string the memory resource to use for the training, like 1Gi.
--name string override name
--rdma enable RDMA
--retry int retry times.
--sync-image string the docker image of syncImage
--sync-mode string syncMode: support rsync, hdfs, git
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
--workers int the worker number to run the distributed training. (default 1)
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena submit](arena_submit.md) - Submit a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,68 +0,0 @@
## arena submit tfjob
Submit TFJob as training job.
### Synopsis
Submit TFJob as training job.
```
arena submit tfjob [flags]
```
### Options
```
-a, --annotation stringArray the annotations
--chief enable chief, which is required for estimator.
--chief-cpu string the cpu resource to use for the Chief, like 1 for 1 core.
--chief-memory string the memory resource to use for the Chief, like 1Gi.
--chief-port int the port of the chief.
--clean-task-policy string How to clean tasks after Training is done, only support Running, None. (default "Running")
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
-e, --env stringArray the environment variables
--evaluator enable evaluator, which is optional for estimator.
--evaluator-cpu string the cpu resource to use for the evaluator, like 1 for 1 core.
--evaluator-memory string the memory resource to use for the evaluator, like 1Gi.
--gpus int the GPU count of each worker to run the training.
-h, --help help for tfjob
--image string the docker image name of training job
--logdir string the training logs dir, default is /training_logs (default "/training_logs")
--name string override name
--ps int the number of the parameter servers.
--ps-cpu string the cpu resource to use for the parameter servers, like 1 for 1 core.
--ps-image string the docker image for tensorflow workers
--ps-memory string the memory resource to use for the parameter servers, like 1Gi.
--ps-port int the port of the parameter server.
--rdma enable RDMA
--retry int retry times.
--sync-image string the docker image of syncImage
--sync-mode string syncMode: support rsync, hdfs, git
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
--tensorboard enable tensorboard
--tensorboard-image string the docker image for tensorboard (default "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/tensorflow:1.12.0-devel")
--worker-cpu string the cpu resource to use for the worker, like 1 for 1 core.
--worker-image string the docker image for tensorflow workers
--worker-memory string the memory resource to use for the worker, like 1Gi.
--worker-port int the port of the worker.
--workers int the worker number to run the distributed training. (default 1)
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena submit](arena_submit.md) - Submit a job.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,41 +0,0 @@
## arena top
Display Resource (GPU) usage.
### Synopsis
Display Resource (GPU) usage.
Available Commands:
node Display Resource (GPU) usage of nodes
job Display Resource (GPU) usage of pods
```
arena top [flags]
```
### Options
```
-h, --help help for top
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
* [arena top job](arena_top_job.md) - Display Resource (GPU) usage of jobs.
* [arena top node](arena_top_node.md) - Display Resource (GPU) usage of nodes.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,37 +0,0 @@
## arena top job
Display Resource (GPU) usage of jobs.
### Synopsis
Display Resource (GPU) usage of jobs.
```
arena top job [flags]
```
### Options
```
--allNamespaces show all the namespaces
-h, --help help for job
-i, --instance string Display instance top info
-r, --refresh Display continuously
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena top](arena_top.md) - Display Resource (GPU) usage.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,35 +0,0 @@
## arena top node
Display Resource (GPU) usage of nodes.
### Synopsis
Display Resource (GPU) usage of nodes.
```
arena top node [flags]
```
### Options
```
-d, --details Display details
-h, --help help for node
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena top](arena_top.md) - Display Resource (GPU) usage.
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,35 +0,0 @@
## arena version
Print version information
### Synopsis
Print version information
```
arena version [flags]
```
### Options
```
-h, --help help for version
--short print just the version number
```
### Options inherited from parent commands
```
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
```
### SEE ALSO
* [arena](arena.md) - arena is the command line interface to Arena
###### Auto generated by spf13/cobra on 24-Apr-2019

View File

@ -1,50 +0,0 @@
## The TFJob plugin framework
If you'd like to customize or enhance the TFJob with your own chart or code.
## Developer Workflow
### Step 1: Implement the following function (optional)
```
// Customized runtime for tf training training
type tfRuntime interface {
// check the tfjob args
check(tf *submitTFJobArgs) (err error)
// transform the tfjob
transform(tf *submitTFJobArgs) (err error)
getChartName() string
}
```
You can refer the implmentation of default tf runtime [../../cmd/arena/commands/training_plugin_interface.go](training_plugin_interface.go)
### Step 2. Create your own chart
If you don't need to create your code for `check` or `transform`, you can create the chart in the same directory of tfjob, mpijob. For example, the chart name is `mock`.
```
cd /charts
cp -r tfjob mock
```
## User Workflow
Just run with the command by specifying annotation `runtime={your runtime}`
```
arena submit tf \
--name=test \
--annotation="runtime=mock" \
--workers=1 \
--chief \
--chief-cpu=4 \
--evaluator \
--evaluator-cpu=4 \
--worker-cpu=2 \
"python test.py"
```

View File

@ -1,118 +0,0 @@
## Setup
This documentation assumes you have a Kubernetes cluster already available.
If you need help setting up a Kubernetes cluster please refer to [Kubernetes Setup](https://kubernetes.io/docs/setup/).
If you want to use GPUs, be sure to follow the Kubernetes [instructions for enabling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/).
Arena doesn't have to run can be run within Kubernetes cluster. It can also be run in your laptop. If you can run `kubectl` to manage the Kubernetes cluster there, you can also use `arena` to manage Training Jobs.
### Requirements
* Linux OS
* Kubernetes >= 1.11, kubectl >= 1.11
* helm version [v2.14.1](https://docs.helm.sh/using_helm/#installing-helm) or later
* tiller with ths same version of helm should be also installed (https://docs.helm.sh/using_helm/#installing-tiller)
### Steps
1\. Prepare kubeconfig file by using `export KUBECONFIG=/etc/kubernetes/admin.conf` or creating a `~/.kube/config`
2\. Download the latest installer from [Release Page](https://github.com/kubeflow/arena/releases), and rename it to `arena-installer.tar.gz`
3\. Untar the installer package
```
# tar -xvf arena-installer.tar.gz
```
4\. Setup Environment Varaibles for customization
4.1\. If you'd like to train and serving in hostNetwork
```
export USE_HOSTNETWORK=true
```
4.2\. If you'd like to customize Kubernetes namespace of arena infrastructure
```
export NAMESPACE={your namespace}
```
4.3\. If you'd like to use your private docker registry instead of `ACR(Alibaba Cloud Container Registry)`:
```
export DOCKER_REGISTRY={your docker registry}
```
4.4\. If you'd like to deploy prometheus in `ACK(Alibaba Container Service for Kubernetes)`
```
export USE_PROMETHEUS=true
export PLATFORM=ack
```
4.5\. If you'd like to use Cloud loadbalancer
```
export USE_LOADBALANCER=true
```
5\. Install arena
```
# cd arena-installer
# sudo ./install.sh
```
6\. Enable shell autocompletion
On Linux, please use bash
On CentOS Linux, you may need to install the bash-completion package which is not installed by default.
```
yum install bash-completion -y
```
On Debian or Ubuntu Linux you may need to install with
```
apt-get install bash-completion
```
To add arena autocompletion to your current shell, run `source <(arena completion bash)`.
On MacOS, please use bash
You can install it with Homebrew:
```
brew install bash-completion@2
```
To add arena autocompletion to your profile, so it is automatically loaded in future shells run:
```
echo "source <(arena completion bash)" >> ~/.bashrc
chmod u+x ~/.bashrc
```
For MacOS, add the following to your `~/.bashrc` file:
```
echo "source $(brew --prefix)/etc/profile.d/bash_completion.sh" >> ~/.bashrc
```
Then you can use [tab] to auto complete the command
```
# arena list
NAME STATUS TRAINER AGE NODE
tf1 PENDING TFJOB 0s N/A
caffe-1080ti-1 RUNNING HOROVOD 45s 192.168.1.120
# arena get [tab]
caffe-1080ti-1 tf1
```

View File

@ -1,157 +0,0 @@
## Setup
This documentation assumes you have a Kubernetes cluster already available.
If you need help setting up a Kubernetes cluster please refer to [Kubernetes Setup](https://kubernetes.io/docs/setup/).
If you want to use GPUs, be sure to follow the Kubernetes [instructions for enabling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/).
Arena doesn't have to run can be run within Kubernetes cluster. It can also be run in your laptop. If you can run `kubectl` to manage the Kubernetes cluster there, you can also use `arena` to manage Training Jobs.
### Requirements
* Kubernetes >= 1.11, kubectl >= 1.11
* helm version [v2.14.1](https://docs.helm.sh/using_helm/#installing-helm) or later
* tiller with ths same version of helm should be also installed (https://docs.helm.sh/using_helm/#installing-tiller)
### Steps
1\. Prepare kubeconfig file by using `export KUBECONFIG=/etc/kubernetes/admin.conf` or creating a `~/.kube/config`
2\. Install kubectl client
Please follow [kubectl installation guide](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
3\. Install Helm client
- Download Helm client from [github.com](https://github.com/helm/helm/releases)
- Unpack it (tar -zxvf helm-v2.14.1-linux-amd64.tgz)
- Find the `helm` binary in the unpacked directory, and move it to its desired destination (mv linux-amd64/helm /usr/local/bin/arena-helm)
Then run `helm list` to check if the the kubernetes can be managed successfully by helm.
```
# arena-helm list
# echo $?
0
```
4\. Download the charts
```
mkdir /charts
git clone https://github.com/kubeflow/arena.git
cp -r arena/charts/* /charts
```
5\. Install TFJob Controller
```
kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-crd.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml
```
6\. Install Dashboard
```
kubectl create -f arena/kubernetes-artifacts/dashboard/dashboard.yaml
```
7\. Install MPIJob Controller
```
kubectl create -f arena/kubernetes-artifacts/mpi-operator/mpi-operator.yaml
```
8\. Build arena
Prerequisites:
- Go >= 1.8
```
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
cd $(go env GOPATH)/src/github.com/kubeflow
git clone https://github.com/kubeflow/arena.git
cd arena
make
```
`arena` binary is located in directory `arena/bin`. You may want add the directory to `$PATH`.
9\. Install and configure kube-arbitrator for gang scheduling(optional)
```
kubectl create -f arena/kubernetes-artifacts/kube-batchd/kube-batched.yaml
```
10\. Enable shell autocompletion
On Linux, please use bash
On CentOS Linux, you may need to install the bash-completion package which is not installed by default.
```
yum install bash-completion -y
```
To add arena autocompletion to your current shell, run source <(arena completion bash).
To add arena autocompletion to your profile, so it is automatically loaded in future shells run:
```
echo "source <(arena completion bash)" >> ~/.bashrc
```
Then you can use [tab] to auto complete the command
```
# arena list
NAME STATUS TRAINER AGE NODE
tf1 PENDING TFJOB 0s N/A
caffe-1080ti-1 RUNNING HOROVOD 45s 192.168.1.120
# arena get [tab]
caffe-1080ti-1 tf1
```
11\. Enable Host network for training (optional)
The training is not `useHostNetwork` by default. If you'd like to run the training in HostNetwork. You can run the command below:
```
find /charts/ -name values.yaml | xargs sed -i "/useHostNetwork/s/false/true/g"
```
12\. Enable Loadbalancer in the public cloud (optional)
Kubernetes can be run on AWS, GCE, Azure and Alibaba Cloud, and `LoadBalancer` is supported in their cloud provider. If you want to access tensorboard on the internet directly, you can run the command below:
```
find /charts/ -name "*.yaml" | xargs sed -i "s/NodePort/LoadBalancer/g"
```
> Warning: it's not encouraged to expose the service to the internet, because the service can be attacked by hacker easily.
13\. Enable Ingress in the public cloud (optional)
If you have ingress controller configured, you are able to access tensorboard through ingress. You can run the command below:
```
find /charts/ -name values.yaml | xargs sed -i "/ingress/s/false/true/g"
```
> Warning: it's not encouraged to expose the service to the internet, because the service can be attacked by hacker easily.
14\. Change imagePullPolicy from `Always` to `IfNotPresent` (optional)
```
find /charts/ -name values.yaml| xargs sed -i "s/Always/IfNotPresent/g"
```
> Warning: this may cause the docker images are not up to date if it's already downloaded in node.

View File

@ -1,154 +0,0 @@
## 部署
本文档假设您已经有可用的 Kubernetes 集群。
如果您需要有关 Kubernetes 集群设置的帮助,请参阅 [Kubernetes 设置](https://kubernetes.io/docs/setup/)。
如果您希望使用 GPU请务必按照 Kubernetes [GPU 启用说明](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) 操作。
Arena 并非必需在 Kubernetes 集群内运行。它也可以在您的笔记本电脑中运行。如果您可以运行 `kubectl` 以管理 Kubernetes 集群,那么也可以使用 `arena` 管理训练作业。
### 要求
* Kubernetes >= 1.11, kubectl >= 1.11
* helm 版本 [v2.14.1](https://docs.helm.sh/using_helm/#installing-helm) 或更新版本
* 此外还要部署与 helm 版本相同的 tiller(https://docs.helm.sh/using_helm/#installing-tiller)
### 步骤
1\.通过使用 `export KUBECONFIG=/etc/kubernetes/admin.conf` 或创建一个 `~/.kube/config` 来准备 kubeconfig 文件
2\.安装 kubectl 客户端
请按照 [kubectl 安装指南] 操作(https://kubernetes.io/docs/tasks/tools/install-kubectl/)
3\.安装 Helm 客户端
- 从 [github.com] 下载 Helm 客户端(https://github.com/helm/helm/releases)
- 将下载到的文件解压缩 (tar -zxvf helm-v2.8.2-linux-amd64.tgz)
- 在解压缩目录中找到 `helm` 二进制文件,将其移到所需目标位置 (mv linux-amd64/helm /usr/local/bin/arena-helm)
然后运行 `helm list` 以检查 helm 能否成功管理 kubernetes。
```
#helm list
#echo $?
0
```
4\.下载 Chart
```
mkdir /charts
git clone https://github.com/kubeflow/arena.git
cp -r arena/charts/* /charts
```
5\.安装 TFJob 控制器
```
kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-crd.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml
```
6\.安装控制台 (可选)
```
kubectl create -f arena/kubernetes-artifacts/dashboard/dashboard.yaml
```
7\.安装 MPIJob 控制器
```
kubectl create -f arena/kubernetes-artifacts/mpi-operator/mpi-operator.yaml
```
8\.安装 arena
先决条件:
- Go >= 1.8
```
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
cd $(go env GOPATH)/src/github.com/kubeflow
git clone https://github.com/kubeflow/arena.git
cd arena
make
```
`arena` 二进制文件位于 `arena/bin` 目录下。您可能希望将目录添加到 `$PATH`
9\.安装并为群调度配置 kube-arbitrator可选
```
kubectl create -f arena/kubernetes-artifacts/kube-batchd/kube-batched.yaml
```
10\.启用 shell 自动完成
在 Linux 上,请使用 bash
在 CentOS Linux 上,您可能需要安装默认并未安装的 bash-completion 包。
```
yum install bash-completion -y
```
要为当前 shell 添加 arena 自动完成,请运行 source <(arena completion bash)。
通过如下方法向您的配置文件添加 arena 自动完成功能,以便将来 shell 运行时可以自动加载此功能:
```
echo "source <(arena completion bash)" >> ~/.bashrc
```
然后,你可以使用 [TAB] 来自动完成命令
```
#arena list
NAME STATUS TRAINER AGE NODE
tf1 PENDING TFJOB 0s N/A
caffe-1080ti-1 RUNNING HOROVOD 45s 192.168.1.120
#arena get [tab]
caffe-1080ti-1 tf1
```
11\.为训练启用主机网络(可选)
默认情况下,训练并非 `useHostNetwork`。如果您希望在 HostNetwork 中运行训练。可以运行如下命令:
```
find /charts/ -name values.yaml | xargs sed -i "/useHostNetwork/s/false/true/g"
```
12\.在公共云中启用 Loadbalancer
Kubernetes 可在 AWS、GCE、Azure 和阿里云中运行,其云提供商支持 `LoadBalancer`。如果您希望在互联网上直接访问 tensorboard可以运行如下代码
```
find /charts/ -name "*.yaml" | xargs sed -i "s/NodePort/LoadBalancer/g"
```
> 警告:我们不鼓励将服务公开给互联网,因为这种做法会导致服务受黑客攻击。
13\. 在公共云中启用 Ingress
Kubernetes 可在 AWS、GCE、Azure 和阿里云中运行,其云提供商支持 `Ingress`。如果您希望在互联网上直接通过统一入口访问 tensorboard可以运行如下代码
```
find /charts/ -name values.yaml | xargs sed -i "/ingress/s/false/true/g"
```
> 警告:我们不鼓励将服务公开给互联网,因为这种做法会导致服务受黑客攻击。
14\. 将 imagePullPolicy 策略由 `Always` 修改为 `IfNotPresent` (可选)
```
find /charts/ -name values.yaml| xargs sed -i "s/Always/IfNotPresent/g"
```
> 警告: 这会导致容器镜像可能不是最新更新版本。

Binary file not shown.

Before

Width:  |  Height:  |  Size: 223 KiB

View File

@ -1,138 +0,0 @@
Here is an example how you can use `Arena` for the machine learning training. It will download the source code from git url.
1. the first step is to check the available resources
```
arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0%)
```
There are 3 available nodes with GPU for running training jobs.
2\. Now we can submit a training job with `arena`, it will download the source code from github
```
# arena submit tf \
--name=tf-git \
--gpus=1 \
--image=tensorflow/tensorflow:1.5.0-devel-gpu \
--sync-mode=git \
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 10000 --data_dir=code/tensorflow-sample-code/data"
configmap/tf-git-tfjob created
configmap/tf-git-tfjob labeled
tfjob.kubeflow.org/tf-git created
INFO[0000] The Job tf-git has been submitted successfully
INFO[0000] You can run `arena get tf-git --type tfjob` to check the job status
```
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`. Also, you may specify the branch you are pulling code from by addding `--env GIT_SYNC_BRANCH=main` to the paramasters while submitting the job.
> If you are using the private git repo, you can use the following command:
```
# arena submit tf \
--name=tf-git \
--gpus=1 \
--image=tensorflow/tensorflow:1.5.0-devel-gpu \
--syncMode=git \
--syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--env=GIT_SYNC_USERNAME=yourname \
--env=GIT_SYNC_PASSWORD=yourpwd \
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py"
```
Notice: `arena` is using [git-sync](https://github.com/kubernetes/git-sync/blob/master/cmd/git-sync/main.go) to sync up source code. You can set the environment variables defined in git-sync project.
3\. List all the jobs
```
# arena list
NAME STATUS TRAINER AGE NODE
tf-git RUNNING tfjob 0s 192.168.1.120
```
4\. Check the resource usage of the job
```
# arena top job
NAME STATUS TRAINER AGE NODE GPU(Requests) GPU(Allocated)
tf-git RUNNING TFJOB 17s 192.168.1.120 1 1
Total Allocated GPUs of Training Job:
1
Total Requested GPUs of Training Job:
1
```
5\. Check the resource usage of the cluster
```
# arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 1
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
1/3 (33%)
```
6\. Get the details of the specific job
```
# arena get tf-git
NAME STATUS TRAINER AGE INSTANCE NODE
tf-git RUNNING TFJOB 5s tf-git-tfjob-worker-0 192.168.1.120
```
7\. Check logs
```
# arena logs tf-git
2018-07-22T23:56:20.841129509Z WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:119: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
2018-07-22T23:56:20.841211064Z Instructions for updating:
2018-07-22T23:56:20.841217002Z
2018-07-22T23:56:20.841221287Z Future major versions of TensorFlow will allow gradients to flow
2018-07-22T23:56:20.841225581Z into the labels input on backprop by default.
2018-07-22T23:56:20.841229492Z
...
2018-07-22T23:57:11.842929868Z Accuracy at step 920: 0.967
2018-07-22T23:57:11.842933859Z Accuracy at step 930: 0.9646
2018-07-22T23:57:11.842937832Z Accuracy at step 940: 0.967
2018-07-22T23:57:11.842941362Z Accuracy at step 950: 0.9674
2018-07-22T23:57:11.842945487Z Accuracy at step 960: 0.9693
2018-07-22T23:57:11.842949067Z Accuracy at step 970: 0.9687
2018-07-22T23:57:11.842952818Z Accuracy at step 980: 0.9688
2018-07-22T23:57:11.842956775Z Accuracy at step 990: 0.9649
2018-07-22T23:57:11.842961076Z Adding run metadata for 999
```
8\. More information about the training job in the logviewer
```
# arena logviewer tf-git
Your LogViewer will be available on:
192.168.1.120:8080/tfjobs/ui/#/default/tf-git-tfjob
```
![](1-tfjob-logviewer.jpg)
Congratulations! You've run the first training job with `arena` successfully.

View File

@ -1,45 +0,0 @@
Arena supports RDMA For distributed Training. We can allocate RDMA device for worker jobs by adding parameter `--rdma`
1. Deploy rdma device plugin
```
# Deploy RDMA device plugin
kubectl create -f kubernetes-artifacts/rdma/rdma-config.yaml
kubectl create -f kubernetes-artifacts/rdma/device-plugin.yaml
```
2\. Label your node with infiniband device
```
# Label RDMA NODE
kubectl label node <your node> accelerator/rdma=true
```
```
# Check Device plugin status
kubectl -n arena-system get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
rdma-sriov-dp-ds 1 1 1 1 1 accelerator/rdma=true 46d
```
3\. Enable arena RDMA config
```
find /charts/ -name values.yaml | xargs sed -i "/enableRDMA/s/false/true/g"
```
4\. Submit a Tensorflow training job using RDMA
```
# arena submit mpi --name=mpi-dist \
--rdma \
--gpus=1 \
--workers=2 \
--image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \
--syncMode=git \
--syncSource=https://github.com/tensorflow/benchmarks.git \
--tensorboard \
"mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3
--save_summaries_steps=10"
```

View File

@ -1,201 +0,0 @@
Arena supports and simplifies distributed spark job.
### 1. To run a distributed spark job, you need to specify:
- The spark job image which contains the main class jar. (required)
- Main class of your jar. (required)
- Jar path in the container.(required)
- The number of executors.(default: 1)
- The resource cpu request of driver pod (default: 1)
- The resource memory request of driver pod (default: 500m)
- The resource cpu request of executor pod (default: 1)
- The resource memory request of executor pod (default: 500m)
### 2. How to create spark job image.
Arena spark job is based on spark-on-k8s-operator(https://github.com/GoogleCloudPlatform/spark-on-k8s-operator).You can create spark job image with tool `docker-image-tool` (https://spark.apache.org/docs/latest/running-on-kubernetes.html#docker-images)
### 3. How to use Arena spark job
##### install spark operator
```$xslt
# arena-system is the default namespace,if not exist please create it.
kubectl create -f arena/kubernetes-artifacts/spark-operator/spark-operator.yaml
```
##### create rbac of spark job
The spark job need service account `spark` to create executors.
```$xslt
kubectl create -f arena/kubernetes-artifacts/spark-operator/spark-rbac.yaml
```
The default namespace is `default`. If you want to run spark job in other namespaces. You can change namespace in spark-rbac.yaml and create a new service account.
##### submit a spark job
```$xslt
arena submit sparkjob --name=demo --image=registry.aliyuncs.com/acs/spark:v2.4.0 --main-class=org.apache.spark.examples.SparkPi --jar=local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar
```
The result is like below.
```$xslt
configmap/demo-sparkjob created
configmap/demo-sparkjob labeled
sparkapplication.sparkoperator.k8s.io/demo created
INFO[0005] The Job demo has been submitted successfully
INFO[0005] You can run `arena get demo --type sparkjob` to check the job status
```
##### get spark job status
```$xslt
arena get --type=sparkjob demo
```
When the job succeed,you will see the result below.
```$xslt
STATUS: SUCCEEDED
NAMESPACE: default
TRAINING DURATION: 15s
NAME STATUS TRAINER AGE INSTANCE NODE
demo1 SUCCEEDED SPARKJOB 1h demo1-driver N/A
```
##### watch log of spark job
```$xslt
arena logs -f demo
```
You will get the log of spark driver pod.
```$xslt
2019-05-08T08:25:21.904409561Z ++ id -u
2019-05-08T08:25:21.904639867Z + myuid=0
2019-05-08T08:25:21.904649704Z ++ id -g
2019-05-08T08:25:21.904901542Z + mygid=0
2019-05-08T08:25:21.904909072Z + set +e
2019-05-08T08:25:21.905241846Z ++ getent passwd 0
2019-05-08T08:25:21.905608733Z + uidentry=root:x:0:0:root:/root:/bin/ash
2019-05-08T08:25:21.905623028Z + set -e
2019-05-08T08:25:21.905629226Z + '[' -z root:x:0:0:root:/root:/bin/ash ']'
2019-05-08T08:25:21.905633894Z + SPARK_K8S_CMD=driver
2019-05-08T08:25:21.905757494Z + case "$SPARK_K8S_CMD" in
2019-05-08T08:25:21.90622059Z + shift 1
2019-05-08T08:25:21.906232126Z + SPARK_CLASSPATH=':/opt/spark/jars/*'
2019-05-08T08:25:21.906236316Z + env
2019-05-08T08:25:21.906239651Z + grep SPARK_JAVA_OPT_
2019-05-08T08:25:21.90624307Z + sort -t_ -k4 -n
2019-05-08T08:25:21.906585896Z + sed 's/[^=]*=\(.*\)/\1/g'
2019-05-08T08:25:21.906908601Z + readarray -t SPARK_EXECUTOR_JAVA_OPTS
2019-05-08T08:25:21.906917535Z + '[' -n '' ']'
2019-05-08T08:25:21.906999069Z + '[' -n '' ']'
2019-05-08T08:25:21.907003871Z + PYSPARK_ARGS=
2019-05-08T08:25:21.907006605Z + '[' -n '' ']'
2019-05-08T08:25:21.907008951Z + R_ARGS=
2019-05-08T08:25:21.907012105Z + '[' -n '' ']'
2019-05-08T08:25:21.907148385Z + '[' '' == 2 ']'
2019-05-08T08:25:21.907994286Z + '[' '' == 3 ']'
2019-05-08T08:25:21.908014459Z + case "$SPARK_K8S_CMD" in
2019-05-08T08:25:21.908018653Z + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
2019-05-08T08:25:21.908023924Z + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.20.90.160 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi spark-internal
2019-05-08T08:25:23.326681135Z 2019-05-08 08:25:23 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-05-08T08:25:23.829843117Z 2019-05-08 08:25:23 INFO SparkContext:54 - Running Spark version 2.4.0
2019-05-08T08:25:23.8529898Z 2019-05-08 08:25:23 INFO SparkContext:54 - Submitted application: Spark Pi
2019-05-08T08:25:23.94670344Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing view acls to: root
2019-05-08T08:25:23.946735076Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing modify acls to: root
2019-05-08T08:25:23.946740267Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing view acls groups to:
2019-05-08T08:25:23.946744543Z 2019-05-08 08:25:23 INFO SecurityManager:54 - Changing modify acls groups to:
2019-05-08T08:25:23.946748767Z 2019-05-08 08:25:23 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2019-05-08T08:25:24.273960575Z 2019-05-08 08:25:24 INFO Utils:54 - Successfully started service 'sparkDriver' on port 7078.
2019-05-08T08:25:24.307632934Z 2019-05-08 08:25:24 INFO SparkEnv:54 - Registering MapOutputTracker
2019-05-08T08:25:24.339548141Z 2019-05-08 08:25:24 INFO SparkEnv:54 - Registering BlockManagerMaster
2019-05-08T08:25:24.339577986Z 2019-05-08 08:25:24 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2019-05-08T08:25:24.340887925Z 2019-05-08 08:25:24 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2019-05-08T08:25:24.359682519Z 2019-05-08 08:25:24 INFO DiskBlockManager:54 - Created local directory at /var/data/spark-118b216d-2d39-4287-ad71-5b5d7c7195c9/blockmgr-5532fd8b-64b9-492c-b94d-308b55d60a71
2019-05-08T08:25:24.388529744Z 2019-05-08 08:25:24 INFO MemoryStore:54 - MemoryStore started with capacity 110.0 MB
2019-05-08T08:25:24.413347888Z 2019-05-08 08:25:24 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2019-05-08T08:25:24.560654618Z 2019-05-08 08:25:24 INFO log:192 - Logging initialized @2462ms
2019-05-08T08:25:24.654721075Z 2019-05-08 08:25:24 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2019-05-08T08:25:24.680943254Z 2019-05-08 08:25:24 INFO Server:419 - Started @2586ms
2019-05-08T08:25:24.715867156Z 2019-05-08 08:25:24 INFO AbstractConnector:278 - Started ServerConnector@7e97551f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-05-08T08:25:24.715897312Z 2019-05-08 08:25:24 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040.
2019-05-08T08:25:24.76123501Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1450078a{/jobs,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.762173789Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@534ca02b{/jobs/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.763361524Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@29a23c3d{/jobs/job,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.764374535Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6fe46b62{/jobs/job/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.764919809Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@591fd34d{/stages,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.765687152Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@61e45f87{/stages/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.766434602Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7c9b78e3{/stages/stage,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.769934319Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5491f68b{/stages/stage/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.769949155Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@736ac09a{/stages/pool,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.769966711Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6ecd665{/stages/pool/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.77037559Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@45394b31{/storage,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.772696599Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1ec7d8b3{/storage/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.772709487Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3b0ca5e1{/storage/rdd,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.773014833Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5bb3131b{/storage/rdd/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.77546416Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@54dcbb9f{/environment,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.775478151Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@74fef3f7{/environment/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.775882882Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2a037324{/executors,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.780702953Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@69eb86b4{/executors/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.780717178Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@585ac855{/executors/threadDump,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.78072195Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5bb8f9e2{/executors/threadDump/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.793805533Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6a933be2{/static,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.808511998Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@378bd86d{/,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.808532751Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2189e7a7{/api,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.808537695Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@644abb8f{/jobs/job/kill,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.80854206Z 2019-05-08 08:25:24 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a411233{/stages/stage/kill,null,AVAILABLE,@Spark}
2019-05-08T08:25:24.808546336Z 2019-05-08 08:25:24 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://demo1-1557303918993-driver-svc.default.svc:4040
2019-05-08T08:25:24.834767942Z 2019-05-08 08:25:24 INFO SparkContext:54 - Added JAR file:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar at spark://demo1-1557303918993-driver-svc.default.svc:7078/jars/spark-examples_2.11-2.4.0.jar with timestamp 1557303924832
2019-05-08T08:25:26.274526541Z 2019-05-08 08:25:26 INFO ExecutorPodsAllocator:54 - Going to request 1 executors from Kubernetes.
2019-05-08T08:25:26.455658752Z 2019-05-08 08:25:26 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
2019-05-08T08:25:26.47651031Z 2019-05-08 08:25:26 INFO NettyBlockTransferService:54 - Server created on demo1-1557303918993-driver-svc.default.svc:7079
2019-05-08T08:25:26.476533172Z 2019-05-08 08:25:26 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2019-05-08T08:25:26.503099521Z 2019-05-08 08:25:26 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None)
2019-05-08T08:25:26.506168762Z 2019-05-08 08:25:26 INFO BlockManagerMasterEndpoint:54 - Registering block manager demo1-1557303918993-driver-svc.default.svc:7079 with 110.0 MB RAM, BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None)
2019-05-08T08:25:26.529524775Z 2019-05-08 08:25:26 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None)
2019-05-08T08:25:26.529543725Z 2019-05-08 08:25:26 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, demo1-1557303918993-driver-svc.default.svc, 7079, None)
2019-05-08T08:25:26.661414752Z 2019-05-08 08:25:26 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4c777e7b{/metrics/json,null,AVAILABLE,@Spark}
2019-05-08T08:25:30.459756195Z 2019-05-08 08:25:30 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.20.90.161:52168) with ID 1
2019-05-08T08:25:30.534179215Z 2019-05-08 08:25:30 INFO KubernetesClusterSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
2019-05-08T08:25:30.679510273Z 2019-05-08 08:25:30 INFO BlockManagerMasterEndpoint:54 - Registering block manager 172.20.90.161:36718 with 110.0 MB RAM, BlockManagerId(1, 172.20.90.161, 36718, None)
2019-05-08T08:25:30.906713226Z 2019-05-08 08:25:30 INFO SparkContext:54 - Starting job: reduce at SparkPi.scala:38
2019-05-08T08:25:30.93537711Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
2019-05-08T08:25:30.936000643Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
2019-05-08T08:25:30.936506781Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Parents of final stage: List()
2019-05-08T08:25:30.938152322Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Missing parents: List()
2019-05-08T08:25:30.958509715Z 2019-05-08 08:25:30 INFO DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
2019-05-08T08:25:31.128459296Z 2019-05-08 08:25:31 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 110.0 MB)
2019-05-08T08:25:31.172704042Z 2019-05-08 08:25:31 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 1256.0 B, free 110.0 MB)
2019-05-08T08:25:31.178025215Z 2019-05-08 08:25:31 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on demo1-1557303918993-driver-svc.default.svc:7079 (size: 1256.0 B, free: 110.0 MB)
2019-05-08T08:25:31.182000364Z 2019-05-08 08:25:31 INFO SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1161
2019-05-08T08:25:31.202640906Z 2019-05-08 08:25:31 INFO DAGScheduler:54 - Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
2019-05-08T08:25:31.203502967Z 2019-05-08 08:25:31 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
2019-05-08T08:25:31.245126257Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, 172.20.90.161, executor 1, partition 0, PROCESS_LOCAL, 7878 bytes)
2019-05-08T08:25:31.805815672Z 2019-05-08 08:25:31 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 172.20.90.161:36718 (size: 1256.0 B, free: 110.0 MB)
2019-05-08T08:25:31.946492966Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, 172.20.90.161, executor 1, partition 1, PROCESS_LOCAL, 7878 bytes)
2019-05-08T08:25:31.957903365Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 727 ms on 172.20.90.161 (executor 1) (1/2)
2019-05-08T08:25:31.99308236Z 2019-05-08 08:25:31 INFO TaskSetManager:54 - Finished task 1.0 in stage 0.0 (TID 1) in 47 ms on 172.20.90.161 (executor 1) (2/2)
2019-05-08T08:25:31.994764897Z 2019-05-08 08:25:31 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2019-05-08T08:25:31.995390219Z 2019-05-08 08:25:31 INFO DAGScheduler:54 - ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.998 s
2019-05-08T08:25:32.003622135Z 2019-05-08 08:25:32 INFO DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 1.094511 s
2019-05-08T08:25:32.005407995Z Pi is roughly 3.1436157180785904
2019-05-08T08:25:32.011499948Z 2019-05-08 08:25:32 INFO AbstractConnector:318 - Stopped Spark@7e97551f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-05-08T08:25:32.014105609Z 2019-05-08 08:25:32 INFO SparkUI:54 - Stopped Spark web UI at http://demo1-1557303918993-driver-svc.default.svc:4040
2019-05-08T08:25:32.01861939Z 2019-05-08 08:25:32 INFO KubernetesClusterSchedulerBackend:54 - Shutting down all executors
2019-05-08T08:25:32.019973046Z 2019-05-08 08:25:32 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Asking each executor to shut down
2019-05-08T08:25:32.025136562Z 2019-05-08 08:25:32 WARN ExecutorPodsWatchSnapshotSource:87 - Kubernetes client has been closed (this is expected if the application is shutting down.)
2019-05-08T08:25:32.087137746Z 2019-05-08 08:25:32 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-05-08T08:25:32.097659039Z 2019-05-08 08:25:32 INFO MemoryStore:54 - MemoryStore cleared
2019-05-08T08:25:32.098360561Z 2019-05-08 08:25:32 INFO BlockManager:54 - BlockManager stopped
2019-05-08T08:25:32.104432515Z 2019-05-08 08:25:32 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2019-05-08T08:25:32.10761075Z 2019-05-08 08:25:32 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-05-08T08:25:32.114734944Z 2019-05-08 08:25:32 INFO SparkContext:54 - Successfully stopped SparkContext
2019-05-08T08:25:32.117170277Z 2019-05-08 08:25:32 INFO ShutdownHookManager:54 - Shutdown hook called
2019-05-08T08:25:32.118273045Z 2019-05-08 08:25:32 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bdb4e416-5ab7-420c-905e-ef43c30fb187
2019-05-08T08:25:32.120019227Z 2019-05-08 08:25:32 INFO ShutdownHookManager:54 - Deleting directory /var/data/spark-118b216d-2d39-4287-ad71-5b5d7c7195c9/spark-06dbab1f-13aa-474c-a1db-8845e14627bf
```
##### delete spark job
```$xslt
arena delete --type=sparkjob demo
```
You will found the spark job is deleted.
```$xslt
sparkapplication.sparkoperator.k8s.io "demo1" deleted
time="2019-05-08T17:27:06+08:00" level=info msg="The Job demo1 has been deleted successfully"
configmap "demo1-sparkjob" deleted
```
Congratulations! You've run the distributed spark job with `arena` successfully.

View File

@ -1,156 +0,0 @@
# Arena supports and simplifies volcano job.
Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms currently missing from
Kubernetes that are commonly required by many classes of batch & elastic workload including:
1. machine learning/deep learning,
2. bioinformatics/genomics, and
3. other "big data" applications.
## pre requisites
- k8s deployment
- deploy the volcano following the steps from kubernetes-artifacts/volcano-operator/README.md
### 1. To run a batch/distributed volcano job, you may need to specify:
```
--minAvailable int The minimal available pods to run for this Job. default value is 1 (default 1)
--name string override name
--queue string Specifies the queue that will be used in the scheduler, default queue is used this leaves empty (default "default")
--schedulerName string Specifies the scheduler Name, default is volcano when not specified (default "volcano")
--taskCPU string cpu request for each task replica / pod. default value is 250m (default "250m")
--taskImages strings the docker images of different tasks of volcano job. default used 3 tasks with ubuntu,nginx and busybox images (default [ubuntu,nginx,busybox])
--taskMemory string memory request for each task replica/pod.default value is 128Mi) (default "128Mi")
--taskName string the task name of volcano job, default value is task (default "task")
--taskPort int the task port number. default value is 2222 (default 2222)
--taskReplicas int the task replica's number to run the distributed tasks. default value is 1 (default 1)
```
### 2. More information related to volcano job.
Arena volcano job is based on (https://github.com/volcano-sh/volcano).
You can get more information related to volcano from https://volcano.sh/
### 3. How to use Arena volcano job
##### install volcano
deploy the volcano following the steps from kubernetes-artifacts/volcano-operator/README.md
To install the chart with the release name `volcano-release`
```bash
$ helm install --name volcano-release kubernetes-artifacts/volcano-operator
```
TO verify all deployments are running use the below command
```bash
kubectl get deployment --all-namespaces | grep {release_name}
```
We should get similar output like given below, where three deployments for controller, admission, scheduler should be running.
```bash
NAME READY UP-TO-DATE AVAILABLE AGE
{release_name}-admission 1/1 1 1 4s
{release_name}-controllers 1/1 1 1 4s
{release_name}-scheduler 1/1 1 1 4s
```
TO verify all pods are running use the below command
```bash
kubectl get pods --all-namespaces | grep {release_name}
```
We should get similar output like given below, where three pods for controller, admission,admissioninit, scheduler should be running.
```bash
NAMESPACE NAME READY STATUS RESTARTS AGE
default volcano-release-admission-cbfdb8549-dz5hg 1/1 Running 0 33s
default volcano-release-admission-init-7xmzd 0/1 Completed 0 33s
default volcano-release-controllers-7967fffb8d-7vnn9 1/1 Running 0 33s
default volcano-release-scheduler-746f6557d8-9pfg6 1/1 Running 0 33s
```
##### submit a volcano job
```$xslt
arena submit volcanojob --name=demo
```
The result is like below.
```$xslt
configmap/demo-volcanojob created
configmap/demo-volcanojob labeled
job.batch.volcano.sh/demo created
INFO[0003] The Job demo has been submitted successfully
INFO[0003] You can run `arena get demo --type volcanojob` to check the job status
```
if we want to provide more command line parameters then
```$xslt
./bin/arena submit volcanojob --name demo12 --taskImages busybox,busybox --taskReplicas 2
```
in above case it creates two tasks each with 2 replicas as shown below
```$xslt
arena get --type volcanojob demo12
```
the result is as below
```$xslt
STATUS: SUCCEEDED
NAMESPACE: default
TRAINING DURATION: 2m
NAME STATUS TRAINER AGE INSTANCE NODE
demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-0-0 11.245.101.184
demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-0-1 11.245.101.184
demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-1-0 11.245.101.184
demo12 SUCCEEDED VOLCANOJOB 2m demo12-task-1-1 11.245.101.184
```
##### get volcano job status
```$xslt
arena get --type=volcanojob demo
```
When the job running/succeed,you will see the result below.
```$xslt
STATUS: RUNNING/SUCCEEDED
NAMESPACE: default
TRAINING DURATION: 45s
NAME STATUS TRAINER AGE INSTANCE NODE
demo SUCCEEDED VOLCANOJOB 59s demo-task-0-0 11.245.101.184
demo RUNNING VOLCANOJOB 59s demo-task-1-0 11.245.101.184
demo SUCCEEDED VOLCANOJOB 59s demo-task-2-0 11.245.101.184
```
##### list arena jobs
```$xslt
arena list
```
we can observe the below data
```$xslt
NAME STATUS TRAINER AGE NODE
demo RUNNING VOLCANOJOB 2m 11.245.101.184
```
##### delete volcano job
```$xslt
arena delete --type=volcanojob demo
```
You will found the volcano job is deleted.
```$xslt
job.batch.volcano.sh "demo" deleted
configmap "demo-volcanojob" deleted
INFO[0000] The Job demo has been deleted successfully
```
Congratulations! You've run the batch/distributed volcano job with `arena` successfully.

View File

@ -1,169 +0,0 @@
# Arena supports Priority and Preemption for MPIJob
## prerequisites
- k8s > 1.11
1.Create `PriorityClass` with the yaml below:
```yaml
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the critical app
kind: PriorityClass
metadata:
name: critical
value: 1100000
---
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the medium app
kind: PriorityClass
metadata:
name: medium
value: 1000000
```
Save the template that applies in a file named `pc.yaml`, and create the `PriorityClass`:
```
kubectl create -f pc.yaml
```
2.There is only 1 GPU available in the Kubernetes cluster
```
# arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
192.168.0.20 192.168.0.20 master 0 0
192.168.0.21 192.168.0.21 master 0 0
192.168.0.22 192.168.0.22 master 0 0
192.168.0.23 192.168.0.23 <none> 1 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/1 (0%)
```
3.Run the MPI training Job with `medium` priority:
The following command is an example.
```
# arena submit mpi \
--name=medium \
--priority=medium \
--gpus=1 \
--workers=1 \
--image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
"mpirun tail -f /dev/null"
configmap/medium-mpijob created
configmap/medium-mpijob labeled
mpijob.kubeflow.org/medium created
INFO[0000] The Job medium has been submitted successfully
INFO[0000] You can run `arena get medium --type mpijob` to check the job status
```
4.Get the details of the specific job
```
# arena get medium
STATUS: RUNNING
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 58s
NAME STATUS TRAINER AGE INSTANCE NODE
medium RUNNING MPIJOB 58s medium-launcher-sz5xj 192.168.0.23
medium RUNNING MPIJOB 58s medium-worker-0 192.168.0.23
```
5.The only one GPU is used by MPI training Job `medium`
```
# arena top node -d
NAME: cn-hangzhou.192.168.0.23
IPADDRESS: 192.168.0.23
ROLE: <none>
NAMESPACE NAME GPU REQUESTS GPU LIMITS
default medium-worker-0 1 1
Total GPUs In Node cn-hangzhou.192.168.0.23: 1
Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%)
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster: 1/1 (100%)
```
6.Run the MPI training Job with `critical` priority:
```
# arena submit mpi \
--name=critical \
--priority=critical \
--gpus=1 \
--workers=1 \
--image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
"mpirun tail -f /dev/null"
```
7.Check MPI Training Job `medium`, and find it's preempted by critical-worker-0
```
# kubectl get events --field-selector involvedObject.name=medium-worker-0
LAST SEEN TYPE REASON OBJECT MESSAGE
15m Normal Scheduled pod/medium-worker-0 Successfully assigned default/medium-worker-0 to 192.168.0.23
14m Normal Pulled pod/medium-worker-0 Container image "registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5" already present on machine
14m Normal Created pod/medium-worker-0 Created container mpi
14m Normal Started pod/medium-worker-0 Started container mpi
2m32s Normal Preempted pod/medium-worker-0 by default/critical-worker-0 on node 192.168.0.23
2m32s Normal Killing pod/medium-worker-0 Stopping container mpi
```
8.Check the details of the MPI Training Job `medium`, and it's turned to fail
```
# arena get medium
STATUS: FAILED
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 12m
NAME STATUS TRAINER AGE INSTANCE NODE
medium FAILED MPIJOB 20m medium-launcher-sz5xj 192.168.0.23
```
9.And check the details of the MPI Training Job `critical`, it's running.
```
# arena get critical
STATUS: RUNNING
NAMESPACE: default
PRIORITY: critical
TRAINING DURATION: 10m
NAME STATUS TRAINER AGE INSTANCE NODE
critical RUNNING MPIJOB 10m critical-launcher-mfffs 192.168.0.23
critical RUNNING MPIJOB 10m critical-worker-0 192.168.0.23
```
10.And we can find the only GPU is used by the MPI Training Job `critical`
```
# arena top node -d
NAME: cn-hangzhou.192.168.0.23
IPADDRESS: 192.168.0.23
ROLE: <none>
NAMESPACE NAME GPU REQUESTS GPU LIMITS
default critical-worker-0 1 1
Total GPUs In Node cn-hangzhou.192.168.0.23: 1
Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%)
-----------------------------------------------------------------------------------------
```
Congratulations! You've run the the job in priorities and preemptions with `arena` successfully.

View File

@ -1,160 +0,0 @@
Arena supports assigning jobs to some k8s particular nodes(Currently only support mpi job and tf job).
some usage examples in here.
1.query k8s cluster information:
```
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-beijing.192.168.3.225 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.226 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.227 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.228 Ready <none> 2d22h v1.12.6-aliyun.1
cn-beijing.192.168.3.229 Ready <none> 2d22h v1.12.6-aliyun.1
cn-beijing.192.168.3.230 Ready <none> 2d22h v1.12.6-aliyun.1
```
2.give a label to nodes,for example: give label "gpu_node=ok" to node "cn-beijing.192.168.3.228" and node "cn-beijing.192.168.3.229",give label "ssd_node=ok" to node "cn-beijing.192.168.3.230"
```
# kubectl label nodes cn-beijing.192.168.3.228 gpu_node=ok
node/cn-beijing.192.168.3.228 labeled
# kubectl label nodes cn-beijing.192.168.3.229 gpu_node=ok
node/cn-beijing.192.168.3.229 labeled
# kubectl label nodes cn-beijing.192.168.3.230 ssd_node=ok
node/cn-beijing.192.168.3.230 labeled
```
## for MPI job
1.when submit a job,you can assign nodes to run job with operation "--selector"
```
# arena submit mpi --name=mpi-dist \
--gpus=1 \
--workers=1 \
--selector gpu_node=ok \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--tensorboard \
--loglevel debug \
"mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
```
2.query the job information
```
# arena get mpi-dist
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 21s
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist RUNNING MPIJOB 21s mpi-dist-launcher-7jn4q 192.168.3.229
mpi-dist RUNNING MPIJOB 21s mpi-dist-worker-0 192.168.3.229
Your tensorboard will be available on:
http://192.168.3.225:31611
```
the jobs are running on node cn-beijing.192.168.3.229(ip is 192.168.3.229).
3.you can use "--selector" multiple times,for example you can use "--selector gpu_node=ok --selector ssd_node=ok" in arena submit command,it represents that the job should be running on nodes which own label "gpu_node=ok" and label "ssd_node=ok".
## for tf job
1.because there is four roles("PS","Worker","Evaluator","Chief") in tf job,you can use "--selector" to assgin nodes,this is effective for all roles.for example:
```
arena submit tfjob \
--name=tf \
--gpus=1 \
--workers=1 \
--selector ssd_node=ok \
--workerImage=cheyang/tf-mnist-distributed:gpu \
--psImage=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--loglevel debug \
"python /app/main.py"
```
use follow command to check the job status:
```
# arena get tf
STATUS: PENDING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 24s
NAME STATUS TRAINER AGE INSTANCE NODE
tf RUNNING TFJOB 24s tf-ps-0 192.168.3.230
tf PENDING TFJOB 24s tf-worker-0 192.168.3.230
Your tensorboard will be available on:
http://192.168.3.225:31867
```
the jobs(include "PS" and "Worker") have been running on cn-beijing.192.168.3.230(ip is 192.168.3.230,label is "ssd_node=ok").
2.you also can assign node to run single role job,for example: if you want to run a job whose role is "PS" on nodes which own label ssd_node="ok" and run "Worker" job on nodes which own label gpu_node=ok,you can use option "--ps-selector" and "--worker-selector"
```
arena submit tfjob \
--name=tf \
--gpus=1 \
--workers=1 \
--ps-selector ssd_node=ok \
--worker-selector gpu_node=ok \
--workerImage=cheyang/tf-mnist-distributed:gpu \
--psImage=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--loglevel debug \
"python /app/main.py"
```
then check the jobs's status:
```
# arena get tf
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 23s
NAME STATUS TRAINER AGE INSTANCE NODE
tf RUNNING TFJOB 23s tf-ps-0 192.168.3.230
tf RUNNING TFJOB 23s tf-worker-0 192.168.3.228
Your tensorboard will be available on:
http://192.168.3.225:30162
```
the "PS" job is running on cn-beijing.192.168.3.230(ip is 192.168.3.230,label is "ssd_node=ok") and the "Worker" job is running on cn-beijing.192.168.3.228(ip is 192.168.3.228,label is "gpu_node=ok")
3.if you use "--selector" in "arena submit tf" command and also use "--ps-selector"(or "--worker-selector","--evaluator-selector","chief-selector"),the value of "--ps-selector" would cover value of "--selector",for example:
```
arena submit tfjob \
--name=tf \
--gpus=1 \
--workers=1 \
--ps-selector ssd_node=ok \
--selector gpu_node=ok \
--workerImage=cheyang/tf-mnist-distributed:gpu \
--psImage=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--loglevel debug \
"python /app/main.py"
```
"PS" job will be running on nodes whose label is "ssd_node=ok",other jobs will be running on nodes whose label is "gpu_node=ok",now verify our conclusions,use follow command to check job status.
```
# arena get tf
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 39s
NAME STATUS TRAINER AGE INSTANCE NODE
tf RUNNING TFJOB 39s tf-ps-0 192.168.3.230
tf RUNNING TFJOB 39s tf-worker-0 192.168.3.228
Your tensorboard will be available on:
http://192.168.3.225:32105
```
as you can see, "PS" job is running on nodes which own label "ssd_node=ok",other jobs are running on nodes which own label "gpu_node=ok"

View File

@ -1,85 +0,0 @@
Arena supports submiting a job with tolerating k8s nodes with taints(Currently only support mpi job and tf job).
some usage examples in here.
1.query k8s cluster information:
```
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-beijing.192.168.3.225 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.226 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.227 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.228 Ready <none> 2d22h v1.12.6-aliyun.1
cn-beijing.192.168.3.229 Ready <none> 2d22h v1.12.6-aliyun.1
cn-beijing.192.168.3.230 Ready <none> 2d22h v1.12.6-aliyun.1
```
2.give some taints for k8s nodes,for example: give taint "gpu_node=invalid:NoSchedule" to node "cn-beijing.192.168.3.228" and node "cn-beijing.192.168.3.229",give taint "ssd_node=invalid:NoSchedule" to node "cn-beijing.192.168.3.230",now all k8s pod can't schedule to these nodes.
```
# kubectl taint nodes cn-beijing.192.168.3.228 gpu_node=invalid:NoSchedule
node/cn-beijing.192.168.3.228 tainted
# kubectl taint nodes cn-beijing.192.168.3.229 gpu_node=invalid:NoSchedule
node/cn-beijing.192.168.3.229 tainted
# kubectl taint nodes cn-beijing.192.168.3.230 ssd_node=invalid:NoSchedule
node/cn-beijing.192.168.3.230 tainted
```
3.when submit a job,you can tolerate some nodes with taints to run job with operation "--toleration"
```
# arena submit mpi --name=mpi-dist \
--gpus=1 \
--workers=1 \
--toleration ssd_node \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--tensorboard \
--loglevel debug \
"mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
```
query the job information
```
# arena get mpi-dist
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 29s
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist RUNNING MPIJOB 29s mpi-dist-launcher-jgms7 192.168.3.230
mpi-dist RUNNING MPIJOB 29s mpi-dist-worker-0 192.168.3.230
Your tensorboard will be available on:
http://192.168.3.225:30052
```
the jobs are running on node cn-beijing.192.168.3.230(ip is 192.168.3.230,taint is ssd_node=invalid).
4.you can use "--toleration" multiple times,for example you can use "--toleration gpu_node --toleration ssd_node" in arena submit command,it represents that the job tolerates nodes which own taint "gpu_node=invalid" and taint "ssd_node=invalid".
```
# arena submit mpi --name=mpi-dist \
--gpus=1 \
--workers=1 \
--toleration ssd_node \
--toleration gpu_node \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--tensorboard \
--loglevel debug \
"mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
```
query the job status:
```
# arena get mpi-dist
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 29s
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist RUNNING MPIJOB 29s mpi-dist-launcher-jgms7 192.168.3.229
mpi-dist RUNNING MPIJOB 29s mpi-dist-worker-0 192.168.3.230
Your tensorboard will be available on:
http://192.168.3.225:30052
```
5.you can use "--toleration all" to tolerate all node taints.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 183 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 290 KiB

View File

@ -1,80 +0,0 @@
# Serving Trained Model with arena
You can use arena to deploy your trained model as RESTful APIs.to illustrate usage,we use a sample project [fast-style-transfer](https://github.com/floydhub/fast-style-transfer).in order to save time,we use its' trainted model and add the model to docker images.
### 1.Serve Mode
we use the app.py script in project to start restful server,you can use arena to deploy trainted model:
```
# arena serve custom \
--name=fast-style-transfer \
--gpus=1 \
--version=alpha \
--replicas=1 \
--restful-port=5000 \
--image=happy365/fast-style-transfer:latest \
"python app.py"
```
check the status of TensorFlow Serving Job:
```
# arena serve list
NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS
fast-style-transfer CUSTOM alpha 1 0 172.21.8.94 grpc:8001,restful:5000
```
because the docker image is very large,pulling it requests some time,we can use kubectl to check the pod status:
```
# kubectl get po
NAME READY STATUS RESTARTS AGE
fast-style-transfer-alpha-custom-serving-845ffbf7dd-btbhj 0/1 ContainerCreating 0 6m44s
```
### 2.Access the service
we can use a client to access the service,run the follow command to create a client:
```
# kubectl run sample-client \
--generator=run-pod/v1 \
--image=happy365/arena-serve-custem-sample-client:latest \
--command -- \
/bin/sleep infinity
```
then,we can query the status of sample-client:
```
# kubectl get po sample-client
NAME READY STATUS RESTARTS AGE
sample-client 1/1 Running 0 87s
```
we should query the sevice name,it is a combination of job name and version(the sample job name is fast-style-transfer and version is alpha):
```
# kubectl get svc fast-style-transfer-alpha
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fast-style-transfer-alpha ClusterIP 172.21.1.114 <none> 5000/TCP 31m
```
now,we can use the "kubectl exec" command to login the container:
```
# kubectl exec -ti sample-client /bin/sh
#
```
then we use "curl" command to access the custom serving job:
```
# curl -o /root/output/beijing_out.jpg -F "file=@/root/input/beijing.jpg" http://fast-style-transfer-alpha:5000
```
the input is an image which name is "beijing.jpg" ![beijing.jpg](15-custom-serving-sample-beijing.jpg),the image is stored in "/root/input",the output is stored in "/root/output". you can use "kubectl cp" command to copy output image from container to host:
```
# kubectl cp sample-client:/root/output/beijing_out.jpg ~/beijing_out.jpg
```
now you can view the image in ~/beijing_out.jpg,there is "beijing_out.jpg" ![beijing_out.jpg](15-custom-serving-sample-beijing_out.jpg)

View File

@ -1,73 +0,0 @@
# Assign configuration files for jobs
you can pass the configuration files to containers when submiting jobs.
this feature only support follow jobs:
* tfjob
* mpijob
## 1.usage
you can use `--config-file <host_path_file>:<container_path_file>` to assign a configuration file to container.and there is some rules:
* if assignd <host_path_file> and not assign <container_path_file>,we see <container_path_file> is the same as <host_path_file>
* <container_path_file> must be a file with absolute path
* you can use `--config-file` more than one in a command,eg: "--config-file /tmp/test1.conf:/etc/config/test1.conf --config-file /tmp/test2.conf:/etc/config/test2.conf"
## 2.sample
firstly,we create a test file which name is "test-config.json",its' path is "/tmp/test-config.json". we want push this file to containers of a tfjob (or mpijob) and the path in container is "/etc/config/config.json".
```
# cat /tmp/test-config.json
{
"key": "job-config"
}
```
secondly,use follow command to create tfjob:
```
# arena submit tfjob \
--name=tf \
--gpus=1 \
--workers=1 \
--workerImage=cheyang/tf-mnist-distributed:gpu \
--psImage=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--config-file /tmp/test-config.json:/etc/config/config.json \
"python /app/main.py"
```
wait a minute,get the job status:
```
# arena get tf
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 16s
NAME STATUS TRAINER AGE INSTANCE NODE
tf RUNNING TFJOB 16s tf-ps-0 192.168.7.18
tf RUNNING TFJOB 16s tf-worker-0 192.168.7.16
Your tensorboard will be available on:
http://192.168.7.10:31825
```
use kubectl to check file is in container or not:
```
# kubectl exec -ti tf-ps-0 -- cat /etc/config/config.json
{
"key": "job-config"
}
# kubectl exec -ti tf-worker-0 -- cat /etc/config/config.json
{
"key": "job-config"
}
```
as you see,the file is in the containers.

View File

@ -1,95 +0,0 @@
This example shows how to use `Arena` to submit a pytorch stand-alone job. This example will download the source code from git url.
1. The first step is to check the available resources.
```
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/12 (0%)
```
There are 3 available nodes with GPU for running training jobs.
2. Submit a pytorch training job, this example download the source code from [Alibaba Cloud code](https://code.aliyun.com/370272561/mnist-pytorch.git).
```
# Single gpu card
➜ arena --loglevel info submit pytorch \
--name=pytorch-local-git \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
configmap/pytorch-local-git-pytorchjob created
configmap/pytorch-local-git-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-local-git created
INFO[0000] The Job pytorch-local-git has been submitted successfully
INFO[0000] You can run `arena get pytorch-local-git --type pytorchjob` to check the job status
```
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`.
> If you are using the private git repo, you can use the following command
```
➜ arena --loglevel info submit pytorch \
--name=pytorch-local-git \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--env=GIT_SYNC_USERNAME=yourname \
--env=GIT_SYNC_PASSWORD=yourpwd \
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
```
3. List all the jobs.
```
➜ arena list
NAME STATUS TRAINER AGE NODE
pytorch-local-git SUCCEEDED PYTORCHJOB 21h N/A
```
4. Get the details of the this job.
```
➜ arena get pytorch-local-git
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 35s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-local-git SUCCEEDED PYTORCHJOB 23h pytorch-local-git-master-0 172.16.0.210
```
5. Check logs.
```
➜ arena logs pytorch-local-git
WORLD_SIZE: 1, CURRENT_RANK: 0
args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000)
Using CUDA
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Train Epoch: 1 [0/60000 (0%)] loss=2.3000
Train Epoch: 1 [640/60000 (1%)] loss=2.2135
Train Epoch: 1 [1280/60000 (2%)] loss=2.1705
Train Epoch: 1 [1920/60000 (3%)] loss=2.0767
Train Epoch: 1 [2560/60000 (4%)] loss=1.8681
...
```

View File

@ -1,131 +0,0 @@
This example shows how to use `Arena` to submit a pytorch distributed job. This example will download the source code from git url.
1. The first step is to check the available resources.
```
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/12 (0%)
```
There are 3 available nodes with GPU for running training jobs.
2. Submit a pytorch distributed training job with 2 nodes and one gpu card, this example download the source code from [Alibaba Cloud code](https://code.aliyun.com/370272561/mnist-pytorch.git).
```
➜ arena --loglevel info submit pytorch \
--name=pytorch-dist-git \
--gpus=1 \
--workers=2 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
configmap/pytorch-dist-git-pytorchjob created
configmap/pytorch-dist-git-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-dist-git created
INFO[0000] The Job pytorch-dist-git has been submitted successfully
INFO[0000] You can run `arena get pytorch-dist-git --type pytorchjob` to check the job status
```
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`.
>`workers` is the total number of nodes participating in the training (must be a positive integer and greater than or equal to 1), including rank0 node used to establish communication (corresponding to the `master` node in the pytorch-operator). The default value of the parameter is 1, which can not be set, as a stand-alone job.
3. List all the jobs.
```
➜ arena list
NAME STATUS TRAINER AGE NODE
pytorch-dist-git SUCCEEDED PYTORCHJOB 23h N/A
```
4. Get the details of the this job. There are 2 instances of this job, and instance `pytorch-dist-git-master-0` is the rank0. Arena simplifies the process of submitting distributed jobs with `PyTorch-Operator`.
A `Service` will be created for this `master` instance for other nodes to access through the name of `Service` in `PyTorch-Operator`, and inject environment variables into each instance: `MASTER_PORT`、`MASTER_ADDR`、`WORLD_SIZE`、`RANK`. Initialization of distributed process group for pytorch dist.init_ process_ group). `MASTER_PORT` auto assign, `MASTER_ADDR` is "localhost" in the `master` instance, and other instances are `Service` name of the `master`,`WORLD_SIZE` is the total number of instances, and `RANK` is the serial number of the current calculation node, and `master` is 0, `Worker` instance is the index of instance name suffix plus one. For example, in the following example, `RANK` of instance `pytorch-dist-git-worker-0` is `0 + 1 = 1`
In Arena, the value filled in by the parameter `--workers` contains one `master` instance, because `master` is also involved in training.
```
➜ arena get pytorch-local-git
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-dist-git SUCCEEDED PYTORCHJOB 23h pytorch-dist-git-master-0 172.16.0.210
pytorch-dist-git SUCCEEDED PYTORCHJOB 23h pytorch-dist-git-worker-0 172.16.0.210
```
5. Check logs.
```
➜ arena logs pytorch-dist-git
WORLD_SIZE: 2, CURRENT_RANK: 0
args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000)
Using CUDA
Using distributed PyTorch with gloo backend
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Train Epoch: 1 [0/60000 (0%)] loss=2.3000
Train Epoch: 1 [640/60000 (1%)] loss=2.2135
Train Epoch: 1 [1280/60000 (2%)] loss=2.1705
Train Epoch: 1 [1920/60000 (3%)] loss=2.0767
Train Epoch: 1 [2560/60000 (4%)] loss=1.8681
Train Epoch: 1 [3200/60000 (5%)] loss=1.4142
Train Epoch: 1 [3840/60000 (6%)] loss=1.0009
...
```
> For multi instances of distributed job, the default output is the log of rank0 (the instance is the `master` node). If you want to view the log of the specific instance, you can view it by `-i` instance name, for example:
```
➜ arena logs pytorch-dist-git -i pytorch-dist-git-worker-0
WORLD_SIZE: 2, CURRENT_RANK: 1
args: Namespace(backend='gloo', batch_size=64, data='/root/code/mnist-pytorch', dir='/root/code/mnist-pytorch/logs', epochs=1, log_interval=10, lr=0.01, momentum=0.5, no_cuda=False, save_model=False, seed=1, test_batch_size=1000)
Using CUDA
Using distributed PyTorch with gloo backend
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Train Epoch: 1 [0/60000 (0%)] loss=2.3000
Train Epoch: 1 [640/60000 (1%)] loss=2.2135
Train Epoch: 1 [1280/60000 (2%)] loss=2.1705
Train Epoch: 1 [1920/60000 (3%)] loss=2.0767
Train Epoch: 1 [2560/60000 (4%)] loss=1.8681
Train Epoch: 1 [3200/60000 (5%)] loss=1.4142
```
> In addition, user can view the logs of the last few lines through the parameter `-t` lines num, such as:
```
➜ arena logs pytorch-dist-git -i pytorch-dist-git-worker-0 -t 5
Train Epoch: 1 [58880/60000 (98%)] loss=0.2048
Train Epoch: 1 [59520/60000 (99%)] loss=0.0646
accuracy=0.9661
```
> For more parameters, see ` arena logs -- help`

View File

@ -1,75 +0,0 @@
This example shows how to use `Arena` to submit a python distributed job and visualize by `Tensorboard`. The sample downloads the source code from git URL.
1. The first step is to check the available resources.
```
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/12 (0%)
```
There are 3 available nodes with GPU for running training jobs.
2. Submit a pytorch distributed training job with 2 nodes and one gpu card, this example download the source code from [Alibaba Cloud code](https://code.aliyun.com/370272561/mnist-pytorch.git).
```
➜ arena --loglevel info submit pytorch \
--name=pytorch-dist-tensorboard \
--gpus=1 \
--workers=2 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--tensorboard \
--logdir=/root/logs \
"python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo --dir /root/logs"
configmap/pytorch-dist-tensorboard-pytorchjob created
configmap/pytorch-dist-tensorboard-pytorchjob labeled
service/pytorch-dist-tensorboard-tensorboard created
deployment.apps/pytorch-dist-tensorboard-tensorboard created
pytorchjob.kubeflow.org/pytorch-dist-tensorboard created
INFO[0000] The Job pytorch-dist-tensorboard has been submitted successfully
INFO[0000] You can run `arena get pytorch-dist-tensorboard --type pytorchjob` to check the job status
```
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`.
> `workers` is the total number of nodes participating in the training (must be a positive integer and greater than or equal to 1), including rank0 node used to establish communication (corresponding to the `master` node in the pytorch-operator). The default value of the parameter is 1, which can not be set, as a stand-alone job.
> `logdir` indicates where the tensorboard reads the event logs of Pytorch.
3. List all the jobs.
```
➜ arena list
NAME STATUS TRAINER AGE NODE
pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h N/A
```
4. Get the details of the this job.
```
➜ arena get pytorch-dist-tensorboard
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 15m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h pytorch-dist-tensorboard-master-0 172.16.0.210
pytorch-dist-tensorboard SUCCEEDED PYTORCHJOB 22h pytorch-dist-tensorboard-worker-0 172.16.0.210
Your tensorboard will be available on:
http://172.16.0.205:30583
```
> Notice: you can access the tensorboard by using `172.16.0.205:30583`. You can consider `sshuttle` if you can't access the tensorboard directly from your laptop. For example:
```
# you can install sshuttle==0.74 in your mac with python2.7
➜ pip install sshuttle==0.74
# 0/0 -> 0.0.0.0/0
➜ sshuttle -r root@39.104.17.205 0/0
```
![](19-pytorchjob-tensorboard.png)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 879 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 413 KiB

View File

@ -1,109 +0,0 @@
Here is an example how you can use `Arena` for the machine learning training. It will download the source code from git url, and use Tensorboard to visualize the Tensorflow computation graph and plot quantitative metrics.
1. the first step is to check the available resources
```
arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0%)
```
There are 3 available nodes with GPU for running training jobs.
2\. Now we can submit a training job with `arena cli`, it will download the source code from github
```
# arena submit tf \
--name=tf-tensorboard \
--gpus=1 \
--image=tensorflow/tensorflow:1.5.0-devel-gpu \
--env=TEST_TMPDIR=code/tensorflow-sample-code/ \
--syncMode=git \
--syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--tensorboard \
--logdir=/training_logs \
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 5000"
configmap/tf-tensorboard-tfjob created
configmap/tf-tensorboard-tfjob labeled
service/tf-tensorboard-tensorboard created
deployment.extensions/tf-tensorboard-tensorboard created
tfjob.kubeflow.org/tf-tensorboard created
INFO[0001] The Job tf-tensorboard has been submitted successfully
INFO[0001] You can run `arena get tf-tensorboard --type tfjob` to check the job status
```
> the source code will be downloaded and extracted to the directory `code/` of the working directory. The default working directory is `/root`, you can also specify by using `--workingDir`.
> `logdir` indicates where the tensorboard reads the event logs of TensorFlow
3\. List all the jobs
```
# arena list
NAME STATUS TRAINER AGE NODE
tf-tensorboard RUNNING TFJOB 0s 192.168.1.119
```
4\. Check the resource usage of the job
```
# arena top job
NAME STATUS TRAINER AGE NODE GPU(Requests) GPU(Allocated)
tf-tensorboard RUNNING TFJOB 26s 192.168.1.119 1 1
Total Allocated GPUs of Training Job:
0
Total Requested GPUs of Training Job:
1
```
5\. Check the resource usage of the cluster
```
# arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 1
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
1/3 (33%)
```
6\. Get the details of the specific job
```
# arena get tf-tensorboard
NAME STATUS TRAINER AGE INSTANCE NODE
tf-tensorboard RUNNING tfjob 15s tf-tensorboard-tfjob-586fcf4d6f-vtlxv 192.168.1.119
tf-tensorboard RUNNING tfjob 15s tf-tensorboard-tfjob-worker-0 192.168.1.119
Your tensorboard will be available on:
192.168.1.117:30670
```
> Notice: you can access the tensorboard by using `192.168.1.117:30670`. You can consider `sshuttle` if you can't access the tensorboard directly from your laptop. For example: `sshuttle -r root@47.89.59.51 192.168.0.0/16`
![](2-tensorboard.jpg)
Congratulations! You've run the training job with `arena` successfully, and you can also check the tensorboard easily.

View File

@ -1,123 +0,0 @@
This example shows how to use `Arena` to submit a python distributed job and mount an NFS data volume. The sample downloads the source code from git URL.
1. Set up an NFS server.(refer to: https://www.cnblogs.com/weifeng1463/p/10037803.html )
```shell
# install nfs server
➜ yum install nfs-utils -y
# Create local directory of NFS server
➜ mkdir -p /root/nfs/data
# Configure nfs server
➜ cat /etc/exports
/root/nfs/data *(rw,no_root_squash)
# Start nfs server
➜ systemctl start nfs; systemctl start rpcbind
➜ systemctl enable nfs
Created symlink from /etc/systemd/system/multi-user.target.wants/nfs-server.service to /usr/lib/systemd/system/nfs-server.service.
```
2. Download training data to shared directory of NFS.
```shell
# Get information of NFS server by showmount, 172.16.0.200 is the host ip of NFS server
➜ showmount -e 172.16.0.200
Export list for 172.16.0.200:
/root/nfs/data *
# Enter shared directory
➜ cd /root/nfs/data
# Prepare training data to shared directory
➜ pwd
/root/nfs/data
# MNIST -> That's the training data we need
➜ ll
total 8.0K
drwxr-xr-x 4 502 games 4.0K 6月 17 16:05 data
drwxr-xr-x 4 root root 4.0K 6月 23 15:17 MNIST
```
3. Create PV.
```shell
# Note: Typesetting may cause yaml indentation problems
➜ cat nfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: pytorchdata
labels:
pytorchdata: nas-mnist
spec:
persistentVolumeReclaimPolicy: Retain
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
nfs:
server: 172.16.0.200
path: "/root/nfs/data"
➜ kubectl create -f nfs-pv.yaml
persistentvolume/pytorchdata created
➜ kubectl get pv | grep pytorchdata
pytorchdata 10Gi RWX Retain Bound default/pytorchdata 7m38s
```
5. Create PVC.
```shell
➜ cat nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pytorchdata
annotations:
description: "this is the mnist demo"
owner: Tom
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
selector:
matchLabels:
pytorchdata: nas-mnist
➜ kubectl create -f nfs-pvc.yaml
persistentvolumeclaim/pytorchdata created
➜ kubectl get pvc | grep pytorchdata
pytorchdata Bound pytorchdata 10Gi RWX 2m3s
```
7. Check the data volume.
```shell
➜ arena data list
NAME ACCESSMODE DESCRIPTION OWNER AGE
pytorchdata ReadWriteMany this is the mnist demo Tom 2m
```
9. Submit the pytorch job through `--data pvc_name:container_path` mount distributed storage volume.
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-data \
--gpus=1 \
--workers=2 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--data=pytorchdata:/mnist_data \
"python /root/code/mnist-pytorch/mnist.py --backend gloo --data /mnist_data/data"
configmap/pytorch-data-pytorchjob created
configmap/pytorch-data-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-data created
INFO[0000] The Job pytorch-data has been submitted successfully
INFO[0000] You can run `arena get pytorch-data --type pytorchjob` to check the job status
```
11. Get status of volume `pytorchdata` in one of the instances by `kubectl describe`.
```shell
# Get the details of the this job
➜ arena get pytorch-data
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 56s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-data SUCCEEDED PYTORCHJOB 1m pytorch-data-master-0 172.16.0.210
pytorch-data SUCCEEDED PYTORCHJOB 1m pytorch-data-worker-0 172.16.0.210
# Get status of volume `pytorchdata` from `pytorch-data-master-0`
➜ kubectl describe pod pytorch-data-master-0 | grep pytorchdata -C 3
```
![](20-pytorchjob-distributed-data.png)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 235 KiB

View File

@ -1,54 +0,0 @@
## Arena supports assigning pytorch jobs to some k8s particular nodes
1. Get k8s cluster information:
```shell
➜ kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-huhehaote.172.16.0.205 Ready master 4h19m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.206 Ready master 4h18m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.207 Ready master 4h17m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.208 Ready <none> 4h13m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.209 Ready <none> 4h13m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.210 Ready <none> 4h13m v1.16.9-aliyun.1
```
2. Give a label to nodes,for example:
```shell
# 172.16.0.208 label gpu_node=ok
➜ kubectl label nodes cn-huhehaote.172.16.0.208 gpu_node=ok
node/cn-huhehaote.172.16.0.208 labeled
# 172.16.0.209 label gpu_node=ok
➜ kubectl label nodes cn-huhehaote.172.16.0.209 gpu_node=ok
node/cn-huhehaote.172.16.0.209 labeled
# 172.16.0.210 label ssd_node=ok
➜ kubectl label nodes cn-huhehaote.172.16.0.210 ssd_node=ok
node/cn-huhehaote.172.16.0.210 labeled
```
3. When submitting a python job, you can use the `--selector` to decide which node the job runs on
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-selector \
--gpus=1 \
--workers=2 \
--selector gpu_node=ok \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
configmap/pytorch-selector-pytorchjob created
configmap/pytorch-selector-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-selector created
INFO[0000] The Job pytorch-selector has been submitted successfully
INFO[0000] You can run `arena get pytorch-selector --type pytorchjob` to check the job status
```
4. Get the job details, you can see that the job only runs on this node with IP 172.16.0.209 and label `gpu_node=ok`.
```shell
➜ arena get pytorch-selector
STATUS: PENDING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 14s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-selector PENDING PYTORCHJOB 14s pytorch-selector-master-0 172.16.0.209
pytorch-selector PENDING PYTORCHJOB 14s pytorch-selector-worker-0 172.16.0.209
```

View File

@ -1,96 +0,0 @@
## Arena supports submiting a pytorch job with tolerating k8s nodes with taints
1. Get k8s cluster information:
```shell
➜ kubectl get node
NAME STATUS ROLES AGE VERSION
cn-huhehaote.172.16.0.205 Ready master 5h13m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.206 Ready master 5h12m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.207 Ready master 5h11m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.208 Ready <none> 5h7m v1.16.9-aliyun.1
cn-huhehaote.172.16.0.209 Ready <none> 5h7m v1.16.9-aliyun.1
cn-huhehasote.172.16.0.210 Ready <none> 5h7m v1.16.9-aliyun.1
```
2. Give some taints for k8s nodes,for example:
```shell
# taint --> gpu_node
➜ kubectl taint nodes cn-huhehaote.172.16.0.208 gpu_node=invalid:NoSchedule
node/cn-huhehaote.172.16.0.208 tainted
➜ kubectl taint nodes cn-huhehaote.172.16.0.209 gpu_node=invalid:NoSchedule
node/cn-huhehaote.172.16.0.209 tainted
# taint --> ssd_node
➜ kubectl taint nodes cn-huhehaote.172.16.0.210 ssd_node=invalid:NoSchedule
node/cn-huhehaote.172.16.0.210 tainted
```
3. When we add the wrong nodes' taints or restore the node's schedulability, we can remove the nodes' taints in the following commands:
```shell
➜ kubectl taint nodes cn-huhehaote.172.16.0.208 gpu_node-
node/cn-huhehaote.172.16.0.208 untainted
➜ kubectl taint nodes cn-huhehaote.172.16.0.209 gpu_node-
node/cn-huhehaote.172.16.0.209 untainted
➜ kubectl taint nodes cn-huhehaote.172.16.0.210 ssd_node-
node/cn-huhehaote.172.16.0.210 untainted
```
4. When submit a job, you can tolerate some nodes with taints to run job with operation `--toleration`, for example `--toleration=gpu_node`. This parameter can be used multiple times with different taint keys.
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-toleration \
--gpus=1 \
--workers=2 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--tensorboard \
--logdir=/root/logs \
--toleration gpu_node \
"python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo --dir /root/logs"
configmap/pytorch-toleration-pytorchjob created
configmap/pytorch-toleration-pytorchjob labeled
service/pytorch-toleration-tensorboard created
deployment.apps/pytorch-toleration-tensorboard created
pytorchjob.kubeflow.org/pytorch-toleration created
INFO[0000] The Job pytorch-toleration has been submitted successfully
INFO[0000] You can run `arena get pytorch-toleration --type pytorchjob` to check the job status
```
5. Get the details of the this job.
```shell
arena get pytorch-toleration
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 2m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-toleration RUNNING PYTORCHJOB 2m pytorch-toleration-master-0 172.16.0.209
pytorch-toleration RUNNING PYTORCHJOB 2m pytorch-toleration-worker-0 172.16.0.209
Your tensorboard will be available on:
http://172.16.0.205:32091
```
6. You can use `--toleration all` to tolerate all node taints.
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-toleration-all \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--toleration all \
"python /root/code/mnist-pytorch/mnist.py --epochs 10 --backend gloo"
configmap/pytorch-toleration-all-pytorchjob created
configmap/pytorch-toleration-all-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-toleration-all created
INFO[0000] The Job pytorch-toleration-all has been submitted successfully
INFO[0000] You can run `arena get pytorch-toleration-all --type pytorchjob` to check the job status
```
7. Get the details of the this job.
```shell
➜ arena get pytorch-toleration-all
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 33s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-toleration-all RUNNING PYTORCHJOB 33s pytorch-toleration-all-master-0 172.16.0.210
```

View File

@ -1,49 +0,0 @@
## Assign configuration files for pytorch jobs
You can pass the configuration files to containers when submiting jobs.
1. Prepare the configuration file to be mounted on the submitted machine.
```shell
# prepare your config-file
➜ cat /tmp/test-config.json
{
"key": "job-config"
}
```
2. Submit the job, and specify the configuration file to mount by `--config-file`.
```shell
# arena submit job by --config-file ${host-config-file}:${container-config-file}
# This parameter supports multiple use and mounting multiple configuration files
➜ arena --loglevel info submit pytorch \
--name=pytorch-config-file \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--config-file /tmp/test-config.json:/etc/config/config.json \
"python /root/code/mnist-pytorch/mnist.py --epochs 50 --backend gloo"
configmap/pytorch-config-file-pytorchjob created
configmap/pytorch-config-file-pytorchjob labeled
configmap/pytorch-config-file-a9cbad1b8719778 created
pytorchjob.kubeflow.org/pytorch-config-file created
INFO[0000] The Job pytorch-config-file has been submitted successfully
INFO[0000] You can run `arena get pytorch-config-file --type pytorchjob` to check the job status
```
3. Get the details of the this job.
```shell
➜ arena get pytorch-config-file --type pytorchjob
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 51s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-config-file RUNNING PYTORCHJOB 51s pytorch-config-file-master-0 172.16.0.210
```
4. Use kubectl to check file is in container or not:
```
➜ kubectl exec -ti pytorch-config-file-master-0 -- cat /etc/config/config.json
{
"key": "job-config"
}
```

View File

@ -1,130 +0,0 @@
## Arena supports Priority and Preemption for pytorch job
1. Create `PriorityClass` with the yaml below.There are two priorities defined here: `critical` and `medium`.
```shell
# critical 和 medium 声明
➜ cat priorityClass.yaml
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the critical app
kind: PriorityClass
metadata:
name: critical
value: 1100000
---
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the medium app
kind: PriorityClass
metadata:
name: medium
value: 1000000
# Create two priority objects: critical and medium
➜ kubectl create -f priorityClass.yaml
priorityclass.scheduling.k8s.io/critical created
priorityclass.scheduling.k8s.io/medium created
```
2. Check the available resources.There are 3 nodes in total, and each node has 4 gpu cards.
```shell
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 0
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 0
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/12 (0%)
```
3. Submit a GPU job with `medium` priority of 3 nodes and 4 cards, which occupies the full resources. In order to verify the effect, we can increase the epoch of training, extend the training time, and facilitate the experiment to view.
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-priority-medium \
--gpus=4 \
--workers=3 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--priority=medium \
"python /root/code/mnist-pytorch/mnist.py --backend gloo --epochs 200"
configmap/pytorch-priority-medium-pytorchjob created
configmap/pytorch-priority-medium-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-priority-medium created
INFO[0000] The Job pytorch-priority-medium has been submitted successfully
INFO[0000] You can run `arena get pytorch-priority-medium --type pytorchjob` to check the job status
```
4. Get the details of the this job. You can see that the task is running.
```shell
➜ arena get pytorch-priority-medium
STATUS: RUNNING
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 3m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-master-0 172.16.0.208
pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-worker-0 172.16.0.210
pytorch-priority-medium RUNNING PYTORCHJOB 3m pytorch-priority-medium-worker-1 172.16.0.209
```
5. Check the GPU card usage. It is all occupied.
```shell
➜ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-huhehaote.172.16.0.205 172.16.0.205 master ready 0 0
cn-huhehaote.172.16.0.206 172.16.0.206 master ready 0 0
cn-huhehaote.172.16.0.207 172.16.0.207 master ready 0 0
cn-huhehaote.172.16.0.208 172.16.0.208 <none> ready 4 4
cn-huhehaote.172.16.0.209 172.16.0.209 <none> ready 4 4
cn-huhehaote.172.16.0.210 172.16.0.210 <none> ready 4 4
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
12/12 (100%)
```
6. Submit a job with priority of `critical` to initiate preemption.
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-priority-critical \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--priority=critical \
"python /root/code/mnist-pytorch/mnist.py --backend gloo --epochs 50"
configmap/pytorch-priority-critical-pytorchjob created
configmap/pytorch-priority-critical-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-priority-critical created
INFO[0000] The Job pytorch-priority-critical has been submitted successfully
INFO[0000] You can run `arena get pytorch-priority-critical --type pytorchjob` to check the job status
```
7. Get the details of the this job.
```shell
➜ arena get pytorch-priority-critical
arena get pytorch-priority-critical
STATUS: RUNNING
NAMESPACE: default
PRIORITY: critical
TRAINING DURATION: 22s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-priority-critical RUNNING PYTORCHJOB 22s pytorch-priority-critical-master-0 172.16.0.208
```
8. Check the job status of `medium` priority. It has become `FAILED`. One instance has been deleted due to preemption.
```shell
➜ arena get pytorch-priority-medium
STATUS: FAILED
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 1m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-priority-medium FAILED PYTORCHJOB 2m pytorch-priority-medium-master-0 172.16.0.210
pytorch-priority-medium FAILED PYTORCHJOB 2m pytorch-priority-medium-worker-0 172.16.0.209
```
9. Check the event of the `pytorch-priority-medium`, and you can see that its `python-priority-media-worker-1` has been expelled. The reason for the expulsion is that the `python-priority-critical-master-0` is also applying for the resource of this node, and the node has no additional GPU resource, so the low priority job is preempted by the high priority job.
```shell
➜ kubectl get events --field-selector involvedObject.name=pytorch-priority-medium-worker-1
```
![](24-pytorchjob-preempted.png)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.5 MiB

View File

@ -1,40 +0,0 @@
## Specify the clean-up policy of pod after finishing for pytorch job
1. Submit a job, and specify `--clean-task-policy` as `All`. After the job finished (`SUCCEEDED` or `FAILED`), all instances (pods) will be deleted; the default is `None`, and all pods will be retained.
```shell
➜ arena --loglevel info submit pytorch \
--name=pytorch-clean-policy \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--clean-task-policy=All \
"python /root/code/mnist-pytorch/mnist.py --backend gloo"
configmap/pytorch-clean-policy-pytorchjob created
configmap/pytorch-clean-policy-pytorchjob labeled
pytorchjob.kubeflow.org/pytorch-clean-policy created
INFO[0000] The Job pytorch-clean-policy has been submitted successfully
INFO[0000] You can run `arena get pytorch-clean-policy --type pytorchjob` to check the job status
```
2. Get the job details. After the job is finished, the instance `python-clean-policy-master-0` has been deleted.
```shell
# RUNNING
➜ arena get pytorch-clean-policy
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 18s
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-clean-policy RUNNING PYTORCHJOB 18s pytorch-clean-policy-master-0 172.16.0.209
# FINISHED
➜ arena get pytorch-clean-policy
STATUS: SUCCEEDED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 37s
NAME STATUS TRAINER AGE INSTANCE NODE
```

View File

@ -1,168 +0,0 @@
# Submit the training jobs with ImagePullSecrets
You can use a private registry when submiting jobs(include tensorboard images).
Assume the following images are in your private registry.
```shell
# pytorch
registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard-secret:1.5.1-cuda10.1-cudnn7-runtime
# tf
registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.5.0-devel-gpu
# mpi
registry.cn-huhehaote.aliyuncs.com/lumo/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5
# tensorboard (--tensorboard-image)
registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel
```
## Contents
* <a href="#create_secret">Create ImagePullSecrets</a>
* <a href="#tfjob">TFJob With Secret</a>
* <a href="#mpijob">MPIJob With Secret</a>
* <a href="#pytorchjob">PyTorchJob With Secret</a>
* <a href="#arenaConfig">Load imagePullSecrets from configuration of Arena<a>
## <a name="create_secret">Create ImagePullSecrets</a>
* Create a [Secret](https://kubernetes.io/docs/concepts/configuration/secret/) with kubectl. In this case, it's [imagePullSecrets](https://kubernetes.io/docs/concepts/containers/images/).
```shell script
kubectl create secret docker-registry [$Reg_Secret] --docker-server=[$Registry] --docker-username=[$Username] --docker-password=[$Password] --docker-email=[$Email]
```
> Note
> [$Reg_Secret] is the name of the secret key, which can be defined by yourself.
> [$Registry] is your private registry address.
> [$Username] is username of your private registry.
> [$Password] is password of your private registry.
> [$Email] is your email address, Optional.
For Example:
```shell
kubectl create secret docker-registry \
lumo-secret \
--docker-server=registry.cn-huhehaote.aliyuncs.com \
--docker-username=******@test.aliyunid.com \
--docker-password=******
secret/lumo-secret created
```
You can check that the secret was created.
```shell
# kubectl get secrets | grep lumo-secret
lumo-secret kubernetes.io/dockerconfigjson 1 52s
```
## <a name="tfjob">TFJob With Secret</a>
Submit the job by using `--image-pull-secrets` to specify the imagePullSecrets.
1. Submit tf job.
```shell
arena submit tf \
--name=tf-git-with-secret \
--working-dir=/root \
--gpus=1 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.5.0-devel-gpu \
--sync-mode=git \
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--data=training-data:/mnist_data \
--tensorboard \
--tensorboard-image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel \
--logdir=/mnist_data/tf_data/logs \
--image-pull-secrets=lumo-secret \
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --log_dir /mnist_data/tf_data/logs --data_dir /mnist_data/tf_data/"
```
> Note:
> If you have many `imagePullSecrets` to use, you can use `--image-pull-secrets` multiple times.
```shell
arena submit tf \
--name=tf-git-with-secret \
... \
--image-pull-secrets=lumo-secret \
--image-pull-secrets=king-secret \
--image-pull-secrets=test-secret
...
```
2. Get the details of the job.
```shell
# arena get tf-git-with-secret
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 17s
NAME STATUS TRAINER AGE INSTANCE NODE
tf-git-with-secret RUNNING TFJOB 17s tf-git-with-secret-chief-0 172.16.0.202
Your tensorboard will be available on:
http://172.16.0.198:30080
```
## <a name="mpijob">MPIJob With Secret</a>
Submit the job by using `--image-pull-secrets` to specify the imagePullSecrets.
1. Submit mpi job.
```shell
arena submit mpi \
--name=mpi-dist-with-secret \
--gpus=1 \
--workers=2 \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \
--sync-mode=git \
--sync-source=https://github.com/tensorflow/benchmarks.git \
--tensorboard \
--tensorboard-image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel \
--image-pull-secrets=lumo-secret \
"mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
```
2. Get the details of the job.
```shell
# arena get mpi-dist-with-secret
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 9m
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-launcher-v8sgt 172.16.0.201
mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-worker-0 172.16.0.201
mpi-dist-with-secret RUNNING MPIJOB 9m mpi-dist-with-secret-worker-1 172.16.0.202
Your tensorboard will be available on:
http://172.16.0.198:30450
```
## <a name="pytorchjob">PyTorchJob With Secret</a>
Submit the job by using `--image-pull-secrets` to specify the imagePullSecrets.
1. Submit pytorch job.
```shell
arena submit pytorch \
--name=pytorch-git-with-secret \
--gpus=1 \
--working-dir=/root \
--image=registry.cn-huhehaote.aliyuncs.com/lumo/pytorch-with-tensorboard-secret:1.5.1-cuda10.1-cudnn7-runtime \
--sync-mode=git \
--sync-source=https://code.aliyun.com/370272561/mnist-pytorch.git \
--data=training-data:/mnist_data \
--tensorboard \
--tensorboard-image=registry.cn-huhehaote.aliyuncs.com/lumo/tensorflow:1.12.0-devel \
--logdir=/mnist_data/pytorch_data/logs \
--image-pull-secrets=lumo-secret \
"python /root/code/mnist-pytorch/mnist.py --epochs 10 --backend nccl --dir /mnist_data/pytorch_data/logs --data /mnist_data/pytorch_data/"
```
2. Get the details of the job.
```shell
# arena get pytorch-git-with-secret
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 2m
NAME STATUS TRAINER AGE INSTANCE NODE
pytorch-git-with-secret RUNNING PYTORCHJOB 2m pytorch-git-with-secret-master-0 172.16.0.202
Your tensorboard will be available on:
http://172.16.0.198:31155
```
## <a name="arenaConfig">Load imagePullSecrets from configuration of Arena</a>
If you don't want to submit job by `--image-pull-secrets` every time. You can replace it with configuration of Arena.
Open the file `~/.arena/config`, if not exist, create it. And fill in the following configurations.
```shell
imagePullSecrets=lumo-secret,king-secret
```
> Note:
> `--image-pull-secrets` will overwrite `~/.arena/config`.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 123 KiB

View File

@ -1,62 +0,0 @@
This guide walks through the steps to deploy and serve a custom model with kfserving
1. Setup
Follow the kFserving [guide](https://github.com/kubeflow/kfserving#install-kfserving) to install kFserving.For the prerequisites,you should ensure 8g memery and 4 core cpu avaliable in your environment.
2. summit your serving job into kfserving
```shell script
arena serve kfserving --name=max-object-detector --port=5000 --image=codait/max-object-detector --model-type=custom
configmap/max-object-detector-202008221942-kfserving created
configmap/max-object-detector-202008221942-kfserving labeled
inferenceservice.serving.kubeflow.org/max-object-detector-202008221942 created
```
3. list the job you just serving
```shell script
arena serve list
NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS
max-object-detector KFSERVING 202008221942 1 1 10.97.52.65 http:80
```
4. test the model service
##### Determine the ingress IP and ports
The first step is to [determine the ingress IP](https://github.com/kubeflow/kfserving/blob/master/README.md#determine-the-ingress-ip-and-ports) and ports and set INGRESS_HOST and INGRESS_PORT
This example uses the [codait/max-object-detector](https://github.com/IBM/MAX-Object-Detector) image. The Max Object Detector api server expects a POST request to the /model/predict endpoint that includes an image multipart/form-data and an optional threshold query string.
```shell script
MODEL_NAME=max-object-detector-202008221942
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -o jsonpath='{.status.url}' | cut -d "/" -f 3)
INGRESS_HOST=localhost
INGRESS_PORT=80
curl -v -F "image=@27-kfserving-custom.jpg" http://${INGRESS_HOST}:${INGRESS_PORT}/model/predict -H "Host: ${SERVICE_HOSTNAME}"
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 80 (#0)
> POST /model/predict HTTP/1.1
> Host: max-object-detector-202008221942.default.example.com
> User-Agent: curl/7.64.1
> Accept: */*
> Content-Length: 125769
> Content-Type: multipart/form-data; boundary=------------------------56b67bc60fc7bdc7
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< content-length: 380
< content-type: application/json
< date: Sun, 23 Aug 2020 03:27:14 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 3566
<
{"status": "ok", "predictions": [{"label_id": "1", "label": "person", "probability": 0.9440352320671082, "detection_box": [0.12420991063117981, 0.12507185339927673, 0.8423266410827637, 0.5974075794219971]}, {"label_id": "18", "label": "dog", "probability": 0.8645510673522949, "detection_box": [0.10447663068771362, 0.17799144983291626, 0.8422801494598389, 0.7320016026496887]}]}
* Connection #0 to host localhost left intact
* Closing connection 0
```
5. delete them
```shell script
arena serve delete max-object-detector --version=202008221942 2 err
inferenceservice.serving.kubeflow.org "max-object-detector-202008221942" deleted
configmap "max-object-detector-202008221942-kfserving" deleted
INFO[0001] The Serving job max-object-detector with version 202008221942 has been deleted successfully
```

View File

@ -1,175 +0,0 @@
This guide walks through the steps to submit a elastic training job with horovod.
1. Build image for training environment
You can use the [registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1]() image directly.
In addition, you can also build your own image with the help of this document [elastic-training-sample-image](https://code.aliyun.com/370272561/elastic-training-sample-image).
2. Submit a elastic training job. Example code from [tensorflow2_mnist_elastic.py](https://github.com/horovod/horovod/blob/master/examples/elastic/tensorflow2_mnist_elastic.py)
```shell script
arena submit etjob \
--name=elastic-training \
--gpus=1 \
--workers=3 \
--max-workers=9 \
--min-workers=1 \
--image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \
--working-dir=/examples \
"horovodrun
-np \$((\${workers}*\${gpus}))
--min-np \$((\${minWorkers}*\${gpus}))
--max-np \$((\${maxWorkers}*\${gpus}))
--host-discovery-script /usr/local/bin/discover_hosts.sh
python /examples/elastic/tensorflow2_mnist_elastic.py
"
```
Output:
```
configmap/elastic-training-etjob created
configmap/elastic-training-etjob labeled
trainingjob.kai.alibabacloud.com/elastic-training created
INFO[0000] The Job elastic-training has been submitted successfully
INFO[0000] You can run `arena get elastic-training --type etjob` to check the job status
```
3. List your job.
```shell script
arena list
```
Output:
```
NAME STATUS TRAINER AGE NODE
elastic-training RUNNING ETJOB 52s 192.168.0.116
```
4. Get your job details.
```shell script
arena get elastic-training
```
Output:
```
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m
NAME STATUS TRAINER AGE INSTANCE NODE
elastic-training RUNNING ETJOB 1m elastic-training-launcher 192.168.0.116
elastic-training RUNNING ETJOB 1m elastic-training-worker-0 192.168.0.114
elastic-training RUNNING ETJOB 1m elastic-training-worker-1 192.168.0.116
elastic-training RUNNING ETJOB 1m elastic-training-worker-2 192.168.0.116
```
5. Check logs
```shell script
arena logs elastic-training --tail 10
```
Output:
```
Tue Sep 8 08:32:50 2020[1]<stdout>:Step #2170 Loss: 0.021992
Tue Sep 8 08:32:50 2020[0]<stdout>:Step #2180 Loss: 0.000902
Tue Sep 8 08:32:50 2020[1]<stdout>:Step #2180 Loss: 0.023190
Tue Sep 8 08:32:50 2020[2]<stdout>:Step #2180 Loss: 0.013149
Tue Sep 8 08:32:51 2020[0]<stdout>:Step #2190 Loss: 0.029536
Tue Sep 8 08:32:51 2020[2]<stdout>:Step #2190 Loss: 0.017537
Tue Sep 8 08:32:51 2020[1]<stdout>:Step #2190 Loss: 0.018273
Tue Sep 8 08:32:51 2020[2]<stdout>:Step #2200 Loss: 0.038399
Tue Sep 8 08:32:51 2020[0]<stdout>:Step #2200 Loss: 0.007017
Tue Sep 8 08:32:51 2020[1]<stdout>:Step #2200 Loss: 0.017495
```
6. Scaleout your job. Will add one worker into jobs.
```shell script
arena scaleout etjob --name="elastic-training" --count=1 --timeout=1m
```
Output:
```
configmap/elastic-training-1599548177-scaleout created
configmap/elastic-training-1599548177-scaleout labeled
scaleout.kai.alibabacloud.com/elastic-training-1599548177 created
INFO[0000] The scaleout job elastic-training-1599548177 has been submitted successfully
```
7. Get your job details. We can see new worker(elastic-training-worker-3) has been "RUNNING".
```shell script
arena get elastic-training
```
Output:
```
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 2m
NAME STATUS TRAINER AGE INSTANCE NODE
elastic-training RUNNING ETJOB 2m elastic-training-launcher 192.168.0.116
elastic-training RUNNING ETJOB 2m elastic-training-worker-0 192.168.0.114
elastic-training RUNNING ETJOB 2m elastic-training-worker-1 192.168.0.116
elastic-training RUNNING ETJOB 2m elastic-training-worker-2 192.168.0.116
elastic-training RUNNING ETJOB 2m elastic-training-worker-3 192.168.0.117
```
8. Check logs.
```shell script
arena logs elastic-training --tail 10
```
Output:
```
Tue Sep 8 08:33:33 2020[1]<stdout>:Step #3140 Loss: 0.014412
Tue Sep 8 08:33:33 2020[0]<stdout>:Step #3140 Loss: 0.004425
Tue Sep 8 08:33:33 2020[3]<stdout>:Step #3150 Loss: 0.000513
Tue Sep 8 08:33:33 2020[2]<stdout>:Step #3150 Loss: 0.062282
Tue Sep 8 08:33:33 2020[1]<stdout>:Step #3150 Loss: 0.020650
Tue Sep 8 08:33:33 2020[0]<stdout>:Step #3150 Loss: 0.008056
Tue Sep 8 08:33:34 2020[3]<stdout>:Step #3160 Loss: 0.002170
Tue Sep 8 08:33:34 2020[2]<stdout>:Step #3160 Loss: 0.009676
Tue Sep 8 08:33:34 2020[1]<stdout>:Step #3160 Loss: 0.051425
Tue Sep 8 08:33:34 2020[0]<stdout>:Step #3160 Loss: 0.023769
```
9. Scalein your job. Will remove one worker from current jobs.
```shell script
arena scalein etjob --name="elastic-training" --count=1 --timeout=1m
```
Output:
```
configmap/elastic-training-1599554041-scalein created
configmap/elastic-training-1599554041-scalein labeled
scalein.kai.alibabacloud.com/elastic-training-1599554041 created
INFO[0000] The scalein job elastic-training-1599554041 has been submitted successfully
```
10. Get your job details. We can see that `elastic-training-worker-3` has been removed.
```shell script
arena get elastic-training
```
Output:
```
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 3m
NAME STATUS TRAINER AGE INSTANCE NODE
elastic-training RUNNING ETJOB 3m elastic-training-launcher 192.168.0.116
elastic-training RUNNING ETJOB 3m elastic-training-worker-0 192.168.0.114
elastic-training RUNNING ETJOB 3m elastic-training-worker-1 192.168.0.116
elastic-training RUNNING ETJOB 3m elastic-training-worker-2 192.168.0.116
```
11. Check logs.
```shell script
arena logs elastic-training --tail 10
```
Output:
```
Tue Sep 8 08:34:43 2020[0]<stdout>:Step #5210 Loss: 0.005627
Tue Sep 8 08:34:43 2020[2]<stdout>:Step #5220 Loss: 0.002142
Tue Sep 8 08:34:43 2020[1]<stdout>:Step #5220 Loss: 0.002978
Tue Sep 8 08:34:43 2020[0]<stdout>:Step #5220 Loss: 0.011404
Tue Sep 8 08:34:44 2020[2]<stdout>:Step #5230 Loss: 0.000689
Tue Sep 8 08:34:44 2020[1]<stdout>:Step #5230 Loss: 0.024597
Tue Sep 8 08:34:44 2020[0]<stdout>:Step #5230 Loss: 0.040936
Tue Sep 8 08:34:44 2020[0]<stdout>:Step #5240 Loss: 0.000125
Tue Sep 8 08:34:44 2020[2]<stdout>:Step #5240 Loss: 0.026498
Tue Sep 8 08:34:44 2020[1]<stdout>:Step #5240 Loss: 0.000308
```

View File

@ -1,182 +0,0 @@
This guide walks through the steps to submit a elastic training job with horovod.
1. Build image for training environment
You can use the [registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1]() image directly.
In addition, you can also build your own image with the help of this document [elastic-training-sample-image](https://code.aliyun.com/370272561/elastic-training-sample-image).
2. Submit a elastic training job. Example code from [pytorch_synthetic_benchmark_elastic.py](https://github.com/horovod/horovod/blob/master/examples/elastic/pytorch_synthetic_benchmark_elastic.py)
```shell script
arena submit etjob \
--name=elastic-training-synthetic \
--gpus=1 \
--workers=3 \
--max-workers=9 \
--min-workers=1 \
--image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \
--working-dir=/examples \
"horovodrun
--verbose
--log-level=DEBUG
-np \$((\${workers}*\${gpus}))
--min-np \$((\${minWorkers}*\${gpus}))
--max-np \$((\${maxWorkers}*\${gpus}))
--start-timeout 100
--elastic-timeout 1000
--host-discovery-script /usr/local/bin/discover_hosts.sh
python /examples/elastic/pytorch_synthetic_benchmark_elastic.py
--num-iters=10000
--num-warmup-batches=0"
```
Output:
```
configmap/elastic-training-synthetic-etjob created
configmap/elastic-training-synthetic-etjob labeled
trainingjob.kai.alibabacloud.com/elastic-training-synthetic created
INFO[0000] The Job elastic-training-synthetic has been submitted successfully
INFO[0000] You can run `arena get elastic-training-synthetic --type etjob` to check the job status
```
3. List your job.
```shell script
arena list
```
Output:
```
NAME STATUS TRAINER AGE NODE
elastic-training-synthetic RUNNING ETJOB 2m 192.168.0.112
```
4. Get your job details.
```shell script
arena get elastic-training-synthetic
```
Output:
```
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 3m
NAME STATUS TRAINER AGE INSTANCE NODE
elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-launcher 192.168.0.112
elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-worker-0 192.168.0.116
elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-worker-1 192.168.0.117
elastic-training-synthetic RUNNING ETJOB 3m elastic-training-synthetic-worker-2 192.168.0.116
```
5. Check logs
```shell script
arena logs elastic-training-synthetic --tail 10
```
Output:
```
Tue Sep 8 09:24:20 2020[0]<stdout>:Iter #54: 95.3 img/sec per GPU
Tue Sep 8 09:24:23 2020[0]<stdout>:Iter #55: 95.3 img/sec per GPU
Tue Sep 8 09:24:27 2020[0]<stdout>:Iter #56: 94.6 img/sec per GPU
Tue Sep 8 09:24:30 2020[0]<stdout>:Iter #57: 97.1 img/sec per GPU
Tue Sep 8 09:24:33 2020[0]<stdout>:Iter #58: 99.7 img/sec per GPU
Tue Sep 8 09:24:36 2020[0]<stdout>:Iter #59: 99.8 img/sec per GPU
Tue Sep 8 09:24:40 2020[0]<stdout>:Iter #60: 98.0 img/sec per GPU
Tue Sep 8 09:24:43 2020[0]<stdout>:Iter #61: 97.1 img/sec per GPU
Tue Sep 8 09:24:46 2020[0]<stdout>:Iter #62: 96.1 img/sec per GPU
Tue Sep 8 09:24:50 2020[0]<stdout>:Iter #63: 100.4 img/sec per GPU
```
6. Scaleout your job. Will add one worker into jobs.
```shell script
arena scaleout etjob --name="elastic-training-synthetic" --count=1 --timeout=1m
```
Output:
```
configmap/elastic-training-synthetic-1599557124-scaleout created
configmap/elastic-training-synthetic-1599557124-scaleout labeled
scaleout.kai.alibabacloud.com/elastic-training-synthetic-1599557124 created
INFO[0000] The scaleout job elastic-training-synthetic-1599557124 has been submitted successfully
```
7. Get your job details. We can see new worker(elastic-training-synthetic-worker-3) has been "RUNNING".
```shell script
arena get elastic-training-synthetic
```
Output:
```
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 5m
NAME STATUS TRAINER AGE INSTANCE NODE
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-launcher 192.168.0.112
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-0 192.168.0.116
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-1 192.168.0.117
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-2 192.168.0.116
elastic-training-synthetic RUNNING ETJOB 5m elastic-training-synthetic-worker-3 192.168.0.112
```
8. Check logs.
```shell script
arena logs elastic-training-synthetic --tail 10
```
Output:
```
Tue Sep 8 09:26:03 2020[0]<stdout>:Iter #76: 65.0 img/sec per GPU
Tue Sep 8 09:26:08 2020[0]<stdout>:Iter #77: 64.0 img/sec per GPU
Tue Sep 8 09:26:13 2020[0]<stdout>:Iter #78: 65.4 img/sec per GPU
Tue Sep 8 09:26:18 2020[0]<stdout>:Iter #79: 64.4 img/sec per GPU
Tue Sep 8 09:26:23 2020[0]<stdout>:Iter #80: 62.9 img/sec per GPU
Tue Sep 8 09:26:28 2020[0]<stdout>:Iter #81: 64.0 img/sec per GPU
Tue Sep 8 09:26:33 2020[0]<stdout>:Iter #82: 64.4 img/sec per GPU
Tue Sep 8 09:26:38 2020[0]<stdout>:Iter #83: 64.9 img/sec per GPU
Tue Sep 8 09:26:43 2020[0]<stdout>:Iter #84: 62.7 img/sec per GPU
Tue Sep 8 09:26:48 2020[0]<stdout>:Iter #85: 64.2 img/sec per GPU
```
9. Scalein your job. Will remove one worker from current jobs.
```shell script
arena scalein etjob --name="elastic-training-synthetic" --count=1 --timeout=1m
```
Output:
```
configmap/elastic-training-synthetic-1599557271-scalein created
configmap/elastic-training-synthetic-1599557271-scalein labeled
scalein.kai.alibabacloud.com/elastic-training-synthetic-1599557271 created
INFO[0000] The scalein job elastic-training-synthetic-1599557271 has been submitted successfully
```
10. Get your job details. We can see that `elastic-training-synthetic-worker-3` has been removed.
```shell script
arena get elastic-training-synthetic
```
Output:
```
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 7m
NAME STATUS TRAINER AGE INSTANCE NODE
elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-launcher 192.168.0.112
elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-worker-0 192.168.0.116
elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-worker-1 192.168.0.117
elastic-training-synthetic RUNNING ETJOB 7m elastic-training-synthetic-worker-2 192.168.0.116
```
11. Check logs.
```shell script
arena logs elastic-training-synthetic --tail 10
```
Output:
```
DEBUG:root:host elastic-training-synthetic-worker-3 has been blacklisted, ignoring exit from local_rank=0
Process 3 exit with status code 134.
Tue Sep 8 09:27:56 2020[0]<stdout>:Iter #97: 96.0 img/sec per GPU
Tue Sep 8 09:28:00 2020[0]<stdout>:Iter #98: 95.4 img/sec per GPU
Tue Sep 8 09:28:03 2020[0]<stdout>:Iter #99: 96.9 img/sec per GPU
Tue Sep 8 09:28:06 2020[0]<stdout>:Iter #100: 97.2 img/sec per GPU
Tue Sep 8 09:28:10 2020[0]<stdout>:Iter #101: 98.5 img/sec per GPU
Tue Sep 8 09:28:13 2020[0]<stdout>:Iter #102: 95.8 img/sec per GPU
Tue Sep 8 09:28:16 2020[0]<stdout>:Iter #103: 97.3 img/sec per GPU
Tue Sep 8 09:28:20 2020[0]<stdout>:Iter #104: 97.3 img/sec per GPU
Tue Sep 8 09:28:23 2020[0]<stdout>:Iter #105: 98.9 img/sec per GPU
```

Binary file not shown.

Before

Width:  |  Height:  |  Size: 485 KiB

View File

@ -1,72 +0,0 @@
Arena supports and simplifies distributed TensorFlow Training (PS/worker mode).
1. To run a distributed Tensorflow Training, you need to specify:
- GPUs of each worker (only for GPU workload)
- The number of workers (required)
- The number of PS (required)
- The docker image of worker (required)
- The docker image of PS (required)
- The Port of Worker (default is 22222)
- The Port of PS (default is 22223)
The following command is an example. In this example, it defines 2 workers and 1 PS, and each worker has 1 GPU. The source code of worker and PS are located in git, and the tensorboard are enabled.
```
# arena submit tf \
--name=tf-dist-git \
--gpus=1 \
--workers=2 \
--worker-image=tensorflow/tensorflow:1.5.0-devel-gpu \
--sync-mode=git \
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--ps=1 \
--ps-image=tensorflow/tensorflow:1.5.0-devel \
--tensorboard \
"python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir=/training_logs --data_dir=code/tensorflow-sample-code/data"
configmap/tf-dist-git-tfjob created
configmap/tf-dist-git-tfjob labeled
service/tf-dist-git-tensorboard created
deployment.extensions/tf-dist-git-tensorboard created
tfjob.kubeflow.org/tf-dist-git created
INFO[0001] The Job tf-dist-git has been submitted successfully
INFO[0001] You can run `arena get tf-dist-git --type tfjob` to check the job status
```
**Note**: If you saw the job or pod is failed, and then look at the logs, you may find out it is due to the reason that git code is not be able to cloned, especially if you are runing container insider some countries like China. This is not caused by arena, but cross-border network connectivity.
2\. Get the details of the specific job
```
# arena get tf-dist-git
NAME STATUS TRAINER AGE INSTANCE NODE
tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-594d59789c-lrfsk 192.168.1.119
tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-ps-0 192.168.1.118
tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-worker-0 192.168.1.119
tf-dist-git RUNNING tfjob 55s tf-dist-git-tfjob-worker-1 192.168.1.120
Your tensorboard will be available on:
192.168.1.117:32298
```
3\. Check the tensorboard
![](3-tensorboard.jpg)
4\. Get the TFJob dashboard
```
# arena logviewer tf-dist-git
Your LogViewer will be available on:
192.168.1.120:8080/tfjobs/ui/#/default/tf-dist-git-tfjob
```
![](4-tfjob-logviewer-distributed.jpg)
Congratulations! You've run the distributed training job with `arena` successfully.

View File

@ -1,78 +0,0 @@
The Distributed Tensorflow job has some roles, includes: Worker,PS,Chief,Evaluator. Sometimes, you may need to decide the sequence when creating them, for example, you may need to create "Worker" role first and then create "PS" role second, This guide will help you.
1. Now, assume that you want to submit a Distributed Tensorflow jobthe tensorflow job has four roles: Worker,PS,Chief,Evaluator and you need the role starting sequence is "Worker,Chief,PS,Evaluator", it is simple for you only add option "--role-sequence" when submitting the job,the following command is an example:
```
$ arena submit tfjob \
--name=tf-distributed-test \
--role-sequence "Worker,Chief,PS,Evaluator" \
--chief \
--evaluator \
--gpus=1 \
--workers=1 \
--worker-image=cheyang/tf-mnist-distributed:gpu \
--ps-image=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--tensorboard-image="registry.cn-hongkong.aliyuncs.com/ai-samples/tensorflow:1.12.0-devel" \
"python /app/main.py"
```
the "--role-sequence Worker,Chief,PS,Evaluator" is the same as "--role-sequence w,c,p,e" and "w" represents "Worker", "c" represents "Chief", "p" represents "PS" and "e" represents "Evaluator".
2. Make sure at least one pod belonging to the tfjob "tf-distributed-test" has annotation "job-role-sequence=Worker,Chief,PS,Evaluator":
```
$ kubectl get po -l tf-job-name=tf-distributed-test
NAME READY STATUS RESTARTS AGE
tf-distributed-test-chief-0 0/1 ContainerCreating 0 5m47s
tf-distributed-test-evaluator-0 0/1 ContainerCreating 0 5m47s
tf-distributed-test-ps-0 1/1 Running 0 5m47s
tf-distributed-test-worker-0 0/1 ContainerCreating 0 5m47s
$ kubectl get po tf-distributed-test-worker-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
job-role-sequence: Worker,Chief,PS,Evaluator
kubernetes.io/psp: ack.privileged
requestGPUsOfJobOwner: "3"
creationTimestamp: 2021-02-22T03:07:49Z
....
```
3. You can validate it by querying the tf-operator logs.
```
$ kubectl get po -n arena-system
NAME READY STATUS RESTARTS AGE
et-operator-576887864c-lvmrs 1/1 Running 1 19d
mpi-operator-66b4cf9b76-kl2fm 1/1 Running 0 26d
pytorch-operator-8545c46f98-cffgw 1/1 Running 4 26d
tf-job-dashboard-78478bfc45-msbzn 1/1 Running 0 19d
tf-job-operator-554d594cff-5vxfg 1/1 Running 0 101m
```
Query the logs of tf-job-operator-554d594cff-5vxfg.
```
$ kubectl logs tf-job-operator-554d594cff-5vxfg -n arena-system | grep "the Role Sequence" | tail -n 1
{"filename":"tensorflow/controller.go:453","job":"default.tf-distributed-test","level":"info","msg":"the Role Sequence of job tf-distributed-test is: [Worker Chief PS Evaluator]","time":"2021-02-01T13:22:23Z","uid":"7db02629-4591-4e0c-a938-c6e4a1cfc074"}
```
As you see the sequence of tf-operator handles the tfjob roles is match the sequence you specified.
If you don't want to specify the role sequence every time when submitting the tfjob, you can save the role sequence to the arena configuration file "~/.arena/config", like:
```
tfjob_role_sequence = Worker,PS,Chief,Evaluator
```
or
```
tfjob_role_sequence = w,p,c,e
```

View File

@ -1,128 +0,0 @@
## Support Multiple Users
In some usage scenarios, you may want multiple users to use arena and these users have different permissions to operate the kubernetes cluster. This guide will tell you how to implement the goal.
Now, assume that there is 3 users to use arena and their privileges are described as follow table:
| User Name | User Namespace | Quota | Additional Privileges |
| --------- | -------------- | ----- |---------------------- |
| alex | workplace1 | - |-|
| bob | workplace2 |limits.cpu: "10",limits.memory: "20Gi",requests.cpu: "5",requests.memory: "10Gi" |list the jobs in the cluster scope|
| tom | workplace3 |requests.nvidia.com/gpu: 20|list the jobs in the namespace scope|
the following steps describe how to generate the kubeconfig files of the users.
1.Prepare the user configuration file, you can refer the ~/charts/user/values.yaml or /charts/user/values.yaml to write your own user configuration file.
The user alex doesn't need to prepare a user configuration file,because it use the default configuration.
The user bob's user configuration file is defined as:
```
quota:
limits.cpu: "10"
requests.cpu: "5"
requests.memory: "10Gi"
limits.memory: "20Gi"
clusterRoles:
- apiGroups:
- batch
resources:
- jobs
verbs:
- list
```
and store it to /tmp/bob-config.yaml
The user tom's user configuration file is defined as:
```
quota:
requests.nvidia.com/gpu: 5
roles:
- apiGroups:
- batch
resources:
- jobs
verbs:
- list
```
and store it to /tmp/tom-config.yaml
2.Generate user kubeconfig, the script 'arena-gen-kubeconfig.sh' can help you:
```
$ arena-gen-kubeconfig.sh -h
Usage:
arena-gen-kubeconfig.sh [OPTION1] [OPTION2] ...
Options:
--user-name <USER_NAME> Specify the user name
--user-namespace <USER_NAMESPACE> Specify the user namespace
--user-config <USER_CONFIG> Specify the user config,refer the ~/charts/user/values.yaml or /charts/user/values.yaml
--force If the user has been existed,force to update the user
--delete Delete the user
--output <KUBECONFIG|USER_MANIFEST_YAML> Specify the output kubeconfig file or the user manifest yaml
--admin-kubeconfig <ADMIN_KUBECONFIG> Specify the Admin kubeconfig file
--cluster-url <CLUSTER_URL> Specify the Cluster URL,if not specified,the script will detect the cluster url
--create-user-yaml Only generate the user manifest yaml,don't apply it and create kubeconfig file
```
Firstly, create the kubeconfig file of alex:
```
$ arena-gen-kubeconfig.sh --user-name alex --user-namespace workplace1 --output /tmp/alex.kubeconfig --force
2021-02-08/11:38:44 DEBUG found arena charts in /Users/yangjunfeng/charts
2021-02-08/11:38:44 DEBUG the user configuration not set,use the default configuration file
resourcequota/arena-quota-alex created
serviceaccount/alex created
clusterrole.rbac.authorization.k8s.io/arena:workplace1:alex configured
clusterrolebinding.rbac.authorization.k8s.io/arena:workplace1:alex configured
role.rbac.authorization.k8s.io/arena:alex created
rolebinding.rbac.authorization.k8s.io/arena:alex created
configmap/arena-user-alex created
Cluster "https://192.168.1.42:6443" set.
User "alex" set.
Context "registry" created.
Switched to context "registry".
2021-02-08/11:38:48 DEBUG kubeconfig written to file /tmp/alex.kubeconfig
```
As you see the kubeconfig file has been created(/tmp/alex.kubeconfig).
Secondly, create the kubeconfig file of user bob:
```
$ arena-gen-kubeconfig.sh --user-name bob --user-namespace workplace2 --user-config /tmp/bob.yaml --output /tmp/bob.kubeconfig --force
```
the kubeconfig file will store at /tmp/bob.kubeconfig
Thirdly, create the kubeconfig file of user tom:
```
$ arena-gen-kubeconfig.sh --user-name tom --user-namespace workplace3 --user-config /tmp/tom.yaml --output /tmp/tom.kubeconfig --force
```
the kubeconfig file will store at /tmp/tom.kubeconfig
3.Make the kubeconfig file is valid, you can set the env KUBECONFIG like:
```
$ export KUBECONFIG=/tmp/alex.kubeconfig
```
4.Now you can use arena to submit your training jobs.
5.If you want to delete the user,execute the command like:
```
$ arena-gen-kubeconfig.sh --user-name tom --user-namespace workplace3 --delete
```

View File

@ -1,110 +0,0 @@
`arena` allows to mount multiple data volumes into the training jobs. There is an example that mounts `data volume` into the training job.
1. You need to create `/data` in the NFS Server, and prepare `mnist data`
```
# mkdir -p /nfs
# mount -t nfs -o vers=4.0 NFS_SERVER_IP:/ /nfs
# mkdir -p /data
# cd /data
# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/t10k-images-idx3-ubyte.gz
# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/t10k-labels-idx1-ubyte.gz
# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/train-images-idx3-ubyte.gz
# wget https://raw.githubusercontent.com/cheyang/tensorflow-sample-code/master/data/train-labels-idx1-ubyte.gz
# cd /
# umount /nfs
```
2\. Create Persistent Volume. Moidfy `NFS_SERVER_IP` to yours.
```
# cat nfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: tfdata
labels:
tfdata: nas-mnist
spec:
persistentVolumeReclaimPolicy: Retain
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
nfs:
server: NFS_SERVER_IP
path: "/data"
# kubectl create -f nfs-pv.yaml
```
3\. Create Persistent Volume Claim.
```
# cat nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tfdata
annotations:
description: "this is the mnist demo"
owner: Tom
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
selector:
matchLabels:
tfdata: nas-mnist
# kubectl create -f nfs-pvc.yaml
```
> Notice: suggest to add `description` and `owner`
4\. Check the data volume
```
# arena data list
NAME ACCESSMODE DESCRIPTION OWNER AGE
tfdata ReadWriteMany this is for mnist demo myteam 43d
```
5\. Now we can submit a distributed training job with `arena`, it will download the source code from github and mount data volume `tfdata` to `/mnist_data`.
```
# arena submit tf --name=tf-dist-data \
--gpus=1 \
--workers=2 \
--workerImage=tensorflow/tensorflow:1.5.0-devel-gpu \
--syncMode=git \
--syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--ps=1 \
--psImage=tensorflow/tensorflow:1.5.0-devel \
--tensorboard \
--data=tfdata:/mnist_data \
"python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir /training_logs --data_dir /mnist_data"
```
> `--data` specifies the data volume to mount to all the tasks of the job, like <name_of_datasource>:<mount_point_on_job>. In this example, the data volume is `tfdata`, and the target directory is `/mnist_data`.
6\. From the logs, we find that the training data is extracted from `/mnist_data` instead of downloading from internet directly.
```
# arena logs tf-dist-data
...
Extracting /mnist_data/train-images-idx3-ubyte.gz
Extracting /mnist_data/train-labels-idx1-ubyte.gz
Extracting /mnist_data/t10k-images-idx3-ubyte.gz
Extracting /mnist_data/t10k-labels-idx1-ubyte.gz
...
Accuracy at step 960: 0.9753
Accuracy at step 970: 0.9739
Accuracy at step 980: 0.9756
Accuracy at step 990: 0.9777
Adding run metadata for 999
```

Binary file not shown.

Before

Width:  |  Height:  |  Size: 239 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 454 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 360 KiB

View File

@ -1,56 +0,0 @@
Arena supports and simplifies distributed TensorFlow Training (MPI mode).
1. To run a distributed Training with MPI support, you need to specify:
- GPUs of each worker (only for GPU workload)
- The number of workers (required)
- The docker image of MPI worker (required)
The following command is an example. In this example, it defines 2 workers, and each worker has 1 GPU. The tensorboard are enabled.
```
# arena submit mpi
--name=mpi-dist \
--gpus=1 \
--workers=2 \
--image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--env=GIT_SYNC_BRANCH=cnn_tf_v1.9_compatible \
--sync-mode=git \
--sync-source=https://github.com/tensorflow/benchmarks.git \
--tensorboard \
"mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
```
2\. Get the details of the specific job
```
# arena get mpi-dist
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist RUNNING MPIJOB 1d mpi-dist-mpijob-launcher-ndnw8 192.168.1.120
mpi-dist RUNNING MPIJOB 1d mpi-dist-mpijob-worker-0 192.168.1.119
mpi-dist RUNNING MPIJOB 1d mpi-dist-mpijob-worker-1 192.168.1.120
Your tensorboard will be available on:
192.168.1.117:32559
```
3\. Check the tensorboard
![](5-mpi-tensorboard.jpg)
4\. Get the MPI dashboard
```
# arena logviewer mpi-dist
Your LogViewer will be available on:
192.168.1.119:9090/#!/log/default/mpi-dist-mpijob-launcher-ndnw8/mpi?namespace=default
```
![](5-mpijob-logviewer.jpg)
Congratulations! You've run the distributed MPI training job with `arena` successfully.

View File

@ -1,67 +0,0 @@
Arena supports distributed TensorFlow Training with gang scheduling by using [kube-arbitrator](https://github.com/kubernetes-incubator/kube-arbitrator).
When running distributed TensorFlow, we'd better to make sure `all` or `nothing`. Gang scheduling can help such case.
> Notice: the current [kubernetes gang scheduler](https://github.com/kubernetes-incubator/kube-arbitrator/tree/release-0.1) is not production ready. For example, it doesn't support Pod Affinity and PodFitsHostPorts for sheduling.
> Limitation: when using gang scheduler, the tensorboard feature doesn't work well.
1. To enable gang scheduler, edit `/charts/tfjob/values.yaml`
Change `enableGangScheduler: false` to `enableGangScheduler: true`
2. To run a distributed Tensorflow Training, you need to specify:
- GPUs of each worker (only for GPU workload)
- The number of workers (required)
- The number of PS (required)
- The docker image of worker (required)
- The docker image of PS (required)
- The Port of Worker (default is 22222)
- The Port of PS (default is 22223)
The following command is an example. In this example, it defines 2 workers and 1 PS, and each worker has 1 GPU. The source code of worker and PS are located in git, and the tensorboard are enabled.
```
# arena submit tf --name=tf-dist-git \
--gpus=1 \
--workers=2 \
--workerImage=tensorflow/tensorflow:1.5.0-devel-gpu \
--syncMode=git \
--syncSource=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--ps=1 \
--psImage=tensorflow/tensorflow:1.5.0-devel \
"python code/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --log_dir /training_logs"
configmap/tf-dist-git-tfjob created
configmap/tf-dist-git-tfjob labeled
service/tf-dist-git-tensorboard created
deployment.extensions/tf-dist-git-tensorboard created
tfjob.kubeflow.org/tf-dist-git created
INFO[0001] The Job tf-dist-git has been submitted successfully
INFO[0001] You can run `arena get tf-dist-git --type tfjob` to check the job status
```
If there are no enough resources, all the instances of the job are `PENDING`. If it's not gang scheduler, you can see some of the pods are `RUNNING` and others are `PENDING`.
```
# arena get tf-dist-data
NAME STATUS TRAINER AGE INSTANCE NODE
tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-ps-0 N/A
tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-worker-0 N/A
tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-worker-1 N/A
tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-worker-2 N/A
tf-dist-data PENDING TFJOB 0s tf-dist-data-tfjob-worker-3 N/A
```
When there are enough resources, the the instances become `RUNNING`
```
NAME STATUS TRAINER AGE INSTANCE NODE
tf-dist-data RUNNING TFJOB 4s tf-dist-data-tfjob-ps-0 192.168.1.115
tf-dist-data RUNNING TFJOB 4s tf-dist-data-tfjob-worker-0 192.168.1.119
tf-dist-data RUNNING TFJOB 4s tf-dist-data-tfjob-worker-1 192.168.1.118
tf-dist-data RUNNING TFJOB 4s tf-dist-data-tfjob-worker-2 192.168.1.120
```

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 571 KiB

View File

@ -1,140 +0,0 @@
You can also use high-level TensorFlow API tf.estimator.Estimator class for running Distributed TensorFlow with good modularity by using `Arena`.
1. Create Persistent Volume. Moidfy `NFS_SERVER_IP` to yours.
```
# cat nfs-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: tfdata
labels:
tfdata: nas-mnist
spec:
persistentVolumeReclaimPolicy: Retain
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
nfs:
server: NFS_SERVER_IP
path: "/data"
# kubectl create -f nfs-pv.yaml
```
2\. Create Persistent Volume Claim.
```
# cat nfs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tfdata
annotations:
description: "this is the mnist demo"
owner: Tom
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
selector:
matchLabels:
tfdata: nas-mnist
# kubectl create -f nfs-pvc.yaml
```
> Notice: suggest to add `description` and `owner`
3\. Check the data volume
```
# arena data list
NAME ACCESSMODE DESCRIPTION OWNER AGE
tfdata ReadWriteMany this is for mnist demo myteam 43d
```
4\. To run a distributed Tensorflow Training, you need to specify:
- GPUs of each worker (Include chief and evaluator)
- Enable chief (required)
- Enable Evaluator (optional)
- The number of workers (required)
- The number of PS (required)
- The docker image of worker and master (required)
- The docker image of PS (required)
- The Port of Chief (default is 22221)
- The Port of Worker (default is 22222)
- The Port of PS (default is 22223)
The following command is an example. In this example, it defines 1 chief worker, 1 workers, 1 PS and 1 evaluator, and each worker has 1 GPU. The source code of worker and PS are located in git, and the tensorboard are enabled.
```
# arena submit tf --name=tf-estimator \
--gpus=1 \
--workers=1 \
--chief \
--evaluator \
--data=tfdata:/data/mnist \
--logdir=/data/mnist/models \
--workerImage=tensorflow/tensorflow:1.9.0-devel-gpu \
--syncMode=git \
--syncSource=https://github.com/cheyang/models.git \
--ps=1 \
--psImage=tensorflow/tensorflow:1.9.0-devel \
--tensorboard \
"bash code/models/dist_mnist_estimator.sh --data_dir=/data/mnist/MNIST_data --model_dir=/data/mnist/models"
configmap/tf-estimator-tfjob created
configmap/tf-estimator-tfjob labeled
service/tf-estimator-tensorboard created
deployment.extensions/tf-estimator-tensorboard created
tfjob.kubeflow.org/tf-estimator created
INFO[0001] The Job tf-estimator has been submitted successfully
INFO[0001] You can run `arena get tf-estimator --type tfjob` to check the job status
```
> `--data` specifies the data volume to mount to all the tasks of the job, like <name_of_datasource>:<mount_point_on_job>. In this example, the data volume is `tfdata`, and the target directory is `/data/mnist`.
5\. From the logs, we have found the training is started
```
# arena logs tf-estimator
2018-09-27T00:37:01.576672145Z 2018-09-27 00:37:01.576562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:chief/replica:0/task:0/device:GPU:0 with 15123 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2018-09-27T00:37:01.578669608Z 2018-09-27 00:37:01.578523: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> localhost:22222}
2018-09-27T00:37:01.578685739Z 2018-09-27 00:37:01.578550: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> tf-estimator-tfjob-ps-0:22223}
2018-09-27T00:37:01.578705274Z 2018-09-27 00:37:01.578562: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> tf-estimator-tfjob-worker-0:22222}
2018-09-27T00:37:01.579637826Z 2018-09-27 00:37:01.579454: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:334] Started server with target: grpc://localhost:22222
2018-09-27T00:37:01.701520696Z I0927 00:37:01.701258 140281586534144 tf_logging.py:115] Calling model_fn.
2018-09-27T00:37:02.172552485Z I0927 00:37:02.172167 140281586534144 tf_logging.py:115] Done calling model_fn.
2018-09-27T00:37:02.173930978Z I0927 00:37:02.173732 140281586534144 tf_logging.py:115] Create CheckpointSaverHook.
2018-09-27T00:37:02.431259294Z I0927 00:37:02.430984 140281586534144 tf_logging.py:115] Graph was finalized.
2018-09-27T00:37:02.4472109Z 2018-09-27 00:37:02.447018: I tensorflow/core/distributed_runtime/master_session.cc:1150] Start master session b0a6d2587e64ebef with config: allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } }
...
2018-09-27T00:37:33.250440133Z I0927 00:37:33.250036 140281586534144 tf_logging.py:115] global_step/sec: 21.8175
2018-09-27T00:37:33.253100942Z I0927 00:37:33.252873 140281586534144 tf_logging.py:115] loss = 0.09276967, step = 500 (4.583 sec)
2018-09-27T00:37:37.764446795Z I0927 00:37:37.764101 140281586534144 tf_logging.py:115] Saving checkpoints for 600 into /data/mnist/models/model.ckpt.
2018-09-27T00:37:38.064104604Z I0927 00:37:38.063472 140281586534144 tf_logging.py:115] Loss for final step: 0.24215397.
```
6\. Check the training status and tensorboard
```
# arena get tf-estimator
NAME STATUS TRAINER AGE INSTANCE NODE
tf-estimator SUCCEEDED TFJOB 5h tf-estimator-tfjob-chief-0 N/A
tf-estimator RUNNING TFJOB 5h tf-estimator-tfjob-evaluator-0 192.168.1.120
tf-estimator RUNNING TFJOB 5h tf-estimator-tfjob-ps-0 192.168.1.119
tf-estimator RUNNING TFJOB 5h tf-estimator-tfjob-worker-0 192.168.1.118
Your tensorboard will be available on:
192.168.1.117:31366
```
7\. Check the tensorboard from 192.168.1.117:31366 in this sample
![](8-tfjob-estimator-tensorboard.jpg)

View File

@ -1,65 +0,0 @@
The command `arena top job <job name>` can display GPU monitoring metrics. Before using it, you must deploy a Prometheus and nodeExporter for GPU Metrics.
1\. Deploy a Prometheus
```
kubectl apply -f kubernetes-artifacts/prometheus/prometheus.yaml
```
2\. Deploy GPU node exporter
* If your cluster is ACK (Alibaba Cloud Kubernetes) cluster, you can just exec command:
```
# change gpu export nodeSelector to aliyun label
sed -i 's|accelerator/nvidia_gpu|aliyun.accelerator/nvidia_count|g' kubernetes-artifacts/prometheus/gpu-expoter.yaml
```
* If your cluster is not ACK cluster, you need to label your GPU node:
```
# label all your GPU nodes
kubectl label node <your GPU node> accelerator/nvidia_gpu=true
```
* Deploy gpu exporter
```
kubectl apply -f kubernetes-artifacts/prometheus/gpu-exporter.yaml
```
> Notice: the prometheus and gpu-exporter components should be deployed in namespace `kube-system`, and so that `arena top job <job name>` can work.
3\. You can check the GPU metrics by prometheus SQL request
```
# kubectl get --raw '/api/v1/namespaces/arena-system/services/prometheus-svc:prometheus/proxy/api/v1/query?query=nvidia_gpu_num_devices'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"nvidia_gpu_num_devices","app":"node-gpu-exporter","instance":"172.16.1.144:9445","job":"kubernetes-service-endpoints","k8s_app":"node-gpu-exporter","kubernetes_name":"node-gpu-exporter","node_name":"mynode"},"value":[1543202894.919,"2"]}]}}
```
4\. Submit a traing job by arena
```
arena submit tf --name=style-transfer \
--gpus=2 \
--workers=2 \
--workerImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/neural-style:gpu \
--workingDir=/neural-style \
--ps=1 \
--psImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/style-transfer:ps \
"python neural_style.py --styles /neural-style/examples/1-style.jpg --iterations 1000000"
```
5\. Check GPU metrics for the job you deployed
```
# arena top job style-transfer
INSTANCE NAME STATUS NODE GPU(Device Index) GPU(Duty Cycle) GPU(Memory MiB)
style-transfer-tfjob-ps-0 Running 192.168.0.95 N/A N/A N/A
style-transfer-tfjob-worker-0 Running 192.168.0.98 0 98% 15641MiB / 16276MiB
1 0% 15481MiB / 16276MiB
style-transfer-tfjob-worker-1 Running 192.168.0.99 0 98% 15641MiB / 16276MiB
1 0% 15481MiB / 16276MiB
```

Binary file not shown.

Before

Width:  |  Height:  |  Size: 223 KiB

View File

@ -1,139 +0,0 @@

这个示例展示了如何使用 `Arena` 进行机器学习模型训练。该示例将从 git url 下载源代码。
1. 第一步是检查可用的GPU资源
```
arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0%)
```
有 3 个包含 GPU 的可用节点用于运行训练作业。
2\.现在,我们可以通过 `arena` 提交一个训练作业,本示例从 github 下载源代码
```
#arena submit tf \
--name=tf-git \
--gpus=1 \
--image=tensorflow/tensorflow:1.5.0-devel-gpu \
--sync-mode=git \
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 10000 --data_dir=code/tensorflow-sample-code/data"
configmap/tf-git-tfjob created
configmap/tf-git-tfjob labeled
tfjob.kubeflow.org/tf-git created
INFO[0000] The Job tf-git has been submitted successfully
INFO[0000] You can run `arena get tf-git --type tfjob` to check the job status
```
> 这会下载源代码,并将其解压缩到工作目录的 `code/` 目录。默认的工作目录是 `/root`,您也可以使用 `--workingDir` 加以指定。同时你也可以通过在提交的命令中通过增加 `--env GIT_SYNC_BRANCH=main` 的方式来声明想要拉取的分支。`注意Github现在新建的repo都会以main作为主分支而不是Mater。`
> 如果您正在使用非公开 git 代码库,则可以使用以下命令:
```
#arena submit tf \
--name=tf-git \
--gpus=1 \
--image=tensorflow/tensorflow:1.5.0-devel-gpu \
--sync-mode=git \
--sync-source=https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git \
--env=GIT_SYNC_USERNAME=yourname \
--env=GIT_SYNC_PASSWORD=yourpwd \
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py"
```
注意:`arena` 使用 [git-sync](https://github.com/kubernetes/git-sync/blob/master/cmd/git-sync/main.go) 来同步源代码。您可以设置在 git-sync 项目中定义的环境变量。
3\.列出所有作业
```
#arena list
NAME STATUS TRAINER AGE NODE
tf-git RUNNING tfjob 0s 192.168.1.120
```
4\.检查作业所使用的GPU资源
```
#arena top job
NAME STATUS TRAINER AGE NODE GPU(Requests) GPU(Allocated)
tf-git RUNNING TFJOB 17s 192.168.1.120 1 1
Total Allocated GPUs of Training Job:
1
Total Requested GPUs of Training Job:
1
```
5\.检查集群所使用的GPU资源
```
#arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 1
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
1/3 (33%)
```
6\.获取特定作业的详细信息
```
#arena get tf-git
NAME STATUS TRAINER AGE INSTANCE NODE
tf-git RUNNING TFJOB 5s tf-git-tfjob-worker-0 192.168.1.120
```
7\.检查日志
```
#arena logs tf-git
2018-07-22T23:56:20.841129509Z WARNING:tensorflow:From code/tensorflow-sample-code/tfjob/docker/mnist/main.py:119: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
2018-07-22T23:56:20.841211064Z Instructions for updating:
2018-07-22T23:56:20.841217002Z
2018-07-22T23:56:20.841221287Z Future major versions of TensorFlow will allow gradients to flow
2018-07-22T23:56:20.841225581Z into the labels input on backprop by default.
2018-07-22T23:56:20.841229492Z
...
2018-07-22T23:57:11.842929868Z Accuracy at step 920: 0.967
2018-07-22T23:57:11.842933859Z Accuracy at step 930: 0.9646
2018-07-22T23:57:11.842937832Z Accuracy at step 940: 0.967
2018-07-22T23:57:11.842941362Z Accuracy at step 950: 0.9674
2018-07-22T23:57:11.842945487Z Accuracy at step 960: 0.9693
2018-07-22T23:57:11.842949067Z Accuracy at step 970: 0.9687
2018-07-22T23:57:11.842952818Z Accuracy at step 980: 0.9688
2018-07-22T23:57:11.842956775Z Accuracy at step 990: 0.9649
2018-07-22T23:57:11.842961076Z Adding run metadata for 999
```
8\.日志查看器中有关训练作业的更多信息
```
#arena logviewer tf-git
Your LogViewer will be available on:
192.168.1.120:8080/tfjobs/ui/#/default/tf-git-tfjob
```
![](1-tfjob-logviewer.jpg)
恭喜!您已经成功使用 `arena` 完成了第一项训练作业。

View File

@ -1,168 +0,0 @@
# Arena 支持MPIJob任务抢占的示例
## 前提条件
- k8s > 1.11
1.利用下列yaml创建`PriorityClass`对象,这里定义了两个优先级`critical`和`medium`:
```yaml
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the critical app
kind: PriorityClass
metadata:
name: critical
value: 1100000
---
apiVersion: scheduling.k8s.io/v1beta1
description: Used for the medium app
kind: PriorityClass
metadata:
name: medium
value: 1000000
```
将上述内容保存到`pc.yaml`文件,并且通过下列命令创建:
```
kubectl create -f pc.yaml
```
2.通过arena命令可以看到在当前Kubernetes集群中只有一张可用GPU卡:
```
# arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
192.168.0.20 192.168.0.20 master 0 0
192.168.0.21 192.168.0.21 master 0 0
192.168.0.22 192.168.0.22 master 0 0
192.168.0.23 192.168.0.23 <none> 1 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/1 (0%)
```
3.提交一个MPI训练任务该任务的优先级为`medium`:
参考如下例子
```
# arena submit mpi \
--name=medium \
--priority=medium \
--gpus=1 \
--workers=1 \
--image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
"mpirun tail -f /dev/null"
configmap/medium-mpijob created
configmap/medium-mpijob labeled
mpijob.kubeflow.org/medium created
INFO[0000] The Job medium has been submitted successfully
INFO[0000] You can run `arena get medium --type mpijob` to check the job status
```
4.查看该任务的运行状态
```
# arena get medium
STATUS: RUNNING
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 58s
NAME STATUS TRAINER AGE INSTANCE NODE
medium RUNNING MPIJOB 58s medium-launcher-sz5xj 192.168.0.23
medium RUNNING MPIJOB 58s medium-worker-0 192.168.0.23
```
5.可以看到该任务占用了唯一的一张GPU卡
```
# arena top node -d
NAME: cn-hangzhou.192.168.0.23
IPADDRESS: 192.168.0.23
ROLE: <none>
NAMESPACE NAME GPU REQUESTS GPU LIMITS
default medium-worker-0 1 1
Total GPUs In Node cn-hangzhou.192.168.0.23: 1
Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%)
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster: 1/1 (100%)
```
6.再提交一个MPI训练任务该任务的优先级为`critical`:
```
# arena submit mpi \
--name=critical \
--priority=critical \
--gpus=1 \
--workers=1 \
--image=registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
"mpirun tail -f /dev/null"
```
7.检查MPI训练任务`medium`的相关事件,可以发现它被驱逐了。而它被驱逐的原因是由于被更重要的任务`critical`下的Pod也在申请GPU资源而集群内只有一个可用的GPU资源所以较低优先级的任务`medium`的`medium-worker-0`被驱逐
```
# kubectl get events --field-selector involvedObject.name=medium-worker-0
LAST SEEN TYPE REASON OBJECT MESSAGE
15m Normal Scheduled pod/medium-worker-0 Successfully assigned default/medium-worker-0 to 192.168.0.23
14m Normal Pulled pod/medium-worker-0 Container image "registry.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5" already present on machine
14m Normal Created pod/medium-worker-0 Created container mpi
14m Normal Started pod/medium-worker-0 Started container mpi
2m32s Normal Preempted pod/medium-worker-0 by default/critical-worker-0 on node 192.168.0.23
2m32s Normal Killing pod/medium-worker-0 Stopping container mpi
```
8.查看MPI训练任务`medium`的细节信息,发现这个任务已经处于失败状态。
```
# arena get medium
STATUS: FAILED
NAMESPACE: default
PRIORITY: medium
TRAINING DURATION: 12m
NAME STATUS TRAINER AGE INSTANCE NODE
medium FAILED MPIJOB 20m medium-launcher-sz5xj 192.168.0.23
```
9.查看MPI训练任务`critical`的细节信息,发现这个任务已经处于运行状态。
```
# arena get critical
STATUS: RUNNING
NAMESPACE: default
PRIORITY: critical
TRAINING DURATION: 10m
NAME STATUS TRAINER AGE INSTANCE NODE
critical RUNNING MPIJOB 10m critical-launcher-mfffs 192.168.0.23
critical RUNNING MPIJOB 10m critical-worker-0 192.168.0.23
```
10.而且也可以通过`arena top node -d`发现这个GPU已经被MPI训练任务`critical`占用。
```
# arena top node -d
NAME: cn-hangzhou.192.168.0.23
IPADDRESS: 192.168.0.23
ROLE: <none>
NAMESPACE NAME GPU REQUESTS GPU LIMITS
default critical-worker-0 1 1
Total GPUs In Node cn-hangzhou.192.168.0.23: 1
Allocated GPUs In Node cn-hangzhou.192.168.0.23: 1 (100%)
-----------------------------------------------------------------------------------------
```
恭喜! 你已经可以通过arena实现对于MPIJob优先级抢占。

View File

@ -1,159 +0,0 @@
Arena支持给提交的任务指定运行的节点目前仅支持mpi和tf类型的任务
下面展示一些使用例子。
1.查询k8s集群信息
```
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-beijing.192.168.3.225 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.226 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.227 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.228 Ready <none> 2d22h v1.12.6-aliyun.1
cn-beijing.192.168.3.229 Ready <none> 2d22h v1.12.6-aliyun.1
cn-beijing.192.168.3.230 Ready <none> 2d22h v1.12.6-aliyun.1
```
2.为一些k8s节点指定标签。例如为节点"cn-beijing.192.168.3.228"和节点"cn-beijing.192.168.3.229"指定标签"gpu_node=ok",为节点"cn-beijing.192.168.3.230"指定标签"ssd_node=ok"。
```
# kubectl label nodes cn-beijing.192.168.3.228 gpu_node=ok
node/cn-beijing.192.168.3.228 labeled
# kubectl label nodes cn-beijing.192.168.3.229 gpu_node=ok
node/cn-beijing.192.168.3.229 labeled
# kubectl label nodes cn-beijing.192.168.3.230 ssd_node=ok
node/cn-beijing.192.168.3.230 labeled
```
## MPI类型的job
1.当提交一些job时可以通过"--selector"选项来确定这些job运行在哪些节点上
```
# arena submit mpi --name=mpi-dist \
--gpus=1 \
--workers=1 \
--selector gpu_node=ok \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--tensorboard \
--loglevel debug \
"mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
```
2.查询job信息
```
# arena get mpi-dist
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 21s
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist RUNNING MPIJOB 21s mpi-dist-launcher-7jn4q 192.168.3.229
mpi-dist RUNNING MPIJOB 21s mpi-dist-worker-0 192.168.3.229
Your tensorboard will be available on:
http://192.168.3.225:31611
```
可以看到job已经运行在节点cn-beijing.192.168.3.228(ip是192.168.3.229)上了。
the jobs have been running on node cn-beijing.192.168.3.228(ip is 192.168.3.229).
3.你可以多次使用"--selector"选项例如你可以在arena的提交命令中使用"--selector gpu_node=ok --selector ssd_node=ok",这代表你需要将job运行在那些同时拥有标签"gpu_node=ok"和标签"ssd_node=ok"的节点上
## TF类型的job
1.因为在tf类型的job当中存在四种角色"PS","Worker","Evaluator","Chief"),你可以使用"--selector"来指定job运行在哪些节点上。
```
arena submit tfjob \
--name=tf \
--gpus=1 \
--workers=1 \
--selector ssd_node=ok \
--work-image=cheyang/tf-mnist-distributed:gpu \
--ps-image=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--loglevel debug \
"python /app/main.py"
```
使用如下命令检查节点状态:
```
# arena get tf
STATUS: PENDING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 24s
NAME STATUS TRAINER AGE INSTANCE NODE
tf RUNNING TFJOB 24s tf-ps-0 192.168.3.230
tf PENDING TFJOB 24s tf-worker-0 192.168.3.230
Your tensorboard will be available on:
http://192.168.3.225:31867
```
可以看到"PS"类型的job和"Worker"类型的job都运行在了节点cn-beijing.192.168.3.230(ip是192.168.3.230,标签是"ssd_node=ok")上了。
the jobs(include "PS" and "Worker") have been running on cn-beijing.192.168.3.230(ip is 192.168.3.230,label is "ssd_node=ok").
2.你也可以单独指定一种角色的job运行在哪些节点上例如如果你希望把"PS" job运行在标签为ssd_node="ok"节点上,把"Worker" job运行在标签为"gpu_node=ok"的节点上,可以使用"--ps-selector"和"--worker-selector"。
```
arena submit tfjob \
--name=tf \
--gpus=1 \
--workers=1 \
--ps-selector ssd_node=ok \
--worker-selector gpu_node=ok \
--work-image=cheyang/tf-mnist-distributed:gpu \
--ps-image=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--loglevel debug \
"python /app/main.py"
```
检查job的状态:
```
# arena get tf
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 23s
NAME STATUS TRAINER AGE INSTANCE NODE
tf RUNNING TFJOB 23s tf-ps-0 192.168.3.230
tf RUNNING TFJOB 23s tf-worker-0 192.168.3.228
Your tensorboard will be available on:
http://192.168.3.225:30162
```
"PS" job运行在节点cn-beijing.192.168.3.230(ip是192.168.3.230,标签是"ssd_node=ok")"Worker" job运行在节点cn-beijing.192.168.3.228(ip是192.168.3.228,标签是"gpu_node=ok")上。
3.如果你同时使用"--selector"和"--ps-selector"(或者"--worker-selector","--evaluator-selector","chief-selector"),那么"--ps-selector"的值会覆盖"--selector"的值。,例如:
```
arena submit tfjob \
--name=tf \
--gpus=1 \
--workers=1 \
--ps-selector ssd_node=ok \
--selector gpu_node=ok \
--work-image=cheyang/tf-mnist-distributed:gpu \
--ps-image=cheyang/tf-mnist-distributed:cpu \
--ps=1 \
--tensorboard \
--loglevel debug \
"python /app/main.py"
```
理论上"--selector"会应用到所有角色的job中在上面的命令中所有角色的job将会被调度到标签为gpu_node=ok的节点上但是因为有"--ps-selector",那么"PS" job会被调度到标签为ssd_node=ok上而不是标签为gpu_node=ok的节点上。
```
# arena get tf
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 39s
NAME STATUS TRAINER AGE INSTANCE NODE
tf RUNNING TFJOB 39s tf-ps-0 192.168.3.230
tf RUNNING TFJOB 39s tf-worker-0 192.168.3.228
Your tensorboard will be available on:
http://192.168.3.225:32105
```
正如你所看到的,"PS" job被调度到拥有标签为"ssd_node=ok"的节点上,其他节点被调度到标签为"gpu_node=ok"的节点上。

View File

@ -1,83 +0,0 @@
Arena支持将提交的job运行在k8s污点上目前仅支持mpi和tf类型的 job
下面展示一些使用例子。
1.查询k8s集群信息
```
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-beijing.192.168.3.225 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.226 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.227 Ready master 2d23h v1.12.6-aliyun.1
cn-beijing.192.168.3.228 Ready <none> 2d22h v1.12.6-aliyun.1
cn-beijing.192.168.3.229 Ready <none> 2d22h v1.12.6-aliyun.1
cn-beijing.192.168.3.230 Ready <none> 2d22h v1.12.6-aliyun.1
```
2.为k8s节点打上一些污点例如为节点"cn-beijing.192.168.3.228"和节点"cn-beijing.192.168.3.229"打上污点"gpu_node=invalid:NoSchedule",为节点"cn-beijing.192.168.3.230"打上污点"ssd_node=invalid:NoSchedule"。现在所有pod都不能调度到这些节点了。
```
# kubectl taint nodes cn-beijing.192.168.3.228 gpu_node=invalid:NoSchedule
node/cn-beijing.192.168.3.228 tainted
# kubectl taint nodes cn-beijing.192.168.3.229 gpu_node=invalid:NoSchedule
node/cn-beijing.192.168.3.229 tainted
# kubectl taint nodes cn-beijing.192.168.3.230 ssd_node=invalid:NoSchedule
node/cn-beijing.192.168.3.230 tainted
```
3.当提交一个job时你可以使用"--toleration"来容忍一些带有污点的k8s节点。
```
# arena submit mpi --name=mpi-dist \
--gpus=1 \
--workers=1 \
--toleration ssd_node \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--tensorboard \
--loglevel debug \
"mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
```
查询job信息
```
# arena get mpi-dist
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 29s
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist RUNNING MPIJOB 29s mpi-dist-launcher-jgms7 192.168.3.230
mpi-dist RUNNING MPIJOB 29s mpi-dist-worker-0 192.168.3.230
Your tensorboard will be available on:
http://192.168.3.225:30052
```
job已经运行在节点cn-beijing.192.168.3.230(ip为192.168.3.230,污点为ssd_node=invalid)上了。
4.你可以在同一个命令中多次使用"--toleration"。例如,你可以在命令中使用"--toleration gpu_node --toleration ssd_node",它代表既可以容忍有污点"gpu_node"的节点,又可以容忍污点"ssd_node"的节点。
```
# arena submit mpi --name=mpi-dist \
--gpus=1 \
--workers=1 \
--toleration ssd_node \
--toleration gpu_node \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
--tensorboard \
--loglevel debug \
"mpirun python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod --train_dir=/training_logs --summary_verbosity=3 --save_summaries_steps=10"
```
查询job状态
```
# arena get mpi-dist
STATUS: RUNNING
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 29s
NAME STATUS TRAINER AGE INSTANCE NODE
mpi-dist RUNNING MPIJOB 29s mpi-dist-launcher-jgms7 192.168.3.229
mpi-dist RUNNING MPIJOB 29s mpi-dist-worker-0 192.168.3.230
Your tensorboard will be available on:
http://192.168.3.225:30052
```
5.你可以使用"--toleration all"来容忍所有节点上的所有污点。

Binary file not shown.

Before

Width:  |  Height:  |  Size: 183 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 290 KiB

View File

@ -1,75 +0,0 @@
# 用arena服务训练模型
你可以适用arena部署你的训练模型通过RESTful API的方式访问。为了说明怎样使用我们将会使用一个案例[fast-style-transfer](https://github.com/floydhub/fast-style-transfer)同时为了节约时间直接使用这个项目已经训练好的模型并且把模型加入docker镜像中。
### 1.部署训练模型
使用项目中的app.py这个脚本启动一个restful服务器你可以使用如下的命令去部署模型
```
# arena serve custom \
--name=fast-style-transfer \
--gpus=1 \
--version=alpha \
--replicas=1 \
--restful-port=5000 \
--image=happy365/fast-style-transfer:latest \
"python app.py"
```
检查TensorFlow Serving Job的状态
```
# arena serve list
NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS
fast-style-transfer CUSTOM alpha 1 0 172.21.8.94 grpc:8001,restful:5000
```
因为docker镜像比较大拉取它需要一定的时间我们可以使用"kubectl"检查pod运行情况:
```
# kubectl get po
NAME READY STATUS RESTARTS AGE
fast-style-transfer-alpha-custom-serving-845ffbf7dd-btbhj 0/1 ContainerCreating 0 6m44s
```
### 2.访问服务
我们可以使用一个带有curl命令的容器作为客户端去访问刚才创建的服务但是首先我们需要创建这个客户端
```
# kubectl run sample-client \
--generator=run-pod/v1 \
--image=happy365/arena-serve-custem-sample-client:latest \
--command -- \
/bin/sleep infinity
```
然后,可以查询客户端的状态:
```
# kubectl get po sample-client
NAME READY STATUS RESTARTS AGE
sample-client 1/1 Running 0 87s
```
在用客户端访问custom service之前我们需要查询服务名称它是一个任务名和版本的结合本例中任务名为fast-style-transfer版本为alpha)
```
# kubectl get svc fast-style-transfer-alpha
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fast-style-transfer-alpha ClusterIP 172.21.1.114 <none> 5000/TCP 31m
```
现在我们可以可以使用kubectl exec 进入容器当中:
```
# kubectl exec -ti sample-client /bin/sh
#
```
接着在容器当中使用curl命令去访问aren创建的自定义服务:
```
# curl -o /root/output/beijing_out.jpg -F "file=@/root/input/beijing.jpg" http://fast-style-transfer-alpha:5000
```
在上面的命令中,输入文件的名称为"beijing.jpg" ![beijing.jpg](15-custom-serving-sample-beijing.jpg),存放的路径为"/root/input",输出文件的路径为"/root/output/beijing_out.jpg"现在需要退出容器然后在master节点上执行kubectl cp命令将结果从容器中拷贝出来
```
# kubectl cp sample-client:/root/output/beijing_out.jpg ~/beijing_out.jpg
```
图片"beijing_out.jpg" ![beijing_out.jpg](15-custom-serving-sample-beijing_out.jpg)将会复制到当前用户的家目录下面。

Some files were not shown because too many files have changed in this diff Show More