Compare commits

...

415 Commits

Author SHA1 Message Date
dependabot[bot] f8ee31410c
chore(deps): bump actions/setup-java from 4 to 5 (#1366)
Bumps [actions/setup-java](https://github.com/actions/setup-java) from 4 to 5.
- [Release notes](https://github.com/actions/setup-java/releases)
- [Commits](https://github.com/actions/setup-java/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/setup-java
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-08-26 02:37:19 +00:00
dependabot[bot] ec5255280c
chore(deps): bump actions/checkout from 4 to 5 (#1359)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-08-14 03:36:12 +00:00
dependabot[bot] d1f7be63ab
chore(deps): bump actions/download-artifact from 4 to 5 (#1356)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 5.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-08-14 03:35:12 +00:00
dependabot[bot] a190ca253b
chore(deps): bump github.com/spf13/pflag from 1.0.6 to 1.0.7 (#1352)
---
updated-dependencies:
- dependency-name: github.com/spf13/pflag
  dependency-version: 1.0.7
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-23 05:34:59 +00:00
dependabot[bot] 695c2c67f0
chore(deps): bump golang.org/x/crypto from 0.39.0 to 0.40.0 (#1351)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.39.0 to 0.40.0.
- [Commits](https://github.com/golang/crypto/compare/v0.39.0...v0.40.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.40.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-22 02:37:58 +00:00
Yi Chen 75ec421d62
Bump helm.sh/helm/v3 from 3.16.3 to 3.18.4 (#1350)
* Bump golang version from 1.23.10 to 1.24.0

Signed-off-by: Yi Chen <github@chenyicn.net>

* Fix go vet check

Signed-off-by: Yi Chen <github@chenyicn.net>

* Bump helm.sh/helm/v3 from 3.16.3 to 3.18.4

Signed-off-by: Yi Chen <github@chenyicn.net>

* Run go mod vendor

Signed-off-by: Yi Chen <github@chenyicn.net>

* Retrieve Helm version from go.mod file

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-07-11 14:56:52 +00:00
Yi Chen 25d7b1109e
Release v0.15.1 (#1344)
* Release v0.15.1

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add changelog for v0.15.1

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-26 06:52:17 +00:00
dependabot[bot] d2d5f77a97
chore(deps): bump golang.org/x/crypto from 0.38.0 to 0.39.0 (#1334)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.31.0 to 0.39.0.
- [Commits](https://github.com/golang/crypto/compare/v0.31.0...v0.39.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.39.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-25 06:30:16 +00:00
dependabot[bot] c4ccb4ca7e
chore(deps): bump github.com/prometheus/common from 0.60.1 to 0.65.0 (#1343)
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.60.1 to 0.65.0.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/RELEASE.md)
- [Commits](https://github.com/prometheus/common/compare/v0.60.1...v0.65.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-version: 0.65.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-25 06:20:15 +00:00
Yi Chen aa33dc51b7
Bump golang version from 1.22.7 to 1.23.10 (#1345)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-25 06:06:16 +00:00
Yi Chen 9e84dad37a
Fix golangci-lint issues (#1341)
* Bump golangci-lint version from v1.57.2 to v2.1.6

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add golangci-lint.yaml

Signed-off-by: Yi Chen <github@chenyicn.net>

* Fix golangci-lint issues

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 17:04:14 +00:00
Yi Chen c9d5653de3
Add support for configuring tolerations (#1337)
* Add support for configuring tolerations

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add basic Helm chart unittests

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add Helm chart unit tests to GitHub CI workflow

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 13:01:13 +00:00
Yi Chen 4618e321ab
Update uninstall bash script (#1335)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 12:58:14 +00:00
Yi Chen ca7bf97da4
[CI] Add CI workflow for releasing Arena images (#1340)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 12:57:14 +00:00
Yi Chen 1c633d76ff
Remove kubernetes artifacts (#1329)
* Remove Kubernetes artifacts

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update Makefile

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-23 12:53:14 +00:00
Yi Chen 3693f59663
Release v0.15.0 (#1332)
* Release v0.15.0

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add changelog for v0.15.0

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-04 15:12:14 +00:00
Syspretor fa2fad7d6e
Feat: support separate affinity policy configuration for PS and worke… (#1331)
Signed-off-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
Co-authored-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
2025-06-04 12:03:14 +00:00
Syspretor 8f4a602ce6
Feat: support affinity policy for kserve and tfjob (#1319)
Signed-off-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
Co-authored-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
2025-06-04 11:33:15 +00:00
Leoyzen ad85546c23
Add custom device support for kserve and kserving. (#1315)
* add custom device support for kserving.

Signed-off-by: Leoyzen <leoyzen@gmail.com>

* add custom device support for kserve.

Signed-off-by: Leoyzen <leoyzen@gmail.com>

---------

Signed-off-by: Leoyzen <leoyzen@gmail.com>
2025-06-04 02:45:14 +00:00
Yi Chen babcb76f91
Make number of replicas of tf-operator deployment configurable (#1323)
* Make tf-operator replicas configurable

Signed-off-by: Yi Chen <github@chenyicn.net>

* Make replicas of tf-operator spread out across different nodes

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-04 02:39:14 +00:00
Yi Chen ba7a09ace6
Make number of replicas of cron-operator deployment configurable (#1325)
* Make cron-operator replicas configurable

Signed-off-by: Yi Chen <github@chenyicn.net>

* Make replicas of cron-operator spread out across different nodes

Signed-off-by: Yi Chen <github@chenyicn.net>

* Remove '--enable-leader-election=true' from args

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-06-03 13:16:14 +00:00
Yi Chen 545f86bfe9
Delete all services when the TFJob is terminated (#1316)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-05-29 12:57:19 +00:00
co63oc 568e3845f5
Fix typos in multiple files (#1310)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-05-13 08:56:21 +00:00
co63oc 8b84559944
Fix typos in multiple files (#1304)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-05-12 12:45:38 +00:00
Yi Chen ee2384b911
fix: service account should use release namespace (#1308)
* Use release namespace

Signed-off-by: Yi Chen <github@chenyicn.net>

* Remove namespace from cluster scoped resource

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-05-12 12:23:38 +00:00
Yi Chen 2fbb3d7ed4
feat: add new value for using localtime in cron-operator (#1296)
* feat: add new value for using localtime in cron-operator

Signed-off-by: Yi Chen <github@chenyicn.net>

* Rename localTime to useHostTimezone

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-04-03 07:31:33 +00:00
Yi Chen 19b5133e6e
refactor: use helm lib instead of helm binary (#1207)
* Delete func ListAllReleasesWithDetail

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func ListReleaseMap

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func ListReleases

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func DeleteRelease

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add some helm util functions

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func InstallRelease

Signed-off-by: Yi Chen <github@chenyicn.net>

* Delete func CheckRelease

Signed-off-by: Yi Chen <github@chenyicn.net>

* Refactor func GetChartVersion

Signed-off-by: Yi Chen <github@chenyicn.net>

* Refactor func GenerateHelmTemplate

Signed-off-by: Yi Chen <github@chenyicn.net>

* Move all helm releated functions into util.go

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add missed import statements and run go mod tidy

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update copyright header

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add flag --helm-binary for forward compatibility

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-21 09:19:27 +00:00
Yi Chen 8d413b5861
Add stale bot to mark stale issues and PRs (#1141)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-21 05:14:26 +00:00
dependabot[bot] 2f6e202bbf
Bump github.com/containerd/containerd from 1.7.23 to 1.7.27 (#1290)
Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.7.23 to 1.7.27.
- [Release notes](https://github.com/containerd/containerd/releases)
- [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md)
- [Commits](https://github.com/containerd/containerd/compare/v1.7.23...v1.7.27)

---
updated-dependencies:
- dependency-name: github.com/containerd/containerd
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-21 04:58:26 +00:00
Yi Chen f3d52fa73a
Add basic e2e tests (#1225)
* Add basic e2e tests

Signed-off-by: Yi Chen <github@chenyicn.net>

* Run go mod vendor

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-21 04:02:27 +00:00
Yi Chen ece85b8ce3
fix: job status displays incorrectly (#1289)
* fix: job status displays incorrectly

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add go unit tests

Signed-off-by: Yi Chen <github@chenyicn.net>

* logging job status

Signed-off-by: Yi Chen <github@chenyicn.net>

* Adjust the order of running and queuing conditions

Signed-off-by: Yi Chen <github@chenyicn.net>

* Use constants instead of hard encoded status

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-20 09:51:27 +00:00
Yi Chen d497232013
Release v0.14.2 (#1282)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-10 02:26:01 +00:00
Yi Chen 9407f9b1a0
Update pytorch operator image (#1281)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-03-10 01:56:01 +00:00
co63oc d9bf195879
Fix typos (#1276)
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-03-06 03:11:39 +00:00
Yi Chen 19abf194bb
Release v0.14.1 (#1275)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-24 03:06:45 +00:00
Yi Chen 1f9350d78c
unset env NVIDIA_VISIBLE_DEVICES when gpushare is enabled (#1273)
* unset env NVIDIA_VISIBLE_DEVICES when gpushare is enabled

Signed-off-by: Yi Chen <github@chenyicn.net>

* Group constants into one const block

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-24 02:34:45 +00:00
Yi Chen 23e9731b52
fix: pytorchjob does not support backoff limit (#1272)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-19 06:57:41 +00:00
Yi Chen d6b177b93d
fix: format of tensorflow standalone training docs is messed up (#1265)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-12 12:18:29 +00:00
Yi Chen 0ca2670770
fix: device value does not support k8s resource quantity (#1267)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-12 12:17:29 +00:00
dependabot[bot] 7d7f75ad2d
Bump github.com/golang/glog from 1.2.3 to 1.2.4 (#1263)
Bumps [github.com/golang/glog](https://github.com/golang/glog) from 1.2.3 to 1.2.4.
- [Release notes](https://github.com/golang/glog/releases)
- [Commits](https://github.com/golang/glog/compare/v1.2.3...v1.2.4)

---
updated-dependencies:
- dependency-name: github.com/golang/glog
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-12 10:25:29 +00:00
DBMxrco 4b21f7299b
docs: fixed typo (#1257)
Signed-off-by: DBMxrco <marcoflet@yahoo.com>
2025-02-12 08:34:29 +00:00
Yi Chen 36a59bba67
Release v0.14.0 (#1264)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-12 06:43:28 +00:00
Yi Chen ccdbf44815
Add changelog for v0.13.1 (#1248)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-02-12 06:34:28 +00:00
dependabot[bot] 36b17b4175
Bump github.com/go-resty/resty/v2 from 2.16.0 to 2.16.5 (#1254)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.16.0 to 2.16.5.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.16.0...v2.16.5)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-12 06:26:29 +00:00
gujing 1058d48063
rename parameter (#1262)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2025-02-12 06:02:30 +00:00
AlanFokCo ce9c5f3bff
Update the version of elastic-job-supervisor in arena-artifacts (#1247)
Signed-off-by: AlanFokCo <892249240@qq.com>
2025-01-13 09:32:08 +00:00
Yi Chen 970afbd209
Add PyTorch mnist example (#1237)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 11:31:16 +00:00
Yi Chen f1bb3bcdbb
feat: add linux/arm64 support for et-operator image (#1241)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 11:00:16 +00:00
Yi Chen b814410627
feat: add linux/arm64 support for cron-operator image (#1240)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 10:59:16 +00:00
Yi Chen 38218aa3a0
feat: add linux/arm64 support for mpi-operator image (#1239)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 10:26:16 +00:00
Yi Chen 13fa5c8dc8
feat: add linux/arm64 support for tf-operator image (#1238)
Signed-off-by: Yi Chen <github@chenyicn.net>
2025-01-02 09:03:16 +00:00
Yi Chen f098f1af85
Release v0.13.0 (#1232)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-23 08:33:15 +00:00
Yi Chen b0e411cab5
Update pytorch-operator image (#1234)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-23 07:55:15 +00:00
dependabot[bot] 5e18210479
Bump github.com/stretchr/testify from 1.9.0 to 1.10.0 (#1233)
Bumps [github.com/stretchr/testify](https://github.com/stretchr/testify) from 1.9.0 to 1.10.0.
- [Release notes](https://github.com/stretchr/testify/releases)
- [Commits](https://github.com/stretchr/testify/compare/v1.9.0...v1.10.0)

---
updated-dependencies:
- dependency-name: github.com/stretchr/testify
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-20 13:55:12 +00:00
Yi Chen 13df29407c
Update tfjob standalone training job doc (#1222)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-20 05:29:11 +00:00
Yi Chen 0a701eb03d
Remove archived docs (#1208)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-20 05:26:12 +00:00
Yi Chen 0482946a0c
Add changelog for v0.12.1 (#1224)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-20 05:25:12 +00:00
dependabot[bot] 0d4b513d65
Bump golang.org/x/crypto from 0.29.0 to 0.31.0 (#1231)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.29.0 to 0.31.0.
- [Commits](https://github.com/golang/crypto/compare/v0.29.0...v0.31.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-20 05:09:13 +00:00
dependabot[bot] e8b9fcd10d
Bump google.golang.org/protobuf from 1.35.1 to 1.36.0 (#1227)
Bumps google.golang.org/protobuf from 1.35.1 to 1.36.0.

---
updated-dependencies:
- dependency-name: google.golang.org/protobuf
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-20 05:02:12 +00:00
Yi Chen 190c18e840
feat: add support for torchrun (#1228)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-19 11:32:11 +00:00
Yi Chen dc0929f32f
Avoid listing jobs and statefulsets when get pytorchjob (#1229)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-12-19 11:29:11 +00:00
Yi Chen 74ade74d3e
Release v0.12.1 (#1215)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-25 11:37:29 +00:00
Yi Chen 316e33c999
Update cron operator image (#1214)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-25 11:35:29 +00:00
dependabot[bot] fc47e460e1
Bump golang.org/x/crypto from 0.28.0 to 0.29.0 (#1206)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.28.0 to 0.29.0.
- [Commits](https://github.com/golang/crypto/compare/v0.28.0...v0.29.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-18 15:06:23 +00:00
Yi Chen 1cba9b99dc
Add docs for releasing arena (#1201)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-18 12:29:23 +00:00
Yi Chen 866ec44648
Publish releases only on master branch (#1210)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-18 12:28:23 +00:00
cheyang ac164b85bf
Support MPI Job with generic devices (#1209)
Signed-off-by: cheyang <cheyang@163.com>
2024-11-18 03:03:22 +00:00
Qianlong d61a784a13
Fix the functionality of generating kubeconfig (#1204) (#1205)
Signed-off-by: 向先 <wangqianlong.wql@alibaba-inc.com>
Co-authored-by: 向先 <wangqianlong.wql@alibaba-inc.com>
2024-11-16 15:45:21 +00:00
dependabot[bot] 74fd3f2ad3
bump github.com/go-resty/resty/v2 from 2.15.3 to 2.16.0 (#1202)
---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-15 09:38:20 +00:00
TzZtzt a765b1c5a0
Fix etjob rendering error when using local logging dir (#1203)
Signed-off-by: trafalgarzzz <trafalgarz@outlook.com>
2024-11-13 06:17:17 +00:00
Yi Chen 0838d54757
Add go mod vendor check to integration test (#1198)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-12 02:23:16 +00:00
Yi Chen ca735b6152
Add changelog for v0.12.0 (#1199)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-12 02:11:17 +00:00
Yi Chen 969ad681a3
Update tf-operator image to fix clean pod policy issues (#1200)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-12 01:55:16 +00:00
dependabot[bot] 29b2d6d2c5
Bump mkdocs-material from 9.5.42 to 9.5.44 (#1190)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.42 to 9.5.44.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.42...9.5.44)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 06:07:15 +00:00
cheyang 22a3df5023
Support distributed serving with vendor update (#1194)
Signed-off-by: cheyang <cheyang@163.com>
2024-11-11 06:06:15 +00:00
lianhui lin 68b71f9006
Feat: add support for distributed serving type (#1187)
* Feat: support distributed serving type

Signed-off-by: 林联辉 <linlianhui.llh@alibaba-inc.com>

* Fix command check

Signed-off-by: 林联辉 <linlianhui.llh@alibaba-inc.com>

* Fix lint problem

Signed-off-by: 林联辉 <linlianhui.llh@alibaba-inc.com>

---------

Signed-off-by: 林联辉 <linlianhui.llh@alibaba-inc.com>
Co-authored-by: 林联辉 <linlianhui.llh@alibaba-inc.com>
2024-11-07 10:20:12 +00:00
dependabot[bot] 70278ce8f7
Bump github.com/prometheus/common from 0.60.0 to 0.60.1 (#1182)
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.60.0 to 0.60.1.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/RELEASE.md)
- [Commits](https://github.com/prometheus/common/compare/v0.60.0...v0.60.1)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 06:43:12 +00:00
dependabot[bot] 8e008a4916
Bump github.com/golang/glog from 1.2.2 to 1.2.3 (#1189)
Bumps [github.com/golang/glog](https://github.com/golang/glog) from 1.2.2 to 1.2.3.
- [Release notes](https://github.com/golang/glog/releases)
- [Commits](https://github.com/golang/glog/compare/v1.2.2...v1.2.3)

---
updated-dependencies:
- dependency-name: github.com/golang/glog
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 03:12:12 +00:00
Yi Chen 46a795e3db
Fix: unable to set cleanPodPolicy to All when submitting TFJob (#1191)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-07 02:53:12 +00:00
Yi Chen 76ca05975e
Add changelog for v0.11.0 (#1181)
* Add changelog for v0.11.0

Signed-off-by: Yi Chen <github@chenyicn.net>

* Bump version to v0.11.0

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-11-07 02:05:12 +00:00
dependabot[bot] dce03cc700
Bump mkdocs-material from 9.5.40 to 9.5.42 (#1179)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.40 to 9.5.42.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.40...9.5.42)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-24 11:52:31 +00:00
qile123 7885f46081
Support ray job (#1123)
Signed-off-by: taiku <ljh404177@alibaba-inc.com>
Co-authored-by: 泰酷 <ljh404177@alibaba-inc.com>
2024-10-24 10:34:31 +00:00
dependabot[bot] 8d6c23d14c
Bump github.com/prometheus/client_golang from 1.20.4 to 1.20.5 (#1176)
Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.20.4 to 1.20.5.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/v1.20.5/CHANGELOG.md)
- [Commits](https://github.com/prometheus/client_golang/compare/v1.20.4...v1.20.5)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-24 05:19:30 +00:00
Yi Chen bd1b0da049
Add changelog for v0.10.1 (#1175)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-23 15:36:30 +00:00
Yi Chen e15cb18aeb
Remove redundant run_arena.sh file (#1172)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-14 12:51:17 +00:00
Yi Chen 82fd0ba7e5
fix: failed to sync cache due to status subresouce missed in tfjob CRD (#1173)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-14 12:48:17 +00:00
dependabot[bot] a1b7285e1d
Bump google.golang.org/protobuf from 1.34.2 to 1.35.1 (#1163)
Bumps google.golang.org/protobuf from 1.34.2 to 1.35.1.

---
updated-dependencies:
- dependency-name: google.golang.org/protobuf
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-14 12:33:17 +00:00
dependabot[bot] 522a0c610f
Bump mkdocs-material from 9.5.38 to 9.5.40 (#1166)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.38 to 9.5.40.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.38...9.5.40)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-14 12:32:17 +00:00
Yi Chen b8af066a2f
Migrate docker image to ACREE (#1171)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-14 09:59:16 +00:00
Yi Chen 42b8fcae2e
Add changelog for v0.10.0 (#1158)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-14 09:58:17 +00:00
Yi Chen 45c8e1b150
fix: unsupported success policy when success policy is not specified (#1170)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-14 09:57:16 +00:00
Yi Chen fdcfd18a98
fix: keep arena installer after installing the binary (#1164)
* Release v0.10.0

Signed-off-by: Yi Chen <github@chenyicn.net>

* fix: keep arena installer after installing the binary

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update tf-operator image

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-10-14 09:56:17 +00:00
dependabot[bot] 41fb18b640
Bump golang.org/x/crypto from 0.27.0 to 0.28.0 (#1162)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.27.0 to 0.28.0.
- [Commits](https://github.com/golang/crypto/compare/v0.27.0...v0.28.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-08 02:46:07 +00:00
dependabot[bot] bf49baae30
Bump github.com/prometheus/common from 0.59.1 to 0.60.0 (#1160)
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.59.1 to 0.60.0.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/RELEASE.md)
- [Commits](https://github.com/prometheus/common/compare/v0.59.1...v0.60.0)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-08 02:43:07 +00:00
dependabot[bot] bd159b2d0f
Bump github.com/go-resty/resty/v2 from 2.15.2 to 2.15.3 (#1156)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.15.2 to 2.15.3.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.15.2...v2.15.3)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-29 10:07:37 +00:00
dependabot[bot] 7c10b6756c
Bump mkdocs-material from 9.5.36 to 9.5.38 (#1153)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.36 to 9.5.38.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.36...9.5.38)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-29 09:54:37 +00:00
Yi Chen 0d95df6f1e
Bump golang from 1.21 to 1.22.7 (#1142)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-29 09:36:37 +00:00
Yi Chen 11b771b417
Add success policy to TF training job (#1148)
* Add successPolicy field to tfjob CRD

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add successPolicy to TFJob charts

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add success-policy flags

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-29 09:30:37 +00:00
AlanFokCo 223e534b91
[Bugfix] Make PytorchJob devices format to key=value (#1155)
Signed-off-by: huozhixin.hzx <huozhixin.hzx@alibaba-inc.com>
Co-authored-by: huozhixin.hzx <huozhixin.hzx@alibaba-inc.com>
2024-09-27 08:45:36 +00:00
dependabot[bot] 7197b5cb40
Bump mkdocs-material from 9.5.35 to 9.5.36 (#1151)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.35 to 9.5.36.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.35...9.5.36)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-23 13:37:32 +00:00
dependabot[bot] b2c5686543
Bump github.com/go-resty/resty/v2 from 2.15.1 to 2.15.2 (#1150)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.15.1 to 2.15.2.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.15.1...v2.15.2)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-23 12:11:32 +00:00
dependabot[bot] dfd3268cc6
Bump github.com/go-resty/resty/v2 from 2.15.0 to 2.15.1 (#1147)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.15.0 to 2.15.1.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.15.0...v2.15.1)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-19 12:36:28 +00:00
dependabot[bot] 513894a1f0
Bump mkdocs-material from 9.5.34 to 9.5.35 (#1145)
Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.34 to 9.5.35.
- [Release notes](https://github.com/squidfunk/mkdocs-material/releases)
- [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG)
- [Commits](https://github.com/squidfunk/mkdocs-material/compare/9.5.34...9.5.35)

---
updated-dependencies:
- dependency-name: mkdocs-material
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-18 16:41:28 +00:00
dependabot[bot] 064927ef5c
Bump github.com/go-resty/resty/v2 from 2.14.0 to 2.15.0 (#1143)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.14.0 to 2.15.0.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.14.0...v2.15.0)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-18 01:52:27 +00:00
dependabot[bot] a9ed5f6eaf
Bump github.com/prometheus/client_golang from 1.20.0 to 1.20.4 (#1144)
Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.20.0 to 1.20.4.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prometheus/client_golang/compare/v1.20.0...v1.20.4)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_golang
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-18 01:42:28 +00:00
Yi Chen b2380e60dc
Bump client-java from 10.0.1 to 11.0.1 (#1132)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-13 11:41:22 +00:00
Yi Chen bf53ba33ea
docs: fix broken links and add CI for checking document build status (#1131)
* Fix broken links in docs

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add CI for building docs

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-13 11:40:22 +00:00
dependabot[bot] 305005ebdf
Bump github.com/prometheus/common from 0.45.0 to 0.59.1 (#1138)
Bumps [github.com/prometheus/common](https://github.com/prometheus/common) from 0.45.0 to 0.59.1.
- [Release notes](https://github.com/prometheus/common/releases)
- [Changelog](https://github.com/prometheus/common/blob/main/RELEASE.md)
- [Commits](https://github.com/prometheus/common/compare/v0.45.0...v0.59.1)

---
updated-dependencies:
- dependency-name: github.com/prometheus/common
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 03:30:22 +00:00
dependabot[bot] b70297a03a
Bump github.com/kserve/kserve from 0.13.0 to 0.13.1 (#1135)
Bumps [github.com/kserve/kserve](https://github.com/kserve/kserve) from 0.13.0 to 0.13.1.
- [Release notes](https://github.com/kserve/kserve/releases)
- [Commits](https://github.com/kserve/kserve/compare/v0.13.0...v0.13.1)

---
updated-dependencies:
- dependency-name: github.com/kserve/kserve
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 01:46:22 +00:00
dependabot[bot] ded5780b29
Bump github.com/go-resty/resty/v2 from 2.12.0 to 2.14.0 (#1134)
Bumps [github.com/go-resty/resty/v2](https://github.com/go-resty/resty) from 2.12.0 to 2.14.0.
- [Release notes](https://github.com/go-resty/resty/releases)
- [Commits](https://github.com/go-resty/resty/compare/v2.12.0...v2.14.0)

---
updated-dependencies:
- dependency-name: github.com/go-resty/resty/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 01:36:22 +00:00
dependabot[bot] c1f39aba1f
Bump github.com/spf13/cobra from 1.8.0 to 1.8.1 (#1137)
Bumps [github.com/spf13/cobra](https://github.com/spf13/cobra) from 1.8.0 to 1.8.1.
- [Release notes](https://github.com/spf13/cobra/releases)
- [Commits](https://github.com/spf13/cobra/compare/v1.8.0...v1.8.1)

---
updated-dependencies:
- dependency-name: github.com/spf13/cobra
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 01:25:22 +00:00
dependabot[bot] 94fc66024f
Bump golang.org/x/crypto from 0.21.0 to 0.27.0 (#1126)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.21.0 to 0.27.0.
- [Commits](https://github.com/golang/crypto/compare/v0.21.0...v0.27.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 01:24:22 +00:00
dependabot[bot] c3e73610b0
Bump github.com/golang/glog from 1.1.2 to 1.2.2 (#1139)
Bumps [github.com/golang/glog](https://github.com/golang/glog) from 1.1.2 to 1.2.2.
- [Release notes](https://github.com/golang/glog/releases)
- [Commits](https://github.com/golang/glog/compare/v1.1.2...v1.2.2)

---
updated-dependencies:
- dependency-name: github.com/golang/glog
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-13 01:17:22 +00:00
Yi Chen e279bad1cf
chore: add issue templates and update depenabot bot (#1140)
* Update issue and pull request templates

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update dependabot config

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add issue label bot

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-12 13:40:22 +00:00
Yi Chen 3409e5b1e4
Increase RSA key bit size from 1024 to 2048 (#1130)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-12 06:33:21 +00:00
Yi Chen 3afe470d8d
chore: remove travis and circle CI (#1129)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-12 06:08:21 +00:00
Yi Chen f11dae2a6f
Update Makefile and release workflow (#1128)
* Update .gitignore

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add .dockerignore

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update Dockerfile for packaging arena installer

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update Makefile

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update integration test workflow

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add check release workflow

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add release workflow

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update Makefile

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update Dockerfile

Signed-off-by: Yi Chen <github@chenyicn.net>

* Make run_arena.sh executable

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-12 04:07:21 +00:00
Yi Chen a80b33508f
Bump arena Java SDK version to 1.0.8 (#1124)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-09-04 10:07:15 +00:00
lizhiboo 6c2373d32e
#1121 Support multiple type devices (#1122)
Signed-off-by: lizhiboo <lizhiboo@yeah.net>
2024-09-03 05:50:14 +00:00
yu lin b500f9eda2
Remove docker dependency (#1113)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-31 08:46:04 +00:00
Yi Chen 98a43dc6d9
Fix submitting spark training jobs and update docs (#1112)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-07-30 03:34:56 +00:00
yu lin 881780fb08
Release arena v0.9.16 (#1110)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-25 13:07:55 +00:00
yu lin 9064896a91
Fix incorrect TensorBoard images. (#1109)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-22 07:12:00 +00:00
yu lin c9dbc8f968
Support config security context for KServe (#1108)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-17 09:08:56 +00:00
yu lin 5748fe4136
Add env-from-secret to read environment variables from secret (#1107)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-17 02:09:56 +00:00
yu lin 33181529ab
Add a demo for using arena CLI in container. (#1105)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-16 03:17:55 +00:00
yu lin 5e8b6ddbff
Support setting shared memory for training job. (#1104)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-11 08:51:21 +00:00
yu lin a3a348c00a
Upgrade the kubernetes dependencies to v1.28 and go version to 1.21 (#1102)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-07-11 03:05:21 +00:00
Yi Chen 7acbb8c408
Add @ChenYi015 as Arena approvers (#1103)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-07-11 02:29:20 +00:00
yu lin 19c9090bd7
Support setting the init-container-image for pytorch-operator (#1097)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-06-18 01:36:58 +00:00
gujing 48eed0fe82
change kserve prom svc to ClusterIP (#1096)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-06-17 12:32:58 +00:00
yu lin 3926187d64
fix arena makefile and dockerfile (#1091)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-05-30 18:06:13 +08:00
yu lin 95d4bbeb94
Add license (#1090)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-05-30 08:59:34 +00:00
yu lin dbf740f8cb
Remove vendor (#1089)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-05-30 07:17:33 +00:00
yu lin 64808b67e6
Fix gpu-exporter and prometheus demo (#1087)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-05-29 06:58:15 +00:00
yu lin 37d8ab4d50
Update Arena Java SDK fastjson version (#1088)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-05-15 08:53:58 +00:00
yu lin a031bae968
Fix get kserve job panic (#1086)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-05-13 06:09:18 +00:00
yu lin f31e1b0be0
Release arena v0.9.15 (#1078)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-04-28 05:11:48 +00:00
yu lin 5034f390d2
Fix command includes quotes cause Helm template failure. (#1075)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-04-19 03:12:47 +00:00
gujing 43b60eddb7
Feature/kserve custom metrics prometheus (#1073)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-04-17 08:35:27 +00:00
yu lin 1398c8f307
Upgrade helm version to v3.13.3 (#1072)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-04-17 08:32:27 +00:00
gujing acac0fbb25
fix --command parameter (#1074)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-04-17 07:51:27 +00:00
yu lin 451030cfcb
Fix port cannot be allocated when submitting a tfjob using the go sdk. (#1071)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-04-16 07:48:51 +00:00
yu lin adb43b8d74
Release arena v0.9.14 (#1070)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-04-10 20:47:00 +08:00
Yi Chen fed8afc602
Update model manage documenation (#1066)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-04-10 09:05:29 +00:00
Yi Chen dd69d9c1af
Fix: model information does not display correctly when getting a training job (#1068)
training job

Signed-off-by: Yi Chen <github@chenyicn.net>
2024-04-10 07:31:29 +00:00
yu lin 768218e8f5
Fix readthedocs build failed. (#1069)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-04-10 07:24:28 +00:00
Yi Chen d1e62ffa3a
Update model manage (#1062)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-04-09 11:36:28 +00:00
Yi Chen c114755222
Add support for MLflow model manage (#1058)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-04-02 02:26:22 +00:00
Yi Chen 12f205ef89
⚠️ Breaking Changes: Migrate model subcommand to model analyze (#1060)
Signed-off-by: Yi Chen <github@chenyicn.net>
2024-03-27 06:14:20 +00:00
yu lin 5ac396c7ab
Release arena v0.9.13 (#1057)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-03-18 09:06:34 +00:00
gujing 8b05634bea
support update --data in kserve serving job (#1049)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-03-18 08:20:34 +00:00
gujing b7f0ecf50e
support config request resources in kserve runtime (#1050)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-03-07 09:14:14 +00:00
gujing 57093a20fb
delete cm if job failed (#1051)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-03-07 09:10:15 +00:00
yu lin 70f4a13547
Support for updating the nodeSelector and toleration in GO SDK. (#1043)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-02-28 03:26:00 +00:00
yu lin d648a2a8cf
Upgrade Kubernetes version 1.26.4 and go version 1.20.12 (#1042)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-02-22 02:08:22 +00:00
yu lin 0a7501c542
Support Kubernetes 1.26 and KServe 0.11.2 (#1041)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-02-21 02:47:53 +00:00
gujing e4631c492d
Add @gujingit as Arena approvers (#1040)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
2024-02-21 02:24:53 +00:00
yu lin 6fd3d0e022
Upgrade Go version to v1.20 (#1032)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-02-20 10:25:40 +00:00
gujing f27a6780ce
feat: add backend param for triton serving (#1039) 2024-02-18 03:58:48 +00:00
Alex Wang ed2aea2f86
add denkensk as approver (#1038) 2024-02-05 07:18:17 +00:00
yu lin a707f81ef6
Release arena v0.9.12 (#1037)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-02-04 15:27:32 +08:00
yu lin 23b4fe9090
Add CI to run Go unit test. (#1035)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-02-04 14:57:26 +08:00
gujing 3e7e915c16
update tritonserver base image to nvcr.io/nvidia/tritonserver:24.01-py3 (#1036) 2024-02-04 06:55:16 +00:00
gujing 8739eb536c
Feature/inferenceservice (#1034)
* chore: update inferenceservice yaml

* chore: update copyright
2024-02-01 03:27:14 +00:00
yu lin 875d0022b5
Add CI to run the tests for Go. (#1031)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-22 02:23:02 +00:00
yu lin 1449e75f92
chore: fix go lint. (#1030)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-19 01:55:59 +00:00
yu lin 10e1e629af
chore: go fmt (#1028)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-17 14:38:54 +00:00
yu lin ff24a10944
chore: Update OWNERS (#1027)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-16 12:34:18 +00:00
xieydd 8db2d49353
Add xieydd as approver (#1026)
Signed-off-by: xieydd <xieydd@gmail.com>
2024-01-16 09:55:18 +00:00
yu lin cdf1bb3102
Compatible with training-operator CRD. (#1024)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-16 09:16:18 +00:00
yu lin 67a9150c56
Update Arena 2024 Roadmap. (#1025)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-16 17:15:25 +08:00
yu lin 0df51d7492
Add @Syulin7 to Approvers. (#1022)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-10 12:39:06 +00:00
yu lin 7f31c6b209
[Discussion] Arena 2024 Roadmap. (#1020)
Signed-off-by: Syulin7 <735122171@qq.com>
2024-01-10 12:06:07 +00:00
yu lin ce87d1095d
Fix release doc and job status. (#1011)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-12-04 14:54:33 +08:00
yu lin c4d37efa2b
Fix patch ownerReference (#1004)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-10-13 16:17:18 +08:00
yu lin a577b6d6ce
Fix incorrect job status display when kube-queue is enabled. (#1003)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-10-13 14:41:26 +08:00
yu lin 261cf3a362
Update kserve document. (#994)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-09-18 20:34:37 +08:00
yu lin 4dc39d6b52
Update kserve document. (#993)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-09-18 16:05:32 +08:00
yu lin a7e6a0fc19
Fix update triton server replicas. (#991)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-09-07 05:57:32 +00:00
yu lin 4afe00e05a
Fix install.sh to support control-plane label. (#989)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-09-05 17:39:41 +08:00
yu lin 46093aec39
Fix circleci. (#986)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-30 11:53:09 +08:00
yu lin bf33adad6d
Fix circleci. (#985)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-30 11:36:07 +08:00
yu lin 650d2ef0f8
Support maxSurge, livenessProbe, readinessProbe. (#983)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-29 08:04:34 +00:00
yu lin 14fa45c995
Add KServe document. (#984)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-29 06:02:34 +00:00
yu lin 2029700bd8
Update install.sh to support new label. (#982)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-28 09:41:23 +00:00
yu lin de8cb950de
Set KServe inference service version by default. (#981)
* Support KServe inference service

Signed-off-by: Syulin7 <735122171@qq.com>

* Set KServe inference service version by default.

Signed-off-by: Syulin7 <735122171@qq.com>

---------

Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-28 02:27:22 +00:00
yu lin 3fe9ae4026
Support KServe inference service (#976)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-16 09:38:04 +00:00
yu lin 81a8bf85c9
Update dependent component version. (#971)
* Update dependent component version.

Signed-off-by: Syulin7 <735122171@qq.com>

* Update dependent component version.

Signed-off-by: Syulin7 <735122171@qq.com>

* Update vendor.

Signed-off-by: Syulin7 <735122171@qq.com>

---------

Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-09 09:45:33 +00:00
yu lin 4b5c18cab9
Update golang version to 1.16 (#970)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-01 15:08:37 +08:00
yu lin 2669f364ee
Update arena version to 0.9.10 (#969)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-08-01 03:33:08 +00:00
yu lin a45f3a5fcf
Enable create secret for deepspeedjob, etjob. (#967)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-07-28 06:49:33 +00:00
yu lin a6a8f3003d
Support launcher-annotation and worker-annotation. (#968)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-07-28 03:12:33 +00:00
yu lin ce4a78dc91
fix --data-dir is not taking effect in custom-serving. (#964)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-06-28 16:18:31 +08:00
yu lin 516d8cbe7b
Update arena version to 0.9.9 (#963)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-06-27 07:00:15 +00:00
yu lin 47c4420e84
[FIX] update serve duplicate create env and toleration (#962)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-06-20 13:05:40 +00:00
yu lin 51151af1c3
fix evaluator node selector (#961)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-06-19 13:08:39 +00:00
yu lin 16c2746bfd
Support new training type deepspeed. (#960)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-06-19 11:19:39 +00:00
yu lin 37745b5610
Support job set image pull policy (#957)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-05-26 02:21:20 +00:00
yu lin 016da2a495
Fix panic when pod started failed (#956)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-05-26 02:19:21 +00:00
yu lin c167d3ea08
Update SDK and JAVA SDK Unit test (#955)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-05-15 12:53:31 +00:00
AlanFokCo cd1f02eb57
Fix evaluatejob job yaml in charts. (#954) 2023-05-11 11:41:01 +00:00
AlanFokCo 908501acea
Move policy v1beta1 to v1. (#953)
* Move policy v1beta1 to v1.

* Use go template to define policy
2023-04-23 02:23:44 +00:00
yu lin 29298ca25a
Add DeepSpeed base image dockerfile. (#952)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-04-14 03:19:15 +00:00
yu lin d51fe2eecb
Support Cron tfjob set ttlAfterFinished. (#951)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-04-07 07:06:15 +00:00
yu lin b58010a509
Update arena version to 0.9.7 (#950)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-04-06 10:05:40 +00:00
yu lin eaf1e7851d
support set TTLSecondsAfterFinished in Builder (#949)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-04-06 09:52:40 +00:00
yu lin b3c2c7f9f3
Update arena version to 0.9.6 (#948)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-04-04 01:45:01 +00:00
yu lin 0c2d171290
Add ownerReference for configmap and tensorboard (#947)
* Add ownerReference for configmap and tensorboard and fix etjob gitImage

Signed-off-by: Syulin7 <735122171@qq.com>

* Update et-operator image

Signed-off-by: Syulin7 <735122171@qq.com>

---------

Signed-off-by: Syulin7 <735122171@qq.com>
2023-04-03 12:04:01 +00:00
AlanFokCo 09a57151f2
Add imagePullSecret and shareMemory for arena serve. (#945)
* Add imagePullSecret and shareMemory for arena serve.

* Add support of --load-model and --extend-command for arena serve triton.

* add usage information for extend-command
2023-03-20 03:38:10 +00:00
yu lin c3948e250d
Support TTLSecondsAfterFinished (#946)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-03-13 06:25:23 +00:00
yu lin f2780f4cea
Support TFJob StartingDeadlineSeconds (#944)
Support upgrade tfjob crd

Signed-off-by: Syulin7 <735122171@qq.com>
2023-03-09 13:00:09 +00:00
yu lin 925cac7e19
Support TFJob/PytorchJob ActiveDeadlineSeconds (#930)
Signed-off-by: Syulin7 <735122171@qq.com>
2023-03-07 07:46:11 +00:00
AlanFokCo d3e59d1703
Update java sdk to 1.0.4. (#943)
* update java sdk to 1.0.4

* modify test file
2023-03-07 06:04:11 +00:00
AlanFokCo f7e889e3f6
Fix serve update when limits is null (#935)
* Fix serve update when limits is null

* update arena version
2023-03-06 11:11:34 +00:00
AlanFokCo 5eb3b9ca7c
Fix/fix gang pod label (#934) 2023-03-06 11:10:35 +00:00
tingshua-yts 85e40e0451
patch for job supervisor (#931)
* add jobsupervisor

* add check job-supervisor in install.sh

* add resource to values

* fix with comment

* update et-operator & job-supervisor version

* modify job supervior name

* add elastic-job-supervisor

* add new line of file

* fix name bug

* revert job-supervisor command

* modify command

* revert install.sh
2023-03-03 11:17:00 +00:00
tingshua-yts eece0452f5
add jobsupervisor charts (#916)
* add jobsupervisor

* add check job-supervisor in install.sh

* add resource to values

* fix with comment

* update et-operator & job-supervisor version

* modify job supervior name

* add elastic-job-supervisor

* add new line of file
2023-03-03 15:20:55 +08:00
AlanFokCo 840f678201
Fix arena java sdk env bug. (#924)
* fix javasdk exec command bug

* Add terminal command when run execCommand
2023-02-28 09:55:53 +08:00
tingshua-yts e195c230d1
optimize arena submit etjob for spot (#912) 2023-02-22 09:22:40 +00:00
AlanFokCo 4c677350a9
Fix arena serve update bug. (#910) 2023-02-21 02:37:39 +00:00
AlanFokCo eed3aeb499
Fix model serve args bug. (#907)
* fix model serve args bug

* fix
2023-02-17 08:37:35 +00:00
AlanFokCo ef1ea85a59
Add toleration dedup for arena serve update. (#905) 2023-02-16 09:05:06 +00:00
AlanFokCo 3c0c15ee98
Fix/fix serve triton bugs (#902)
* fix deployment yaml bug

* update arena VERSION
2023-02-08 13:04:32 +00:00
AlanFokCo c3da54dbc4
Modify the support method of Toleration (#899)
* modify set toleration way and add toleration for update serve

* support toleration for update serve

* update charts yaml
2023-02-08 03:44:32 +00:00
AlanFokCo e11d8f7715
Arena support submit parameters useHostNetwork useHostIPC useHostPID. (#900) 2023-02-08 03:39:34 +00:00
AlanFokCo ac04ff5947
Modify helm package url in Dockerfile. (#901) 2023-02-08 02:46:32 +00:00
AlanFokCo 60d1fd4fc6
Add support for arena serve update. (#894) 2023-01-14 10:09:20 +00:00
AlanFokCo 5c113cae4f
Update arena tf and pytorch operator images. (#886) 2022-12-15 10:39:22 +00:00
AlanFokCo 9114bccc93
Support custom scheduler name (#884) 2022-12-06 13:48:09 +00:00
tingshua-yts 99e2649e45
add empty-dir and empty-dir-subpath-expr arg (#880)
* add --empty-dir && --empty-dir-subpath-expr

* add java test

* modify chart

* modify empty-dir to temp-dir
2022-12-02 09:15:16 +00:00
AlanFokCo 28663a84d1
Add display of startTime and endTime on getting and topping job. (#879) 2022-11-29 08:28:42 +00:00
AlanFokCo df37e209c5
make serve update to zero available (#877) 2022-11-23 15:33:43 +00:00
AlanFokCo 774f02adcb
Add disable tf config annotation when job is standalone. (#873)
* add disable tf config annotation when job is standalone.

* modify set annotation
2022-11-22 03:31:41 +00:00
AlanFokCo 44b6bdca06
Update images and support clean all policy for tfjob. (#871)
* update tf and pytorch operator image

* support clean all policy for tfjob
2022-11-18 04:21:21 +00:00
happy2048 5af7539d48
skip to update crd when upgrade arena-artifacts (#861) 2022-10-25 08:58:29 +00:00
happy2048 06ee271376
update version to 0.9.1 (#855) 2022-10-24 08:47:42 +00:00
happy2048 cb27b5df22
change repo from kube-ai to acs (#854) 2022-10-24 07:45:41 +00:00
tingshua-yts 0fc1ef29b1
modify restful-serving to http-serving of serve deployment service port name (#848)
* modify restful-serving to http-serving for serve svc

* modify chart version

* modify custom service metric name
2022-10-12 09:20:01 +00:00
tingshua-yts 6cbdaa3afb
Feature/add subpathexpr (#843) 2022-09-05 08:46:39 +00:00
tingshua-yts 2b5da64346
Feature/add resource limit for tfjob (#836)
* add cpu limit

* modify chart version

* add changelog
2022-08-15 02:55:14 +00:00
happy2048 9e70decd6c
refact installation (#837) 2022-08-08 11:42:52 +00:00
tingshua-yts 7243c9bce6
add pod info (#832)
* add pod info

* update version
2022-08-04 13:08:06 +00:00
GreenHand 4145068c94
add evaluator and tensorboard to pod group (#835) 2022-08-04 06:22:06 +00:00
AlanFokCo 65072b4f6a
Add support of gpu-core in top-node command. (#826)
* Provide support of gpu-core for serving job

* Add support of gpu-core in top-node command.

* add description of gpu-core

* reformat code
2022-08-01 03:21:46 +00:00
heluocs 3ea26bbeb0
verify k8s resources valid (#821) 2022-08-01 02:41:46 +00:00
ZhixinHuo 91114cc834
Provide support of gpu-core for serving job (#824) 2022-07-27 06:07:10 +00:00
heluocs fcda9bb6f5
feature/java sdk submit training job support --label (#793)
* java sdk submit training job support --label

* upgrade fastjson
2022-06-16 01:34:43 +00:00
heluocs add6ac606f
feature/cron workload support custom labels (#784)
* cron support custom labels

* fix update serving bug

* support update serving annotations and labels
2022-06-13 07:53:59 +00:00
heluocs d4c97527ac
Feature/modeljob adapt helm3 (#767)
* fix modeljob with helm3

* upgrade cron-operator to v0.1.2
2022-04-29 03:25:56 +00:00
happy2048 447b534163
fix bug: failed to run pytorch with rdma (#757) 2022-03-15 10:21:40 +00:00
heluocs 05d75f1f05
release v0.9.0 (#756) 2022-03-14 03:13:56 +00:00
happy2048 28c8263823
fix conflict file (#752) 2022-03-14 02:43:56 +00:00
heluocs 015ff4a3a5
Feature/refactor model evaluate (#748)
* support update serving job in java sdk

* support arena model evaluate

* mark arena evaluate as deprecated

* fix get deployment bug
2022-03-14 02:42:56 +00:00
happy2048 62415ec062
fix: arena artifacts error (#755) 2022-03-14 02:33:56 +00:00
happy2048 af3504b070
upgreade helm and kubectl (#751) 2022-03-03 08:08:28 +00:00
happy2048 95356164c6
fix mpi-opreator crash error (#743) 2022-01-26 08:21:43 +00:00
heluocs 8cbeadb3b2
add arena model analyze job (#742)
* add arena model analyze job, include arena model[profile|optimize|benchmark] etc.

* update the model analyze job docs
2022-01-26 07:45:43 +00:00
happy2048 37bfe9bb32
fix error when kubedl-operator has been existed (#739) 2022-01-19 09:01:20 +00:00
happy2048 4f1c62cc94
support prometheus url token (#738) 2022-01-19 09:00:20 +00:00
happy2048 48c950ae04
support crd v1 and v1beta1 (#737) 2022-01-17 03:18:28 +00:00
happy2048 9066e985ea
add arena-artifacts to adapt k8s 1.22 (#735) 2022-01-10 07:01:16 +00:00
heluocs df2c3ea9de
return status of evaluatejob (#729)
* return status of evaluatejob

* java sdk support evaluate api

* fix not set shell default value in client bug
2022-01-10 03:18:15 +00:00
heluocs 60756c65ca
add --clean-task-policy for mpijob (#727)
* add --clean-task-policy for mpijob

* fix --image-pull-secret bug in args builder
2021-12-23 02:23:41 +00:00
heluocs 32d391771e
upgrade java sdk (#713)
* support custom shell

* update arena java sdk

* upgrade git-sync image to support git token
2021-12-13 03:37:36 +00:00
happy2048 d0ad02d39d
release v0.8.9 (#690) 2021-11-23 03:39:06 +00:00
Alex Wang 5a47651b12
support queued before scheduled (#708) 2021-11-15 04:00:41 +00:00
heluocs eb2933b4b0
Feature/support update deployed serving (#703)
* serving support --config-file and --shell

* support arena serve update [tensorflow|triton|custom]
2021-11-15 03:52:42 +00:00
Alex Wang b2cd6c5287
Support gang scheduling in tf job (#706)
* fix spell errors

* support coscheduling in tf job

* update doc for gang scheduling in tf job
2021-11-08 02:01:02 -08:00
happy2048 4ce2d1b7d7
read admin users from arena-configmap (#697) 2021-11-02 01:58:59 -07:00
winger 5e8ec652b9
arena top job show wrong gpu request when enable chief or evaluator (#700) 2021-10-29 01:56:51 -07:00
heluocs a14ad59363
update change log (#696) 2021-10-25 23:42:43 -07:00
Kai Zhang daded93e9e
Updateowners (#695)
* add happy2048 as approver, add heluocs as reviewer

* remove heluocs
2021-10-25 20:54:42 -07:00
heluocs 65f646f860
fix no write permission of /tmp in tensorflow serving container (#692)
* update model-config-file mount path

add shell param in arena go api

* update arena docs
2021-10-26 11:09:54 +08:00
happy2048 3a8be3dc5c
fix bug: gpushare is not work for serving (#689) 2021-10-12 00:39:44 -07:00
happy2048 d06b92df6b
fix bug: don't install cronjob crd when kubedl is existed (#683) 2021-09-27 23:34:10 -07:00
happy2048 1d00b1fcb7
release v0.8.8 (#681) 2021-09-27 20:37:10 -07:00
heluocs 3786dfa757
specify shell type by user with --shell in training job (#680)
* specify shell type by user with --shell in training job

* update chart version
2021-09-26 01:20:38 -07:00
happy2048 194352f5bc
fix bug: disable nvidia ENV for none gpu request job (#679)
* fix bug: disable nvidia ENV for none gpu request job

* update chart version
2021-09-26 12:34:02 +08:00
heluocs 42d6d76d3e
add evaluate job (#659) 2021-09-23 05:59:36 -07:00
Kai Zhang f506738418
update go mod and gitignore (#676) 2021-09-21 21:06:44 -07:00
heluocs ce06540d5d
feature/tensorflow serving support monitoring (#661)
* tensorflow serving support --monitoring-config-file
cron tfjob add labels
serving service support custom labels

* custom serving support metrics port
2021-09-13 01:26:01 -07:00
heluocs b67e40c456
triton support custom command (#656) 2021-08-31 20:13:44 -07:00
cheyang ed6de0d5c7
Update docs for installation, To #36006921 (#657)
Signed-off-by: cheyang <cheyang@163.com>
2021-08-27 14:05:02 +08:00
happy2048 f73ef6250e
support launcher mounts pvcs (#637) 2021-08-10 20:17:25 -07:00
cheyang 892217e74c
Update helm download repo, To #36006291 (#636)
Signed-off-by: cheyang <cheyang@163.com>
2021-08-08 11:52:38 +08:00
happy2048 c24e3d535b
release v0.8.7 (#635) 2021-08-06 05:35:43 -07:00
happy2048 70486e5e7d
fix bug: allocated gpus is 0 when using arena top node (#628) 2021-08-04 21:11:18 -07:00
happy2048 e51b97eb2b
isolate users in namespaces (#625) 2021-08-04 00:51:42 -07:00
Alex Wang 4ba23b122f
add annotation in tf job level (#617)
* add annotation in tfjob

* update chart version
2021-07-28 00:07:45 -07:00
heluocs dd265dae42
support annotations/nodeSelector/tolerations in tensorflow serving an… (#619)
* support annotations/nodeSelector/tolerations in tensorflow serving and nvidia triton

* add changelog
2021-07-28 00:00:45 -07:00
happy2048 a2bec8c2e6
release v0.8.6 (#607) 2021-07-05 01:57:51 -07:00
heluocs cb249e1285
delete previous installed cron-operator when install new version (#606) 2021-07-05 01:56:51 -07:00
happy2048 7ab7410ff0
fix bug: requestGPUMemory is missing for serving job (#605)
* fix bug: requestGPUMemory is missing for serving job

* fix doc error
2021-07-05 00:59:51 -07:00
heluocs 86cb696826
refactor cron (#603)
* refactor cron

* add PsGPU in tfjob_builder and cron_tfjob_builder
2021-07-04 20:22:50 -07:00
heluocs 5bc27110bf
Feature/add uuid and creation timestamp in job info (#599)
* refactor code and return uuid in job info

* add request cpus in serving job instance

* add uuid and creation timestamp in returned job and instance info

* add cron tfjob doc

* resolve conflict

* delete the default pvc in triton
2021-06-30 19:52:23 -07:00
happy2048 00ed936b5b
fix bug: display gpu error for serving job (#598) 2021-06-21 23:56:54 -07:00
happy2048 efa4f6d040
display gpus for serving job (#596) 2021-06-21 19:41:54 -07:00
happy2048 e2789d70d1
fix bug: install.sh not work on mac when --host-network is enabled (#595) 2021-06-21 19:40:54 -07:00
happy2048 11d09241f4
fix bug: tfserving display error (#590) 2021-06-17 04:56:28 -07:00
cheyang 3d6c09a8df
Fix istio issue (#589)
* Add sample, To #34516710

* Add sample, To #34516710

* Add sample, To #34516710

* Fix istio support for tf serving, To #34516710

* Fix istio support for tf serving, To #34516710
2021-06-16 18:56:04 +08:00
cheyang b328b87ddb
Fix istio support for tf serving, To #34516710 (#588) 2021-06-15 21:51:52 +08:00
happy2048 ea7c4ea672
fix bug: go vet error (#586) 2021-06-14 01:01:36 -07:00
happy2048 71ec536fcd
change display format for tfserving (#585) 2021-06-14 15:19:29 +08:00
cheyang eaf5106bd0
Fix tf serving typo (#584)
* Fix tf serving typo, To #34515520

Signed-off-by: cheyang <cheyang@163.com>

* Fix tf serving typo, To #34515520

Signed-off-by: cheyang <cheyang@163.com>

* Fix tf serving typo, To #34515520

Signed-off-by: cheyang <cheyang@163.com>

* Fix tf serving typo, To #34515520

Signed-off-by: cheyang <cheyang@163.com>
2021-06-13 23:14:49 +08:00
happy2048 264b96a3fd
add release notes for v0.8.5 (#577) 2021-06-07 10:56:12 +08:00
winger 77c4d32450
add "FAQ: Failed To Install Arena on Mac" to install doc (#576) 2021-06-04 02:44:36 -07:00
happy2048 f8aea8c690
update arena version to 0.8.5 (#573) 2021-06-03 23:32:36 -07:00
happy2048 641ba829b3
add uninstall script (#571) 2021-06-03 20:07:37 -07:00
uzuku 586faff1be
Fix cmd typo in installation/binary.md (#569) 2021-06-03 20:06:36 -07:00
heluocs a3f3694fae
add cron rbac (#568) 2021-05-31 19:58:04 -07:00
heluocs 3453b57f32
support nvidia triton server (#559)
* support nvidia triton server

* add nvidia triton docs
2021-05-31 19:38:04 -07:00
happy2048 3324674757
fix bug: cannot display total gpu memory (#561) 2021-05-25 03:15:09 -07:00
happy2048 4fb97ce04a
update version to 0.8.4 (#557) 2021-05-12 01:29:26 -07:00
heluocs 2fbb6080b8
support get and list cron from cache (#554) 2021-05-11 02:30:42 -07:00
happy2048 bdce431b90
change the calculation of gpushare allocated gpus (#552)
* change calculation of gpushare allocated gpus

* update version to 0.8.3
2021-05-07 03:50:07 -07:00
heluocs ed79092ffd
fix arena get tfjob instances missing created by cron (#551) 2021-05-06 19:41:06 -07:00
happy2048 07c1439ce3
fix bug: error in arena top node (#550) 2021-05-06 02:29:06 -07:00
cheyang 48129d11f9
Update v0.8.2, To #33889330 (#548) 2021-04-27 21:14:03 +08:00
heluocs 54f3d37879
support cron job (#546)
* support cron job

* delete debug msg

* check training job create time exists

* update cron type

* fix submit args imagePullSecrets error

* fix imagePullSecrets args error

* fix cron command description
2021-04-27 01:28:13 -07:00
happy2048 f4d5df00d0
add --rdma to install.sh (#543) 2021-04-21 20:00:41 -07:00
happy2048 c3f42edf10
refact install.sh (#542) 2021-04-21 19:29:42 -07:00
cheyang 20e9fe4efa
Create 0.8.1 (#541) 2021-04-18 22:21:53 +08:00
xiaozhouX 4dca70ddc9
update gpu exporter yaml, support both docker and containerd runtime (#538) 2021-04-16 23:08:14 -07:00
Alex Wang 9bba0d8c58
add gpu topology doc (#540) 2021-04-16 23:00:14 -07:00
Thomas bf8e065b68
chore: Fix broken link (#527)
* Fix broken link

* Fix broken link

* chore: Fix broken link
2021-04-14 23:11:12 -07:00
Luo Yili 86deea17c6
Add NJU PASA Lab in ADOPTERS.md (#497)
Co-authored-by: cheyang <cheyang@163.com>
2021-04-14 23:09:12 -07:00
happy2048 61fc2c159c
update client-go to v0.18.5 (#537)
update version to 0.9.0
2021-04-14 23:04:12 -07:00
Alex Wang c0fd8e46ba
mpi job support gpu topology scheduling (#535) 2021-04-13 04:35:03 -07:00
OyutianO c3eb2cf573
remove initializers args in spark-operator.yaml (#530) 2021-04-07 19:46:02 -07:00
Thomas a1c50041e1
Fix broken link (#522)
* Fix broken link

* Fix broken link
2021-04-02 04:44:20 -07:00
cheyang ba37c8a984
Fix CVE-2020-8570 (#526)
Signed-off-by: cheyang <cheyang@163.com>
2021-04-02 15:30:07 +08:00
heluocs cebb2cee7a
refactor arena java sdk (#519) 2021-03-31 04:26:18 -07:00
happy2048 551d2d058d
release v0.8.0 (#518) 2021-03-29 03:11:49 -07:00
happy2048 435b517e16
fix bug: spark job not work (#517) 2021-03-29 02:31:50 -07:00
haijohn f68991bc50
chore: do not track __pycache__, dist, build, egg-info folder (#516) 2021-03-28 23:12:49 -07:00
happy2048 4b02fa9607
add java sdk and python sdk (#505) 2021-03-26 00:59:46 -07:00
happy2048 ed0d1ab840
fix bug: error for getting training job logs when no chief pod (#501) 2021-03-15 02:34:47 -07:00
cheyang 0195afcacf
Polish description of adopter page (#503)
Signed-off-by: cheyang <cheyang@163.com>
2021-03-15 15:27:42 +08:00
Bin Fan d28009bd98
Update ADOPTERS.md (#502) 2021-03-15 12:23:23 +08:00
happy2048 8227e03805
remove unused files (#491) 2021-03-05 19:29:49 -08:00
happy2048 ad6688db30
reduce the execution time of operating serving jobs (#489) 2021-03-05 03:03:48 -08:00
happy2048 c183c9bdd4
change link of ADOPTERS.md (#485) 2021-02-28 22:42:40 -08:00
heluocs aa7450f787
add seldon core support (#486) 2021-02-28 22:36:39 -08:00
xiaozhouX 5bf35204de
fix et-operator image version (#484) 2021-02-24 02:10:50 -08:00
happy2048 b6f754f716
archived docs (#483) 2021-02-24 02:08:50 -08:00
happy2048 5f5b9de7ed
refact documentation (#482) 2021-02-23 03:16:03 -08:00
datadamon c1899105ff
Update ADOPTERS.md (#479) 2021-02-23 00:44:03 -08:00
happy2048 535193746b
complete the doc of tfjob with role sequence (#476) 2021-02-21 22:03:44 -08:00
OyutianO 2480fe92de
add cpu and mem apis for pytorchjob (#475) 2021-02-21 22:01:42 -08:00
happy2048 bf26dee3c9
optimize code to reduce execution time (#463) 2021-02-18 22:24:25 -08:00
xieydd 50c26e9fbe
Add Unisound in Adopters (#462) 2021-02-19 14:22:33 +08:00
happy2048 2720e6f5b0
add script to generate arena user (#459) 2021-02-08 01:25:12 -08:00
OyutianO 7af62c8e6c
Add cpu and memory command flag for pytorchjob (#458) 2021-02-08 01:23:12 -08:00
happy2048 4ed2d203f2
support creationTimestamp in arena get command (#457) 2021-02-08 01:21:12 -08:00
xiaozhouX ca3934b452
Feature/update et operator crd (#455)
* update et-operator crd

* update scaleout and scalein, skip install cm
2021-02-05 01:48:52 -08:00
happy2048 d1eda4e5a2
fix some bugs (#456)
fix bug: panic for arena attach command when the job has no instances

fix bug: virtual kubelet is not a real gpu node, but not skip it
2021-02-05 01:36:52 -08:00
chaowangnk1 b605824639
add microsoft as user (#453) 2021-02-03 20:26:28 -08:00
happy2048 d0e39fc8c1
support lower letters of tfjob roles (#452) 2021-02-02 23:00:28 -08:00
happy2048 2fbc4b892e
support specifying role sequence of tfjob (#451) 2021-02-01 22:14:28 -08:00
Bob.Liu 545dd90f47
Add HUYA info in adopters page (#450)
Add HUYA info in adopters page
2021-02-01 03:23:49 -08:00
cheyang 95a674686b
Add adopters page (#449)
Signed-off-by: cheyang <cheyang@163.com>
2021-01-30 22:45:19 +08:00
cheyang 3fc7c97ead
Update version (#448)
Signed-off-by: cheyang <cheyang@163.com>
2021-01-30 13:08:14 +08:00
xiaozhouX fcf0f8b387
cacheClient use dynamicMapper (#444) 2021-01-27 19:01:07 -08:00
happy2048 3559f56b57
fix bug: make sure et-operator is installed in arena-system namespace (#447) 2021-01-27 08:11:40 -08:00
xiaozhouX 386e37ba9f
fix tfjob hostNetwork, remove ps constraint (#445) 2021-01-26 23:41:39 -08:00
DavidSpek c6f5800d09
add dependabot config script (#403)
* add dependabot config script

* replace with new python script

* add main function
2021-01-24 23:56:54 -08:00
happy2048 8027003fea
add the release note v0.7.0 (#410) 2021-01-24 19:08:53 -08:00
happy2048 336c06d1c7
make sure all k8s object can get or list from cache client (#409) 2021-01-24 16:26:53 -08:00
xiaozhouX 95c9b6bdf4
update edl operator (#407)
* remove default scaler script

* remove default script for et scale job

* update et operator
2021-01-24 16:24:53 -08:00
xiaozhouX b2404c4a75
Add cacheClient in daemon mode (#404) 2021-01-22 08:03:26 -08:00
happy2048 9825b10f58
update go version of CI to 1.14.10 (#408) 2021-01-22 02:53:27 -08:00
OyutianO a43477d13c
update dependency of k8s and client-go to v1.16.9 (#401) 2021-01-19 06:43:45 -08:00
happy2048 dac8d1cf91
fix bug: error in getting all node gpu metrics (#397)
fix bug: divide 0 in 'top node' when total gpu memory is 0
2021-01-13 06:52:36 -08:00
happy2048 7eea1ab194
fix some bugs for supporting arms prometheus (#396)
fix bug: labels of aliyun arms-prometheus has been changed
fix bug: errors in getting running time of pod
fix bug: trainer type is not capitalized of arena top job
2021-01-04 01:31:55 -08:00
happy2048 0c7961c616
arena top node supports gpu metrics (#395) 2020-12-25 08:48:28 -08:00
happy2048 6e2f9973ce
update version to 0.7.0 (#394)
fix bug: kubeconfig file path includes ~

update golang docker image version to 1.14

optimize 'arena get' and 'arena top node'
2020-12-15 09:07:48 -08:00
happy2048 5b0b6f4f79
refact top node and top job (#392)
fix bug: missing double quotes

fix bug: missing --namesoace and --config
2020-12-13 00:15:27 -08:00
happy2048 692077ce91
refact serving jobs (#389) 2020-12-04 07:43:59 -08:00
happy2048 039149693e
refact sparkjob (#388) 2020-11-30 02:16:50 -08:00
happy2048 340100e9d0
refact etjob (#387) 2020-11-29 06:44:48 -08:00
happy2048 5282382ef4
refact volcanojob (#386) 2020-11-28 03:24:47 -08:00
happy2048 4ac5fe6e15
refact horovodjob (#385) 2020-11-27 07:56:48 -08:00
happy2048 fac5593c11
refact mpijob (#384) 2020-11-27 02:54:50 -08:00
happy2048 2f7e48aa52
refact pytorchjob (#383) 2020-11-26 08:10:20 -08:00
happy2048 c3e0582ab5
refact tfjob (#382) 2020-11-22 03:19:34 -08:00
happy2048 b7576c75e4
add package viper (#381) 2020-11-21 22:41:35 -08:00
happy2048 50813f7ea8
fix bug: pytorchjob don't support arena-go-sdk (#378) 2020-11-21 20:39:33 -08:00
happy2048 10f90525a8
fix bug: to fix commit(1290baf51b) error (#379) 2020-11-21 19:53:34 -08:00
xiaozhouX d12e011acb
add default value for protocol in trainingJob CR (#375) 2020-11-10 19:32:31 +08:00
happy2048 a9e5644d45
add some functions to support arena-go-sdk (#374) 2020-11-09 20:58:40 -08:00
heluocs 5b32b154a1
fix arena submit distribute training command (#366) 2020-11-07 20:17:37 -08:00
OyutianO 1290baf51b
Support GangScheduling Native in PytorchJob (#370)
* update pytorchjob version

* add pod-gropu paras in pytorchjob.yaml

* add pytrochjob gang schedule feature

* add strconv import

* add test

* add test

* add test

* add test

* add test

* add test

* add test

* Test passed

* modify CHANGELOG.md
2020-11-07 20:13:38 -08:00
happy2048 57ce0cc091
replace dep to go module (#371)
* replace dep to go module

* add GO111MODULE=off to Makefile
2020-11-07 19:39:37 -08:00
cheyang 65e9b24b09
Update v0.6.0, To #30961382 (#372) 2020-11-08 09:00:18 +08:00
Yuze Ma cc19ff2198
added specification for branch during job submitting (#365)
* added specification for branch during job submitting

* Added branch specification for tutorial in Chinese
2020-10-11 19:10:47 -07:00
xiaozhouX 267298d715
update et-controller image version (#362) 2020-10-09 01:25:06 -07:00
King 3427848463
add support etjob (#361)
* add support edljob

* replace 'EDL/edl' with 'ET/et'

* add user guide about elastic training job.
2020-09-10 00:59:45 -07:00
hwk42 aec9ffa7d7
add kfserving support feature (#358) 2020-08-27 09:07:54 -07:00
happy2048 2f76b96c15
support image regionalization (#356) 2020-08-19 06:27:12 -07:00
happy2048 f246c57316
disabled launcher of mpi job to mount pvc (#355) 2020-08-16 23:24:18 -07:00
King bf5cb7fa93
bugfix: fix typo of imagepullsecrets (#354) 2020-07-27 21:21:07 -07:00
King d0ee8aeec4
hotfix ci error about struct field tag (#351) 2020-07-24 01:02:23 -07:00
King 3732c5819a
add general function to process common params #29467442 (#349) 2020-07-24 00:04:23 -07:00
King e689b57561
load imagePullSecrets from arenaConfigs #29122384 (#348) 2020-07-22 20:09:40 -07:00
King 64667df022
add userguide about imagePullSecrets #29122384 (#347) 2020-07-22 16:25:40 -07:00
King 34c0cb4155
add support imagePullSecrets for mpi #29122384 (#346) 2020-07-20 22:59:13 -07:00
King 819d2e74ae
add support imagePullSecrets for tf #29122384 (#344) 2020-07-20 22:55:14 -07:00
King d8ea539f38
add support imagePullSecrets for pytorch #29122384 (#342) 2020-07-20 19:49:15 -07:00
King f1f6f4e694
add a common submit param for imagePullSecrets #29122384 (#345) 2020-07-19 23:48:51 -07:00
King b2d3c24274
update FAQ about pytorch (#343) 2020-07-19 20:28:50 -07:00
King 196a19acd7
add gpu memory unit for arena top node -s/-s -d #28523962 (#339) 2020-07-16 15:21:00 -07:00
cheyang e22162d6f9
Update the version (#338) 2020-07-10 07:47:22 +08:00
cheyang 70d16402e1
Add change log (#337) 2020-07-08 00:43:07 -07:00
cheyang 68662f2d90
Features/pytorch (#336)
* add install pytorch operator at install.sh #28687833 (#322)

* support submit pytorch job by git #28289726 (#324)

add pytorch-operator api & client about submiting pytorch job #28289726

submit pytorch job #28289726

workerCount - 1, because the master is also considered as a worker #28289726

update charts about submiting pytorch job #28289726

add charts, pytorch operator api&client, and support submit pytorchjob
by git  #28289726

* Query&Delete pytorchjob (#325)

* support submit pytorch job by git #28289726

add pytorch-operator api & client about submiting pytorch job #28289726

submit pytorch job #28289726

workerCount - 1, because the master is also considered as a worker #28289726

update charts about submiting pytorch job #28289726

add charts, pytorch operator api&client, and support submit pytorchjob
by git  #28289726

* support list pytorchjob

modify deps of pytorch operator support list pytorchjob

modify function setConfigDefaults of pytorch operator client

find chief pod

fix list status

fix list status, add log

support list pytorchjob #28881677

support list pytorchjob #28881677

* support get pytorchjob #28881677

* support delete pytorchjob #28887283

* remove useless code from pytorch-operator crd and add submit help info about pytorchjob #28183365 (#326)

and support pytorchjob toleration for pytorchjob

* remove types.PatchType at fake_pytorchjob.go for suppressing go test warn (#327)

* Add userguide for pytorchjob (#328)

* add tips for dataset that pvc name is not support repeated #29005224 (#329)

* add support running cleanPolicy for pytorchjob #29175794 (#334)

Co-authored-by: King <jiaqianjing@gmail.com>
2020-07-07 03:17:58 -07:00
8358 changed files with 2015526 additions and 406969 deletions

View File

@ -1,25 +0,0 @@
# Golang CircleCI 2.0 configuration file
#
# Check https://circleci.com/docs/2.0/language-go/ for more details
version: 2
jobs:
build:
docker:
- image: circleci/golang:1.10
working_directory: /go/src/github.com/kubeflow/arena
steps:
- checkout
- setup_remote_docker:
docker_layer_caching: false
- run:
name: run tests
command: |
test -z "$(go fmt ./... 2>/dev/null | tee /dev/stderr)" || (echo "please format Go code with 'gofmt'")
go vet ./...
go test -race -v ./...
- run: docker build -t acs/arena:$CIRCLE_BUILD_NUM -f Dockerfile.install .
- run:
name: codecov
command: |
go test -race -coverprofile=coverage.txt -covermode=atomic ./...
bash <(curl -s https://codecov.io/bash)

18
.dockerignore Normal file
View File

@ -0,0 +1,18 @@
bin/
docs/
jupyter/
samples/
sdk/
.gitignore
.readthedocs.yaml
Dockerfile*
LICENSE
OWNERS
README.md
README_cn.md
ROADMAP.md
ROADMAP_cn.md
cover.out
demo.jpg
mkdocs.yml
prow_config.yaml

48
.github/ISSUE_TEMPLATE/bug_report.yaml vendored Normal file
View File

@ -0,0 +1,48 @@
name: Bug Report
description: Tell us about a problem you are experiencing with Arena
labels: ["kind/bug", "lifecycle/needs-triage"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this Arena bug report!
- type: textarea
id: problem
attributes:
label: What happened?
description: |
Please provide as much info as possible.
Not doing so may result in your bug not being addressed in a timely manner.
validations:
required: true
- type: textarea
id: expected
attributes:
label: What did you expect to happen?
validations:
required: true
- type: textarea
id: environment
attributes:
label: Environment
value: |
Kubernetes version:
```bash
$ kubectl version
```
Arena version:
```bash
$ arena version
```
validations:
required: true
- type: input
id: votes
attributes:
label: Impacted by this bug?
value: Give it a 👍 We prioritize the issues with most 👍

6
.github/ISSUE_TEMPLATE/config.yaml vendored Normal file
View File

@ -0,0 +1,6 @@
blank_issues_enabled: true
contact_links:
- name: Arena Documentation
url: https://arena-docs.readthedocs.io/en/stable
about: Much help can be found in the docs

View File

@ -0,0 +1,28 @@
name: Feature Request
description: Suggest an idea for Arena
labels: ["kind/feature", "lifecycle/needs-triage"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this Arena feature request!
- type: textarea
id: feature
attributes:
label: What you would like to be added?
description: |
A clear and concise description of what you want to add to Arena.
Please consider to write Arena enhancement proposal if it is a large feature request.
validations:
required: true
- type: textarea
id: rationale
attributes:
label: Why is this needed?
validations:
required: true
- type: input
id: votes
attributes:
label: Love this feature?
value: Give it a 👍 We prioritize the features with most 👍

27
.github/ISSUE_TEMPLATE/question.yaml vendored Normal file
View File

@ -0,0 +1,27 @@
name: Question
description: Ask question about Arena
labels: ["kind/question", "lifecycle/needs-triage"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this question!
- type: textarea
id: feature
attributes:
label: What question do you want to ask?
description: |
A clear and concise description of what you want to ask about Arena.
validations:
required: true
- type: textarea
id: rationale
attributes:
label: Any additional context?
validations:
required: false
- type: input
id: votes
attributes:
label: Have the same question?
value: Give it a 👍 We prioritize the question with most 👍

29
.github/PULL_REQUEST_TEMPLATE.md vendored Normal file
View File

@ -0,0 +1,29 @@
<!-- Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, check our contributor guidelines: https://www.kubeflow.org/docs/about/contributing
2. To know more about Arena, check the developer guide:
https://arena-docs.readthedocs.io/en/latest/
3. If you want *faster* PR reviews, check how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
-->
## Purpose of this PR
<!-- Provide a clear and concise description of the changes. Explain the motivation behind these changes and link to relevant issues or discussions. -->
**Proposed changes:**
- <Change 1>
- <Change 2>
- <Change 3>
## Change Category
<!-- Indicate the type of change by marking the applicable boxes. -->
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] Feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that could affect existing functionality)
- [ ] Documentation update
### Rationale
<!-- Provide reasoning for the changes if not already covered in the description above. -->

26
.github/dependabot.yml vendored Normal file
View File

@ -0,0 +1,26 @@
version: 2
updates:
- package-ecosystem: gomod
directory: /
schedule:
interval: daily
- package-ecosystem: maven
directory: /
schedule:
interval: daily
- package-ecosystem: pip
directory: /
schedule:
interval: daily
- package-ecosystem: docker
directory: /
schedule:
interval: daily
- package-ecosystem: github-actions
directory: /
schedule:
interval: daily

5
.github/issue_label_bot.yaml vendored Normal file
View File

@ -0,0 +1,5 @@
# For https://mlbot.net a Github bot that labels issues using KubeFlow
label-alias:
bug: kind/bug
feature_request: kind/feature
question: kind/question

69
.github/workflows/check-release.yaml vendored Normal file
View File

@ -0,0 +1,69 @@
name: Check Release
on:
pull_request:
branches:
- master
paths:
- VERSION
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
SEMVER_PATTERN: '^([0-9]+)\.([0-9]+)\.([0-9]+)(-rc\.([0-9]+))?$'
ARENA_ARTIFACTS_CHART: arena-artifacts
jobs:
check:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
with:
fetch-depth: 0
- name: Configure Git
run: |
git config user.name "$GITHUB_ACTOR"
git config user.email "$GITHUB_ACTOR@users.noreply.github.com"
- name: Check whether version matches semver pattern
run: |
VERSION=$(cat VERSION)
if [[ ${VERSION} =~ ${{ env.SEMVER_PATTERN }} ]]; then
echo "Version '${VERSION}' matches semver pattern."
else
echo "Version '${VERSION}' does not match semver pattern."
exit 1
fi
echo "VERSION=${VERSION}" >> $GITHUB_ENV
- name: Check arena artifacts chart version and appVersion
run: |
CHART_VERSION=$(cat ${{ env.ARENA_ARTIFACTS_CHART }}/Chart.yaml | grep -e '^version:' | awk '{print $2}')
CHART_APP_VERSION=$(cat ${{ env.ARENA_ARTIFACTS_CHART }}/Chart.yaml | grep -e '^appVersion:' | awk '{print $2}')
if [[ ${CHART_VERSION} == ${VERSION} ]]; then
echo "Chart version '${CHART_VERSION}' matches version '${VERSION}'."
else
echo "Chart version '${CHART_VERSION}' does not match version '${VERSION}'."
exit 1
fi
if [[ ${CHART_APP_VERSION} == ${VERSION} ]]; then
echo "Chart appVersion '${CHART_APP_VERSION}' matches version '${VERSION}'."
else
echo "Chart appVersion '${CHART_APP_VERSION}' does not match version '${VERSION}'."
exit 1
fi
- name: Check if tag exists
run: |
git fetch --tags
if git tag -l | grep -q "^v${VERSION}$"; then
echo "Tag 'v${VERSION}' already exists."
exit 1
else
echo "Tag 'v${VERSION}' does not exist."
fi

137
.github/workflows/integration.yaml vendored Normal file
View File

@ -0,0 +1,137 @@
name: Integration Test
on:
pull_request:
branches:
- master
- release-*
push:
branches:
- master
- release-*
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.actor }}
cancel-in-progress: true
jobs:
build-arena:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod
- name: Run go mod tidy
run: |
go mod tidy
if ! git diff --quiet; then
echo "Please run 'go mod tidy' to add missing and remove unused dependencies"
exit 1
fi
- name: Run go mod vendor
run: |
go mod vendor
if ! git diff --quiet; then
echo "Please run 'go mod vendor' to make vendored copy of dependencies"
exit 1
fi
- name: Run go fmt check
run: |
make go-fmt
if ! git diff --quiet; then
echo "Please run 'make go-fmt' to run go fmt against code"
exit 1
fi
- name: Run go vet check
run: |
make go-vet
if ! git diff --quiet; then
echo "Please run 'make go-vet' to run go vet against code"
exit 1
fi
- name: Run golangci-lint
run: |
make go-lint
- name: Run Go unit tests
run: |
make unit-test
- name: Run Helm unit tests
run: |
make helm-unittest
- name: Build arena binary
run: |
make arena
build-java-sdk:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
- uses: actions/setup-java@v5
with:
distribution: zulu
java-version: 8
- name: Build Java SDK
run: |
make java-sdk
build-docs:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
- uses: actions/setup-python@v5
with:
python-version: 3.11
- name: Build docs
run: |
pip install -r docs/requirements.txt
mkdocs build --strict
e2e-test:
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
with:
fetch-depth: 0
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod
- name: Set up Kind cluster
uses: helm/kind-action@v1
with:
node_image: kindest/node:v1.29.10
config: arena-artifacts/ci/kind-config.yaml
- name: Install arena client
run: |
make arena-installer
tar -zxf arena-installer-*.tar.gz
arena-installer-*/install.sh --only-binary
- name: Run e2e tests
run: |
make e2e-test

242
.github/workflows/release.yaml vendored Normal file
View File

@ -0,0 +1,242 @@
name: Release
on:
push:
branches:
- master
paths:
- VERSION
env:
IMAGE_REGISTRY: ghcr.io
IMAGE_REPOSITORY: ${{ github.repository }}
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
package-arena-installer:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
os:
- linux
- darwin
arch:
- amd64
- arm64
steps:
- name: Checkout
uses: actions/checkout@v5
- name: Read version from VERSION file
run: |
VERSION=$(cat VERSION)
echo "VERSION=${VERSION}" >> ${GITHUB_ENV}
- name: Get git commit id
run: |
COMMIT=$(git rev-parse --short HEAD)
echo "COMMIT=${COMMIT}" >>${GITHUB_ENV}
- name: Build arena installer tarball
run: |
make arena-installer OS=${{ matrix.os }} ARCH=${{ matrix.arch }}
- uses: actions/upload-artifact@v4
with:
name: arena-installer-${{ env.VERSION }}-${{ matrix.os }}-${{ matrix.arch }}
path: arena-installer-${{ env.VERSION }}-${{ matrix.os }}-${{ matrix.arch }}.tar.gz
if-no-files-found: error
overwrite: true
build-arena-image:
name: Build Arena container image
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
platform:
- linux/amd64
- linux/arm64
steps:
- name: Prepare
run: |
platform=${{ matrix.platform }}
echo "PLATFORM_PAIR=${platform//\//-}" >> $GITHUB_ENV
- name: Checkout source code
uses: actions/checkout@v5
- name: Read version from VERSION file
run: |
VERSION=$(cat VERSION)
echo "VERSION=${VERSION}" >> $GITHUB_ENV
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}
tags: |
type=semver,pattern={{version}},value=${{ env.VERSION }}
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker buildx
uses: docker/setup-buildx-action@v3
- name: Login to container registry
uses: docker/login-action@v3
with:
registry: ${{ env.IMAGE_REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push by digest
id: build
uses: docker/build-push-action@v6
with:
platforms: ${{ matrix.platform }}
labels: ${{ steps.meta.outputs.labels }}
outputs: type=image,name=${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }},push-by-digest=true,name-canonical=true,push=true
- name: Export digest
run: |
mkdir -p /tmp/digests
digest="${{ steps.build.outputs.digest }}"
touch "/tmp/digests/${digest#sha256:}"
- name: Upload digest
uses: actions/upload-artifact@v4
with:
name: digests-${{ env.PLATFORM_PAIR }}
path: /tmp/digests/*
if-no-files-found: error
retention-days: 1
release-image:
needs:
- build-arena-image
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
- name: Read version from VERSION file
run: |
VERSION=$(cat VERSION)
echo "VERSION=${VERSION}" >> $GITHUB_ENV
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}
tags: |
type=semver,pattern={{version}},value=${{ env.VERSION }}
- name: Download digests
uses: actions/download-artifact@v5
with:
path: /tmp/digests
pattern: digests-*
merge-multiple: true
- name: Set up Docker buildx
uses: docker/setup-buildx-action@v3
- name: Login to container registry
uses: docker/login-action@v3
with:
registry: ${{ env.IMAGE_REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Create manifest list and push
working-directory: /tmp/digests
run: |
docker buildx imagetools create $(jq -cr '.tags | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON") \
$(printf '${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}@sha256:%s ' *)
- name: Inspect image
run: |
docker buildx imagetools inspect ${{ env.IMAGE_REGISTRY }}/${{ env.IMAGE_REPOSITORY }}:${{ steps.meta.outputs.version }}
push_tag:
needs:
- package-arena-installer
- release-image
runs-on: ubuntu-latest
steps:
- name: Checkout source code
uses: actions/checkout@v5
with:
fetch-depth: 0
- name: Configure Git
run: |
git config user.name "$GITHUB_ACTOR"
git config user.email "$GITHUB_ACTOR@users.noreply.github.com"
- name: Read version from VERSION file
run: |
VERSION=$(cat VERSION)
echo "VERSION=${VERSION}" >> ${GITHUB_ENV}
- name: Create and push tag
run: |
TAG="v${VERSION}"
git tag -a ${TAG} -m "Release v${VERSION}"
git push origin ${TAG}
draft_release:
needs:
- push_tag
permissions:
contents: write
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v5
- name: Configure Git
run: |
git config user.name "$GITHUB_ACTOR"
git config user.email "$GITHUB_ACTOR@users.noreply.github.com"
- name: Read version from VERSION file
run: |
VERSION=$(cat VERSION)
echo "VERSION=${VERSION}" >> ${GITHUB_ENV}
- name: Download arena installer tarballs
uses: actions/download-artifact@v5
with:
pattern: arena-installer-${{ env.VERSION }}-{linux,darwin}-{amd64,arm64}
- name: Release
uses: softprops/action-gh-release@v2
with:
token: ${{ secrets.GITHUB_TOKEN }}
tag_name: v${{ env.VERSION }}
prerelease: ${{ contains(env.VERSION, 'rc') }}
target_commitish: ${{ github.sha }}
draft: true
files: |
arena-installer-*/arena-installer-*.tar.gz
fail_on_unmatched_files: true

43
.github/workflows/stale.yaml vendored Normal file
View File

@ -0,0 +1,43 @@
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
#
# You can adjust the behavior by modifying this file.
# For more information, see:
# https://github.com/actions/stale
name: Mark stale issues and pull requests
on:
schedule:
- cron: "0 0 * * 0"
jobs:
stale:
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/stale@v9
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
days-before-stale: 360
days-before-close: 180
stale-issue-message: >
This issue has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
close-issue-message: >
This issue has been automatically closed because it has not had recent
activity. Please comment "/reopen" to reopen it.
stale-issue-label: lifecycle/stale
exempt-issue-labels: lifecycle/frozen
stale-pr-message: >
This pull request has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
close-pr-message: >
This pull request has been automatically closed because it has not had recent
activity. Please comment "/reopen" to reopen it.
stale-pr-label: lifecycle/stale
exempt-pr-labels: lifecycle/frozen

27
.gitignore vendored
View File

@ -1,6 +1,25 @@
bin/
**/*.tgz
**/.DS_Store
.idea
.kube
Library
public/
site/
tmp/
sdk/arena-python-sdk/dist/
sdk/arena-python-sdk/build/
sdk/arena-python-sdk/arenasdk.egg-info/
.hugo_build.lock
.kube
*.tgz
*.tar.gz
# Python
__pycache__/
# Go
cover.out
# IDE files
.idea/
.vscode/
# MacOS
.DS_Store

76
.golangci.yaml Normal file
View File

@ -0,0 +1,76 @@
version: "2"
run:
# Timeout for total work, e.g. 30s, 5m, 5m30s.
# If the value is lower or equal to 0, the timeout is disabled.
# Default: 0 (disabled)
timeout: 2m
linters:
# Enable specific linters.
# https://golangci-lint.run/usage/linters/#enabled-by-default
enable:
# Detects places where loop variables are copied.
- copyloopvar
# Checks for duplicate words in the source code.
- dupword
# Tool for detection of FIXME, TODO and other comment keywords.
# - godox
# Enforces consistent import aliases.
- importas
# Find code that shadows one of Go's predeclared identifiers.
- predeclared
# Check that struct tags are well aligned.
- tagalign
# Remove unnecessary type conversions.
- unconvert
# Checks Go code for unused constants, variables, functions and types.
- unused
# Disable specific linters.
disable:
# Errcheck is a program for checking for unchecked errors in Go code.
- errcheck
settings:
importas:
# List of aliases
alias:
- pkg: k8s.io/api/admissionregistration/v1
alias: admissionregistrationv1
- pkg: k8s.io/api/apps/v1
alias: appsv1
- pkg: k8s.io/api/batch/v1
alias: batchv1
- pkg: k8s.io/api/core/v1
alias: corev1
- pkg: k8s.io/api/extensions/v1beta1
alias: extensionsv1beta1
- pkg: k8s.io/api/networking/v1
alias: networkingv1
- pkg: k8s.io/apimachinery/pkg/apis/meta/v1
alias: metav1
- pkg: sigs.k8s.io/controller-runtime
alias: ctrl
exclusions:
# Which file paths to exclude: they will be analyzed, but issues from them won't be reported.
# "/" will be replaced by the current OS file path separator to properly work on Windows.
# Default: []
paths:
- pkg/operators
issues:
# Maximum issues count per one linter.
# Set to 0 to disable.
# Default: 50
max-issues-per-linter: 50
# Maximum count of issues with the same text.
# Set to 0 to disable.
# Default: 3
max-same-issues: 10
formatters:
enable:
# Check import statements are formatted according to the 'goimport' command.
- goimports

23
.readthedocs.yaml Normal file
View File

@ -0,0 +1,23 @@
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
# Required
version: 2
# Set the version of Python and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"
mkdocs:
configuration: mkdocs.yml
# Optionally build your docs in additional formats such as PDF
formats:
- pdf
# Optionally set the version of Python and requirements required to build your docs
python:
install:
- requirements: docs/requirements.txt

View File

@ -1,16 +0,0 @@
language: go
go:
- "1.10"
go_import_path: github.com/kubeflow/arena
# let us have speedy Docker-based Travis workers
sudo: false
script:
- go build -o bin/arena cmd/arena/*.go
- go vet ./...
- go test -v ./...
- test -z "$(go fmt ./... 2>/dev/null | tee /dev/stderr)" || (echo "please format Go code with 'gofmt'")
- go test -race -v ./...

View File

@ -1,46 +1,236 @@
## [Release 0.3.0]
# Changelog
### Added
## [v0.15.1](https://github.com/kubeflow/arena/tree/v0.15.1) (2025-06-25)
- Add Priority class support for MPIJob and TFJob
- Display Unhealthy GPU devices
- Integrate GPUShare capablities
- Upgrade TFJob to V1 (commit id: d746bde)
- Add Customize Serving
- Add GPUsharing features for Serving Job
### Features
## [Release 0.2.0]
- Add support for configuring tolerations ([#1337](https://github.com/kubeflow/arena/pull/1337) by [@ChenYi015](https://github.com/ChenYi015))
### Added
### Misc
- Add spark and volcano Job
- Add multiple users and add PodSecurityContext for Training Job
- Add TensorRT
- Remove kubernetes artifacts ([#1329](https://github.com/kubeflow/arena/pull/1329) by [@ChenYi015](https://github.com/ChenYi015))
- [CI] Add CI workflow for releasing Arena images ([#1340](https://github.com/kubeflow/arena/pull/1340) by [@ChenYi015](https://github.com/ChenYi015))
- Update uninstall bash script ([#1335](https://github.com/kubeflow/arena/pull/1335) by [@ChenYi015](https://github.com/ChenYi015))
- Fix golangci-lint issues ([#1341](https://github.com/kubeflow/arena/pull/1341) by [@ChenYi015](https://github.com/ChenYi015))
- Bump golang version from 1.22.7 to 1.23.10 ([#1345](https://github.com/kubeflow/arena/pull/1345) by [@ChenYi015](https://github.com/ChenYi015))
- chore(deps): bump github.com/prometheus/common from 0.60.1 to 0.65.0 ([#1343](https://github.com/kubeflow/arena/pull/1343) by [@dependabot[bot]](https://github.com/apps/dependabot))
- chore(deps): bump golang.org/x/crypto from 0.38.0 to 0.39.0 ([#1334](https://github.com/kubeflow/arena/pull/1334) by [@dependabot[bot]](https://github.com/apps/dependabot))
### Changed
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.15.0...v0.15.1)
- Refactoring code to remove dependency of helm create
- Enhance cluster management
## [v0.15.0](https://github.com/kubeflow/arena/tree/v0.15.0) (2025-06-04)
## [Release 0.1.0]
### Features
### Added
- refactor: use helm lib instead of helm binary ([#1207](https://github.com/kubeflow/arena/pull/1207) by [@ChenYi015](https://github.com/ChenYi015))
- feat: add new value for using localtime in cron-operator ([#1296](https://github.com/kubeflow/arena/pull/1296) by [@ChenYi015](https://github.com/ChenYi015))
- Delete all services when the TFJob is terminated ([#1316](https://github.com/kubeflow/arena/pull/1316) by [@ChenYi015](https://github.com/ChenYi015))
- Make number of replicas of cron-operator deployment configurable ([#1325](https://github.com/kubeflow/arena/pull/1325) by [@ChenYi015](https://github.com/ChenYi015))
- Make number of replicas of tf-operator deployment configurable ([#1323](https://github.com/kubeflow/arena/pull/1323) by [@ChenYi015](https://github.com/ChenYi015))
- Add custom device support for kserve and kserving. ([#1315](https://github.com/kubeflow/arena/pull/1315) by [@Leoyzen](https://github.com/Leoyzen))
- Feat: support affinity policy for kserve and tfjob ([#1319](https://github.com/kubeflow/arena/pull/1319) by [@Syspretor](https://github.com/Syspretor))
- Feat: support separate affinity policy configuration for PS and worke… ([#1331](https://github.com/kubeflow/arena/pull/1331) by [@Syspretor](https://github.com/Syspretor))
- Add TFJob v1alpha2 for Solo/Distributed Training, and support binpack and spread mode
- Add Download Source Code from Git for Training
- Add Tensorboard
- Add top node/job for checking GPU allocations in Kubernetes
- Add MPIJob v1alpha1 for Solo/Distributed Training
- Add gang scheduling support for TFJob
- Add Data
- Add RDMA support
### Bug Fixes
### Changed
- fix: job status displays incorrectly ([#1289](https://github.com/kubeflow/arena/pull/1289) by [@ChenYi015](https://github.com/ChenYi015))
- fix: service account should use release namespace ([#1308](https://github.com/kubeflow/arena/pull/1308) by [@ChenYi015](https://github.com/ChenYi015))
### Removed
### Misc
### Fixed
- Add basic e2e tests ([#1225](https://github.com/kubeflow/arena/pull/1225) by [@ChenYi015](https://github.com/ChenYi015))
- Bump github.com/containerd/containerd from 1.7.23 to 1.7.27 ([#1290](https://github.com/kubeflow/arena/pull/1290) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Add stale bot to mark stale issues and PRs ([#1141](https://github.com/kubeflow/arena/pull/1141) by [@ChenYi015](https://github.com/ChenYi015))
- Fix typos in multiple files ([#1304](https://github.com/kubeflow/arena/pull/1304) by [@co63oc](https://github.com/co63oc))
- Fix typos in multiple files ([#1310](https://github.com/kubeflow/arena/pull/1310) by [@co63oc](https://github.com/co63oc))
### Deprecated
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.14.2...v0.15.0)
- HorovodJob is going to remove when MPIJob is production ready
## [v0.14.2](https://github.com/kubeflow/arena/tree/v0.14.2) (2025-03-10)
### Misc
- Fix typos ([#1276](https://github.com/kubeflow/arena/pull/1276) by [@co63oc](https://github.com/co63oc))
- Update pytorch operator image ([#1281](https://github.com/kubeflow/arena/pull/1281) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.14.1...v0.14.2)
## [v0.14.1](https://github.com/kubeflow/arena/tree/v0.14.1) (2025-02-24)
### Bug Fixes
- fix: device value does not support k8s resource quantity ([#1267](https://github.com/kubeflow/arena/pull/1267) by [@ChenYi015](https://github.com/ChenYi015))
- fix: pytorchjob does not support backoff limit ([#1272](https://github.com/kubeflow/arena/pull/1272) by [@ChenYi015](https://github.com/ChenYi015))
- unset env NVIDIA_VISIBLE_DEVICES when gpushare is enabled ([#1273](https://github.com/kubeflow/arena/pull/1273) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- docs: fixed typo ([#1257](https://github.com/kubeflow/arena/pull/1257) by [@DBMxrco](https://github.com/DBMxrco))
- Bump github.com/golang/glog from 1.2.3 to 1.2.4 ([#1263](https://github.com/kubeflow/arena/pull/1263) by [@dependabot[bot]](https://github.com/apps/dependabot))
- fix: format of tensorflow standalone training docs is messed up ([#1265](https://github.com/kubeflow/arena/pull/1265) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.14.0...v0.14.1)
## [v0.14.0](https://github.com/kubeflow/arena/tree/v0.14.0) (2025-02-12)
### Features
- rename parameter ([#1262](https://github.com/kubeflow/arena/pull/1262) by [@gujingit](https://github.com/gujingit))
### Misc
- Add changelog for v0.13.1 ([#1248](https://github.com/kubeflow/arena/pull/1248) by [@ChenYi015](https://github.com/ChenYi015))
- Bump github.com/go-resty/resty/v2 from 2.16.0 to 2.16.5 ([#1254](https://github.com/kubeflow/arena/pull/1254) by [@dependabot[bot]](https://github.com/apps/dependabot))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.13.1...v0.14.0)
## [v0.13.1](https://github.com/kubeflow/arena/tree/v0.13.1) (2025-01-13)
### Misc
- feat: add linux/arm64 support for tf-operator image ([#1238](https://github.com/kubeflow/arena/pull/1238) by [@ChenYi015](https://github.com/ChenYi015))
- feat: add linux/arm64 support for mpi-operator image ([#1239](https://github.com/kubeflow/arena/pull/1239) by [@ChenYi015](https://github.com/ChenYi015))
- feat: add linux/arm64 support for cron-operator image ([#1240](https://github.com/kubeflow/arena/pull/1240) by [@ChenYi015](https://github.com/ChenYi015))
- feat: add linux/arm64 support for et-operator image ([#1241](https://github.com/kubeflow/arena/pull/1241) by [@ChenYi015](https://github.com/ChenYi015))
- Add PyTorch mnist example ([#1237](https://github.com/kubeflow/arena/pull/1237) by [@ChenYi015](https://github.com/ChenYi015))
- Update the version of elastic-job-supervisor in arena-artifacts ([#1247](https://github.com/kubeflow/arena/pull/1247) by [@AlanFokCo](https://github.com/AlanFokCo))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.13.0...v0.13.1)
## [v0.13.0](https://github.com/kubeflow/arena/tree/v0.13.0) (2024-12-23)
### New Features
- feat: add support for torchrun ([#1228](https://github.com/kubeflow/arena/pull/1228) by [@ChenYi015](https://github.com/ChenYi015))
- Update pytorch-operator image ([#1234](https://github.com/kubeflow/arena/pull/1234) by [@ChenYi015](https://github.com/ChenYi015))
### Bug Fix
- Avoid listing jobs and statefulsets when get pytorchjob ([#1229](https://github.com/kubeflow/arena/pull/1229) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Update tfjob standalone training job doc ([#1222](https://github.com/kubeflow/arena/pull/1222) by [@ChenYi015](https://github.com/ChenYi015))
- Remove archived docs ([#1208](https://github.com/kubeflow/arena/pull/1208) by [@ChenYi015](https://github.com/ChenYi015))
- Add changelog for v0.12.1 ([#1224](https://github.com/kubeflow/arena/pull/1224) by [@ChenYi015](https://github.com/ChenYi015))
- Bump golang.org/x/crypto from 0.29.0 to 0.31.0 ([#1231](https://github.com/kubeflow/arena/pull/1231) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump google.golang.org/protobuf from 1.35.1 to 1.36.0 ([#1227](https://github.com/kubeflow/arena/pull/1227) by [@dependabot[bot]](https://github.com/apps/dependabot))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.12.1...v0.13.0)
## [v0.12.1](https://github.com/kubeflow/arena/tree/v0.12.1) (2024-11-25)
### New Features
- Support MPI Job with generic devices ([#1209](https://github.com/kubeflow/arena/pull/1209) by [@cheyang](https://github.com/cheyang))
### Bug Fix
- Update tf-operator image to fix clean pod policy issues ([#1200](https://github.com/kubeflow/arena/pull/1200) by [@ChenYi015](https://github.com/ChenYi015))
- Fix etjob rendering error when using local logging dir ([#1203](https://github.com/kubeflow/arena/pull/1203) by [@TrafalgarZZZ](https://github.com/TrafalgarZZZ))
- Fix the functionality of generating kubeconfig (#1204) ([#1205](https://github.com/kubeflow/arena/pull/1205) by [@wqlparallel](https://github.com/wqlparallel))
- Update cron operator image ([#1214](https://github.com/kubeflow/arena/pull/1214) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Add changelog for v0.12.0 ([#1199](https://github.com/kubeflow/arena/pull/1199) by [@ChenYi015](https://github.com/ChenYi015))
- Add go mod vendor check to integration test ([#1198](https://github.com/kubeflow/arena/pull/1198) by [@ChenYi015](https://github.com/ChenYi015))
- bump github.com/go-resty/resty/v2 from 2.15.3 to 2.16.0 ([#1202](https://github.com/kubeflow/arena/pull/1202) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Publish releases only on master branch ([#1210](https://github.com/kubeflow/arena/pull/1210) by [@ChenYi015](https://github.com/ChenYi015))
- Add docs for releasing arena ([#1201](https://github.com/kubeflow/arena/pull/1201) by [@ChenYi015](https://github.com/ChenYi015))
- Bump golang.org/x/crypto from 0.28.0 to 0.29.0 ([#1206](https://github.com/kubeflow/arena/pull/1206) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Release v0.12.1 ([#1215](https://github.com/kubeflow/arena/pull/1215) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/29b2d6d2...v0.12.1)
## [v0.12.0](https://github.com/kubeflow/arena/tree/v0.12.0) (2024-11-11)
### New Features
- Feat: add support for distributed serving type ([#1187](https://github.com/kubeflow/arena/pull/1187) by [@linnlh](https://github.com/linnlh))
- Support distributed serving with vendor update ([#1194](https://github.com/kubeflow/arena/pull/1194) by [@cheyang](https://github.com/cheyang))
### Misc
- Bump github.com/golang/glog from 1.2.2 to 1.2.3 ([#1189](https://github.com/kubeflow/arena/pull/1189) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/prometheus/common from 0.60.0 to 0.60.1 ([#1182](https://github.com/kubeflow/arena/pull/1182) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump mkdocs-material from 9.5.42 to 9.5.44 ([#1190](https://github.com/kubeflow/arena/pull/1190) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Release v0.12.0 ([#1197](https://github.com/kubeflow/arena/pull/1197) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/46a795e3...v0.12.0)
## [v0.11.0](https://github.com/kubeflow/arena/tree/v0.11.0) (2024-10-24)
### New Features
- Support ray job ([#1123](https://github.com/kubeflow/arena/pull/1123) by [@qile123](https://github.com/qile123))
### Misc
- Bump github.com/prometheus/client_golang from 1.20.4 to 1.20.5 ([#1176](https://github.com/kubeflow/arena/pull/1176) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump mkdocs-material from 9.5.40 to 9.5.42 ([#1179](https://github.com/kubeflow/arena/pull/1179) by [@dependabot[bot]](https://github.com/apps/dependabot))
[Full Changelog](https://github.com/kubeflow/arena/compare/e15cb18...v0.11.0)
## [v0.10.1](https://github.com/kubeflow/arena/tree/v0.10.1) (2024-10-14)
### Bug Fixes
- fix: keep arena installer after installing the binary ([#1164](https://github.com/kubeflow/arena/pull/1164) by [@ChenYi015](https://github.com/ChenYi015))
- fix: unsupported success policy when success policy is not specified ([#1170](https://github.com/kubeflow/arena/pull/1170) by [@ChenYi015](https://github.com/ChenYi015))
- fix: failed to sync cache due to status subresouce missed in tfjob CRD ([#1173](https://github.com/kubeflow/arena/pull/1173) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Bump github.com/prometheus/common from 0.59.1 to 0.60.0 ([#1160](https://github.com/kubeflow/arena/pull/1160) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump golang.org/x/crypto from 0.27.0 to 0.28.0 ([#1162](https://github.com/kubeflow/arena/pull/1162) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Migrate docker image to ACREE ([#1171](https://github.com/kubeflow/arena/pull/1171) by [@ChenYi015](https://github.com/ChenYi015))
- Bump mkdocs-material from 9.5.38 to 9.5.40 ([#1166](https://github.com/kubeflow/arena/pull/1166) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump google.golang.org/protobuf from 1.34.2 to 1.35.1 ([#1163](https://github.com/kubeflow/arena/pull/1163) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Remove redundant run_arena.sh file ([#1172](https://github.com/kubeflow/arena/pull/1172) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.10.0...v0.10.1)
## [v0.10.0](https://github.com/kubeflow/arena/tree/v0.10.0) (2024-09-29)
### New Features
- Support multiple type devices ([#1122](https://github.com/kubeflow/arena/pull/1122) by [@lizhiboo](https://github.com/lizhiboo))
- Increase RSA key bit size from 1024 to 2048 ([#1130](https://github.com/kubeflow/arena/pull/1130) by [@ChenYi015](https://github.com/ChenYi015))
- Add success policy to TF training job ([#1148](https://github.com/kubeflow/arena/pull/1148) by [@ChenYi015](https://github.com/ChenYi015))
### Bug Fixes
- Fix submitting spark training jobs and update docs ([#1112](https://github.com/kubeflow/arena/pull/1112) by [@ChenYi015](https://github.com/ChenYi015))
- docs: fix broken links and add CI for checking document build status ([#1131](https://github.com/kubeflow/arena/pull/1131) by [@ChenYi015](https://github.com/ChenYi015))
- [Bugfix] Make PytorchJob devices format to key=value ([#1155](https://github.com/kubeflow/arena/pull/1155) by [@AlanFokCo](https://github.com/AlanFokCo))
### SDK
- Bump arena Java SDK version to 1.0.8 ([#1124](https://github.com/kubeflow/arena/pull/1124) by [@ChenYi015](https://github.com/ChenYi015))
### Misc
- Remove docker dependency ([#1113](https://github.com/kubeflow/arena/pull/1113) by [@Syulin7](https://github.com/Syulin7))
- Update Makefile and release workflow ([#1128](https://github.com/kubeflow/arena/pull/1128) by [@ChenYi015](https://github.com/ChenYi015))
- chore: remove travis and circle CI ([#1129](https://github.com/kubeflow/arena/pull/1129) by [@ChenYi015](https://github.com/ChenYi015))
- chore: add issue templates and update depenabot bot ([#1140](https://github.com/kubeflow/arena/pull/1140) by [@ChenYi015](https://github.com/ChenYi015))
- Bump github.com/golang/glog from 1.1.2 to 1.2.2 ([#1139](https://github.com/kubeflow/arena/pull/1139) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump golang.org/x/crypto from 0.21.0 to 0.27.0 ([#1126](https://github.com/kubeflow/arena/pull/1126) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/spf13/cobra from 1.8.0 to 1.8.1 ([#1137](https://github.com/kubeflow/arena/pull/1137) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/go-resty/resty/v2 from 2.12.0 to 2.14.0 ([#1134](https://github.com/kubeflow/arena/pull/1134) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/kserve/kserve from 0.13.0 to 0.13.1 ([#1135](https://github.com/kubeflow/arena/pull/1135) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/prometheus/common from 0.45.0 to 0.59.1 ([#1138](https://github.com/kubeflow/arena/pull/1138) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump client-java from 10.0.1 to 11.0.1 ([#1132](https://github.com/kubeflow/arena/pull/1132) by [@ChenYi015](https://github.com/ChenYi015))
- Bump github.com/prometheus/client_golang from 1.20.0 to 1.20.4 ([#1144](https://github.com/kubeflow/arena/pull/1144) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/go-resty/resty/v2 from 2.14.0 to 2.15.0 ([#1143](https://github.com/kubeflow/arena/pull/1143) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump mkdocs-material from 9.5.34 to 9.5.35 ([#1145](https://github.com/kubeflow/arena/pull/1145) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/go-resty/resty/v2 from 2.15.0 to 2.15.1 ([#1147](https://github.com/kubeflow/arena/pull/1147) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/go-resty/resty/v2 from 2.15.1 to 2.15.2 ([#1150](https://github.com/kubeflow/arena/pull/1150) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump mkdocs-material from 9.5.35 to 9.5.36 ([#1151](https://github.com/kubeflow/arena/pull/1151) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump golang from 1.21 to 1.22.7 ([#1142](https://github.com/kubeflow/arena/pull/1142) by [@ChenYi015](https://github.com/ChenYi015))
- Bump mkdocs-material from 9.5.36 to 9.5.38 ([#1153](https://github.com/kubeflow/arena/pull/1153) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Bump github.com/go-resty/resty/v2 from 2.15.2 to 2.15.3 ([#1156](https://github.com/kubeflow/arena/pull/1156) by [@dependabot[bot]](https://github.com/apps/dependabot))
- Release v0.10.0 ([#1157](https://github.com/kubeflow/arena/pull/1157) by [@ChenYi015](https://github.com/ChenYi015))
[Full Changelog](https://github.com/kubeflow/arena/compare/v0.9.16...v0.10.0)

41
Dockerfile Normal file
View File

@ -0,0 +1,41 @@
ARG BASE_IMAGE=debian:12-slim
FROM golang:1.24.0 AS builder
ARG TARGETOS
ARG TARGETARCH
WORKDIR /workspace
COPY . .
RUN set -eux && \
VERSION=$(cat VERSION) && \
make arena-installer OS=${TARGETOS} ARCH=${TARGETARCH} && \
mv arena-installer-${VERSION}-${TARGETOS}-${TARGETARCH}.tar.gz arena-installer.tar.gz
FROM ${BASE_IMAGE}
ARG TARGETOS
ARG TARGETARCH
WORKDIR /root
RUN apt-get update \
&& apt-get install -y tini \
&& rm -rf /var/lib/apt/lists/*
COPY --from=builder /workspace/arena-installer.tar.gz .
RUN set -eux && \
tar -zxvf arena-installer.tar.gz && \
mv arena-installer-*-${TARGETOS}-${TARGETARCH} arena-installer && \
arena-installer/install.sh --only-binary && \
rm -rf arena-installer.tar.gz
COPY entrypoint.sh /usr/local/bin/
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]

View File

@ -1,84 +0,0 @@
#**********************************************************************
# Builder
#
# Create a go runtime for building arena
ARG GOLANG_VERSION=1.10
ARG KUBE_VERSION=v1.11.2
ARG HELM_VERSION=v2.14.1
ARG VERSION=v0.3.0-rc
ARG OS_ARCH=linux-amd64
ARG COMMIT=stable
ARG TARGET=cli-$OS_ARCH
FROM golang:$GOLANG_VERSION-stretch as build
ARG KUBE_VERSION
ARG HELM_VERSION
ARG OS_ARCH
ARG TARGET
ENV KUBE_VERSION $KUBE_VERSION
ENV HELM_VERSION $HELM_VERSION
ENV VERSION $VERSION
ENV OS_ARCH $OS_ARCH
ENV COMMIT $COMMIT
ENV TARGET $TARGET
RUN mkdir -p /go/src/github.com/kubeflow/arena
WORKDIR /go/src/github.com/kubeflow/arena
COPY . .
RUN make $TARGET
RUN wget https://storage.googleapis.com/kubernetes-helm/helm-$HELM_VERSION-$OS_ARCH.tar.gz && \
tar -xvf helm-$HELM_VERSION-$OS_ARCH.tar.gz && \
mv $OS_ARCH/helm /usr/local/bin/helm && \
chmod u+x /usr/local/bin/helm && \
chmod u+x /go/src/github.com/kubeflow/arena/install.sh
RUN OS=$(echo $OS_ARCH | cut -f1 -d-) && \
ARCH=$(echo $OS_ARCH | cut -f2 -d-) && \
curl -o /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${KUBE_VERSION}/bin/${OS}/${ARCH}/kubectl && \
chmod +x /usr/local/bin/kubectl
#**********************************************************************
#
# Create arena pacakge
#
FROM centos:7
ARG KUBE_VERSION
ARG HELM_VERSION
ARG OS_ARCH
ARG TARGET
ARG COMMIT
ARG VERSION
ENV OS_ARCH $OS_ARCH
ENV COMMIT $COMMIT
ENV TARGET $TARGET
ENV VERSION $VERSION
ENV ARENA_HOME /arena-installer
ENV ARENA_TARFILE /arena-installer-$VERSION-$COMMIT-$OS_ARCH.tar.gz
RUN mkdir -p $ARENA_HOME/bin
COPY --from=build /go/src/github.com/kubeflow/arena/bin/arena $ARENA_HOME/bin/arena
COPY --from=build /go/src/github.com/kubeflow/arena/install.sh $ARENA_HOME/install.sh
COPY --from=build /usr/local/bin/helm $ARENA_HOME/bin/helm
COPY --from=build /go/src/github.com/kubeflow/arena/kubernetes-artifacts $ARENA_HOME/kubernetes-artifacts
COPY --from=build /usr/local/bin/kubectl $ARENA_HOME/bin/kubectl
COPY --from=build /go/src/github.com/kubeflow/arena/charts $ARENA_HOME/charts
RUN tar -zcvf $ARENA_TARFILE $ARENA_HOME

View File

@ -1,39 +0,0 @@
FROM golang:1.10-stretch as build
RUN mkdir -p /go/src/github.com/kubeflow/arena
WORKDIR /go/src/github.com/kubeflow/arena
COPY . .
RUN make
RUN wget https://storage.googleapis.com/kubernetes-helm/helm-v2.14.1-linux-amd64.tar.gz && \
tar -xvf helm-v2.14.1-linux-amd64.tar.gz && \
mv linux-amd64/helm /usr/local/bin/helm && \
chmod u+x /usr/local/bin/helm
ENV K8S_VERSION v1.11.2
RUN curl -o /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${K8S_VERSION}/bin/linux/amd64/kubectl && chmod +x /usr/local/bin/kubectl
FROM centos:7
COPY --from=build /go/src/github.com/kubeflow/arena/bin/arena /usr/local/bin/arena
COPY --from=build /usr/local/bin/helm /usr/local/bin/helm
COPY --from=build /go/src/github.com/kubeflow/arena/kubernetes-artifacts /root/kubernetes-artifacts
COPY --from=build /usr/local/bin/kubectl /usr/local/bin/kubectl
COPY --from=build /go/src/github.com/kubeflow/arena/charts /charts
ADD run_arena.sh /usr/local/bin
RUN chmod u+x /usr/local/bin/run_arena.sh
RUN yum install bash-completion -y && \
echo "source <(arena completion bash)" >> ~/.bashrc
ENTRYPOINT ["/usr/local/bin/run_arena.sh"]

View File

@ -3,7 +3,7 @@ ARG BASE_IMAGE=tensorflow/tensorflow:1.12.0-devel-py3
ARG USER=root
FROM golang:1.10-stretch as build
FROM golang:1.23.10 AS build
RUN mkdir -p /go/src/github.com/kubeflow/arena
@ -12,12 +12,12 @@ COPY . .
RUN make
RUN wget https://storage.googleapis.com/kubernetes-helm/helm-v2.14.1-linux-amd64.tar.gz && \
tar -xvf helm-v2.14.1-linux-amd64.tar.gz && \
RUN wget https://get.helm.sh/helm-v3.13.3-linux-amd64.tar.gz && \
tar -xvf helm-v3.13.3-linux-amd64.tar.gz && \
mv linux-amd64/helm /usr/local/bin/helm && \
chmod u+x /usr/local/bin/helm
ENV K8S_VERSION v1.11.2
ENV K8S_VERSION v1.28.4
RUN curl -o /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${K8S_VERSION}/bin/linux/amd64/kubectl && chmod +x /usr/local/bin/kubectl
FROM $BASE_IMAGE

View File

@ -2,7 +2,7 @@ ARG BASE_IMAGE=registry.aliyuncs.com/kubeflow-images-public/tensorflow-1.12.0-no
ARG USER=jovyan
FROM golang:1.10-stretch as build
FROM golang:1.23.10 AS build
RUN mkdir -p /go/src/github.com/kubeflow/arena
@ -11,12 +11,12 @@ COPY . .
RUN make
RUN wget https://storage.googleapis.com/kubernetes-helm/helm-v2.14.1-linux-amd64.tar.gz && \
tar -xvf helm-v2.14.1-linux-amd64.tar.gz && \
RUN wget https://get.helm.sh/helm-v3.13.3-linux-amd64.tar.gz && \
tar -xvf helm-v3.13.3-linux-amd64.tar.gz && \
mv linux-amd64/helm /usr/local/bin/helm && \
chmod u+x /usr/local/bin/helm
ENV K8S_VERSION v1.11.2
ENV K8S_VERSION v1.28.4
RUN curl -o /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${K8S_VERSION}/bin/linux/amd64/kubectl && chmod +x /usr/local/bin/kubectl
FROM $BASE_IMAGE
@ -35,4 +35,4 @@ RUN apt-get update && \
echo "source /etc/bash_completion" >> /etc/bash.bashrc && \
echo "source <(arena completion bash)" >> /etc/bash.bashrc
USER $USER
USER $USER

15
FAQ.md
View File

@ -1,15 +0,0 @@
# FAQ
## Common problems and solutions where arena doesn't launch:
- ``` error: unable to recognize "/tmp/tf-dist-git.yaml392889812": no matches for kind "TFJob" in version "kubeflow.org/v1alpha2"```
### Solution
```
git clone https://github.com/kubeflow/arena.git
kubectl delete -f kubernetes-artifacts/tf-operator/tf-operator.yaml
kubectl create -f kubernetes-artifacts/tf-operator/tf-operator.yaml
```
## Common questions:
### Does arena support pytorch
Not yet, although support for using kfserving is planned for 2019. More updates will be available here.

509
Gopkg.lock generated
View File

@ -1,509 +0,0 @@
# This file is autogenerated, do not edit; changes may be undone by the next 'dep ensure'.
[[projects]]
name = "cloud.google.com/go"
packages = ["compute/metadata"]
revision = "97efc2c9ffd9fe8ef47f7f3203dc60bbca547374"
version = "v0.28.0"
[[projects]]
name = "github.com/PuerkitoBio/purell"
packages = ["."]
revision = "0bcb03f4b4d0a9428594752bd2a3b9aa0a9d4bd4"
version = "v1.1.0"
[[projects]]
branch = "master"
name = "github.com/PuerkitoBio/urlesc"
packages = ["."]
revision = "de5bf2ad457846296e2031421a34e2568e304e35"
[[projects]]
name = "github.com/cpuguy83/go-md2man"
packages = ["md2man"]
revision = "20f5889cbdc3c73dbd2862796665e7c465ade7d1"
version = "v1.0.8"
[[projects]]
name = "github.com/davecgh/go-spew"
packages = ["spew"]
revision = "346938d642f2ec3594ed81d874461961cd0faa76"
version = "v1.1.0"
[[projects]]
name = "github.com/emicklei/go-restful"
packages = [
".",
"log"
]
revision = "3eb9738c1697594ea6e71a7156a9bb32ed216cf0"
version = "v2.8.0"
[[projects]]
name = "github.com/ghodss/yaml"
packages = ["."]
revision = "0ca9ea5df5451ffdf184b4428c902747c2c11cd7"
version = "v1.0.0"
[[projects]]
name = "github.com/go-openapi/jsonpointer"
packages = ["."]
revision = "3a0015ad55fa9873f41605d3e8f28cd279c32ab2"
version = "0.15.0"
[[projects]]
name = "github.com/go-openapi/jsonreference"
packages = ["."]
revision = "3fb327e6747da3043567ee86abd02bb6376b6be2"
version = "0.15.0"
[[projects]]
branch = "master"
name = "github.com/go-openapi/spec"
packages = ["."]
revision = "f1468acb3b29cdd5c5f6fa29435d2d2d6e6c9ff1"
[[projects]]
name = "github.com/go-openapi/swag"
packages = ["."]
revision = "2b0bd4f193d011c203529df626a65d63cb8a79e8"
version = "0.15.0"
[[projects]]
name = "github.com/gogo/protobuf"
packages = [
"gogoproto",
"proto",
"protoc-gen-gogo/descriptor",
"sortkeys",
"types"
]
revision = "636bf0302bc95575d69441b25a2603156ffdddf1"
version = "v1.1.1"
[[projects]]
branch = "master"
name = "github.com/golang/glog"
packages = ["."]
revision = "23def4e6c14b4da8ac2ed8007337bc5eb5007998"
[[projects]]
name = "github.com/golang/protobuf"
packages = [
"proto",
"ptypes",
"ptypes/any",
"ptypes/duration",
"ptypes/timestamp"
]
revision = "b4deda0973fb4c70b50d226b1af49f3da59f5265"
version = "v1.1.0"
[[projects]]
branch = "master"
name = "github.com/google/btree"
packages = ["."]
revision = "e89373fe6b4a7413d7acd6da1725b83ef713e6e4"
[[projects]]
branch = "master"
name = "github.com/google/gofuzz"
packages = ["."]
revision = "24818f796faf91cd76ec7bddd72458fbced7a6c1"
[[projects]]
name = "github.com/googleapis/gnostic"
packages = [
"OpenAPIv2",
"compiler",
"extensions"
]
revision = "7c663266750e7d82587642f65e60bc4083f1f84e"
version = "v0.2.0"
[[projects]]
branch = "master"
name = "github.com/gregjones/httpcache"
packages = [
".",
"diskcache"
]
revision = "9cad4c3443a7200dd6400aef47183728de563a38"
[[projects]]
branch = "master"
name = "github.com/hashicorp/golang-lru"
packages = [
".",
"simplelru"
]
revision = "0fb14efe8c47ae851c0034ed7a448854d3d34cf3"
[[projects]]
name = "github.com/imdario/mergo"
packages = ["."]
revision = "9316a62528ac99aaecb4e47eadd6dc8aa6533d58"
version = "v0.3.5"
[[projects]]
name = "github.com/inconshreveable/mousetrap"
packages = ["."]
revision = "76626ae9c91c4f2a10f34cad8ce83ea42c93bb75"
version = "v1.0"
[[projects]]
name = "github.com/json-iterator/go"
packages = ["."]
revision = "ab8a2e0c74be9d3be70b3184d9acc634935ded82"
version = "1.1.4"
[[projects]]
branch = "master"
name = "github.com/mailru/easyjson"
packages = [
"buffer",
"jlexer",
"jwriter"
]
revision = "d5012789d6659eeed305f54c1b1542e7b65829e6"
[[projects]]
name = "github.com/mitchellh/go-homedir"
packages = ["."]
revision = "af06845cf3004701891bf4fdb884bfe4920b3727"
version = "v1.1.0"
[[projects]]
name = "github.com/modern-go/concurrent"
packages = ["."]
revision = "bacd9c7ef1dd9b15be4a9909b8ac7a4e313eec94"
version = "1.0.3"
[[projects]]
name = "github.com/modern-go/reflect2"
packages = ["."]
revision = "4b7aa43c6742a2c18fdef89dd197aaae7dac7ccd"
version = "1.0.1"
[[projects]]
branch = "master"
name = "github.com/petar/GoLLRB"
packages = ["llrb"]
revision = "53be0d36a84c2a886ca057d34b6aa4468df9ccb4"
[[projects]]
name = "github.com/peterbourgon/diskv"
packages = ["."]
revision = "5f041e8faa004a95c88a202771f4cc3e991971e6"
version = "v2.0.1"
[[projects]]
name = "github.com/pmezard/go-difflib"
packages = ["difflib"]
revision = "792786c7400a136282c1664665ae0a8db921c6c2"
version = "v1.0.0"
[[projects]]
name = "github.com/russross/blackfriday"
packages = ["."]
revision = "55d61fa8aa702f59229e6cff85793c22e580eaf5"
version = "v1.5.1"
[[projects]]
name = "github.com/sirupsen/logrus"
packages = ["."]
revision = "3e01752db0189b9157070a0e1668a620f9a85da2"
version = "v1.0.6"
[[projects]]
name = "github.com/spf13/cobra"
packages = [
".",
"doc"
]
revision = "ef82de70bb3f60c65fb8eebacbb2d122ef517385"
version = "v0.0.3"
[[projects]]
name = "github.com/spf13/pflag"
packages = ["."]
revision = "583c0c0531f06d5278b7d917446061adc344b5cd"
version = "v1.0.1"
[[projects]]
name = "github.com/stretchr/testify"
packages = ["assert"]
revision = "ffdc059bfe9ce6a4e144ba849dbedead332c6053"
version = "v1.3.0"
[[projects]]
branch = "master"
name = "golang.org/x/crypto"
packages = ["ssh/terminal"]
revision = "c126467f60eb25f8f27e5a981f32a87e3965053f"
[[projects]]
branch = "master"
name = "golang.org/x/net"
packages = [
"context",
"context/ctxhttp",
"http/httpguts",
"http2",
"http2/hpack",
"idna"
]
revision = "3673e40ba22529d22c3fd7c93e97b0ce50fa7bdd"
[[projects]]
branch = "master"
name = "golang.org/x/oauth2"
packages = [
".",
"google",
"internal",
"jws",
"jwt"
]
revision = "d2e6202438beef2727060aa7cabdd924d92ebfd9"
[[projects]]
branch = "master"
name = "golang.org/x/sys"
packages = [
"unix",
"windows"
]
revision = "bd9dbc187b6e1dacfdd2722a87e83093c2d7bd6e"
[[projects]]
name = "golang.org/x/text"
packages = [
"collate",
"collate/build",
"internal/colltab",
"internal/gen",
"internal/tag",
"internal/triegen",
"internal/ucd",
"language",
"secure/bidirule",
"transform",
"unicode/bidi",
"unicode/cldr",
"unicode/norm",
"unicode/rangetable",
"width"
]
revision = "f21a4dfb5e38f5895301dc265a8def02365cc3d0"
version = "v0.3.0"
[[projects]]
branch = "master"
name = "golang.org/x/time"
packages = ["rate"]
revision = "9d24e82272b4f38b78bc8cff74fa936d31ccd8ef"
[[projects]]
name = "google.golang.org/appengine"
packages = [
".",
"internal",
"internal/app_identity",
"internal/base",
"internal/datastore",
"internal/log",
"internal/modules",
"internal/remote_api",
"internal/urlfetch",
"urlfetch"
]
revision = "b1f26356af11148e710935ed1ac8a7f5702c7612"
version = "v1.1.0"
[[projects]]
name = "gopkg.in/inf.v0"
packages = ["."]
revision = "d2d2541c53f18d2a059457998ce2876cc8e67cbf"
version = "v0.9.1"
[[projects]]
name = "gopkg.in/yaml.v2"
packages = ["."]
revision = "5420a8b6744d3b0345ab293f6fcba19c978f1183"
version = "v2.2.1"
[[projects]]
branch = "release-1.0"
name = "istio.io/api"
packages = ["networking/v1alpha3"]
revision = "76349c53b87f03f1e610b3aa3843dba3c38138d7"
[[projects]]
name = "k8s.io/api"
packages = [
"admissionregistration/v1alpha1",
"admissionregistration/v1beta1",
"apps/v1",
"apps/v1beta1",
"apps/v1beta2",
"authentication/v1",
"authentication/v1beta1",
"authorization/v1",
"authorization/v1beta1",
"autoscaling/v1",
"autoscaling/v2beta1",
"batch/v1",
"batch/v1beta1",
"batch/v2alpha1",
"certificates/v1beta1",
"core/v1",
"events/v1beta1",
"extensions/v1beta1",
"networking/v1",
"policy/v1beta1",
"rbac/v1",
"rbac/v1alpha1",
"rbac/v1beta1",
"scheduling/v1alpha1",
"scheduling/v1beta1",
"settings/v1alpha1",
"storage/v1",
"storage/v1alpha1",
"storage/v1beta1"
]
revision = "2d6f90ab1293a1fb871cf149423ebb72aa7423aa"
version = "kubernetes-1.11.1"
[[projects]]
name = "k8s.io/apimachinery"
packages = [
"pkg/api/errors",
"pkg/api/meta",
"pkg/api/resource",
"pkg/apis/meta/internalversion",
"pkg/apis/meta/v1",
"pkg/apis/meta/v1/unstructured",
"pkg/apis/meta/v1beta1",
"pkg/conversion",
"pkg/conversion/queryparams",
"pkg/fields",
"pkg/labels",
"pkg/runtime",
"pkg/runtime/schema",
"pkg/runtime/serializer",
"pkg/runtime/serializer/json",
"pkg/runtime/serializer/protobuf",
"pkg/runtime/serializer/recognizer",
"pkg/runtime/serializer/streaming",
"pkg/runtime/serializer/versioning",
"pkg/selection",
"pkg/types",
"pkg/util/cache",
"pkg/util/clock",
"pkg/util/diff",
"pkg/util/errors",
"pkg/util/framer",
"pkg/util/intstr",
"pkg/util/json",
"pkg/util/mergepatch",
"pkg/util/net",
"pkg/util/runtime",
"pkg/util/sets",
"pkg/util/strategicpatch",
"pkg/util/validation",
"pkg/util/validation/field",
"pkg/util/wait",
"pkg/util/yaml",
"pkg/version",
"pkg/watch",
"third_party/forked/golang/json",
"third_party/forked/golang/reflect"
]
revision = "103fd098999dc9c0c88536f5c9ad2e5da39373ae"
version = "kubernetes-1.11.0"
[[projects]]
name = "k8s.io/client-go"
packages = [
"discovery",
"discovery/fake",
"kubernetes",
"kubernetes/scheme",
"kubernetes/typed/admissionregistration/v1alpha1",
"kubernetes/typed/admissionregistration/v1beta1",
"kubernetes/typed/apps/v1",
"kubernetes/typed/apps/v1beta1",
"kubernetes/typed/apps/v1beta2",
"kubernetes/typed/authentication/v1",
"kubernetes/typed/authentication/v1beta1",
"kubernetes/typed/authorization/v1",
"kubernetes/typed/authorization/v1beta1",
"kubernetes/typed/autoscaling/v1",
"kubernetes/typed/autoscaling/v2beta1",
"kubernetes/typed/batch/v1",
"kubernetes/typed/batch/v1beta1",
"kubernetes/typed/batch/v2alpha1",
"kubernetes/typed/certificates/v1beta1",
"kubernetes/typed/core/v1",
"kubernetes/typed/events/v1beta1",
"kubernetes/typed/extensions/v1beta1",
"kubernetes/typed/networking/v1",
"kubernetes/typed/policy/v1beta1",
"kubernetes/typed/rbac/v1",
"kubernetes/typed/rbac/v1alpha1",
"kubernetes/typed/rbac/v1beta1",
"kubernetes/typed/scheduling/v1alpha1",
"kubernetes/typed/scheduling/v1beta1",
"kubernetes/typed/settings/v1alpha1",
"kubernetes/typed/storage/v1",
"kubernetes/typed/storage/v1alpha1",
"kubernetes/typed/storage/v1beta1",
"pkg/apis/clientauthentication",
"pkg/apis/clientauthentication/v1alpha1",
"pkg/apis/clientauthentication/v1beta1",
"pkg/version",
"plugin/pkg/client/auth/exec",
"plugin/pkg/client/auth/gcp",
"rest",
"rest/watch",
"testing",
"third_party/forked/golang/template",
"tools/auth",
"tools/cache",
"tools/clientcmd",
"tools/clientcmd/api",
"tools/clientcmd/api/latest",
"tools/clientcmd/api/v1",
"tools/metrics",
"tools/pager",
"tools/reference",
"transport",
"util/buffer",
"util/cert",
"util/connrotation",
"util/flowcontrol",
"util/homedir",
"util/integer",
"util/jsonpath",
"util/retry"
]
revision = "1f13a808da65775f22cbf47862c4e5898d8f4ca1"
version = "kubernetes-1.11.2"
[[projects]]
branch = "release-1.9"
name = "k8s.io/kube-openapi"
packages = [
"pkg/common",
"pkg/util/proto"
]
revision = "7ee50c0aa8059d610950c952a9ed7a5e33ab336a"
[solve-meta]
analyzer-name = "dep"
analyzer-version = 1
inputs-digest = "bc45b05152e3777b384e1572fc44170e726479c0d0dd312187c7e007f6a114d5"
solver-name = "gps-cdcl"
solver-version = 1

View File

@ -1,63 +0,0 @@
# Gopkg.toml example
#
# Refer to https://golang.github.io/dep/docs/Gopkg.toml.html
# for detailed Gopkg.toml documentation.
#
# required = ["github.com/user/thing/cmd/thing"]
# ignored = ["github.com/user/project/pkgX", "bitbucket.org/user/project/pkgA/pkgY"]
#
# [[constraint]]
# name = "github.com/user/project"
# version = "1.0.0"
#
# [[constraint]]
# name = "github.com/user/project2"
# branch = "dev"
# source = "github.com/myfork/project2"
#
# [[override]]
# name = "github.com/x/y"
# version = "2.4.0"
#
# [prune]
# non-go = false
# go-tests = true
# unused-packages = true
[[constraint]]
name = "github.com/sirupsen/logrus"
version = "1.0.5"
[[constraint]]
name = "github.com/spf13/cobra"
version = "0.0.2"
[[constraint]]
name = "gopkg.in/yaml.v2"
version = "2.2.1"
[[constraint]]
name = "k8s.io/api"
version = "kubernetes-1.11.2"
[[constraint]]
name = "k8s.io/client-go"
version = "kubernetes-1.11.2"
[[constraint]]
name = "k8s.io/apimachinery"
version = "kubernetes-1.11.2"
[[constraint]]
name = "github.com/gogo/protobuf"
version = "v1.1.1"
[[constraint]]
name = "istio.io/api"
branch = "release-1.0"
[prune]
go-tests = true
unused-packages = true

313
Makefile
View File

@ -1,18 +1,64 @@
PACKAGE=github.com/kubeflow/arena
CURRENT_DIR=$(shell pwd)
DIST_DIR=${CURRENT_DIR}/bin
ARENA_CLI_NAME=arena
JOB_MONITOR=jobmon
OS_ARCH?=linux-amd64
.SILENT:
VERSION=$(shell cat ${CURRENT_DIR}/VERSION)
BUILD_DATE=$(shell date -u +'%Y-%m-%dT%H:%M:%SZ')
GIT_COMMIT=$(shell git rev-parse HEAD)
GIT_SHORT_COMMIT=$(shell git rev-parse --short HEAD)
DOCKER_BUILD_DATE=$(shell date -u +'%Y%m%d%H%M%S')
GIT_TAG=$(shell if [ -z "`git status --porcelain`" ]; then git describe --exact-match --tags HEAD 2>/dev/null; fi)
GIT_TREE_STATE=$(shell if [ -z "`git status --porcelain`" ]; then echo "clean" ; else echo "dirty"; fi)
PACKR_CMD=$(shell if [ "`which packr`" ]; then echo "packr"; else echo "go run vendor/github.com/gobuffalo/packr/packr/main.go"; fi)
# Get the currently used golang install path (in GOPATH/bin, unless GOBIN is set)
ifeq (,$(shell go env GOBIN))
GOBIN=$(shell go env GOPATH)/bin
else
GOBIN=$(shell go env GOBIN)
endif
# Setting SHELL to bash allows bash commands to be executed by recipes.
# Options are set to exit when a recipe line exits non-zero or a piped command fails.
SHELL = /usr/bin/env bash -o pipefail
.SHELLFLAGS = -ec
PACKAGE ?= github.com/kubeflow/arena
CURRENT_DIR ?= $(shell pwd)
DIST_DIR ?= $(CURRENT_DIR)/bin
ARENA_CLI_NAME ?= arena
JOB_MONITOR ?= jobmon
ARENA_UNINSTALL ?= arena-uninstall
OS ?= $(shell go env GOOS)
ARCH ?= $(shell go env GOARCH)
VERSION ?= $(shell cat VERSION)
BUILD_DATE := $(shell date -u +'%Y-%m-%dT%H:%M:%SZ')
GIT_COMMIT := $(shell git rev-parse HEAD)
GIT_SHORT_COMMIT := $(shell git rev-parse --short HEAD)
DOCKER_BUILD_DATE := $(shell date -u +'%Y%m%d%H%M%S')
GIT_TAG := $(shell if [ -z "`git status --porcelain`" ]; then git describe --exact-match --tags HEAD 2>/dev/null; fi)
GIT_TREE_STATE := $(shell if [ -z "`git status --porcelain`" ]; then echo "clean" ; else echo "dirty"; fi)
PACKR_CMD := $(shell if [ "`which packr`" ]; then echo "packr"; else echo "go run vendor/github.com/gobuffalo/packr/packr/main.go"; fi)
# Location to install binaries
LOCALBIN ?= $(CURRENT_DIR)/bin
# Location to put temp files
TEMPDIR ?= $(CURRENT_DIR)/tmp
# ARENA_ARTIFACTS
ARENA_ARTIFACTS_CHART_PATH ?= $(CURRENT_DIR)/arena-artifacts
# Versions
GOLANG_VERSION=$(shell grep -e '^go ' go.mod | cut -d ' ' -f 2)
KUBECTL_VERSION ?= v1.28.4
HELM_VERSION ?= $(shell grep -e 'helm.sh/helm/v3 ' go.mod | cut -d ' ' -f 2)
HELM_UNITTEST_VERSION ?= 0.5.1
KIND_VERSION ?= v0.23.0
KIND_K8S_VERSION ?= v1.29.3
ENVTEST_VERSION ?= release-0.18
ENVTEST_K8S_VERSION ?= 1.29.3
GOLANGCI_LINT_VERSION ?= v2.1.6
# Binaries
ARENA ?= arena-v$(VERSION)-$(OS)-$(ARCH)
KUBECTL ?= kubectl-$(KUBECTL_VERSION)-$(OS)-$(ARCH)
HELM ?= helm-$(HELM_VERSION)-$(OS)-$(ARCH)
KIND ?= $(LOCALBIN)/kind-$(KIND_VERSION)
ENVTEST ?= $(LOCALBIN)/setup-envtest-$(ENVTEST_VERSION)
GOLANGCI_LINT ?= golangci-lint-$(GOLANGCI_LINT_VERSION)
# Tarballs
ARENA_INSTALLER ?= arena-installer-$(VERSION)-$(OS)-$(ARCH)
ARENA_INSTALLER_TARBALL ?= $(ARENA_INSTALLER).tar.gz
BUILDER_IMAGE=arena-builder
BASE_IMAGE=registry.aliyuncs.com/kubeflow-images-public/tensorflow-1.12.0-notebook-gpu:v0.4.0
@ -31,8 +77,12 @@ override LDFLAGS += \
-extldflags "-static"
# docker image publishing options
IMAGE_REGISTRY ?= docker.io
IMAGE_REPOSITORY ?= kubeflow/arena
IMAGE_TAG ?= $(VERSION)
IMAGE ?= $(IMAGE_REGISTRY)/$(IMAGE_REPOSITORY):$(IMAGE_TAG)
DOCKER_PUSH=false
IMAGE_TAG=latest
BASE_IMAGE ?= debian:12-slim
ifneq (${GIT_TAG},)
IMAGE_TAG=${GIT_TAG}
@ -55,44 +105,117 @@ ifdef IMAGE_NAMESPACE
IMAGE_PREFIX=${IMAGE_NAMESPACE}/
endif
##@ General
# The help target prints out all targets with their descriptions organized
# beneath their categories. The categories are represented by '##@' and the
# target descriptions by '##'. The awk command is responsible for reading the
# entire set of makefiles included in this invocation, looking for lines of the
# file as xyz: ## something, and then pretty-format the target and help. Then,
# if there's a line with ##@ something, that gets pretty-printed as a category.
# More info on the usage of ANSI control characters for terminal formatting:
# https://en.wikipedia.org/wiki/ANSI_escape_code#SGR_parameters
# More info on the awk command:
# http://linuxcommand.org/lc3_adv_awk.php
.PHONY: help
help: ## Display this help.
@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n make \033[36m<target>\033[0m\n"} /^[a-zA-Z_0-9-]+:.*?##/ { printf " \033[36m%-30s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)
.PHONY: all
all: go-fmt go-vet go-lint unit-test e2e-test
##@ Development
go-fmt: ## Run go fmt against code.
@echo "Running go fmt..."
go fmt ./...
go-vet: ## Run go vet against code.
@echo "Running go vet..."
go vet ./...
.PHONY: go-lint
go-lint: golangci-lint ## Run golangci-lint linter.
@echo "Running golangci-lint run..."
$(LOCALBIN)/$(GOLANGCI_LINT) run --timeout 5m ./...
.PHONY: go-lint-fix
go-lint-fix: golangci-lint ## Run golangci-lint linter and perform fixes.
@echo "Running golangci-lint run --fix..."
$(LOCALBIN)/$(GOLANGCI_LINT) run --fix --timeout 5m ./...
.PHONY: unit-test
unit-test: ## Run go unit tests.
@echo "Running go test..."
go test $(shell go list ./... | grep -v /e2e) -coverprofile cover.out
.PHONY: e2e-test
e2e-test: envtest ## Run the e2e tests against a Kind k8s instance that is spun up.
@echo "Running e2e tests..."
go test ./test/e2e/ -v -ginkgo.v -timeout 30m
# Build the project
.PHONY: default
default:
ifeq ($(OS),Windows_NT)
default: cli-windows
default: arena-windows
else
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S),Linux)
$(info "Building on Linux")
default: cli-linux-amd64
default: arena-linux-amd64
else ifeq ($(UNAME_S),Darwin)
$(info "Building on Darwin")
default: cli-darwin-amd64
default: arena-darwin-amd64
else
$(error "The OS is not supported")
endif
endif
.PHONY: cli-linux-amd64
cli-linux-amd64:
mkdir -p bin
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -tags 'netgo' -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${ARENA_CLI_NAME} cmd/arena/*.go
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${JOB_MONITOR} cmd/job-monitor/*.go
##@ Build
.PHONY: cli-darwin-amd64
cli-darwin-amd64:
mkdir -p bin
CGO_ENABLED=0 GOOS=darwin go build -tags 'netgo' -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${ARENA_CLI_NAME} ./cmd/arena/*.go
$(LOCALBIN):
mkdir -p $(LOCALBIN)
.PHONY: cli-windows
cli-windows:
mkdir -p bin
CGO_ENABLED=0 GOARCH=amd64 GOOS=windows go build -tags 'netgo' -ldflags '${LDFLAGS}' -o ${DIST_DIR}/${ARENA_CLI_NAME} ./cmd/arena/*.go
$(TEMPDIR):
mkdir -p $(TEMPDIR)
clean: ## Clean up all downloaded and generated files.
rm -rf $(LOCALBIN) $(TEMPDIR)
.PHONY: install-image
install-image:
docker build -t cheyang/arena:${VERSION}-${DOCKER_BUILD_DATE}-${GIT_SHORT_COMMIT} -f Dockerfile.install .
.PHONY: arena
arena: $(LOCALBIN) ## Build arena CLI for current platform.
@echo "Building arena CLI..."
CGO_ENABLED=0 GOOS=$(OS) GOARCH=$(ARCH) go build -tags netgo -ldflags '${LDFLAGS}' -o $(LOCALBIN)/$(ARENA) cmd/arena/main.go
.PHONY: java-sdk
java-sdk: ## Build Java SDK.
echo "Building arena Java SDK..."
mvn package -Dmaven.test.skip=true -Dgpg.skip -f sdk/arena-java-sdk
.PHONY: docker-build
docker-build: ## Build docker image.
docker build \
--build-arg BASE_IMAGE=$(BASE_IMAGE) \
--tag $(IMAGE) \
-f Dockerfile \
.
.PHONY: docker-push
docker-push: ## Push docker image.
docker push $(IMAGE)
.PHONY: docker-buildx
PLATFORMS ?= linux/amd64,linux/arm64
docker-buildx: ## Build and push docker images for multiple platforms.
- $(CONTAINER_TOOL) buildx create --name arena-builder
$(CONTAINER_TOOL) buildx use arena-builder
- $(CONTAINER_TOOL) buildx build --push \
--platform=$(PLATFORMS) \
--build-arg BASE_IMAGE=$(BASE_IMAGE) \
--tag $(IMAGE) \
-f Dockerfile \
.
- $(CONTAINER_TOOL) buildx rm arena-builder
.PHONY: notebook-image-kubeflow
notebook-image-kubeflow:
@ -104,18 +227,106 @@ notebook-image:
docker build --build-arg "BASE_IMAGE=tensorflow/tensorflow:1.12.0-devel-py3" -t cheyang/arena:${VERSION}-notebook-${DOCKER_BUILD_DATE}-${GIT_SHORT_COMMIT}-cpu -f Dockerfile.notebook.cpu .
docker tag cheyang/arena:${VERSION}-notebook-${DOCKER_BUILD_DATE}-${GIT_SHORT_COMMIT}-cpu cheyang/arena-notebook:cpu
# make OS_ARCH=darwin-amd64 build-pkg for mac
.PHONY: build-pkg
build-pkg:
docker rm -f arena-pkg || true
docker build --build-arg "KUBE_VERSION=v1.11.2" \
--build-arg "HELM_VERSION=v2.14.1" \
--build-arg "COMMIT=${GIT_SHORT_COMMIT}" \
--build-arg "VERSION=${VERSION}" \
--build-arg "OS_ARCH=${OS_ARCH}" \
--build-arg "GOLANG_VERSION=1.10" \
--build-arg "TARGET=cli-${OS_ARCH}" \
-t arena-build:${VERSION}-${GIT_SHORT_COMMIT}-${OS_ARCH} -f Dockerfile.build .
docker run -itd --name=arena-pkg arena-build:${VERSION}-${GIT_SHORT_COMMIT}-${OS_ARCH} /bin/bash
docker cp arena-pkg:/arena-installer-${VERSION}-${GIT_SHORT_COMMIT}-${OS_ARCH}.tar.gz .
docker rm -f arena-pkg
.PHONY: build-dependabot
build-dependabot:
python3 hack/create_dependabot.py
.PHONY: arena-installer
arena-installer: $(ARENA_INSTALLER_TARBALL) ## Build arena installer tarball
$(ARENA_INSTALLER_TARBALL): arena kubectl helm
echo "Building arena installer tarball..." && \
rm -rf $(TEMPDIR)/$(ARENA_INSTALLER) && \
mkdir -p $(TEMPDIR)/$(ARENA_INSTALLER)/bin && \
cp $(LOCALBIN)/$(ARENA) $(TEMPDIR)/$(ARENA_INSTALLER)/bin/arena && \
cp $(LOCALBIN)/$(KUBECTL) $(TEMPDIR)/$(ARENA_INSTALLER)/bin/kubectl && \
cp $(LOCALBIN)/$(HELM) $(TEMPDIR)/$(ARENA_INSTALLER)/bin/helm && \
cp -R charts $(TEMPDIR)/$(ARENA_INSTALLER) && \
cp -R arena-artifacts $(TEMPDIR)/$(ARENA_INSTALLER) && \
cp arena-gen-kubeconfig.sh $(TEMPDIR)/$(ARENA_INSTALLER)/bin && \
cp install.sh $(TEMPDIR)/$(ARENA_INSTALLER) && \
cp uninstall.sh $(TEMPDIR)/$(ARENA_INSTALLER)/bin/arena-uninstall && \
tar -zcf $(ARENA_INSTALLER).tar.gz -C $(TEMPDIR) $(ARENA_INSTALLER) && \
echo "Successfully saved arena installer to $(ARENA_INSTALLER).tar.gz."
##@ Helm
.PHONY: helm-unittest
helm-unittest: helm-unittest-plugin ## Run Helm chart unittests.
set -x && $(LOCALBIN)/$(HELM) unittest $(ARENA_ARTIFACTS_CHART_PATH) --strict --file "tests/**/*_test.yaml" --chart-tests-path $(CURRENT_DIR)
##@ Dependencies
.PHONY: golangci-lint
golangci-lint: $(LOCALBIN)/$(GOLANGCI_LINT) ## Download golangci-lint locally if necessary.
$(LOCALBIN)/$(GOLANGCI_LINT): $(LOCALBIN)
$(call go-install-tool,$(LOCALBIN)/$(GOLANGCI_LINT),github.com/golangci/golangci-lint/v2/cmd/golangci-lint,${GOLANGCI_LINT_VERSION})
.PHONY: envtest
envtest: $(ENVTEST) ## Download setup-envtest locally if necessary.
$(ENVTEST): $(LOCALBIN)
$(call go-install-tool,$(ENVTEST),sigs.k8s.io/controller-runtime/tools/setup-envtest,$(ENVTEST_VERSION))
.PHONY: kubectl
kubectl: $(LOCALBIN)/$(KUBECTL)
$(LOCALBIN)/$(KUBECTL): $(LOCALBIN) $(TEMPDIR)
$(eval KUBECTL_URL=https://dl.k8s.io/release/$(KUBECTL_VERSION)/bin/$(OS)/$(ARCH)/kubectl)
$(eval KUBECTL_SHA_URL=$(KUBECTL_URL).sha256)
cd $(TEMPDIR) && \
echo "Download $(KUBECTL) if not present..." && \
if [ ! -f $(KUBECTL) ]; then \
curl -sSLo $(KUBECTL) $(KUBECTL_URL); \
fi && \
echo "Download $(KUBECTL).sha256 if not present..." && \
if [ ! -f kubectl.sha256 ]; then \
curl -sSLo $(KUBECTL).sha256 $(KUBECTL_SHA_URL); \
fi && \
echo "Verifying checksum..." && \
echo -n "$$(cat $(KUBECTL).sha256) $(KUBECTL)" | shasum -a 256 --check --quiet || (echo "Checksum verification failed, exiting." && false) && \
echo "Make kubectl executable and move it to bin directory..." && \
chmod +x $(KUBECTL) && \
cp $(KUBECTL) $(LOCALBIN) && \
echo "Successfully installed kubectl to $(LOCALBIN)/$(KUBECTL)."
.PHONY: helm
helm: $(LOCALBIN)/$(HELM)
$(LOCALBIN)/$(HELM): $(LOCALBIN) $(TEMPDIR)
$(eval HELM_URL=https://get.helm.sh/$(HELM).tar.gz)
$(eval HELM_SHA_URL=https://get.helm.sh/$(HELM).tar.gz.sha256sum)
cd $(TEMPDIR) && \
echo "Download $(HELM).tar.gz if not present..." && \
if [ ! -f $(HELM).tar.gz ]; then \
wget -qO $(HELM).tar.gz $(HELM_URL); \
fi && \
echo "Download $(HELM).tar.gz.sha256sum if not present..." && \
if [ ! -f $(HELM).tar.gz.sha256sum ]; then \
wget -qO $(HELM).tar.gz.sha256sum $(HELM_SHA_URL); \
fi && \
echo "Verifying checksum..." && \
cat $(HELM).tar.gz.sha256sum | shasum -a 256 --check --quiet || (echo "Checksum verification failed, exiting." && false) && \
echo "Extract helm tarball and move it to bin directory..." && \
tar -zxf $(HELM).tar.gz && \
cp ${OS}-${ARCH}/helm $(LOCALBIN)/$(HELM) && \
echo "Successfully installed helm to $(LOCALBIN)/$(HELM)."
.PHONY: helm-unittest-plugin
helm-unittest-plugin: helm ## Download helm unittest plugin locally if necessary.
if [ -z "$(shell $(LOCALBIN)/$(HELM) plugin list | grep unittest)" ]; then \
echo "Installing helm unittest plugin"; \
$(LOCALBIN)/$(HELM) plugin install https://github.com/helm-unittest/helm-unittest.git --version $(HELM_UNITTEST_VERSION); \
fi
# go-install-tool will 'go install' any package with custom target and name of binary, if it doesn't exist
# $1 - target path with name of binary (ideally with version)
# $2 - package url which can be installed
# $3 - specific version of package
define go-install-tool
@[ -f $(1) ] || { \
set -e; \
package=$(2)@$(3) ;\
echo "Downloading $${package}" ;\
GOBIN=$(LOCALBIN) go install $${package} ;\
mv "$$(echo "$(1)" | sed "s/-$(3)$$//")" $(1) ;\
}
endef

9
OWNERS
View File

@ -1,8 +1,11 @@
approvers:
- cheyang
- wsxiaozhang
- denverdino
- Syulin7
- xieydd
- denkensk
- gujingit
- ChenYi015
reviewers:
- GarnettWang
- wsxiaozhang
- xiaozhouX
- osswangxining

View File

@ -1,9 +1,8 @@
# Arena
[![CircleCI](https://circleci.com/gh/kubeflow/arena.svg?style=svg)](https://circleci.com/gh/kubeflow/arena)
[![Build Status](https://travis-ci.org/kubeflow/arena.svg?branch=master)](https://travis-ci.org/kubeflow/arena)
[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/arena)](https://goreportcard.com/report/github.com/kubeflow/arena)
[![GitHub release](https://img.shields.io/github/v/release/kubeflow/arena)](https://github.com/kubeflow/arena/releases) [![Integration Test](https://github.com/kubeflow/arena/actions/workflows/integration.yaml/badge.svg)](https://github.com/kubeflow/arena/actions/workflows/integration.yaml) [![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/arena)](https://goreportcard.com/report/github.com/kubeflow/arena)
View the [Arena documentation](https://arena-docs.readthedocs.io/en/latest).
## Overview
@ -17,34 +16,15 @@ For the Chinese version, please refer to [中文文档](README_cn.md)
## Setup
You can follow up the [Installation guide](docs/installation/INSTALL_FROM_BINARY.md)
You can follow up the [Installation guide](https://arena-docs.readthedocs.io/en/latest/installation)
## User Guide
Arena is a command-line interface to run and monitor the machine learning training jobs and check their results in an easy way. Currently it supports solo/distributed training.
- [1. Run a training Job with source code from git](docs/userguide/1-tfjob-standalone.md)
- [2. Run a training Job with tensorboard](docs/userguide/2-tfjob-tensorboard.md)
- [3. Run a distributed training Job](docs/userguide/3-tfjob-distributed.md)
- [4. Run a distributed training Job with external data](docs/userguide/4-tfjob-distributed-data.md)
- [5. Run a distributed training Job based on MPI](docs/userguide/5-mpijob-distributed.md)
- [6. Run a distributed TensorFlow training job with gang scheduler](docs/userguide/6-tfjob-gangschd.md)
- [7. Run TensorFlow Serving](docs/userguide/7-tf-serving.md)
- [8. Run TensorFlow Estimator](docs/userguide/8-tfjob-estimator.md)
- [9. Monitor GPUs of the training job ](docs/userguide/9-top-job-gpu-metric.md)
- [10. Run a distributed training job with RDMA](docs/userguide/10-rdma-integration.md)
- [11. Run a distributed spark job](docs/userguide/11-sparkjob-distributed.md)
- [12. Run a Volcano job](docs/userguide/12-volcanojob.md)
- [13. Preempted mpi job](docs/userguide/13-preempted-mpijob.md)
- [14. Submit jobs with node selectors](docs/userguide/14-submit-with-node-selector.md)
- [15. Submit jobs with tolerating taints](docs/userguide/14-submit-with-node-toleration.md)
- [16. Run a custom serving job](docs/userguide/15-custom-serving-sample.md)
- [17. Run a training Job with configuration files](docs/userguide/16-assign-config-file.md)
Arena is a command-line interface to run and monitor the machine learning training jobs and check their results in an easy way. Please refer the [User Guide](https://arena-docs.readthedocs.io/en/latest/training) to manage your training jobs.
## Demo
[![](demo.jpg)](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/50210690772.mp4)
[![arena demo](demo.jpg)](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/50210690772.mp4)
## Developing
@ -52,7 +32,7 @@ Prerequisites:
- Go >= 1.8
```
```shell
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
cd $(go env GOPATH)/src/github.com/kubeflow
git clone https://github.com/kubeflow/arena.git
@ -62,11 +42,11 @@ make
`arena` binary is located in directory `arena/bin`. You may want to add the directory to `$PATH`.
Then you can follow [Installation guide for developer](docs/installation/INSTALL_FROM_SOURCE.md)
Then you can follow [Installation guide for developer](https://arena-docs.readthedocs.io/en/latest/installation)
## CPU Profiling
```
```shell
# set profile rate (HZ)
export PROFILE_RATE=1000
@ -77,16 +57,18 @@ INFO[0000] Dump cpu profile file into /tmp/cpu_profile
Then you can analyze the profile by following [Go CPU profiling: pprof and speedscope](https://coder.today/go-profiling-pprof-and-speedscope-b05b812cc429)
## Adopters
If you are interested in Arena and would like to share your experiences with others, you are warmly welcome to add your information on [ADOPTERS.md](docs/about/ADOPTERS.md) page. We will continuously discuss new requirements and feature design with you in advance.
## FAQ
Please refer to [FAQ](FAQ.md)
Please refer to [FAQ](https://arena-docs.readthedocs.io/en/latest/faq).
## CLI Document
Please refer to [arena.md](docs/cli/arena.md)
Please refer to [arena.md](docs/cli/arena.md).
## RoadMap
See [RoadMap](ROADMAP.md)
See [RoadMap](ROADMAP.md).

View File

@ -1,9 +1,6 @@
# Arena
[![CircleCI](https://circleci.com/gh/kubeflow/arena.svg?style=svg)](https://circleci.com/gh/kubeflow/arena)
[![Build Status](https://travis-ci.org/kubeflow/arena.svg?branch=master)](https://travis-ci.org/kubeflow/arena)
[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/arena)](https://goreportcard.com/report/github.com/kubeflow/arena)
[![Integration Test](https://github.com/kubeflow/arena/actions/workflows/integration.yaml/badge.svg)](https://github.com/kubeflow/arena/actions/workflows/integration.yaml)[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/arena)](https://goreportcard.com/report/github.com/kubeflow/arena)
## 概述
@ -13,28 +10,25 @@ Arena 是一个命令行工具,可供数据科学家轻而易举地运行和
简而言之Arena 的目标是让数据科学家感觉自己就像是在一台机器上工作,而实际上还可以享受到 GPU 集群的强大力量。
## 设置
您可以按照 [安装指南](docs/installation_cn/README.md) 执行操作
您可以按照 [安装指南](https://arena-docs.readthedocs.io/en/latest/installation) 执行操作
## 用户指南
Arena 是一种命令行界面,支持轻而易举地运行和监控机器学习训练作业,并便捷地检查结果。目前,它支持独立/分布式训练。
- [1.使用 git 上的源代码运行训练作业](docs/userguide_cn/1-tfjob-standalone.md)
- [2.使用 tensorboard 运行训练作业](docs/userguide_cn/2-tfjob-tensorboard.md)
- [3.运行分布式训练作业](docs/userguide_cn/3-tfjob-distributed.md)
- [4.使用外部数据运行分布式训练作业](docs/userguide_cn/4-tfjob-distributed-data.md)
- [5.运行基于 MPI 的分布式训练作业](docs/userguide_cn/5-mpijob-distributed.md)
- [6.使用群调度器运行分布式 TensorFlow 训练作业](docs/userguide_cn/6-tfjob-gangschd.md)
- [7.运行 TensorFlow Serving](docs/userguide_cn/7-tf-serving.md)
- [8.运行 TensorFlow Estimator](docs/userguide_cn/8-tfjob-estimator.md)
- [1.使用 git 上的源代码运行训练作业](https://arena-docs.readthedocs.io/en/latest/training/tfjob/standalone/)
- [2.使用 tensorboard 运行训练作业](https://arena-docs.readthedocs.io/en/latest/training/tfjob/tensorboard/)
- [3.运行分布式训练作业](https://arena-docs.readthedocs.io/en/latest/training/tfjob/distributed/)
- [4.使用外部数据运行分布式训练作业](https://arena-docs.readthedocs.io/en/latest/training/tfjob/dataset/)
- [5.运行基于 MPI 的分布式训练作业](https://arena-docs.readthedocs.io/en/latest/training/mpijob/distributed/)
- [6.使用群调度器运行分布式 TensorFlow 训练作业](https://arena-docs.readthedocs.io/en/latest/training/etjob/elastictraining-tensorflow2-mnist/)
- [7.运行 TensorFlow Serving](https://arena-docs.readthedocs.io/en/latest/serving/tfserving/serving/)
## 演示
[![](demo.jpg)](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/50210690772.mp4)
[![arena demo](demo.jpg)](http://cloud.video.taobao.com/play/u/2987821887/p/1/e/6/t/1/50210690772.mp4)
## 开发
@ -42,7 +36,7 @@ Arena 是一种命令行界面,支持轻而易举地运行和监控机器学
- Go >= 1.8
```
```shell
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
cd $(go env GOPATH)/src/github.com/kubeflow
git clone https://github.com/kubeflow/arena.git
@ -59,4 +53,3 @@ make
## 路线图
请参阅[路线图](ROADMAP.md)

View File

@ -1,10 +1,40 @@
# Arena 2019 Roadmap
# Kubeflow Arena Roadmap
## Kubeflow Arena 2024 Roadmap
This document defines a high level roadmap for Arena development.
### 2019
* ObjectiveSimplify the user experience by deeply integrating with the Kubeflow Ecosystem.
* Kubeflow Integration
* Prepare Arena for release v1.0.0 alongside kubeflow v1.10.
* Develop a seamless integration with the Training Operator to help simplify model training using command line.
* Integrate with Kubeflow Pipelines to facilitate model training from a Pipeline.
* Enable mode serving with KServe.
* Add documentation to Kubeflow website:
* Installation, uninstallation, and upgrade processes.
* Guide for tfjob, mpijob, pytorchJob.
#### Core CUJs
* ObjectiveAmplify the Extensibility of the Arena for Different ML frameworks, AIGC models and platforms.
* Support DeepSpeed Training Job.
* Support for submitting and managing llm fine-tuning jobs, like DeepSpeed.
* Enable model serving for an expanded set of models like Baichuan, LLaMA, ChatGLM, Falcon, and more.
* Extend platform support to include ARM.
* Integrate [Fluid project](https://github.com/fluid-cloudnative/fluid) for efficient data management.
* Add support for Ray Job management with [Kuberay](https://github.com/ray-project/kuberay).
* Objective: Boost Performance and Stability.
* Regularly publish recommended practices documentation.
* Enhancements on Arena SDK.
* Enhance code quality by:
* Reduce repetitive code.
* Enhance unit test.
* Implement automated End-to-End Test:
* Add integration tests using GitHub Actions.
* Strive for more than 60% Test Coverage of Supported Features.
## Kubeflow Arena 2019 Roadmap
### Core CUJs
Objectives: "Make Arena easily to be integrated with External System."
@ -19,13 +49,13 @@ Objectives: "Simplify the user experience of the data scientists and provide a l
* Submit and manage Model Serving with [KF Serving](https://github.com/kubeflow/kfserving)
Objectives: "Make Arena support the same Operator compatiable with different API version, so the upgrade of operator doesn't impact the existing users' experiences."
Objectives: "Make Arena support the same Operator compatible with different API version, so the upgrade of operator doesn't impact the existing users' experiences."
* Compatibility:
* v1aphla2 and v1 TFJob
* v1alpha1 and v1aphla2 MPIJob
Objectives: "Enchance the software quality of Arena so it can be in the quick iteration"
Objectives: "Enhance the software quality of Arena so it can be in the quick iteration"
* Refactor the source code
* Move Training implementation from `cmd` into `pkg`

View File

@ -1 +1 @@
0.4.0
0.15.1

View File

@ -0,0 +1,28 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
# helm-unittest
tests
.debug
__snapshot__

View File

@ -0,0 +1,67 @@
apiVersion: v2
name: arena-artifacts
description: A Helm chart for installing arena dependencies
# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.15.1
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: 0.15.1
dependencies:
- name: tf-operator
alias: tf
version: 0.1.0
repository: "@tf-operator"
condition: tf.enabled,global.tf.enabled
- name: tf-dashboard
alias: tfdashboard
version: 0.1.0
repository: "@tf-dashbard"
condition: tfdashboard.enabled,global.tfdashboard.enabled
- name: cron-operator
alias: cron
version: 0.1.0
repository: "@cron-operator"
condition: cron.enabled,global.cron.enabled
- name: et-operator
alias: et
version: 0.1.1
repository: "@et-operator"
condition: et.enabled,global.et.enabled
- name: mpi-operator
alias: mpi
version: 0.1.0
repository: "@mpi-operator"
condition: mpi.enabled,global.mpi.enabled
- name: pytorch-operator
alias: pytorch
version: 0.1.0
repository: "@pytorch-operator"
condition: pytorch.enabled,global.pytorch.enabled
- name: gpu-exporter
alias: exporter
version: 0.1.0
repository: "@gpu-exporter"
condition: exporter.enabled,global.exporter.enabled
- name: elastic-job-supervisor
alias: elastic-job-supervisor
version: 0.1.0
repository: "@elastic-job-supervisor"
condition: elastic-job-supervisor.enabled,global.elastic-job-supervisor.enabled

View File

@ -0,0 +1,223 @@
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.6.0
git-repo: http://gitlab.alibaba-inc.com/kube-ai/kubedlpro.git
git-branch: feature/k8s-1.22
git-commit: 4f076d22
creationTimestamp: null
name: crons.apps.kubedl.io
spec:
group: apps.kubedl.io
names:
kind: Cron
listKind: CronList
plural: crons
singular: cron
scope: Namespaced
versions:
- additionalPrinterColumns:
- jsonPath: .status.conditions[-1:].type
name: State
type: string
- jsonPath: .metadata.creationTimestamp
name: Age
type: date
name: v1alpha1
schema:
openAPIV3Schema:
description: Cron is the Schema for the crons API
properties:
apiVersion:
description: 'APIVersion defines the versioned schema of this representation
of an object. Servers should convert recognized schemas to the latest
internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
type: string
kind:
description: 'Kind is a string value representing the REST resource this
object represents. Servers may infer this from the endpoint the client
submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type: string
metadata:
type: object
spec:
description: CronSpec defines the desired state of Cron
properties:
concurrencyPolicy:
description: 'Specifies how to treat concurrent executions of a Task.
Valid values are: - "Allow" (default): allows CronJobs to run concurrently;
- "Forbid": forbids concurrent runs, skipping next run if previous
run hasn''t finished yet; - "Replace": cancels currently running
job and replaces it with a new one'
type: string
deadline:
description: Deadline is the timestamp that a cron job can keep scheduling
util then.
format: date-time
type: string
historyLimit:
description: The number of finished job history to retain. This is
a pointer to distinguish between explicit zero and not specified.
format: int32
type: integer
schedule:
description: The schedule in Cron format, see https://en.wikipedia.org/wiki/Cron.
type: string
suspend:
description: This flag tells the controller to suspend subsequent
executions, it does not apply to already started executions. Defaults
to false.
type: boolean
template:
description: Specifies the job that will be created when executing
a CronTask.
properties:
apiVersion:
description: 'APIVersion defines the versioned schema of this
representation of an object. Servers should convert recognized
schemas to the latest internal value, and may reject unrecognized
values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
type: string
kind:
description: 'Kind is a string value representing the REST resource
this object represents. Servers may infer this from the endpoint
the client submits requests to. Cannot be updated. In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type: string
workload:
description: Workload is the specification of the desired cron
job with specific types.
type: object
x-kubernetes-preserve-unknown-fields: true
type: object
required:
- schedule
- template
type: object
status:
description: CronStatus defines the observed state of Cron
properties:
active:
description: A list of currently running jobs.
items:
description: 'ObjectReference contains enough information to let
you inspect or modify the referred object. --- New uses of this
type are discouraged because of difficulty describing its usage
when embedded in APIs. 1. Ignored fields. It includes many fields
which are not generally honored. For instance, ResourceVersion
and FieldPath are both very rarely valid in actual usage. 2.
Invalid usage help. It is impossible to add specific help for
individual usage. In most embedded usages, there are particular restrictions
like, "must refer only to types A and B" or "UID not honored"
or "name must be restricted". Those cannot be well described
when embedded. 3. Inconsistent validation. Because the usages
are different, the validation rules are different by usage, which
makes it hard for users to predict what will happen. 4. The fields
are both imprecise and overly precise. Kind is not a precise
mapping to a URL. This can produce ambiguity during interpretation
and require a REST mapping. In most cases, the dependency is
on the group,resource tuple and the version of the actual
struct is irrelevant. 5. We cannot easily change it. Because
this type is embedded in many locations, updates to this type will
affect numerous schemas. Don''t make new APIs embed an underspecified
API type they do not control. Instead of using this type, create
a locally provided and used type that is well-focused on your
reference. For example, ServiceReferences for admission registration:
https://github.com/kubernetes/api/blob/release-1.17/admissionregistration/v1/types.go#L533
.'
properties:
apiVersion:
description: API version of the referent.
type: string
fieldPath:
description: 'If referring to a piece of an object instead of
an entire object, this string should contain a valid JSON/Go
field access statement, such as desiredState.manifest.containers[2].
For example, if the object reference is to a container within
a pod, this would take on a value like: "spec.containers{name}"
(where "name" refers to the name of the container that triggered
the event) or if no container name is specified "spec.containers[2]"
(container with index 2 in this pod). This syntax is chosen
only to have some well-defined way of referencing a part of
an object. TODO: this design is not final and this field is
subject to change in the future.'
type: string
kind:
description: 'Kind of the referent. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type: string
name:
description: 'Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names'
type: string
namespace:
description: 'Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/'
type: string
resourceVersion:
description: 'Specific resourceVersion to which this reference
is made, if any. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency'
type: string
uid:
description: 'UID of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#uids'
type: string
type: object
type: array
history:
description: History is a list of scheduled cron job with its digest
records.
items:
properties:
created:
description: Created is the creation timestamp of job.
format: date-time
type: string
finished:
description: Finished is the failed or succeeded timestamp of
job.
format: date-time
type: string
object:
description: Object is the reference of the historical scheduled
cron job.
properties:
apiGroup:
description: APIGroup is the group for the resource being
referenced. If APIGroup is not specified, the specified
Kind must be in the core API group. For any other third-party
types, APIGroup is required.
type: string
kind:
description: Kind is the type of resource being referenced
type: string
name:
description: Name is the name of resource being referenced
type: string
required:
- kind
- name
type: object
status:
description: Status is the final status when job finished.
type: string
required:
- object
- status
type: object
type: array
lastScheduleTime:
description: Information when was the last time the job was successfully
scheduled.
format: date-time
type: string
type: object
type: object
served: true
storage: true
subresources:
status: {}
status:
acceptedNames:
kind: ""
plural: ""
conditions: []
storedVersions: []

View File

@ -0,0 +1,186 @@
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.6.0
git-repo: https://github.com/AliyunContainerService/et-operator.git
git-branch: master
git-commit: "1499985"
creationTimestamp: null
name: scaleins.kai.alibabacloud.com
spec:
group: kai.alibabacloud.com
names:
kind: ScaleIn
listKind: ScaleInList
plural: scaleins
singular: scalein
scope: Namespaced
versions:
- additionalPrinterColumns:
- jsonPath: .status.conditions[-1:].type
name: Phase
type: string
- jsonPath: .metadata.creationTimestamp
name: Age
type: date
name: v1alpha1
schema:
openAPIV3Schema:
description: ScaleIn is the Schema for the scaleins API
properties:
apiVersion:
description: 'APIVersion defines the versioned schema of this representation
of an object. Servers should convert recognized schemas to the latest
internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
type: string
kind:
description: 'Kind is a string value representing the REST resource this
object represents. Servers may infer this from the endpoint the client
submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type: string
metadata:
type: object
spec:
description: ScaleInSpec defines the desired state of ScaleIn
properties:
backoffLimit:
description: Optional number of retries to execute script.
format: int32
type: integer
env:
items:
properties:
name:
type: string
value:
type: string
type: object
type: array
script:
type: string
selector:
properties:
name:
type: string
type: object
timeout:
description: Optional number of timeout to execute script.
format: int32
type: integer
toDelete:
properties:
count:
type: integer
podNames:
items:
type: string
type: array
type: object
type: object
status:
description: Most recently observed status of the PyTorchJob. Read-only
(modified by the system).
properties:
completionTime:
description: Represents time when the job was completed. It is not
guaranteed to be set in happens-before order across separate operations.
It is represented in RFC3339 form and is in UTC.
format: date-time
type: string
conditions:
description: Conditions is an array of current observed job conditions.
items:
description: JobCondition describes the state of the job at a certain
point.
properties:
lastTransitionTime:
description: Last time the condition transitioned from one status
to another.
format: date-time
type: string
lastUpdateTime:
description: The last time this condition was updated.
format: date-time
type: string
message:
description: A human readable message indicating details about
the transition.
type: string
reason:
description: The reason for the condition's last transition.
type: string
status:
description: Status of the condition, one of True, False, Unknown.
type: string
type:
description: Type of job condition.
type: string
required:
- status
- type
type: object
type: array
currentScaler:
description: record scaleout/scalein name when scaling. e.g. (default/scaleout-sample)
type: string
lastReconcileTime:
description: Represents last time when the job was reconciled. It
is not guaranteed to be set in happens-before order across separate
operations. It is represented in RFC3339 form and is in UTC.
format: date-time
type: string
phase:
description: record trainingjob current phase
type: string
replicaStatuses:
additionalProperties:
description: ReplicaStatus represents the current observed state
of the replica.
properties:
active:
description: The number of actively running pods.
format: int32
type: integer
failed:
description: The number of pods which reached phase Failed.
format: int32
type: integer
succeeded:
description: The number of pods which reached phase Succeeded.
format: int32
type: integer
type: object
description: ReplicaStatuses is map of ReplicaType and ReplicaStatus,
specifies the status of each replica.
type: object
restartCount:
description: The number of times the Job has been restarted
format: int32
type: integer
startTime:
description: Represents time when the job was acknowledged by the job controller. It is not guaranteed to be set in happens-before order across separate operations. It is represented in RFC3339 form and is in UTC.
format: date-time
type: string
toDeletePods:
description: record delete pods for scalein
items:
type: string
type: array
required:
- conditions
- replicaStatuses
- restartCount
type: object
type: object
served: true
storage: true
subresources:
status: {}
status:
acceptedNames:
kind: ""
plural: ""
conditions: []
storedVersions: []

View File

@ -0,0 +1,182 @@
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.6.0
git-repo: https://github.com/AliyunContainerService/et-operator.git
git-branch: master
git-commit: "1499985"
creationTimestamp: null
name: scaleouts.kai.alibabacloud.com
spec:
group: kai.alibabacloud.com
names:
kind: ScaleOut
listKind: ScaleOutList
plural: scaleouts
singular: scaleout
scope: Namespaced
versions:
- additionalPrinterColumns:
- jsonPath: .status.conditions[-1:].type
name: Phase
type: string
- jsonPath: .metadata.creationTimestamp
name: Age
type: date
name: v1alpha1
schema:
openAPIV3Schema:
description: ScaleOut is the Schema for the scaleouts API
properties:
apiVersion:
description: 'APIVersion defines the versioned schema of this representation
of an object. Servers should convert recognized schemas to the latest
internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
type: string
kind:
description: 'Kind is a string value representing the REST resource this
object represents. Servers may infer this from the endpoint the client
submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type: string
metadata:
type: object
spec:
description: ScaleOutSpec defines the desired state of ScaleOut
properties:
backoffLimit:
description: Optional number of retries to execute script.
format: int32
type: integer
env:
items:
properties:
name:
type: string
value:
type: string
type: object
type: array
script:
type: string
selector:
properties:
name:
type: string
type: object
timeout:
description: Optional number of timeout to execute script.
format: int32
type: integer
toAdd:
properties:
count:
format: int32
type: integer
type: object
type: object
status:
description: Most recently observed status of the PyTorchJob. Read-only
(modified by the system).
properties:
addPods:
items:
type: string
type: array
completionTime:
description: Represents time when the job was completed. It is not
guaranteed to be set in happens-before order across separate operations.
It is represented in RFC3339 form and is in UTC.
format: date-time
type: string
conditions:
description: Conditions is an array of current observed job conditions.
items:
description: JobCondition describes the state of the job at a certain
point.
properties:
lastTransitionTime:
description: Last time the condition transitioned from one status
to another.
format: date-time
type: string
lastUpdateTime:
description: The last time this condition was updated.
format: date-time
type: string
message:
description: A human readable message indicating details about
the transition.
type: string
reason:
description: The reason for the condition's last transition.
type: string
status:
description: Status of the condition, one of True, False, Unknown.
type: string
type:
description: Type of job condition.
type: string
required:
- status
- type
type: object
type: array
currentScaler:
description: record scaleout/scalein name when scaling. e.g. (default/scaleout-sample)
type: string
lastReconcileTime:
description: Represents last time when the job was reconciled. It
is not guaranteed to be set in happens-before order across separate
operations. It is represented in RFC3339 form and is in UTC.
format: date-time
type: string
phase:
description: record trainingjob current phase
type: string
replicaStatuses:
additionalProperties:
description: ReplicaStatus represents the current observed state
of the replica.
properties:
active:
description: The number of actively running pods.
format: int32
type: integer
failed:
description: The number of pods which reached phase Failed.
format: int32
type: integer
succeeded:
description: The number of pods which reached phase Succeeded.
format: int32
type: integer
type: object
description: ReplicaStatuses is map of ReplicaType and ReplicaStatus,
specifies the status of each replica.
type: object
restartCount:
description: The number of times the Job has been restarted
format: int32
type: integer
startTime:
description: Represents time when the job was acknowledged by the job controller. It is not guaranteed to be set in happens-before order across separate operations. It is represented in RFC3339 form and is in UTC.
format: date-time
type: string
required:
- conditions
- replicaStatuses
- restartCount
type: object
type: object
served: true
storage: true
subresources:
status: {}
status:
acceptedNames:
kind: ""
plural: ""
conditions: []
storedVersions: []

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,223 @@
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.6.0
git-repo: http://gitlab.alibaba-inc.com/kube-ai/kubedlpro.git
git-branch: feature/k8s-1.22
git-commit: 4f076d22
creationTimestamp: null
name: crons.apps.kubedl.io
spec:
group: apps.kubedl.io
names:
kind: Cron
listKind: CronList
plural: crons
singular: cron
scope: Namespaced
versions:
- additionalPrinterColumns:
- jsonPath: .status.conditions[-1:].type
name: State
type: string
- jsonPath: .metadata.creationTimestamp
name: Age
type: date
name: v1alpha1
schema:
openAPIV3Schema:
description: Cron is the Schema for the crons API
properties:
apiVersion:
description: 'APIVersion defines the versioned schema of this representation
of an object. Servers should convert recognized schemas to the latest
internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
type: string
kind:
description: 'Kind is a string value representing the REST resource this
object represents. Servers may infer this from the endpoint the client
submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type: string
metadata:
type: object
spec:
description: CronSpec defines the desired state of Cron
properties:
concurrencyPolicy:
description: 'Specifies how to treat concurrent executions of a Task.
Valid values are: - "Allow" (default): allows CronJobs to run concurrently;
- "Forbid": forbids concurrent runs, skipping next run if previous
run hasn''t finished yet; - "Replace": cancels currently running
job and replaces it with a new one'
type: string
deadline:
description: Deadline is the timestamp that a cron job can keep scheduling
util then.
format: date-time
type: string
historyLimit:
description: The number of finished job history to retain. This is
a pointer to distinguish between explicit zero and not specified.
format: int32
type: integer
schedule:
description: The schedule in Cron format, see https://en.wikipedia.org/wiki/Cron.
type: string
suspend:
description: This flag tells the controller to suspend subsequent
executions, it does not apply to already started executions. Defaults
to false.
type: boolean
template:
description: Specifies the job that will be created when executing
a CronTask.
properties:
apiVersion:
description: 'APIVersion defines the versioned schema of this
representation of an object. Servers should convert recognized
schemas to the latest internal value, and may reject unrecognized
values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
type: string
kind:
description: 'Kind is a string value representing the REST resource
this object represents. Servers may infer this from the endpoint
the client submits requests to. Cannot be updated. In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type: string
workload:
description: Workload is the specification of the desired cron
job with specific types.
type: object
x-kubernetes-preserve-unknown-fields: true
type: object
required:
- schedule
- template
type: object
status:
description: CronStatus defines the observed state of Cron
properties:
active:
description: A list of currently running jobs.
items:
description: 'ObjectReference contains enough information to let
you inspect or modify the referred object. --- New uses of this
type are discouraged because of difficulty describing its usage
when embedded in APIs. 1. Ignored fields. It includes many fields
which are not generally honored. For instance, ResourceVersion
and FieldPath are both very rarely valid in actual usage. 2.
Invalid usage help. It is impossible to add specific help for
individual usage. In most embedded usages, there are particular restrictions
like, "must refer only to types A and B" or "UID not honored"
or "name must be restricted". Those cannot be well described
when embedded. 3. Inconsistent validation. Because the usages
are different, the validation rules are different by usage, which
makes it hard for users to predict what will happen. 4. The fields
are both imprecise and overly precise. Kind is not a precise
mapping to a URL. This can produce ambiguity during interpretation
and require a REST mapping. In most cases, the dependency is
on the group,resource tuple and the version of the actual
struct is irrelevant. 5. We cannot easily change it. Because
this type is embedded in many locations, updates to this type will
affect numerous schemas. Don''t make new APIs embed an underspecified
API type they do not control. Instead of using this type, create
a locally provided and used type that is well-focused on your
reference. For example, ServiceReferences for admission registration:
https://github.com/kubernetes/api/blob/release-1.17/admissionregistration/v1/types.go#L533
.'
properties:
apiVersion:
description: API version of the referent.
type: string
fieldPath:
description: 'If referring to a piece of an object instead of
an entire object, this string should contain a valid JSON/Go
field access statement, such as desiredState.manifest.containers[2].
For example, if the object reference is to a container within
a pod, this would take on a value like: "spec.containers{name}"
(where "name" refers to the name of the container that triggered
the event) or if no container name is specified "spec.containers[2]"
(container with index 2 in this pod). This syntax is chosen
only to have some well-defined way of referencing a part of
an object. TODO: this design is not final and this field is
subject to change in the future.'
type: string
kind:
description: 'Kind of the referent. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type: string
name:
description: 'Name of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names'
type: string
namespace:
description: 'Namespace of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/'
type: string
resourceVersion:
description: 'Specific resourceVersion to which this reference
is made, if any. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency'
type: string
uid:
description: 'UID of the referent. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#uids'
type: string
type: object
type: array
history:
description: History is a list of scheduled cron job with its digest
records.
items:
properties:
created:
description: Created is the creation timestamp of job.
format: date-time
type: string
finished:
description: Finished is the failed or succeeded timestamp of
job.
format: date-time
type: string
object:
description: Object is the reference of the historical scheduled
cron job.
properties:
apiGroup:
description: APIGroup is the group for the resource being
referenced. If APIGroup is not specified, the specified
Kind must be in the core API group. For any other third-party
types, APIGroup is required.
type: string
kind:
description: Kind is the type of resource being referenced
type: string
name:
description: Name is the name of resource being referenced
type: string
required:
- kind
- name
type: object
status:
description: Status is the final status when job finished.
type: string
required:
- object
- status
type: object
type: array
lastScheduleTime:
description: Information when was the last time the job was successfully
scheduled.
format: date-time
type: string
type: object
type: object
served: true
storage: true
subresources:
status: {}
status:
acceptedNames:
kind: ""
plural: ""
conditions: []
storedVersions: []

View File

@ -0,0 +1,231 @@
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.4.1
creationTimestamp: null
name: scaleins.kai.alibabacloud.com
spec:
additionalPrinterColumns:
- JSONPath: .status.conditions[-1:].type
name: Phase
type: string
- JSONPath: .metadata.creationTimestamp
name: Age
type: date
group: kai.alibabacloud.com
names:
kind: ScaleIn
listKind: ScaleInList
plural: scaleins
singular: scalein
scope: Namespaced
subresources:
status: {}
validation:
openAPIV3Schema:
properties:
apiVersion:
type: string
kind:
type: string
metadata:
type: object
spec:
properties:
backoffLimit:
format: int32
type: integer
env:
items:
properties:
name:
type: string
value:
type: string
type: object
type: array
script:
type: string
selector:
properties:
name:
type: string
type: object
timeout:
format: int32
type: integer
toDelete:
properties:
count:
type: integer
podNames:
items:
type: string
type: array
type: object
type: object
type: object
version: v1alpha1
versions:
- name: v1alpha1
served: true
storage: true
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.4.1
creationTimestamp: null
name: scaleouts.kai.alibabacloud.com
spec:
additionalPrinterColumns:
- JSONPath: .status.conditions[-1:].type
name: Phase
type: string
- JSONPath: .metadata.creationTimestamp
name: Age
type: date
group: kai.alibabacloud.com
names:
kind: ScaleOut
listKind: ScaleOutList
plural: scaleouts
singular: scaleout
scope: Namespaced
subresources:
status: {}
validation:
openAPIV3Schema:
properties:
apiVersion:
type: string
kind:
type: string
metadata:
type: object
spec:
properties:
backoffLimit:
format: int32
type: integer
env:
items:
properties:
name:
type: string
value:
type: string
type: object
type: array
script:
type: string
selector:
properties:
name:
type: string
type: object
timeout:
format: int32
type: integer
toAdd:
properties:
count:
format: int32
type: integer
type: object
type: object
type: object
version: v1alpha1
versions:
- name: v1alpha1
served: true
storage: true
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.4.1
creationTimestamp: null
name: trainingjobs.kai.alibabacloud.com
spec:
additionalPrinterColumns:
- JSONPath: .status.phase
name: Phase
type: string
- JSONPath: .metadata.creationTimestamp
name: Age
type: date
group: kai.alibabacloud.com
names:
kind: TrainingJob
listKind: TrainingJobList
plural: trainingjobs
singular: trainingjob
scope: Namespaced
subresources:
status: {}
validation:
openAPIV3Schema:
properties:
apiVersion:
type: string
kind:
type: string
metadata:
type: object
spec:
properties:
cleanPodPolicy:
type: string
etReplicaSpecs:
properties:
launcher:
properties:
replicas:
format: int32
maximum: 1
minimum: 1
type: integer
restartPolicy:
type: string
type: object
worker:
properties:
maxReplicas:
format: int32
minimum: 1
type: integer
minReplicas:
format: int32
minimum: 1
type: integer
replicas:
format: int32
minimum: 1
type: integer
restartPolicy:
type: string
type: object
required:
- launcher
- worker
type: object
x-kubernetes-preserve-unknown-fields: true
launcherAttachMode:
type: string
slotsPerWorker:
format: int32
type: integer
required:
- etReplicaSpecs
type: object
type: object
version: v1alpha1
versions:
- name: v1alpha1
served: true
storage: true

View File

@ -0,0 +1,47 @@
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: mpijobs.kubeflow.org
spec:
group: kubeflow.org
version: v1alpha1
scope: Namespaced
subresources:
status: {}
names:
plural: mpijobs
singular: mpijob
kind: MPIJob
shortNames:
- mj
- mpij
validation:
openAPIV3Schema:
properties:
spec:
title: The MPIJob spec
description: Either `gpus` or `replicas` should be specified, but not both
oneOf:
- properties:
gpus:
title: Total number of GPUs
description: Valid values are 1, 2, 4, or any multiple of 8
oneOf:
- type: integer
enum:
- 1
- 2
- 4
- type: integer
multipleOf: 8
minimum: 8
required:
- gpus
- properties:
replicas:
title: Total number of replicas
description: The GPU resource limit should be specified for each replica
type: integer
minimum: 1
required:
- replicas

View File

@ -0,0 +1,43 @@
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: pytorchjobs.kubeflow.org
spec:
additionalPrinterColumns:
- JSONPath: .status.conditions[-1:].type
name: State
type: string
- JSONPath: .metadata.creationTimestamp
name: Age
type: date
group: kubeflow.org
names:
kind: PyTorchJob
plural: pytorchjobs
singular: pytorchjob
scope: Namespaced
subresources:
status: {}
validation:
openAPIV3Schema:
properties:
spec:
properties:
pytorchReplicaSpecs:
properties:
Master:
properties:
replicas:
maximum: 1
minimum: 1
type: integer
Worker:
properties:
replicas:
minimum: 1
type: integer
versions:
- name: v1
served: true
storage: true

View File

@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/

View File

@ -0,0 +1,24 @@
apiVersion: v2
name: cron-operator
description: A Helm chart for Kubernetes
# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "v0.1.1"

View File

@ -0,0 +1,74 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: cron-operator
namespace: {{ .Release.Namespace }}
labels:
app: cron-operator
{{- include "arena.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.replicas }}
selector:
matchLabels:
app: cron-operator
{{- include "arena.labels" . | nindent 6 }}
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
namespace: {{ .Release.Namespace }}
labels:
app: cron-operator
{{- include "arena.labels" . | nindent 8 }}
spec:
containers:
- name: cron
image: {{ include "arena.imagePrefix" . }}/{{ .Values.image }}:{{ .Values.tag }}
imagePullPolicy: {{ .Values.imagePullPolicy }}
args:
- --workloads=Cron
ports:
- containerPort: 8443
name: metrics
protocol: TCP
{{- with .Values.resources }}
resources:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- if .Values.useHostTimezone }}
volumeMounts:
- name: volume-localtime
mountPath: /etc/localtime
readOnly: true
{{- end }}
{{- if .Values.useHostTimezone }}
volumes:
- name: volume-localtime
hostPath:
path: /etc/localtime
{{- end }}
nodeSelector:
{{- include "arena.nodeSelector" . | nindent 8 }}
{{- include "arena.nonEdgeNodeSelector" . | nindent 8 }}
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 18 }}
topologyKey: kubernetes.io/hostname
tolerations:
{{- with .Values.global.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- with .Values.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- include "arena.tolerateNonEdgeNodeSelector" . | nindent 6 }}
serviceAccountName: cron-operator

View File

@ -0,0 +1,267 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cron-operator
namespace: {{ .Release.Namespace }}
labels:
app: cron-operator
{{- include "arena.labels" . | nindent 4 }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
creationTimestamp: null
name: cron-operator-role
namespace: {{ .Release.Namespace }}
labels:
app: cron-operator
{{- include "arena.labels" . | nindent 4 }}
rules:
- apiGroups:
- kubeflow.org
resources:
- tfjobs
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- kubeflow.org
resources:
- tfjobs/status
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- kubeflow.org
resources:
- pytorchjobs
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- kubeflow.org
resources:
- pytorchjobs/status
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- xdl.kubedl.io
resources:
- xdljobs
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- xdl.kubedl.io
resources:
- xdljobs/status
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- xgboostjob.kubeflow.org
resources:
- xgboostjobs
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- xgboostjob.kubeflow.org
resources:
- xgboostjobs/status
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- pods/status
verbs:
- get
- update
- patch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- ""
resources:
- pods/status
verbs:
- get
- update
- patch
- apiGroups:
- ""
resources:
- events
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- ""
resources:
- persistentvolumeclaims
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- apps
resources:
- controllerrevisions
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- admissionregistration.k8s.io
resources:
- mutatingwebhookconfigurations
- validatingwebhookconfigurations
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- ""
resources:
- services
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- "apps.kubedl.io"
resources:
- crons
- crons/status
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
creationTimestamp: null
name: cron-operator-rolebinding
labels:
app: cron-operator
{{- include "arena.labels" . | nindent 4 }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cron-operator-role
subjects:
- kind: ServiceAccount
name: cron-operator
namespace: {{ .Release.Namespace }}

View File

@ -0,0 +1,19 @@
---
apiVersion: v1
kind: Service
metadata:
name: cron-operator
namespace: {{ .Release.Namespace }}
labels:
app: cron-operator
{{- include "arena.labels" . | nindent 4 }}
spec:
type: ClusterIP
ports:
- port: 80
targetPort: metrics
protocol: TCP
name: metrics
selector:
app: cron-operator
{{- include "arena.labels" . | nindent 4 }}

View File

@ -0,0 +1,21 @@
# Default values for cron-operator.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
# -- Replicas of cron-operator deployment.
replicas: 1
# -- Whether to use host timezone in the container.
useHostTimezone: false
# -- Resources for cron-operator pods.
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 200m
memory: 2Gi
# -- Tolerations for cron-operator pods.
tolerations: []

View File

@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/

View File

@ -0,0 +1,24 @@
apiVersion: v2
name: elastic-job-supervisor
description: A Helm chart for Kubernetes
# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 1.2.0
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "v1.2.0"

View File

@ -0,0 +1,50 @@
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: elastic-job-supervisor
{{- include "arena.labels" . | nindent 4 }}
name: elastic-job-supervisor
namespace: {{ .Release.Namespace }}
spec:
replicas: 1
selector:
matchLabels:
app: elastic-job-supervisor
{{- include "arena.labels" . | nindent 6 }}
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
{{- include "arena.labels" . | nindent 8 }}
app: elastic-job-supervisor
spec:
nodeSelector:
{{- include "arena.nodeSelector" . | nindent 8 }}
{{- include "arena.nonEdgeNodeSelector" . | nindent 8 }}
tolerations:
{{- with .Values.global.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- with .Values.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- include "arena.tolerateNonEdgeNodeSelector" . | nindent 6 }}
containers:
- command:
- /job-supervisor
image: {{ include "arena.imagePrefix" . }}/{{ .Values.image }}:{{ .Values.tag }}
imagePullPolicy: {{ .Values.imagePullPolicy }}
name: elastic-job-supervisor
resources:
{{- toYaml .Values.resources | nindent 12 }}
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
serviceAccount: elastic-job-supervisor
serviceAccountName: elastic-job-supervisor
terminationGracePeriodSeconds: 30

View File

@ -0,0 +1,64 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: elastic-job-supervisor
namespace: {{ .Release.Namespace }}
labels:
{{- include "arena.labels" . | nindent 4 }}
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: elastic-job-supervisor
labels:
{{- include "arena.labels" . | nindent 4 }}
rules:
- apiGroups:
- ""
resources:
- pods
- events
verbs:
- get
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- list
- watch
- apiGroups:
- kubeflow.org
resources:
- '*'
verbs:
- '*'
- apiGroups:
- kai.alibabacloud.com
resources:
- '*'
verbs:
- '*'
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: elastic-job-supervisor
labels:
{{- include "arena.labels" . | nindent 4 }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: elastic-job-supervisor
subjects:
- kind: ServiceAccount
name: elastic-job-supervisor
namespace: {{ .Release.Namespace }}

View File

@ -0,0 +1,3 @@
# Default values for elastic-job-supervisor
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

View File

@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/

View File

@ -0,0 +1,24 @@
apiVersion: v2
name: et-operator
description: A Helm chart for Kubernetes
# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.1
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "v0.1.1"

View File

@ -0,0 +1,46 @@
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: et-operator
{{- include "arena.labels" . | nindent 4 }}
name: et-operator
namespace: {{ .Release.Namespace }}
spec:
replicas: 1
selector:
matchLabels:
app: et-operator
{{- include "arena.labels" . | nindent 6 }}
template:
metadata:
labels:
{{- include "arena.labels" . | nindent 8 }}
app: et-operator
spec:
nodeSelector:
{{- include "arena.nodeSelector" . | nindent 8 }}
{{- include "arena.nonEdgeNodeSelector" . | nindent 8 }}
tolerations:
{{- with .Values.global.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- with .Values.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- include "arena.tolerateNonEdgeNodeSelector" . | nindent 6 }}
containers:
- args:
- --enable-leader-election
- --create-ssh-secret={{ .Values.createSSHSecret }}
- --init-container-image={{ .Values.initContainerImage }}
command:
- /manager
image: {{ include "arena.imagePrefix" . }}/{{ .Values.image }}:{{ .Values.tag }}
imagePullPolicy: {{ .Values.imagePullPolicy }}
name: manager
resources:
{{- toYaml .Values.resources | nindent 12 }}
serviceAccountName: et-operator
terminationGracePeriodSeconds: 10

View File

@ -0,0 +1,255 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app: et-operator
{{- include "arena.labels" . | nindent 4 }}
name: et-operator
namespace: {{ .Release.Namespace }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: et-operator-leader-election
namespace: {{ .Release.Namespace }}
labels:
{{- include "arena.labels" . | nindent 4 }}
rules:
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- ""
resources:
- configmaps/status
verbs:
- get
- update
- patch
- apiGroups:
- ""
resources:
- events
verbs:
- create
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
creationTimestamp: null
name: et-operator
labels:
{{- include "arena.labels" . | nindent 4 }}
rules:
- apiGroups:
- ""
resources:
- configmaps
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- pods
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- create
- apiGroups:
- ""
resources:
- pods/status
verbs:
- get
- patch
- update
- apiGroups:
- ""
resources:
- serviceaccounts
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- services
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- services/status
verbs:
- get
- patch
- update
- apiGroups:
- kai.alibabacloud.com
resources:
- scaleins
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- kai.alibabacloud.com
resources:
- scaleins/status
verbs:
- get
- patch
- update
- apiGroups:
- kai.alibabacloud.com
resources:
- scaleouts
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- kai.alibabacloud.com
resources:
- scaleouts/status
verbs:
- get
- patch
- update
- apiGroups:
- kai.alibabacloud.com
resources:
- trainingjobs
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- kai.alibabacloud.com
resources:
- trainingjobs/status
verbs:
- get
- patch
- update
- apiGroups:
- rbac.authorization.k8s.io
resources:
- rolebindings
verbs:
- create
- get
- list
- watch
- apiGroups:
- rbac.authorization.k8s.io
resources:
- roles
verbs:
- create
- get
- list
- watch
{{- if .Values.createSSHSecret }}
- apiGroups:
- ""
resources:
- secrets
verbs:
- get
- list
- watch
- create
{{- end }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: et-operator-leader-election
namespace: {{ .Release.Namespace }}
labels:
{{- include "arena.labels" . | nindent 4 }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: et-operator-leader-election
subjects:
- kind: ServiceAccount
name: et-operator
namespace: {{ .Release.Namespace }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: et-operator
labels:
{{- include "arena.labels" . | nindent 4 }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: et-operator
subjects:
- kind: ServiceAccount
name: et-operator
namespace: {{ .Release.Namespace }}

View File

@ -0,0 +1,3 @@
# Default values for et-operator.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

View File

@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/

View File

@ -0,0 +1,24 @@
apiVersion: v2
name: gpu-exporter
description: A Helm chart for Kubernetes
# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "1.0.0"

View File

@ -0,0 +1,69 @@
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ack-prometheus-gpu-exporter
namespace: {{ .Release.Namespace }}
labels:
{{- include "arena.labels" . | nindent 4 }}
spec:
selector:
matchLabels:
{{- include "arena.labels" . | nindent 6 }}
k8s-app: ack-prometheus-gpu-exporter
template:
metadata:
labels:
{{- include "arena.labels" . | nindent 8 }}
k8s-app: ack-prometheus-gpu-exporter
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: aliyun.accelerator/nvidia_name
operator: Exists
nodeSelector:
{{- include "arena.nodeSelector" . | nindent 8 }}
{{- include "arena.nonEdgeNodeSelector" . | nindent 8 }}
tolerations:
{{- with .Values.global.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- with .Values.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- include "arena.tolerateNonEdgeNodeSelector" . | nindent 6 }}
containers:
- env:
- name: MY_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: {{ include "arena.imagePrefix" . }}/{{ .Values.image }}:{{ .Values.tag }}
imagePullPolicy: {{ .Values.imagePullPolicy }}
securityContext:
privileged: true
name: node-gpu-exporter
ports:
- containerPort: 9445
name: http-metrics
protocol: TCP
resources:
limits:
cpu: 300m
memory: 300Mi
requests:
cpu: 200m
memory: 50Mi
volumeMounts:
- mountPath: /var/run/docker.sock
name: docker-sock
hostPID: true
restartPolicy: Always
volumes:
- hostPath:
path: /var/run/docker.sock
type: File
name: docker-sock

View File

@ -0,0 +1,17 @@
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: ack-prometheus-gpu-exporter
{{- include "arena.labels" . | nindent 4 }}
name: node-gpu-exporter
namespace: {{ .Release.Namespace }}
spec:
ports:
- name: http-metrics
port: 9445
protocol: TCP
targetPort: 9445
selector:
k8s-app: ack-prometheus-gpu-exporter
type: ClusterIP

View File

@ -0,0 +1,22 @@
{{- if .Capabilities.APIVersions.Has "monitoring.coreos.com/v1" -}}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: node-gpu-exporter
labels:
k8s-app: ack-prometheus-gpu-exporter
{{- include "arena.labels" . | nindent 4 }}
namespace: {{ .Release.Namespace }}
spec:
selector:
matchLabels:
k8s-app: ack-prometheus-gpu-exporter
namespaceSelector:
matchNames:
- {{ .Release.Namespace }}
# any: true
endpoints:
- port: http-metrics
interval: "45s"
path: /metrics
{{- end }}

View File

@ -0,0 +1 @@

View File

@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/

View File

@ -0,0 +1,24 @@
apiVersion: v2
name: mpi-operator
description: A Helm chart for Kubernetes
# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "v1.0.0-aliyun"

View File

@ -0,0 +1,46 @@
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mpi-operator
namespace: {{ .Release.Namespace }}
labels:
app: mpi-operator
{{- include "arena.labels" . | nindent 4 }}
spec:
replicas: 1
selector:
matchLabels:
app: mpi-operator
{{- include "arena.labels" . | nindent 6 }}
template:
metadata:
labels:
app: mpi-operator
{{- include "arena.labels" . | nindent 8 }}
spec:
serviceAccountName: mpi-operator
nodeSelector:
{{- include "arena.nodeSelector" . | nindent 8 }}
{{- include "arena.nonEdgeNodeSelector" . | nindent 8 }}
tolerations:
{{- with .Values.global.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- with .Values.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- include "arena.tolerateNonEdgeNodeSelector" . | nindent 6 }}
containers:
- name: mpi-operator
image: {{ include "arena.imagePrefix" . }}/{{ .Values.image }}:{{ .Values.tag }}
imagePullPolicy: {{ .Values.imagePullPolicy }}
args:
- --gpus-per-node
- "8"
- --kubectl-delivery-image
- {{ include "arena.imagePrefix" . }}/{{ .Values.kubectlDelivery.image }}:{{ .Values.kubectlDelivery.tag }}
- --alsologtostderr
- --v=5
resources:
{{- toYaml .Values.resources | nindent 10 }}

View File

@ -0,0 +1,103 @@
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: mpi-operator
labels:
{{- include "arena.labels" . | nindent 4 }}
rules:
- apiGroups:
- ""
resources:
- configmaps
- serviceaccounts
verbs:
- create
- list
- watch
# This is needed for the launcher Role.
- apiGroups:
- ""
resources:
- pods
verbs:
- get
# This is needed for the launcher Role.
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- create
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- apiGroups:
- rbac.authorization.k8s.io
resources:
- roles
- rolebindings
verbs:
- create
- list
- watch
- apiGroups:
- apps
resources:
- statefulsets
verbs:
- create
- list
- update
- watch
- apiGroups:
- batch
resources:
- jobs
verbs:
- create
- list
- update
- watch
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- create
- get
- apiGroups:
- kubeflow.org
resources:
- mpijobs
- mpijobs/status
verbs:
- "*"
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: mpi-operator
namespace: {{ .Release.Namespace }}
labels:
{{- include "arena.labels" . | nindent 4 }}
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: mpi-operator
namespace: {{ .Release.Namespace }}
labels:
{{- include "arena.labels" . | nindent 4 }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: mpi-operator
subjects:
- kind: ServiceAccount
name: mpi-operator
namespace: {{ .Release.Namespace }}

View File

@ -0,0 +1,4 @@
# Default values for mpi-operator.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

View File

@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/

View File

@ -0,0 +1,24 @@
apiVersion: v2
name: pytorch-operator
description: A Helm chart for Kubernetes
# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "v0.7.0"

View File

@ -0,0 +1,48 @@
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: pytorch-operator
namespace: {{ .Release.Namespace }}
labels:
{{- include "arena.labels" . | nindent 4 }}
spec:
replicas: 1
selector:
matchLabels:
name: pytorch-operator
{{- include "arena.labels" . | nindent 6 }}
template:
metadata:
labels:
name: pytorch-operator
{{- include "arena.labels" . | nindent 8 }}
annotations:
sidecar.istio.io/inject: "false"
spec:
nodeSelector:
{{- include "arena.nodeSelector" . | nindent 8 }}
{{- include "arena.nonEdgeNodeSelector" . | nindent 8 }}
tolerations:
{{- with .Values.global.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- with .Values.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- include "arena.tolerateNonEdgeNodeSelector" . | nindent 6 }}
containers:
- command:
- /pytorch-operator.v1
- --alsologtostderr
- -v=1
- --monitoring-port=8443
- --threadiness=4
- --init-container-image={{ .Values.initContainerImage }}
# image: gcr.io/kubeflow-images-public/pytorch-operator:v0.6.0-18-g5e36a57
image: {{ include "arena.imagePrefix" . }}/{{ .Values.image }}:{{ .Values.tag }}
imagePullPolicy: {{ .Values.imagePullPolicy }}
name: pytorch-operator
resources:
{{- toYaml .Values.resources | nindent 12 }}
serviceAccountName: pytorch-operator

View File

@ -0,0 +1,70 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app: pytorch-operator
{{- include "arena.labels" . | nindent 4 }}
name: pytorch-operator
namespace: {{ .Release.Namespace }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app: pytorch-operator
{{- include "arena.labels" . | nindent 4 }}
name: pytorch-operator
rules:
- apiGroups:
- kubeflow.org
resources:
- pytorchjobs
- pytorchjobs/status
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- pods
- services
- endpoints
- events
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app: pytorch-operator
{{- include "arena.labels" . | nindent 4 }}
name: pytorch-operator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: pytorch-operator
subjects:
- kind: ServiceAccount
name: pytorch-operator
namespace: {{ .Release.Namespace }}

View File

@ -0,0 +1,22 @@
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8443"
prometheus.io/scrape: "true"
labels:
app: pytorch-operator
{{- include "arena.labels" . | nindent 4 }}
name: pytorch-operator
namespace: {{ .Release.Namespace }}
spec:
ports:
- name: monitoring-port
port: 8443
targetPort: 8443
selector:
name: pytorch-operator
{{- include "arena.labels" . | nindent 4 }}
type: ClusterIP

View File

@ -0,0 +1,3 @@
# Default values for pytorch-operator.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

View File

@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/

View File

@ -0,0 +1,24 @@
apiVersion: v2
name: tf-dashboard
description: A Helm chart for Kubernetes
# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "1.16.0"

View File

@ -0,0 +1,47 @@
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 4 }}
name: tf-job-dashboard
namespace: {{ .Release.Namespace }}
spec:
selector:
matchLabels:
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 6 }}
template:
metadata:
labels:
kustomize.component: tf-job-operator
name: tf-job-dashboard
{{- include "arena.labels" . | nindent 8 }}
spec:
nodeSelector:
{{- include "arena.nodeSelector" . | nindent 8 }}
{{- include "arena.nonEdgeNodeSelector" . | nindent 8 }}
tolerations:
{{- with .Values.global.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- with .Values.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- include "arena.tolerateNonEdgeNodeSelector" . | nindent 6 }}
containers:
- command:
- /opt/tensorflow_k8s/dashboard/backend
env:
- name: KUBEFLOW_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
image: {{ include "arena.imagePrefix" . }}/{{ .Values.image }}:{{ .Values.tag }}
imagePullPolicy: {{ .Values.imagePullPolicy }}
name: tf-job-dashboard
ports:
- containerPort: 8080
resources:
{{- toYaml .Values.resources | nindent 12 }}
serviceAccountName: tf-job-dashboard

View File

@ -0,0 +1,83 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app: tf-job-dashboard
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 4 }}
name: tf-job-dashboard
namespace: {{ .Release.Namespace }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app: tf-job-dashboard
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 4 }}
name: tf-job-dashboard
rules:
- apiGroups:
- tensorflow.org
- kubeflow.org
resources:
- tfjobs
- tfjobs/status
verbs:
- '*'
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- '*'
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
verbs:
- '*'
- apiGroups:
- batch
resources:
- jobs
verbs:
- '*'
- apiGroups:
- ""
resources:
- configmaps
- pods
- services
- endpoints
- persistentvolumeclaims
- events
- pods/log
- namespaces
verbs:
- '*'
- apiGroups:
- apps
- extensions
resources:
- deployments
verbs:
- '*'
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app: tf-job-dashboard
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 4 }}
name: tf-job-dashboard
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: tf-job-dashboard
subjects:
- kind: ServiceAccount
name: tf-job-dashboard
namespace: {{ .Release.Namespace }}

View File

@ -0,0 +1,26 @@
apiVersion: v1
kind: Service
metadata:
annotations:
getambassador.io/config: |-
---
apiVersion: ambassador/v0
kind: Mapping
name: tfjobs-ui-mapping
prefix: /tfjobs/
rewrite: /tfjobs/
service: tf-job-dashboard.kubeflow
labels:
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 4 }}
name: tf-job-dashboard
namespace: {{ .Release.Namespace }}
spec:
ports:
- port: 80
targetPort: 8080
selector:
{{- include "arena.labels" . | nindent 4 }}
kustomize.component: tf-job-operator
name: tf-job-dashboard
type: ClusterIP

View File

@ -0,0 +1,3 @@
# Default values for tf-dashboard.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

View File

@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/

View File

@ -0,0 +1,24 @@
apiVersion: v2
name: tf-operator
description: A Helm chart for Kubernetes
# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "1.0.0"

View File

@ -0,0 +1,13 @@
apiVersion: v1
data:
controller_config_file.yaml: |-
{
"grpcServerFilePath": "/opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py"
}
kind: ConfigMap
metadata:
labels:
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 4 }}
name: tf-job-operator-config
namespace: {{ .Release.Namespace }}

View File

@ -0,0 +1,71 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-job-operator
namespace: {{ .Release.Namespace }}
labels:
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.replicas }}
selector:
matchLabels:
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 6 }}
template:
metadata:
labels:
kustomize.component: tf-job-operator
name: tf-job-operator
{{- include "arena.labels" . | nindent 8 }}
spec:
containers:
- name: tf-job-operator
image: {{ include "arena.imagePrefix" . }}/{{ .Values.image }}:{{ .Values.tag }}
imagePullPolicy: {{ .Values.imagePullPolicy }}
command:
- /opt/kubeflow/tf-operator.v1
- --alsologtostderr
- -v=1
- --monitoring-port=8443
- --threadiness=4
env:
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeMounts:
- name: config-volume
mountPath: /etc/config
resources:
{{- toYaml .Values.resources | nindent 10 }}
volumes:
- name: config-volume
configMap:
name: tf-job-operator-config
nodeSelector:
{{- include "arena.nodeSelector" . | nindent 8 }}
{{- include "arena.nonEdgeNodeSelector" . | nindent 8 }}
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 18 }}
topologyKey: kubernetes.io/hostname
tolerations:
{{- with .Values.global.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- with .Values.tolerations }}
{{- . | toYaml | nindent 6 }}
{{- end }}
{{- include "arena.tolerateNonEdgeNodeSelector" . | nindent 6 }}
serviceAccountName: tf-job-operator

View File

@ -0,0 +1,102 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: tf-job-operator
namespace: {{ .Release.Namespace }}
labels:
app: tf-job-operator
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 4 }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: tf-job-operator
labels:
app: tf-job-operator
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 4 }}
rules:
- apiGroups:
- tensorflow.org
- kubeflow.org
resources:
- tfjobs
- tfjobs/status
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- get
- list
- watch
- apiGroups:
- storage.k8s.io
resources:
- storageclasses
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- batch
resources:
- jobs
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- configmaps
- pods
- services
- endpoints
- persistentvolumeclaims
- events
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: tf-job-operator
labels:
app: tf-job-operator
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 4 }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: tf-job-operator
subjects:
- kind: ServiceAccount
name: tf-job-operator
namespace: {{ .Release.Namespace }}

View File

@ -0,0 +1,23 @@
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8443"
prometheus.io/scrape: "true"
labels:
app: tf-job-operator
kustomize.component: tf-job-operator
{{- include "arena.labels" . | nindent 4 }}
name: tf-job-operator
namespace: {{ .Release.Namespace }}
spec:
ports:
- name: monitoring-port
port: 8443
targetPort: 8443
selector:
kustomize.component: tf-job-operator
name: tf-job-operator
{{- include "arena.labels" . | nindent 4 }}
type: ClusterIP

View File

@ -0,0 +1,6 @@
# Default values for tf-operator.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
# -- Replicas of tf-operator deployment.
replicas: 1

View File

@ -0,0 +1,5 @@
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker

View File

@ -0,0 +1,2 @@
global:
imagePrefix: registry-us-east-1.ack.aliyuncs.com

View File

@ -0,0 +1,49 @@
{{- define "arena.imagePrefix" -}}
{{- if eq .Values.global.clusterProfile "Edge" }}
{{- .Values.global.imagePrefix }}
{{- else if .Values.global.pullImageByVPCNetwork }}
{{- .Values.global.imagePrefix | replace "registry." "registry-vpc." }}
{{- else }}
{{- .Values.global.imagePrefix }}
{{- end }}
{{- end }}
{{- define "arena.nodeSelector" }}
{{- range $nodeKey,$nodeVal := .Values.nodeSelector }}
{{ $nodeKey }}: "{{ $nodeVal }}"
{{- end }}
{{- range $nodeKey,$nodeVal := .Values.global.nodeSelector }}
{{ $nodeKey }}: "{{ $nodeVal }}"
{{- end }}
{{- end }}
{{- define "arena.nonEdgeNodeSelector" }}
{{- if eq .Values.global.clusterProfile "Edge" }}
alibabacloud.com/is-edge-worker: "false"
{{- end }}
{{- end }}
{{- define "arena.tolerateNonEdgeNodeSelector" }}
{{- if eq .Values.global.clusterProfile "Edge" }}
- key: node-role.alibabacloud.com/addon
operator: Exists
effect: NoSchedule
{{- end }}
{{- end }}
{{- define "arena.version" }}
{{- .Values.binary.tag }}
{{- end }}
{{- define "arena.labels" -}}
helm.sh/chart: arena-artifacts
app.kubeflow.org/managed-by: arena
{{- end }}
{{- define "crd.api" }}
{{- if .Capabilities.APIVersions.Has "apiextensions.k8s.io/v1beta1" -}}
v1beta1
{{- else -}}
v1
{{- end }}
{{- end }}

View File

@ -0,0 +1,10 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: arena-config
namespace: {{ .Release.Namespace }}
labels:
app.kubeflow.org: arena
{{- include "arena.labels" . | nindent 4 }}
data:
adminUsers: ""

View File

@ -0,0 +1,93 @@
{{- if .Values.binary.enabled }}
{{- if gt (int .Values.binary.masterCount) 0 }}
apiVersion: batch/v1
kind: Job
metadata:
namespace: {{ .Release.Namespace }}
name: binary-installer-{{ include "arena.version" . }}
labels:
app: binary-installer
name: binary-installer-{{ include "arena.version" . }}
{{- include "arena.labels" . | nindent 4 }}
spec:
parallelism: {{ .Values.binary.masterCount }}
backoffLimit: {{ .Values.binary.retry }}
template:
metadata:
labels:
app: binary-installer
name: binary-installer-{{ include "arena.version" . }}
{{- include "arena.labels" . | nindent 8 }}
spec:
hostNetwork: true
hostPID: true
tolerations:
- effect: NoSchedule
operator: Exists
key: node-role.kubernetes.io/control-plane
- effect: NoSchedule
operator: Exists
key: node-role.kubernetes.io/master
- effect: NoSchedule
operator: Exists
key: node.cloudprovider.kubernetes.io/uninitialized
- key: node-role.alibabacloud.com/addon
operator: Exists
effect: NoSchedule
restartPolicy: Never
containers:
- name: installer
image: {{ include "arena.imagePrefix" . }}/{{ .Values.binary.image }}:{{ .Values.binary.tag }}
imagePullPolicy: {{ .Values.binary.imagePullPolicy }}
securityContext:
privileged: true
command:
- sh
- -c
- |
rm -rf /usr/local/arena-installer/arena-installer
cp -a /root/arena-installer /usr/local/arena-installer
options='--only-binary --region-id {{ include "arena.imagePrefix" . }}'
{{- if .Values.binary.hostNetwork }}
options="$options --host-network"
{{- end }}
{{- if .Values.binary.rdma }}
options="$options --rdma"
{{- end }}
nsenter -t 1 -i -p -n -u -m -- /usr/local/arena-installer/arena-installer/install.sh $options
env:
volumeMounts:
- name: arena-installer
mountPath: /usr/local/arena-installer
- name: kube
mountPath: /root/.kube
volumes:
- hostPath:
path: /usr/local/arena-installer
type: DirectoryOrCreate
name: arena-installer
- hostPath:
path: /root/.kube
type: Directory
name: kube
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: name
operator: In
values:
- binary-installer-{{ include "arena.version" . }}
topologyKey: "kubernetes.io/hostname"
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: Exists
{{- end }}
{{- end }}

View File

@ -0,0 +1,114 @@
#
# Copyright 2025 The Kubeflow authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
suite: Test cron operator deployment
templates:
- charts/cron/templates/operator-dp.yaml
release:
name: arena-artifacts
namespace: arena-system
set:
cron:
enabled: true
tests:
- it: Should add tolerations if `global.tolerations` is set
set:
global:
tolerations:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
asserts:
- equal:
path: spec.template.spec.tolerations
value:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
- it: Should add tolerations if `cron.tolerations` is set
set:
cron:
tolerations:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
asserts:
- equal:
path: spec.template.spec.tolerations
value:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
- it: Should add tolerations if both `global.tolerations` and `cron.tolerations` are set
set:
global:
tolerations:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
cron:
tolerations:
- key: key3
operator: Equal
value: value3
effect: NoSchedule
- key: key4
operator: Exists
effect: NoSchedule
asserts:
- equal:
path: spec.template.spec.tolerations
value:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
- key: key3
operator: Equal
value: value3
effect: NoSchedule
- key: key4
operator: Exists
effect: NoSchedule

View File

@ -0,0 +1,110 @@
#
# Copyright 2025 The Kubeflow authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
suite: Test elastic job supervisor deployment
templates:
- charts/elastic-job-supervisor/templates/deployment.yaml
release:
name: arena-artifacts
namespace: arena-system
tests:
- it: Should add tolerations if `global.tolerations` is set
set:
global:
tolerations:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
asserts:
- equal:
path: spec.template.spec.tolerations
value:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
- it: Should add tolerations if `elastic-job-supervisor.tolerations` is set
set:
elastic-job-supervisor:
tolerations:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
asserts:
- equal:
path: spec.template.spec.tolerations
value:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
- it: Should add tolerations if both `global.tolerations` and `elastic-job-supervisor.tolerations` are set
set:
global:
tolerations:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
elastic-job-supervisor:
tolerations:
- key: key3
operator: Equal
value: value3
effect: NoSchedule
- key: key4
operator: Exists
effect: NoSchedule
asserts:
- equal:
path: spec.template.spec.tolerations
value:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
- key: key3
operator: Equal
value: value3
effect: NoSchedule
- key: key4
operator: Exists
effect: NoSchedule

View File

@ -0,0 +1,114 @@
#
# Copyright 2025 The Kubeflow authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
suite: Test et operator deployment
templates:
- charts/et/templates/operator-dp.yaml
release:
name: arena-artifacts
namespace: arena-system
set:
et:
enabled: true
tests:
- it: Should add tolerations if `global.tolerations` is set
set:
global:
tolerations:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
asserts:
- equal:
path: spec.template.spec.tolerations
value:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
- it: Should add tolerations if `et.tolerations` is set
set:
et:
tolerations:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
asserts:
- equal:
path: spec.template.spec.tolerations
value:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
- it: Should add tolerations if both `global.tolerations` and `et.tolerations` are set
set:
global:
tolerations:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
et:
tolerations:
- key: key3
operator: Equal
value: value3
effect: NoSchedule
- key: key4
operator: Exists
effect: NoSchedule
asserts:
- equal:
path: spec.template.spec.tolerations
value:
- key: key1
operator: Equal
value: value1
effect: NoSchedule
- key: key2
operator: Exists
effect: NoSchedule
- key: key3
operator: Equal
value: value3
effect: NoSchedule
- key: key4
operator: Exists
effect: NoSchedule

Some files were not shown because too many files have changed in this diff Show More