Commit Graph

221 Commits

Author SHA1 Message Date
Aldo Culquicondor 70a866ee52
Downgrade v2 API to v2beta1 (#378)
To leave the path open for improving the API without having to release a v3.
2021-07-16 11:29:46 -04:00
Yuan Tang d7f7421ba7
Remove go report badge (#377) 2021-07-15 22:42:18 -04:00
Aldo Culquicondor e80137c286
Consolidate validation and defaulting logic (#376)
Validation happens in a single place, improving coverage
2021-07-15 14:28:38 -07:00
Aldo Culquicondor e9547bf98b
Revert horovod example to v1 (#374) 2021-07-15 14:19:38 -07:00
Aldo Culquicondor 6afa62ca0b
Add integration tests for v2 controller (#375)
* Do inter-pod communication through SSH

The controller generates keys and mounts them to the containers. The container images must know how to place the credentials and set file permissions.

* Use init-container instead of entrypoint

* Fix scheme for recorder and defaults

* Add integration tests for v2 controller
2021-07-15 06:43:51 -07:00
ezioliao 30cdf43f93
kubectl-delivery wait all workers to become readiness (#371)
2. update kubectl_delivery controller unit test.
2021-06-29 04:07:45 -07:00
Aldo Culquicondor 4e7b23eb15
Upgrade v2 dependencies to k8s 1.19 (#370) 2021-06-22 14:15:31 -07:00
Aldo Culquicondor 5de1dbba34
Merge v2 Makefile into main one (#369)
This removes the need for a separate Dockerfile
2021-06-22 16:59:42 -04:00
Xiaoyu Zhai 53e29222a7
Fix kubectl-delivery security issues (#368) 2021-06-22 13:40:06 -04:00
Lei Xue 336fa6d726
Update gomod for sample-controller (#361)
Signed-off-by: Lei Xue <vfs@live.com>
2021-06-22 09:00:46 -04:00
Aldo Culquicondor 3d4a4bdb51
Fork v2 controller and API in a new module (#366) 2021-06-22 08:58:51 -04:00
Aldo Culquicondor 39d2108515
Propose a new architecture with focus on scalability and robustness (#360)
* Propose a new architecture with focus con scalability and robustness

* incorporate comments and fix assumptions about security
2021-06-21 10:34:54 -07:00
Chu Xiangyang 6ee71d45dd
Add gpu resource pattern (#363)
* Add gpu resource pattern

* add tests for isGPULauncher
2021-05-24 08:15:09 -04:00
Aldo Culquicondor b453a9b395
Add development options to build system (#362)
and fix CRD and ClusterRole on kustomize
2021-05-22 18:52:30 -07:00
Wang Zhang 680cd4db0f
Add python sdk and auto-generate script (#357) 2021-05-13 20:20:43 -04:00
Ce Gao 298b527bb0
chore: Add Wang Zhang as a reviewer (#355) 2021-04-21 04:41:41 -07:00
junfan.zhang d56bdb56d0
Fix typo (#349) 2021-04-21 14:50:22 +08:00
Wang Zhang 69b1d3b384
fix trimleft issue when deleting redundant pods (#353) 2021-04-18 20:21:15 +08:00
Peng Gao 1ff487d111
Update metrics doc (#350)
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
2021-04-13 05:06:03 -07:00
Peng Gao 424088cef4
Add mpi job info metric (#347)
Add mpiJobInfoGauge to collect MPIJob information.

Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
2021-04-07 10:13:21 +08:00
Yuan Tang 129bb84852
chore: Remove inactive owners (#346) 2021-04-03 21:55:21 -07:00
shinytang6 50c8ce152f
feat: upgrade common & volcano version (#345)
Signed-off-by: shinytang6 <1074461480@qq.com>
2021-04-01 10:58:19 -07:00
Yannis Zarkadas b367aa5588
MPI Operator: Consolidate manifests (#340)
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
2021-03-17 23:11:16 -07:00
Wang Zhang f788e75925
propose elastic training with horovod with mpi-operator (#335) 2021-03-11 22:21:25 -08:00
Wang Zhang d708607eed
fix --min-np and --max-np args in elastic example (#338) 2021-03-10 12:48:24 -08:00
Wang Zhang 97e867b618
Elastic deploy (#337) 2021-03-10 05:02:24 -08:00
Wang Zhang 1f9a999f19
fix elastic horovod example issue on command and image (#334)
fix role updating issue when worker replicas increases

fix test issue
2021-03-09 14:26:18 +08:00
Rui Fang 0eed7da66d
Remove launcherRunsWorkload flag (#331) 2021-03-08 21:14:48 -05:00
Wang Zhang 9b32e14211
add elastic horovod support for v1 (#332)
modify test cases

add example for elastic horovod

fix issues from comments
2021-03-04 17:05:44 +08:00
Yannis Zarkadas c39111e53d
MPI Operator: Move manifests development upstream (#326)
* manifests: Move manifests development upstream

As part of the work of wg-manifests for 1.3
(https://github.com/kubeflow/manifests/issues/1735), we are moving manifests
development in upstream repos. This gives the application developers full
ownership of their manifests, tracked in a single place.

This commit copies the manifests for application `MPI Operator`
from path `apps/mpi-job/upstream` of kubeflow/manifests to path
`manifests` of the upstream repo (https://github.com/kubeflow/mpi-operator).

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>

* README: Update README to point to new manifests location

Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
2021-03-02 06:27:48 -08:00
Peng Gao fdae816fee
Add created jobs count metric (#320)
There is no metric to count running jobs. The metric
introduced is equal to failed + successful + running.

Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
2021-01-26 04:28:35 -08:00
Rong Ou ed76e46292
Update horovod version for the tensorflow benchmark example (#317) 2021-01-14 13:47:00 -05:00
Yaoyang Liu c12cb211b8
fix: non-running worker will not be removed under CleanPodPolicyRunning (#313) 2020-12-29 11:40:28 +08:00
Naveen 08acc4bd77
Implemented codeql for code scanning (#311)
Implemented code scanning using codesql.
2020-12-19 10:26:25 -08:00
Yuan Tang 802bdc22c0
Add link to GitHub Actions builds on master branch (#309) 2020-12-17 18:02:24 -08:00
Naveen 4697266869
Fixed golangci lint issue (#308)
Fixed the golangci lint issue by fixing the warning.

Also update the .gitignore to ignore the coverage file which was from
the tests.
2020-12-17 20:46:12 -05:00
Naveen 05370aadbb
Migrate to GitHub Pipelines from travis (#305)
* Update to include golangci-lint and go mod tidy

The go mod tidy to make sure the mod files are clean.

The golangci-lint helps in running in the linter locally instead just on
CI. The present linting has issues and that is the reason to include
only for new revisions.

Resolves #304

See also #303

* Implemented GitHub actions

Implemented GitHub actions for build, test, go mod tidy, coverage and
golangci-lint.

* Removed the .travis.yml

Removed the .travis.yml because of migration to GitHub actions.
2020-12-17 18:47:10 -05:00
Naveen 27790c84ff
Cleans up the unused go.sum dependencies (#303) 2020-12-15 16:24:20 -08:00
Naveen 19173091b0
Go fmt changes that caused the git tree to be dirty (#302)
These go fmt changes caused the git tree to be dirty.
2020-12-15 15:20:20 -08:00
Peng Gao 463380c8f0
Support setting queue name (#300)
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
2020-12-03 06:19:00 -08:00
divinerapier b4a36be802
fix: it panics if workerSpec.Replicas is unset (#298)
Signed-off-by: divinerapier <sihao.fang@outlook.com>
2020-11-30 06:44:50 -08:00
Yuan Tang 75f424a802
Explicitly pass namespaces to MPI Operator's deployment (#295) 2020-10-19 18:34:28 -04:00
Yuan Tang 5a4d077018
Update Caicloud to Bytedance in adopters list (#292)
* Update Caicloud to Bytedance in adopters list

* Update url

* Update ADOPTERS.md
2020-10-07 09:48:16 -07:00
Andrey Velichkevich 947d396a9c
Change horovod example to V1 (#290)
* Add Horovod v1 example

Add Horovod version to Dockerimage
Add HP in args

* Upload horovod example to docker kubeflow hub
2020-09-21 20:51:40 -04:00
Andrey Velichkevich 21523bb3ab
Fix additional printer columns in v1 CRD (#289) 2020-09-17 09:38:48 -07:00
Tim Deng 07bbb45de9
add support for using Intel MPI(2019.7) and MVAPICH2 (#283)
* + support for IntelMPI and MPICH
+ local minikube test pass
+ add new Spec "mpiDistribution"
@ 2020/7/27

* * fix ineffectual assignment
* change email address

* * update variable name

* * fix some spelling and naming problems

* + add more notes

* + auto filter prefix parameters

* * fix some spelling problem
* update notes about hostfile generating

* + mpich-format hostfile split

* + generate hosts for hydra to resolve hostname

* * update notes

* * fix sh script
+ move hosts sending and merging here
* use special type instaed of string

* * check return value

* * update options' name

* + add unit test for generateHosts

* ^ fixed lint reported errors
2020-08-03 04:27:40 -07:00
Lei Xue 0edc69b150
add QPS and burst (#280)
Signed-off-by: Lei Xue <vfs@live.com>
2020-07-27 03:36:16 -07:00
allenfan 42f43a97a2
support run workload on launcher (#276)
* support run workload on launcher

* update unit test

* update isGPULauncher method
2020-07-17 11:34:51 -07:00
Daniel 40697cb2f5
Avoid potential nil pointer. (#275)
Signed-off-by: 屈骏 <qujun@tiduyun.com>
2020-06-30 01:56:06 -07:00
Yuan Tang e9e9abea33
Rename Ant Financial to Ant Group (#273) 2020-06-25 09:48:39 -07:00