Compare commits

...

213 Commits

Author SHA1 Message Date
Yuki Iwai fe7a35dffa
tenzen-y steps down from Katib approver role (#2561)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-07-28 13:27:49 +00:00
dependabot[bot] dd107108b5
Bump golang.org/x/crypto from 0.31.0 to 0.35.0 (#2543)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.31.0 to 0.35.0.
- [Commits](https://github.com/golang/crypto/compare/v0.31.0...v0.35.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.35.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-16 20:08:39 +00:00
Yuki Iwai 8e887b8719
chore: Upgrade Go version to 1.23 (#2526)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-07-16 19:17:38 +00:00
Andrey Velichkevich 5d70808886
feat(docs): Guide to report security vulnerabilities (#2556)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-07-15 16:29:38 +00:00
Andrey Velichkevich ba2cf7d1ec
chore(docs): Add OpenSSF Badge (#2555)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-07-13 22:06:21 +00:00
Hezhi (Helen) Xie 73b8c5c029
[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization (#2420)
* add e2e test for tune api

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* upgrade training-operator sdk

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* specify the version of training operator sdk

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix num_labels error and update the version of training operator controller

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the version of training operator

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* debug

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check import path of HuggingFaceModelParams

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update the version of training operator sdk

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update the name of experiment

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add step of checking pod

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of pod

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add check

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check reason for imagepullbackoff

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* revert timeout limit

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* extend timeout limit

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update training operator sdk version

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of pod

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update the function of getting logs

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add the step of describing pod

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check disk space

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* change work directory

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* change work directory

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* increase timeout limit

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of controller and events

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* change work directory

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* change work directory

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* change work directory

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of kubelet

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of kubelet

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* increase cpu

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of training operator

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the use of resources

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of container 'pytorch' and 'storage_initializer'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix error of checking use of resources

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add other checks to find the error reason

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* set 'storage_config'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* reduce the number of tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* Check container runtime logs

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* set the driver of minikube as docker

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* set the driver of minikube to none

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check logs of pod

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check memory usage

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* increase 'termination_grace_period_seconds' in podspec

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix annotations error

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* restart docker

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete restarting docker

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* use original docker data directory

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update installation of Katib SDK with extra requires

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* test trainer image built with cpu

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add action of free up disk space (including move docker data directory)

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete unnecessary checks and update the part of fetching pod description and logs

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete fetching pod logs

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add blank line at the end of free-up-disk-space yaml file

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update experiment name

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update test function name to be consistent with experiment name

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* move import statements inside the function

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* apply pprint for the logging output

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update experiment names

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix the sequence of arguments in 'trial_template'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* test example in user guide

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix access token error

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix the error of setup

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix the error of setup

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* reverse back

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

---------

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
2025-06-26 14:13:16 +00:00
dependabot[bot] 5cd9592335
Bump brace-expansion in /pkg/ui/v1beta1/frontend (#2551)
Bumps  and [brace-expansion](https://github.com/juliangruber/brace-expansion). These dependencies needed to be updated together.

Updates `brace-expansion` from 2.0.1 to 2.0.2
- [Release notes](https://github.com/juliangruber/brace-expansion/releases)
- [Commits](https://github.com/juliangruber/brace-expansion/compare/v2.0.1...v2.0.2)

Updates `brace-expansion` from 1.1.11 to 2.0.2
- [Release notes](https://github.com/juliangruber/brace-expansion/releases)
- [Commits](https://github.com/juliangruber/brace-expansion/compare/v2.0.1...v2.0.2)

---
updated-dependencies:
- dependency-name: brace-expansion
  dependency-version: 2.0.2
  dependency-type: indirect
- dependency-name: brace-expansion
  dependency-version: 2.0.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-12 04:40:51 +00:00
Vikas Saxena 9421f2322b
New fixing kustomize5 warning (#2549)
* ran kustomize edit fix for katib-cert-manager

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

* ran kustomize edit fix for katib-external-db

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

* fixed up comments

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

* fixed build error with katib-cert-maanager

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

* fixed up comments in katib-external-db

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

* fixed warnings in katib-leader-election

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

* fixed warnings in katib-openshift

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

* fixed warnings in katib-standalone-postgres

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

* fixed warnings in katib-with-kubeflow

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

* fixed up comments

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

* fixing diffs errors in the updated code

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>

---------

Signed-off-by: Vikas Saxena <Vikas.Saxena.2006@gmail.com>
2025-05-13 04:31:20 +00:00
Ayush Gupta 1ebd5e4453
Fix Istio sidecar injection by moving from annotations to labels (#2527)
* Fix Istio sidecar injection by moving from annotations to labels

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

* Update Istio sidecar injection from annotations to labels across the codebase
Replace annotations with labels for Istio sidecar injection according to Istio recommendations. Update conformance tests, examples, constants, composers, and utilities to use the new label-based approach consistently.

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

* fix: Update SuggestionLabels function and composer implementation for Istio label injection

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

* Fix linting issues in mpi-job-horovod.py

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

* update: function moved from annotations to labels

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>

---------

Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
2025-05-09 17:52:41 +00:00
Harshvir Potpose c9513c633d
Fix PSS restricted warnings (#2528)
* fix pss warnings

Signed-off-by: Harshvir Potpose <hpotpose62@gmail.com>

* fix mysql

Signed-off-by: Harshvir Potpose <hpotpose62@gmail.com>

---------

Signed-off-by: Harshvir Potpose <hpotpose62@gmail.com>
2025-04-29 16:33:02 +00:00
M!l!nd dd4acfc2ce
feat: add `CITATION.cff` file (#2547)
* feat: add `CITATION.cff` file

Signed-off-by: milinddethe15 <milinddethe15@gmail.com>

* Update CITATION.cff

Co-authored-by: Shao Wang <2690692950@qq.com>
Signed-off-by: M!l!nd <99114125+milinddethe15@users.noreply.github.com>

---------

Signed-off-by: milinddethe15 <milinddethe15@gmail.com>
Signed-off-by: M!l!nd <99114125+milinddethe15@users.noreply.github.com>
Co-authored-by: Shao Wang <2690692950@qq.com>
2025-04-18 16:47:24 +00:00
dependabot[bot] 349b571541
Bump github.com/golang-jwt/jwt/v4 from 4.5.1 to 4.5.2 (#2533)
Bumps [github.com/golang-jwt/jwt/v4](https://github.com/golang-jwt/jwt) from 4.5.1 to 4.5.2.
- [Release notes](https://github.com/golang-jwt/jwt/releases)
- [Changelog](https://github.com/golang-jwt/jwt/blob/main/VERSION_HISTORY.md)
- [Commits](https://github.com/golang-jwt/jwt/compare/v4.5.1...v4.5.2)

---
updated-dependencies:
- dependency-name: github.com/golang-jwt/jwt/v4
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-15 12:36:24 +00:00
Helber Belmiro 8e965f11d8
chore(test): Removed the no longer needed trigger-rerun-test.yaml (#2540)
Signed-off-by: Helber Belmiro <helber.belmiro@gmail.com>
2025-04-09 16:41:20 +00:00
Andrey Velichkevich 6578306795
chore(docs): Add Changelog Katib v0.18.0 (#2537)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-03-29 21:48:31 +00:00
saileshd1402 54764d6aa4
Revert GHCR changes for Notebook examples (#2536)
Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>
2025-03-24 22:06:03 +00:00
Mahdi Khashan db4b68bf56
[feature] move manifest image references to ghcr (#2529)
* move to ghcr

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* move images to ghcr

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* manifests

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* change registry in all path

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update script

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* fix

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* fix

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* slight fix

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

---------

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <58775404+mahdikhashan@users.noreply.github.com>
2025-03-24 17:11:50 +00:00
Mahdi Khashan 1f76bb3bbf
[feature] migrate docker images to ghcr (#2520)
* update custom action

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* define token as input

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* clean up meta job

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* change build-and-publish-imageg.yaml

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* remove secret from workflow call

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* remove docker credentials from publish* images

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* revert meta step changes

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* revert changes

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* update

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* add dockerhub as a job

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* revert secrets

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* revert docker secrets

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* revert docker secrets

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* consolidate/merge registeries

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* fix inputs

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* revert docker path name

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

---------

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
2025-03-18 19:12:14 +00:00
dependabot[bot] 4884253067
Bump axios from 1.7.9 to 1.8.3 in /pkg/ui/v1beta1/frontend (#2524)
Bumps [axios](https://github.com/axios/axios) from 1.7.9 to 1.8.3.
- [Release notes](https://github.com/axios/axios/releases)
- [Changelog](https://github.com/axios/axios/blob/v1.x/CHANGELOG.md)
- [Commits](https://github.com/axios/axios/compare/v1.7.9...v1.8.3)

---
updated-dependencies:
- dependency-name: axios
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-12 22:35:41 +00:00
dependabot[bot] 9e430ceaf5
Bump @babel/helpers from 7.25.0 to 7.26.10 in /pkg/ui/v1beta1/frontend (#2523)
Bumps [@babel/helpers](https://github.com/babel/babel/tree/HEAD/packages/babel-helpers) from 7.25.0 to 7.26.10.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.26.10/packages/babel-helpers)

---
updated-dependencies:
- dependency-name: "@babel/helpers"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-12 22:31:34 +00:00
Gary Miguel c18035e104
Support old-style TensorFlow events (tensorboard) (#2467)
* Support old-style TensorFlow events (tensorboard)

Fixes: https://github.com/kubeflow/katib/issues/2466
Signed-off-by: Gary Miguel <garymm@garymm.org>

* format

Signed-off-by: Gary Miguel <garymm@garymm.org>

* test

Signed-off-by: Gary Miguel <garymm@garymm.org>

* don't continue loops

Signed-off-by: Gary Miguel <garymm@garymm.org>

* format

Signed-off-by: Gary Miguel <garymm@garymm.org>

---------

Signed-off-by: Gary Miguel <garymm@garymm.org>
2025-02-15 00:59:37 +00:00
Anish Asthana 3c88967299
Add 'KEP Usage' KEP and template link (#2509)
Signed-off-by: Anish Asthana <anishasthana1@gmail.com>
2025-02-14 23:07:37 +00:00
Andrey Velichkevich 338a5c107b
Add Changelog for Katib v0.18.0-rc.0 (#2515)
* Add Changelog for Katib v0.18.0-rc.0

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add sections for GSoC projects

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update CHANGELOG.md

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-02-13 19:01:36 +00:00
Andrey Velichkevich 302020c29e
Bump Katib Python SDK to 0.18.0rc0 version (#2514)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-02-13 18:05:36 +00:00
Andrey Velichkevich 7b4652058d
[SDK] Support PyTorchJob as a Trial Worker (#2512)
* [SDK] Support PyTorchJob as Trial Worker

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix pod spec for Job

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Set default restart_policy to Never

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix primary_container_name for PyTorchJob

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add unit tests for PyTorchJob as Trial

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add e2e test for PyTorchJob as Trial

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Bump kubeflow-training SDK

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Deploy Training Operator with server side apply

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Decrease CPUs for E2E

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Install Training Operator for tune workflow

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix comments

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-02-13 11:10:36 +00:00
Shashank Mittal 6389cbadf1
[GSOC] `optuna` suggestion service logic update (#2446)
* unit test fixed

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* Update pkg/suggestion/v1beta1/hyperopt/base_service.py

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* comment fixed

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* initial logic update

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* added unit and e2e tests for optuna suggestion service update

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* refactored code

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* added parameter for logUniform and minor changes

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* fix

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

---------

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-02-10 16:18:06 +00:00
Shao Wang c2b5b52762
fix(webhook): fix validation message in experiment webhook (#2507)
Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-02-05 03:09:37 +00:00
Aydan Pirani 4d2a23073a
Set experiment names at a max of 40 characters. (#2468)
Signed-off-by: Aydan Pirani <aydanpirani@gmail.com>
2025-02-04 17:05:36 +00:00
Mahdi Khashan 3e736dc54d
[CI] optimize katib ui dockerfile (#2505)
* fix flakiness

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* fix flakiness 2

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* fix flakiness 3

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* use alpine for first stage

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* use alpline git

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* no security audit

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* force npm ci

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

---------

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
2025-02-01 20:42:33 +00:00
Shashank Mittal bf034636fa
[GSOC] `hyperopt` suggestion service logic update (#2412)
* resolved merge conflicts

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* fix

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* DISTRIBUTION_UNKNOWN enum set to 0 in gRPC api

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* convert parameter method fix

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

validation fix

add e2e tests for hyperopt

added e2e test to workflow

* convert feasibleSpace func updated

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* renamed DISTRIBUTION_UNKNOWN to DISTRIBUTION_UNSPECIFIED

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* fix

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* added more test cases for hyperopt distributions

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* added support for NORMAL and LOG_NORMAL in hyperopt suggestion service

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* added e2e tests for NORMAL and LOG_NORMAL

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

sigma calculation fixed

fix

parse new arguments to mnist.py

* hyperopt-suggestion example update

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* updated logic for log distributions

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* updated logic for log distributions

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* e2e test fixed

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* added support for parameter distributions for Parameter type INT

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* unit test fixed

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* Update pkg/suggestion/v1beta1/hyperopt/base_service.py

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* comment fixed

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* added unit tests for INT parameter type

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* completed param unit test cases

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* handled default case for normal distributions when min or max are not specified

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* fixed validation logic for min and max

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* removed unnecessary test params

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* fixes

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* added comments

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* fix

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* set default distribution as uniform

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* line omit

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* removed empty spaces from yaml files

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

---------

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2025-01-30 21:26:52 +00:00
Hezhi (Helen) Xie 741238d712
Install typing-extensions v4.10.0 to fix Python test error (#2504)
* update the version of typing-extensions

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update comment

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

---------

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
2025-01-30 15:58:53 +00:00
Shao Wang 28e466e1b8
[GSoC] Provide a PyTorch MNIST Example for Push-based Metrics Collection (#2437)
Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-01-29 10:23:51 +00:00
Mahdi Khashan 09523cdfad
[SDK] improve PVC creation name error (#2496)
* improve pvc name error message by failing early and clear message with correct name example

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* fix lint

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* fix lint

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* raise value error for wrong name format by reconciliation

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* revert created utils

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve test case name

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve value error message

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

* improve code flow

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>

---------

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
2025-01-28 00:32:50 +00:00
Du Xinmin 0133983d4a
Sort experiments by descending creation date by default in katib-ui (#2498)
* Sort experiments by descending creation date by default in katib-ui

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: Update "renders every Experiment name into the table" test to not check order

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: Update "renders every Experiment name into the table" test in order of startTime

Signed-off-by: Xinmin Du <2812493086@qq.com>

---------

Signed-off-by: Xinmin Du <2812493086@qq.com>
2025-01-27 23:14:51 +00:00
Hezhi (Helen) Xie 40e1e651f2
[GSoC] Add unit tests for `tune` API (#2423)
* add unit tests for tune api

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update unit tests and fix api errors

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* test

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* test

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update unit tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* undo changes to Makefile

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete debug code

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update unit test

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update the version of training operator

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* adjust 'list_namespaced_persistent_volume_claim' to be called with keyword argument

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* create constant for namespace when check pvc creation error

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add type check for 'trainer_parameters'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update test names

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add verification for key Experiment information & add 'kubeflow-training[huggingface' into dependencies

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add verification for objective metric name

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete unnecessary changes

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* unify objective function

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* unify objective function

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

---------

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
2025-01-24 20:38:21 +00:00
Hezhi (Helen) Xie 2567939fc9
[SDK] Update `tune` API (#2497)
* fix tune api error

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete check for

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

---------

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
2025-01-22 15:19:27 +00:00
dependabot[bot] f46cee565b
Bump axios from 1.7.2 to 1.7.9 in /pkg/ui/v1beta1/frontend (#2486)
Bumps [axios](https://github.com/axios/axios) from 1.7.2 to 1.7.9.
- [Release notes](https://github.com/axios/axios/releases)
- [Changelog](https://github.com/axios/axios/blob/v1.x/CHANGELOG.md)
- [Commits](https://github.com/axios/axios/compare/v1.7.2...v1.7.9)

---
updated-dependencies:
- dependency-name: axios
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-22 13:53:52 +00:00
dependabot[bot] d87b41f4b0
Bump express from 4.19.2 to 4.21.2 in /pkg/ui/v1beta1/frontend (#2477)
Bumps [express](https://github.com/expressjs/express) from 4.19.2 to 4.21.2.
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/4.21.2/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.19.2...4.21.2)

---
updated-dependencies:
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-22 13:14:43 +00:00
royliang aa04cf4335
Update MutatingWebhookConfiguration: Switch from objectSelector to AdmissionWebhookMatchConditions (#2241)
Signed-off-by: lianghao208 <roylizard3@gmail.com>
2025-01-22 12:34:49 +00:00
Caio Almeida 59af784f50
chore: supporting the listen-address parameter on db-manager (#2465)
Signed-off-by: Caio Almeida <caio.f.r.amd@gmail.com>
2025-01-22 00:03:41 +00:00
Tsz Lung Chung 224aa9d7a0
fix(api): resolve all api voilation exceptions in katib api (#2482)
Signed-off-by: truc0 <22969604+truc0@users.noreply.github.com>
2025-01-21 14:23:11 +00:00
dependabot[bot] 93bee4dc25
Bump golang.org/x/net from 0.27.0 to 0.33.0 (#2476)
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.27.0 to 0.33.0.
- [Commits](https://github.com/golang/net/compare/v0.27.0...v0.33.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-15 05:27:12 +00:00
Du Xinmin 0cab624e6e
Upgrade klog to v2 (#2470)
* Upgrade klog dependency to v2

Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: fix conflict with k8s upate

Signed-off-by: Xinmin Du <2812493086@qq.com>

---------

Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Du Xinmin <dux.m.in@sjtu.edu.cn>
Co-authored-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
2025-01-15 05:22:12 +00:00
dependabot[bot] 1412c56059
Bump golang.org/x/crypto from 0.21.0 to 0.31.0 (#2464)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.21.0 to 0.31.0.
- [Commits](https://github.com/golang/crypto/compare/v0.21.0...v0.31.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-15 01:55:22 +00:00
Du Xinmin e5482959fc
Ignore cache exporting errors in the image building workflows (#2487)
Signed-off-by: Xinmin Du <2812493086@qq.com>
2025-01-14 23:13:08 +00:00
Shao Wang 3b554aaf64
Upgrade grpcio version to v1.64.1 (#2483)
Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-01-14 20:35:42 +00:00
Shao Wang bf4a0b2c41
Upgrade Kubernetes to v1.31.3 (#2478)
* chore(ci): add k8s version 1.31.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(Makefile): upgrade envtest version to 1.31 & setup-envtest to release-0.19.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: update k8s related package in go.mod

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: make generate.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(test): add SkipNameValidation option to test frame.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* refactor(grpc): remove deprecated grpc.Dial implementation.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(dependency): remove dependency on k8s v1.28

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: add type assertion to ptr.To

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-01-14 11:06:08 +00:00
Shao Wang eb8af4d502
fix(trial): use propagated gomega to improve debuggability. (#2432)
Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-01-10 18:57:44 +00:00
Shao Wang 9889b33599
Upgrade Kubernetes to v1.30.7 (#2463)
* chore: update go.mod & go mod tidy.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: replace source.Kind and EnqueueRequestForXxx with typed func call.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: update admission.Decoder in webhook.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: update Makefile.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: update codegen script.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: execute update-codegen.sh.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: update openapigen & generate new openapi definitions.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: fix typo error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: update k8s version in CI.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(codegen): output CODEGEN_PKG.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(codegen): move shell check annotation.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(ci): change k8s version in go test to 1.30.0.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: remove toolchain declaration in go.mod

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: remove codegen dependency in openapigen.sh.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: fix bugs in recursive dir detection.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: remove a blank line.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: remove klog/v2

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(codegen): add three dots in the comment.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(codegen): fix package dependency on k8s.io/code-generator.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(Makefile): add go-mod-download.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2025-01-10 18:46:06 +00:00
Mahdi Khashan 9531372530
[DOCS] move llm hyperparameter optimisation design image to the proposal directory and rename it (#2472)
- remove redundant folder

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
2025-01-08 17:43:21 +00:00
Shao Wang 336396436a
fix(ui): update None Collector with Push Collector. (#2418)
* fix(ui): update None Collector with Push Collector.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(ui): replace some remaining None MC.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2024-12-04 08:28:00 +00:00
Tariq Hasan 5212949244
fix: Resolve errors in e2e tests for cypress in Katib UI (#2384)
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
2024-12-03 14:02:59 +00:00
Shao Wang fce751a90e
doc(example): fix the broken link. (#2433)
* fix: fix the broken link.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(doc): update guidance in multi-users pipelines setup.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2024-12-02 13:32:58 +00:00
Shao Wang 3e3e0f8cdc
fix: remove remaining MXNet dependency. (#2456)
Signed-off-by: Electronic-Waste <2690692950@qq.com>
2024-12-02 13:23:57 +00:00
Andrey Velichkevich dc3398dbd4
Remove Dropout layer from ENAS Trial container to fix E2E tests (#2455)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-12-02 08:13:57 +00:00
dependabot[bot] 2b41ae62ab
Bump github.com/golang-jwt/jwt/v4 from 4.5.0 to 4.5.1 (#2449)
Bumps [github.com/golang-jwt/jwt/v4](https://github.com/golang-jwt/jwt) from 4.5.0 to 4.5.1.
- [Release notes](https://github.com/golang-jwt/jwt/releases)
- [Changelog](https://github.com/golang-jwt/jwt/blob/main/VERSION_HISTORY.md)
- [Commits](https://github.com/golang-jwt/jwt/compare/v4.5.0...v4.5.1)

---
updated-dependencies:
- dependency-name: github.com/golang-jwt/jwt/v4
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-05 00:45:09 +00:00
Gonçalo Montalvão Marques 706a6f2190
docs: remove katib workflow (#2443)
Signed-off-by: Gonçalo Montalvão Marques <9379664+gonmmarques@users.noreply.github.com>
2024-10-15 15:14:18 +00:00
Andrey Velichkevich 0bc143ad1a
Promote @Electronic-Waste and @helenxie-bit as Katib reviewers (#2439)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-10-11 19:07:12 +00:00
Andrey Velichkevich 719ae382c1
Update README and out-of-date docs (#2438)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-10-10 19:50:10 +00:00
Shao Wang 867c40a1b0
[GSoC] Compatibility Changes in Trial Controller (#2394)
* chore: add condition branch in requeue logic.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: add ReportObservationLog in katib_manager_util.go.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: add ReportTrialUnavailableMetrics func.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: insert unavailable value into Katib DB.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: fix lint error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: add nil condition judgement.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: add nil condition judgement in trial_controller_util.go

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(trial): delete nil check of MC kind in the Trial controller.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(trial): init MC in newFakeTrialBatchJob to avoid nil condition in trial reconcile loop.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(trial): fix lint error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(trial): fix lint error in controller.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(trial): add integration test for Push MC.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(trial): retry reconcilation when reporting unavailable metrics failed.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(trial): fix EXPECT order.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(trial): fix typo error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(trial): add errReportMetricsFailed.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* Update pkg/controller.v1beta1/trial/trial_controller.go

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>

* Update pkg/controller.v1beta1/trial/trial_controller_util.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>

* Update pkg/controller.v1beta1/trial/trial_controller.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(trial): rename errors pkg.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(trial): update the order of UT.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(trial): use different names for UTs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(trial): separate Push MC UTs with original UTs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(trial): fix line error with gofmt.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(trial): reserve one UT for Push MC.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(trial): fix typo error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(trial): make some tiny changes.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(trial): move cancel func to t.Cleanup.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(trial): use the propagated gomega instance to improve debuggability.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(trial): use gofmt to reformat code.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-09-19 07:24:28 +00:00
Hezhi (Helen) Xie bc09cfd412
[SDK] Fix types error (#2424)
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
2024-09-05 16:09:15 +00:00
Hezhi (Helen) Xie e251a07cb9
[GSoC] Update `tune` API for LLM hyperparameters optimization (#2393)
* update tune api for llm hyperparameters optimization

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* resolve conflict

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix the problem of dependency

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix the format of import statement

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* adjust the blank lines

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete the trainer to reuse it in Training Operator

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update constants

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update metrics format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update the type of  and

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update the message of 'ImportError'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add TODO of PVC creation

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update the name of pvc

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* reuse constants from Training Operator

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* keep 'parameters' and update validation

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update for test

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* reuse 'get_container_spec' and 'get_pod_template_spec' from Training Operator

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* format with black

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix Lint error

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix Lint errors

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete types

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix e2e test error

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add TODO

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* format with max line length

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* format docstring

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add helper functions

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* run test again

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* run test again

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* run test again

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix dict substitution in training_parameters

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix typo

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* resolve conflicts and add check for case of no parameters

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix flake8 error

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update isort file to black and fix typo

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* modify the set of metrics format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update tune API

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add types.TrainerResources class

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix flake8 error

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* resolve conflict

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete properties of 'TrainerResources'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format error

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update types

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add import of 'TrainerResources' in '__init__.py' of katib

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* revert changes and rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check pvc and pv status of katib deployments

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check pvc and pv status of katib deployments

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* recommit changes

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update minikube version when setup

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete the code that disables formatting for the tune function

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update according to andrey's feedback

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add helper function in utils

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* move metrics_collector_spec back & update helper functions & add return type for helper functions

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix some typos

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* simplify the definition of 'TrainerResources'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

---------

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
2024-09-03 11:35:14 +00:00
Shao Wang a524f33830
[SDK] fix grpc related bugs in Python SDK (#2398)
* fix: fix bugs in report_metrics.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: fix bugs in tune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: fix bugs in get_trial_metrics.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: update .gitignore and setup.py.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: update Makefile.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* feat: add report_metrics_test.py.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: fix lint error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* feat: add UTs for get_trial_metrics.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: update post_gen.py.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* refactor: rebase to master.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): use single katib_client.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): add TODO for import rewrite.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): fix lint error with black.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): fix lint error with isort.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(sdk): reformat import in katib_client_test.py.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2024-08-23 11:12:58 +00:00
Ignas Baranauskas 0e2ba6efc1
Changes isort profile to black, to be fully compatible and adds 'pkg' dir for black and flake8 (#2413)
* Chnage the isort profile to black, and add pkg dir for black and flake8

Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>

* Fix the formating

Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>

* Fix flake8 lint issues

Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>

---------

Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
2024-08-22 15:33:57 +00:00
Shashank Mittal 4964d04208
[GSOC] Add validator for feasible space distribution (#2404)
* added validator for feasible space distribution

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

validation logic fixed

added unit test

added unit test for valid distribution

requested changes made

Update pkg/webhook/v1beta1/experiment/validator/validator.go

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

fmt

* fmt fix

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

---------

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>
2024-08-20 17:16:56 +00:00
Tariq Hasan abd1c428c7
Introduced error constants and replaced reflect with cmp (#2289)
* introduced error constants and replaced reflect with cmp

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* fix order of mock method calls

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

---------

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
2024-08-18 18:32:53 +00:00
Shashank Mittal 2f5bda2da9
[GSOC] added Unknown distribution and convertDistribution in suggestion client (#2403)
* added Unknown distribution and convertDistribution in suggestion client

added unit tests

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

* removed custom compare func

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

---------

Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>
2024-08-18 16:27:54 +00:00
Shao Wang 4a385f515a
[Test] Refactor `inject_webhook_test.go` according to the Developer Guide (#2401)
* test(webhook): save current work.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* refactor(test/webhook): refactor inject_webhook_test.go.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(webhook): fix lint error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(webhook): add UT deleted by accident.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2024-08-16 18:46:28 +00:00
Ignas Baranauskas e9e6e0c0b1
Enhance pre-commit hooks with flake8 and black (#2407)
* Add black formater and flake8 linter to pre-commit

Also add's the flake8 config file

Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>

* Fixes black formating

Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>

* Fixes flake8 linting errors

Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>

---------

Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
2024-08-16 10:13:28 +00:00
dependabot[bot] 8eb0e86385
Bump github.com/docker/docker from 26.1.4+incompatible to 26.1.5+incompatible (#2405)
Bumps [github.com/docker/docker](https://github.com/docker/docker) from 26.1.4+incompatible to 26.1.5+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](https://github.com/docker/docker/compare/v26.1.4...v26.1.5)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-08-09 19:55:38 +00:00
Shao Wang b6f7cfd9a7
[SDK] test: Add e2e test for tune function. (#2399)
* fix(sdk): fix error field metrics_collector in tune function.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): Add e2e tests for tune function.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): add missing field parameters.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* refactor(test/sdk): add run-e2e-tune-api.py.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): delete tune testing code in run-e2e-experiment.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): add blank lines.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): add verbose and temporarily delete e2e-experiment test.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): add namespace_labels.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): add time.sleep(5).

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): add error output.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): build random image for tune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): delete extra debug log.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* refactor(test/sdk): create separate workflow for tune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): change api to API.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): change the permission of scripts.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): delete exit code & comment image pulling.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): delete image pulling phase.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): refactor workflow file to use template.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): mark experiments and trial-images as not required.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): pass tune-api param to setup-minikube.sh.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): fix err in template-e2e-test.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): add debug logs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(sdk): reorder params and delete logs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2024-08-06 17:50:39 +00:00
Shashank Mittal 51b246fa1c
[GSOC] Support for various Parameter distributions in Katib (#2334)
Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

modified feasibleSpace

refactored proposal based on comments

comparison table updated

extra heading removed
2024-07-31 08:03:05 +00:00
Shashank Mittal 6a17c3e35a
[GSoC] Added `DistributionType` to Experiment API (#2377)
Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>

modified feasibleSpace

Removed Categorical from Distribution
2024-07-31 04:37:05 +00:00
dependabot[bot] 9a8c9d480f
Bump github.com/docker/docker from 24.0.9+incompatible to 26.1.4+incompatible (#2400)
Bumps [github.com/docker/docker](https://github.com/docker/docker) from 24.0.9+incompatible to 26.1.4+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](https://github.com/docker/docker/compare/v24.0.9...v26.1.4)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-07-30 15:47:56 +00:00
Shashank Mittal ffc005855d
added `Distribution` field to feasibleSpace in `api.proto` (#2397)
Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>
2024-07-26 02:40:55 +00:00
Hezhi Xie 2c57522758
[GSoC] Create LLM Hyperparameters Optimization API Proposal (#2333)
* create llm hyperparameters tuning api proposal

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update llm hyperparameters tuning api proposal

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update proposal

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix some typos

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update the path of image and delete parameter 'resouces_per_worker' from tune api

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete objective function and adjust the design of tune API

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* Update docs/proposals/llm-hyperparameter-optimization-api.md

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* Move 'Advanced Functionalities' to 'Non-Goals', and update 'Implementation' part

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update 'pytorch_config'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* change the name of 'pytorch_config' to 'resources_per_trial'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* adjust format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* adjust format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* adjust format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update implementation part and the type of 'resources_per_trial'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update the example

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update 'resources_per_trial'& add one more option for defining objective function

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix typo errors

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete 'WIP' tag

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update example

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update example

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update example

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

---------

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-07-25 14:10:54 +00:00
Shao Wang a6c37e4f3a
fix: remove the dependency of `protocmp` in `google.golang.org/protobuf/testing/protocmp`. (#2391)
Signed-off-by: Electronic-Waste <2690692950@qq.com>
2024-07-24 16:03:53 +00:00
Shao Wang a8840f26f8
[GSoC] Add New Parameter in `tune` (#2369)
* chore: add metrics_collector_config in tune function.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* rebase: rebase feat/new-param-tune to master.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: add metrics collector kind list in comment.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: always pass Trial name to the training container.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: delete passing env variable logics in katib_client.py

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: passing env variable KATIB_TRIAL_NAME in the webhook of pod.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: pass env variable KATIB_TRIAL_NAME only to the primary container.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: add report_metrics in post_gen.py.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: change nil error to allErrs(deleted by accident).

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: fix lint error in inject_webhook.go.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: wrap env variables passing logics into mutatePodEnv.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: add unit tests for mutatePodEnv.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: delete protocmp.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2024-07-18 17:51:57 +00:00
Alex a3dd708541
Begin enabling pre-commit hooks (#2242)
* Begin enabling pre-commit hooks

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

* Address PR feedback

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>

---------

Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
2024-07-18 17:04:58 +00:00
jaffe 206fe1c106
Update Instructions for Argo Workflows (#2382)
Signed-off-by: jaffe-fly <flydemailbox@163.com>
2024-07-17 15:32:57 +00:00
Ikko Eltociear Ashimine 7be8b243f6
docs: update suggestion.md (#2387)
implmentation -> implementation

Signed-off-by: Ikko Ashimine <ashimine_ikko_bp@tenso.com>
2024-07-17 14:13:57 +00:00
Andrey Velichkevich 0b4e7c1780
Add command to re-run GitHub Actions tests (#2385)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-07-15 15:38:55 +00:00
Andrey Velichkevich 33f60c8ac0
Bump Katib Python SDK to 0.17.0 version (#2379)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-07-15 15:14:55 +00:00
Andrey Velichkevich da3238d310
Add Changelog for Katib v0.17.0 (#2380)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-07-15 15:09:54 +00:00
Tariq Hasan db17214cf0
Replaced hpcloud with nxadm for tail package in Go (#2375)
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
2024-07-10 00:13:12 +00:00
Shao Wang 154a85b740
[GSoC] New Interface `report_metrics` in Python SDK (#2371)
* chore: add report_metrics.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: modify the code according to the first review.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: add validation for metrics value & rename katib_report_metrics.py to report_metrics.py.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: update import path in __init__.py.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: delete blank line.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: update RuntimeError doc string & correct spelling error & add new line.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: delete blank in the last line.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2024-07-05 23:41:48 +00:00
Shao Wang f06906d338
[GSoC] KEP for Project 6: Push-based Metrics Collection for Katib (#2328)
* doc: initial commit of gsoc proposal(project 6).

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* doc: complete KEP for gsoc proposal(Project 6).

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: add non-goals and examples.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: add .

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: add compatibility changes in trial controller.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: update architecture figure.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: update format.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: update doc after the review in 10th, June.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: add code link and remove namespace env variable.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: modify proposal after the review in 14th, June.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: delete WIP label.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: add timeout param into report_metrics.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: metrics_collector_config spelling.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2024-06-28 22:54:42 +00:00
Curtis e83628bb49
Use ErrorList for experiment validator (#2329)
Signed-off-by: Kun Chang <curtis@mail.ustc.edu.cn>
2024-06-27 11:03:11 +00:00
Andrey Velichkevich 57ed828702
Add Changelog for Katib v0.17.0-rc.1 (#2370)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-06-25 16:26:13 +00:00
Vihang Mehta 7eb73b6b19
Remove default caBundle value (#2368)
Signed-off-by: Vihang Mehta <vihang@gimletlabs.ai>
2024-06-24 14:09:09 +00:00
Andrey Velichkevich 8bbac200a8
Bump Katib Python SDK to 0.17.0rc1 version (#2365)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-06-20 18:51:00 +00:00
Tariq Hasan 99ba1d58cf
Add unit test for `create_experiment` in the `katib_client` module (#2325)
* added logger for katib_client module

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* added API_VERSION as a constant

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* updated the KatibClient constructor to match the TrainingClient constructor

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* added test for create_experiment in katib_client

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

---------

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
2024-06-20 15:34:00 +00:00
Andrey Velichkevich 5a0b7db651
Remove code generation from release script (#2363)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-06-20 10:27:00 +00:00
Andrey Velichkevich f8b8d8d484
[SDK] Fix empty list for env variables and numpy version (#2360)
* [SDK] Fix empty list for env variables

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix numpy version in tests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-06-18 06:57:58 +00:00
Yuki Iwai 8a342460f2
Upgrade the protobuf version to >=4.21.12,<5 (#2358)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-06-17 10:51:57 +00:00
coldWater 0d190b9437
Replace gRPC code generation tool from Znly/protoc to Buf (#2344)
* Replace gRPC code generation tool from Znly/protoc to Buf

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* del build.sh

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* cleanup

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix test

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* refine

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* rm outter yaml

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

---------

Signed-off-by: forsaken628 <forsaken628@gmail.com>
2024-06-15 15:18:33 +00:00
coldWater e6bd3e7b5b
Replace already closed github.com/golang/mock with go.uber.org/mock (#2357)
* replace gomock

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* revert

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

---------

Signed-off-by: forsaken628 <forsaken628@gmail.com>
2024-06-14 13:54:09 +00:00
coldWater b02aed8ec6
Use cache-dependency-path in actions/setup-go for CI workflow (#2355)
Signed-off-by: forsaken628 <forsaken628@gmail.com>
2024-06-14 07:06:08 +00:00
coldWater 4e4ce6f731
Fix TestReconcileBatchJob (#2350)
* update

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* update

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* update

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* update

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* cleanup

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* update

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* use gomock

Signed-off-by: forsaken628 <forsaken628@gmail.com>

---------

Signed-off-by: forsaken628 <forsaken628@gmail.com>
2024-06-14 06:41:09 +00:00
Andrey Velichkevich 7959ffd548
[SDK] Explain Python version support cycle (#2354)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-06-13 08:25:08 +00:00
coldWater d69d04e77e
Migrate KatibCertGenerator to OPA CertController (#2345)
* Migrate KatibCertGenerator to OPA CertController

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* fix

Signed-off-by: forsaken628 <forsaken628@gmail.com>

* typo

Signed-off-by: forsaken628 <forsaken628@gmail.com>

---------

Signed-off-by: forsaken628 <forsaken628@gmail.com>
2024-06-12 10:10:07 +00:00
Andrey Velichkevich 2a9ffb169b
Update Slack Invitation (#2349)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-06-11 11:18:08 +00:00
Hezhi Xie 87aec69b9f
Fix apple silicon rosetta error when building images from the source code (#2342)
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
2024-06-05 11:59:03 +00:00
Yuki Iwai 55e283ea1b
Drop Python 3.7 and Support Python 3.11 in the SDK (#2337)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-05-29 13:39:15 +00:00
Jerry-yz 328bc5ca6a
fix katib use crds token pipeline trail template guide (#2330)
Signed-off-by: Jerry-yz <yz386071268@gmail.com>
2024-05-29 09:42:16 +00:00
Andrey Velichkevich 199e8a41f5
Update GitHub template to better triage Issues (#2335)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-05-29 02:04:16 +00:00
Andrey Velichkevich a1046db880
Fix Scikit-Learn Version for Skopt Tests (#2336)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-05-29 00:15:11 +00:00
Andrey Velichkevich c4c3eb5243
Add Changelog for Katib v0.17.0-rc.0 (#2319)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-05-13 15:22:18 +00:00
Mehrshad 8c9a33a2f7
Update outdated actions (#2324)
Signed-off-by: Mehrshad <code.rezaei@gmail.com>
2024-05-07 06:20:43 +00:00
Tariq Hasan 1551ca3975
Make test fields private in Go unit tests (#2316)
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
2024-04-30 14:38:50 +00:00
Andrey Velichkevich af900202c6
Bump Katib Python SDK to 0.17.0rc0 Version (#2318)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-04-30 14:06:50 +00:00
Andrey Velichkevich ea46a7f2b7
Support ARM64 arch for release images (#2315)
* Support ARM arch for release images

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update Developer Doc

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-04-24 22:48:44 +00:00
dependabot[bot] 2d308b72c3
Bump golang.org/x/net from 0.19.0 to 0.23.0 (#2312)
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.19.0 to 0.23.0.
- [Commits](https://github.com/golang/net/compare/v0.19.0...v0.23.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-19 13:55:49 +00:00
Yuki Iwai 21320b6d57
Upgrade Go version to v1.22 (#2309)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-04-15 12:58:51 +00:00
Yuki Iwai 025ce256a4
Drop Kubernetes v1.26, and support Kubernetes v1.29 (#2308)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-04-15 10:55:51 +00:00
Yuki Iwai 1365e473c5
Drop Kubernetes v1.25, and Support Kubernetes v1.28 (#2303)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-04-11 16:14:47 +00:00
Andrey Velichkevich 086093fed7
[SDK] Fix env per Trial parameter in tune API (#2304)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2024-04-11 07:09:47 +00:00
Shao Wang 7df05c23a5
fix: clean up UTs for file metrics collector (#2285)
* chore: replace testDir with tempDir.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: expose and compare errors.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* refactor: integrate test generation func into testCases.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* refactor: update error comparing mechanism.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: make some changes under the review of yuki.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2024-04-03 06:42:22 +00:00
Yuki Iwai 9680b8c73f
Upgrade TensorFlow version to v2.16.1 (#2282)
* Upgrade TensorFlow version to v2.16.1

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Replace deprecated ImageDataGenerator with new data augmentation approach

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-04-02 20:09:22 +00:00
Yuki Iwai 8629a3ce05
CI: Enable parallel mode for the coveralls (#2297)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-04-02 14:50:22 +00:00
Bharath K Balaji 36150bc3e9
Python SDK - Generate Name functionality for creating experiments. (#2272)
* added dco

Signed-off-by: Bharath Krishna <bharathk005@gmail.com>

* updated condition

Signed-off-by: Bharath Krishna <bharathk005@gmail.com>

* added exception to catch missing name and generateName

Signed-off-by: Bharath Krishna <bharathk005@gmail.com>

* updated experiment_name in create_experiment

Signed-off-by: Bharath Krishna <bharathk005@gmail.com>

* py sdk create_exp - added type validation

Signed-off-by: Bharath Krishna <bharathk005@gmail.com>

* added dco

Signed-off-by: Bharath Krishna <bharathk005@gmail.com>

* updated condition

Signed-off-by: Bharath Krishna <bharathk005@gmail.com>

* added exception to catch missing name and generateName

Signed-off-by: Bharath Krishna <bharathk005@gmail.com>

* updated experiment_name in create_experiment

Signed-off-by: Bharath Krishna <bharathk005@gmail.com>

* py sdk create_exp - added type validation

Signed-off-by: Bharath Krishna <bharathk005@gmail.com>

---------

Signed-off-by: Bharath Krishna <bharathk005@gmail.com>
2024-04-02 14:18:22 +00:00
dependabot[bot] 250e9d176f
Bump golang.org/x/net from 0.10.0 to 0.17.0 (#2233)
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.10.0 to 0.17.0.
- [Commits](https://github.com/golang/net/compare/v0.10.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-25 20:04:16 +00:00
dependabot[bot] 1df32f2b24
Bump google.golang.org/grpc from 1.53.0 to 1.56.3 (#2236)
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.53.0 to 1.56.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.53.0...v1.56.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-25 18:47:17 +00:00
dependabot[bot] 0a5c9e5191
Bump golang.org/x/crypto from 0.1.0 to 0.17.0 (#2249)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.1.0 to 0.17.0.
- [Commits](https://github.com/golang/crypto/compare/v0.1.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-25 16:56:17 +00:00
dependabot[bot] b3e4715c33
Bump jose from 2.0.6 to 2.0.7 in /pkg/ui/v1beta1/frontend (#2275)
Bumps [jose](https://github.com/panva/jose) from 2.0.6 to 2.0.7.
- [Release notes](https://github.com/panva/jose/releases)
- [Changelog](https://github.com/panva/jose/blob/v2.0.7/CHANGELOG.md)
- [Commits](https://github.com/panva/jose/compare/v2.0.6...v2.0.7)

---
updated-dependencies:
- dependency-name: jose
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-25 15:41:18 +00:00
dependabot[bot] ec86f23311
Bump google.golang.org/protobuf from 1.30.0 to 1.33.0 (#2284)
Bumps google.golang.org/protobuf from 1.30.0 to 1.33.0.

---
updated-dependencies:
- dependency-name: google.golang.org/protobuf
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-25 15:34:18 +00:00
dependabot[bot] 51c9350847
Bump github.com/docker/docker from 24.0.0+incompatible to 24.0.9+incompatible (#2292)
Bumps [github.com/docker/docker](https://github.com/docker/docker) from 24.0.0+incompatible to 24.0.9+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](https://github.com/docker/docker/compare/v24.0.0...v24.0.9)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-25 15:14:17 +00:00
dependabot[bot] ae894507c9
Bump follow-redirects from 1.15.4 to 1.15.6 in /pkg/ui/v1beta1/frontend (#2287)
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.15.4 to 1.15.6.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.15.4...v1.15.6)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-18 23:17:35 +00:00
Yuki Iwai 6f372f6808
Upgrade Python version to 3.11 (#2278)
* Upgrade Python version to 3.11

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Upgrade the numpy version to 1.25.2

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Increase resource requests for the ENAS suggestion service

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Update pytest CI

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Prepare dedicated pytest for skopt

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-03-12 19:54:11 +00:00
Shao Wang 5837b8a90e
chore: add unit testcases for files in Text format. (#2274)
* chore: add unit testcases for files in Text format.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: adjust file layout using gofmt.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: combine test for JSON and TEXT file format.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: rename file-gen functions.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* refactor: update cmd.Diff params and log outputs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: add more valid and invalid testcases for TEXT format.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: convert testcase name to const.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore: compact dir generation & deletion operations into funcs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: delete constants used only once.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix: fix ci error in errorcheck.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
2024-03-12 18:13:11 +00:00
Yuki Iwai 679e6fb8f8
Upgrade PyTorch version to v2.2.1 (#2279)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-03-12 00:07:10 +00:00
Chen Pin-han 61406a5397
Fix tensor devices for DARTS Trial (#2273)
* Update architect.py

72907153+sifa1024@users.noreply.github.com

Signed-off-by: Chen Pin-Han <72907153+sifa1024​@users.noreply.github.com>

* Update run_trial.py

72907153+sifa1024@users.noreply.github.com

Signed-off-by: Chen Pin-Han <72907153+sifa1024​@users.noreply.github.com>

* Update architect.py

72907153+sifa1024@users.noreply.github.com

Signed-off-by: Chen Pin-Han <72907153+sifa1024​@users.noreply.github.com>

---------

Signed-off-by: Chen Pin-Han <72907153+sifa1024​@users.noreply.github.com>
2024-03-10 03:15:40 +00:00
Curtis a2f3fcae55
Add environment variable option to set postgres ssl mode (#2266)
Signed-off-by: Kun Chang <curtis@mail.ustc.edu.cn>
2024-03-05 19:31:07 +00:00
Yuki Iwai 03a400128a
Upgrade google/go-containerregistry/pkg/authn/k8schain (#2252)
Signed-off-by: tenzen-y <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-03-05 09:45:07 +00:00
Yuki Iwai fc858d15dd
Remove MXNet examples (#2267)
* UT: Replace MXNet example with PyTorch example

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* CI: Replace MXNet examples with PyTorch examples

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2024-03-04 10:45:07 +00:00
Matteo Mortari 8df3c5c838
typo fix stale.yaml (#2257)
Message `close-pr-message` was likely a wrong copy-paste from stale.

This aligns `close-` messages.
2024-02-05 14:40:17 +00:00
dependabot[bot] 19268062f1
Bump axios and wait-on in /pkg/ui/v1beta1/frontend (#2254)
Bumps [axios](https://github.com/axios/axios) to 1.6.5 and updates ancestor dependency [wait-on](https://github.com/jeffbski/wait-on). These dependencies need to be updated together.


Updates `axios` from 0.27.2 to 1.6.5
- [Release notes](https://github.com/axios/axios/releases)
- [Changelog](https://github.com/axios/axios/blob/v1.x/CHANGELOG.md)
- [Commits](https://github.com/axios/axios/compare/v0.27.2...v1.6.5)

Updates `wait-on` from 7.0.1 to 7.2.0
- [Release notes](https://github.com/jeffbski/wait-on/releases)
- [Commits](https://github.com/jeffbski/wait-on/compare/v7.0.1...v7.2.0)

---
updated-dependencies:
- dependency-name: axios
  dependency-type: indirect
- dependency-name: wait-on
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-09 21:47:06 +00:00
dependabot[bot] 10f17fedfb
Bump follow-redirects from 1.14.8 to 1.15.4 in /pkg/ui/v1beta1/frontend (#2253)
Bumps [follow-redirects](https://github.com/follow-redirects/follow-redirects) from 1.14.8 to 1.15.4.
- [Release notes](https://github.com/follow-redirects/follow-redirects/releases)
- [Commits](https://github.com/follow-redirects/follow-redirects/compare/v1.14.8...v1.15.4)

---
updated-dependencies:
- dependency-name: follow-redirects
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-09 20:46:33 +00:00
Sunghyuk Kay d92c168baa
DB: Add environment variable option to skip DB table creationˆ (#2245)
* DB: Add env to skip DB creationˆ

* DB: Rename var name & Remove new function

* Migration -> Initialization
* Remove GetBoolEnvOrDefault

* DB: Rearrange dependencies
2024-01-04 16:21:13 +00:00
Yuki Iwai bf9a1b09e9
Add Technical and style guide to the contribution guide (#2250)
Signed-off-by: tenzen-y <yuki.iwai.tz@gmail.com>
2024-01-04 14:41:12 +00:00
Yuki Iwai 75ea35cc0f
Install typing-extensions v4.6.3 for Optuna (#2251)
Signed-off-by: tenzen-y <yuki.iwai.tz@gmail.com>
2024-01-04 13:32:12 +00:00
Andrey Velichkevich 4617346302
Remove legacy BO code (#2246) 2023-12-06 02:46:06 +00:00
Shi Pengcheng f4c8861c81
[SDK] Add `env` & `env_from` in client tune (#2235)
* add env & env_from spec

* unify env and env_from specs
2023-11-17 09:33:08 +00:00
Andrey Velichkevich fbe7c786e9
Add Changelog for Katib v0.16.0 (#2239) 2023-11-03 03:07:52 +00:00
Andrey Velichkevich f62e40dbd3
Bump Katib Python SDK to 0.16.0 version (#2238) 2023-11-03 03:06:52 +00:00
Andrey Velichkevich 700e64e053
Fix Optuna Validation for CMA-ES (#2240)
* Fix Optuna Validation for CMA-ES

* Fix Optuna test
2023-11-02 18:48:32 +00:00
dependabot[bot] d2e311fc03
Bump debug from 4.2.0 to 4.3.4 in /pkg/ui/v1beta1/frontend (#2230)
Bumps [debug](https://github.com/debug-js/debug) from 4.2.0 to 4.3.4.
- [Release notes](https://github.com/debug-js/debug/releases)
- [Commits](https://github.com/debug-js/debug/compare/4.2.0...4.3.4)

---
updated-dependencies:
- dependency-name: debug
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-17 17:07:56 +00:00
dependabot[bot] cf7fe2e47e
Bump @babel/traverse from 7.15.4 to 7.23.2 in /pkg/ui/v1beta1/frontend (#2234)
Bumps [@babel/traverse](https://github.com/babel/babel/tree/HEAD/packages/babel-traverse) from 7.15.4 to 7.23.2.
- [Release notes](https://github.com/babel/babel/releases)
- [Changelog](https://github.com/babel/babel/blob/main/CHANGELOG.md)
- [Commits](https://github.com/babel/babel/commits/v7.23.2/packages/babel-traverse)

---
updated-dependencies:
- dependency-name: "@babel/traverse"
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-17 12:44:56 +00:00
Shi Pengcheng 50a3f4110d
[SDK] Add 'algorithm_settings' in client tune (#2227) 2023-10-05 10:22:15 +00:00
Alex 520a39701b
[SDK] Raise more human-readable name conflict exception (#2199)
Co-authored-by: andreafehrman <andrea.k.fehrman@vanderbilt.edu>
Co-authored-by: harrisonfritz <harrisonmichaelfritz@gmail.com>
2023-09-07 22:21:33 +00:00
Andrey Velichkevich e3e0aa24ae
Add Katib ROADMAP 2022/2023 (#2153)
* Add Katib ROADMAP 2022/2023

* Add multi-objective optimization

* Add Scalability Improvements

* Remove Katib CRD naming
2023-08-24 22:40:54 +00:00
Andrey Velichkevich 2843a814a6
Update Ubuntu to 22.04 for E2E Tests (#2222)
* Update Ubuntu to 22.04 for E2E Tests

* Update Ubuntu for all Tests
2023-08-24 20:06:16 +00:00
Andrey Velichkevich 373f6e6d7d
Run Stale Action Every 5th Hour (#2221) 2023-08-23 15:18:46 +00:00
Andrey Velichkevich ea27fa7fee
Add Stale GitHub Action (#2220) 2023-08-21 17:15:35 +00:00
Yuki Iwai 87a0161c2c
Use the controller-runtime logger in the cert-generator (#2219)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-18 17:05:50 +00:00
Andrey Velichkevich 1f5fb48c6e
Add Changelog for Katib v0.16.0-rc.1 (#2218) 2023-08-17 00:33:38 +00:00
Andrey Velichkevich b107b2cf4e
Add Changelog for Katib v0.16.0-rc.0 (#2204) 2023-08-16 22:31:37 +00:00
Andrey Velichkevich 2f3ffc7d23
Bump Katib Python SDK to 0.16.0rc1 version (#2217) 2023-08-16 16:07:03 +00:00
dependabot[bot] 2ae992a111
Bump d3-color and @swimlane/ngx-charts in /pkg/ui/v1beta1/frontend (#2210)
Bumps [d3-color](https://github.com/d3/d3-color) to 3.1.0 and updates ancestor dependency [@swimlane/ngx-charts](https://github.com/swimlane/ngx-charts). These dependencies need to be updated together.


Updates `d3-color` from 2.0.0 to 3.1.0
- [Release notes](https://github.com/d3/d3-color/releases)
- [Commits](https://github.com/d3/d3-color/compare/v2.0.0...v3.1.0)

Updates `@swimlane/ngx-charts` from 19.2.0 to 20.4.1
- [Changelog](https://github.com/swimlane/ngx-charts/blob/master/docs/changelog.md)
- [Commits](https://github.com/swimlane/ngx-charts/compare/19.2.0...20.4.1)

---
updated-dependencies:
- dependency-name: d3-color
  dependency-type: indirect
- dependency-name: "@swimlane/ngx-charts"
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-08-16 04:37:03 +00:00
Yuki Iwai 29887c13a0
Upgrade Tensorflow version to v2.13.0 (#2201)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-15 23:09:03 +00:00
Yuki Iwai c33494bc8f
Start waiting for certs to be ready before sending data to the channel (#2209)
Start waiting for certs to be ready before sending data to the channel

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-15 22:17:03 +00:00
Yuki Iwai aa772b607d
Remove a katib-webhook-cert Secret from components (#2207)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-15 22:06:03 +00:00
Yuki Iwai 1b68744276
Bug: Wait for the certs to be mounted inside the container (#2198)
* Wait for the certs to be mounted inside the container

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Initialize fullServiceDomain when adding certgenerator to the manager

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Output logs every 15 seconds if the certs don't yet exist in the container

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-15 21:15:03 +00:00
Yuki Iwai 2ae3eb5adf
E2E: Add additional checks to verify if the components are ready (#2202)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-15 18:28:02 +00:00
Yuki Iwai 4dbb49f536
Skip to inject the metrics-collector pods to the katib controller (#2203)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-15 18:13:03 +00:00
Andrey Velichkevich 7f0d9229fa
Bump Katib Python SDK to 0.16.0rc0 version (#2205) 2023-08-15 15:18:03 +00:00
Yuki Iwai 888bec38f4
Sending an empty data to the certsReady channel (#2196)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-05 14:17:53 +00:00
Alex 923d0fcca8
[SDK] Enable resource specification for trial containers (#2192)
Co-authored-by: shipengcheng1230 <shipengcheng1230@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2023-08-05 10:46:54 +00:00
Andrey Velichkevich 114485dc04
Change failurePolicy to Fail for Katib Webhooks (#2018) 2023-08-04 23:27:53 +00:00
Yuki Iwai 06740a00e9
Consolidate the katib-cert-generator to the katib-controller (#2185)
* Consolidate the katib-cert-generator to the katib-controller

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Use deployed secret instead of creating a new secret when the cert-generator saves certs on secret

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Rename secretName with webhookSecretName in the .init.certGenerator

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Fix manifests

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Remove unneeded comments

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Restore unintentionally deleted log

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Rename package cert-generator with certgenerator

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Add test cases to check if the enable is set to true when the webhookServiceName or webhookSecretName is set

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Update the developer guide

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Swap livness probe and readiness probe

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Introduce SSA to the cert-generator

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Use the same member names between CertGenerator and KatibConfig

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Disable leader election on the cert-generator

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Drop unneeded fields from SSA patches

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-04 19:31:20 +00:00
mChowdhury-91 f074329a14
Default Resume Policy to never from UI (#2195) 2023-08-04 18:05:20 +00:00
Yuki Iwai 74cf5b8d4e
Upgrade Go version to v1.20 (#2190)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-03 11:20:19 +00:00
Yuki Iwai c731fd29d5
Replace grpc_health_probe with the built-in gRPC container probe feature (#2189)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-03 11:19:20 +00:00
Yuki Iwai c749d27c70
Allow install binaries for the arm64 in the envtest (#2188)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-01 20:41:07 +00:00
Yuki Iwai e69235daa1
Implement KatibConfig API (#2176)
* Implement KatibConfig API

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Replace 'collectorKind' with 'kind'

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Replace 'metricsCollectorSidecars' with 'metricsCollectors'

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Fix a typo

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Make the init.controller.leaderElection non-pointer

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Make the init.controller.injectSecurityContext non-pointer

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Update a comment for the future works

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Update manifest for the katib-leader-election

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Fix a comment for the KatibConfig API

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Replace 'configapi' with 'configv1beta1'

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Remove debug code

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Put constant for the default KatibConfig value on /pkg/apis/config/v1beta1

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Use 'sigs.k8s.io/yaml' instead of 'github.com/ghodss/yaml'

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Avoid to depend on k8s.io/utils directly

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Fix a typo

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Refactor katib-config using kustomize vars

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Fix a typo

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Put KatibConfig on every install

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

* Remove configMapGenerator from the katib-with-kubeflow

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

---------

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-01 18:45:08 +00:00
Yuki Iwai f1e3f3adcd
Drop Kubernetes v1.24 and support Kubernetes v1.27 (#2182)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-08-01 16:32:06 +00:00
Alex b7295cb548
[SDK] Add namespace parameter to KatibClient (#2183)
* [SDK] Add namespace parameter to KatibClient

Co-authored-by: andreafehrman <andrea.k.fehrman@vanderbilt.edu>
Co-authored-by: ryanrusson <ryan.russon@gmail.com>

* Update sdk/python/v1beta1/kubeflow/katib/api/katib_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Co-authored-by: andreafehrman <andrea.k.fehrman@vanderbilt.edu>
Co-authored-by: ryanrusson <ryan.russon@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2023-08-01 15:18:08 +00:00
Yuki Iwai d67c07b7a1
Drop Kubernetes v1.23 and support Kubernetes v1.26 (#2177)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-07-31 20:15:29 +00:00
Yuki Iwai a6938481b1
Replace action to setup minikube with medyagh/setup-minikube (#2178)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-07-31 15:42:30 +00:00
Andrey Velichkevich a20bc85b94
[UI] Remove Deprecated Katib UI (#2179)
* [UI] Remove Deprecated Katib UI

* Fix UI Developer doc
2023-07-25 09:53:29 +00:00
dependabot[bot] 89bd21f710
Bump word-wrap from 1.2.3 to 1.2.4 in /pkg/new-ui/v1beta1/frontend (#2174)
Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.4.
- [Release notes](https://github.com/jonschlinkert/word-wrap/releases)
- [Commits](https://github.com/jonschlinkert/word-wrap/compare/1.2.3...1.2.4)

---
updated-dependencies:
- dependency-name: word-wrap
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-19 11:21:24 +00:00
dependabot[bot] eb901c192e
Bump word-wrap from 1.2.3 to 1.2.4 in /pkg/ui/v1beta1/frontend (#2173)
Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.4.
- [Release notes](https://github.com/jonschlinkert/word-wrap/releases)
- [Commits](https://github.com/jonschlinkert/word-wrap/compare/1.2.3...1.2.4)

---
updated-dependencies:
- dependency-name: word-wrap
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-19 09:56:23 +00:00
dependabot[bot] f740889569
Bump webpack from 5.74.0 to 5.88.2 in /pkg/ui/v1beta1/frontend (#2172)
Bumps [webpack](https://github.com/webpack/webpack) from 5.74.0 to 5.88.2.
- [Release notes](https://github.com/webpack/webpack/releases)
- [Commits](https://github.com/webpack/webpack/compare/v5.74.0...v5.88.2)

---
updated-dependencies:
- dependency-name: webpack
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-18 18:48:22 +00:00
dependabot[bot] 3b7c77a582
Bump golang.org/x/net from 0.5.0 to 0.7.0 (#2122)
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.5.0 to 0.7.0.
- [Release notes](https://github.com/golang/net/releases)
- [Commits](https://github.com/golang/net/compare/v0.5.0...v0.7.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-18 18:46:22 +00:00
dependabot[bot] 067c119337
Bump google.golang.org/grpc from 1.47.0 to 1.53.0 (#2167)
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.47.0 to 1.53.0.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.47.0...v1.53.0)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-18 15:30:26 +00:00
dependabot[bot] c5552362dc
Bump semver from 5.7.1 to 5.7.2 in /pkg/new-ui/v1beta1/frontend (#2170)
Bumps [semver](https://github.com/npm/node-semver) from 5.7.1 to 5.7.2.
- [Release notes](https://github.com/npm/node-semver/releases)
- [Changelog](https://github.com/npm/node-semver/blob/v5.7.2/CHANGELOG.md)
- [Commits](https://github.com/npm/node-semver/compare/v5.7.1...v5.7.2)

---
updated-dependencies:
- dependency-name: semver
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-18 15:12:22 +00:00
dependabot[bot] 86602b56ed
Bump semver from 6.3.0 to 6.3.1 in /pkg/ui/v1beta1/frontend (#2169)
Bumps [semver](https://github.com/npm/node-semver) from 6.3.0 to 6.3.1.
- [Release notes](https://github.com/npm/node-semver/releases)
- [Changelog](https://github.com/npm/node-semver/blob/v6.3.1/CHANGELOG.md)
- [Commits](https://github.com/npm/node-semver/compare/v6.3.0...v6.3.1)

---
updated-dependencies:
- dependency-name: semver
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-18 13:00:22 +00:00
dependabot[bot] 6bb3a3f3f3
Bump tough-cookie from 4.1.2 to 4.1.3 in /pkg/ui/v1beta1/frontend (#2168)
Bumps [tough-cookie](https://github.com/salesforce/tough-cookie) from 4.1.2 to 4.1.3.
- [Release notes](https://github.com/salesforce/tough-cookie/releases)
- [Changelog](https://github.com/salesforce/tough-cookie/blob/master/CHANGELOG.md)
- [Commits](https://github.com/salesforce/tough-cookie/compare/v4.1.2...v4.1.3)

---
updated-dependencies:
- dependency-name: tough-cookie
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-10 08:55:54 +00:00
Andrey Velichkevich ede6e7410c
[UI] Fix Trial Logs when Kubernetes Job Fails (#2164)
* [UI] Fix Trial Logs when Kubernetes Job Fails

* Return error when Pod is in the Pending state
2023-06-20 02:20:40 +00:00
Andrew Scribner 37b237f560
Remove Charmed Operators for Katib (#2161)
This PR removes the Charmed Operators for Katib, as well as the associated tests.  In the past this repo was the source of truth for these operators, but they have since been maintained [here](https://github.com/canonical/katib-operators/) and we've done a poor job of keeping the repos in sync.  This commit removes the redundancy.
2023-06-07 17:31:58 +00:00
pheianox 6e0069bc7e
Add PITS Global Data Recovery Services to the adopters list (#2160)
* Add PITS Global Data Recovery Services to the adopters list

* Apply alphabetical order in the adopters list
2023-05-26 15:44:21 +00:00
dependabot[bot] 0102f1fc1f
Bump socket.io-parser from 4.2.2 to 4.2.3 in /pkg/new-ui/v1beta1/frontend (#2158)
Bumps [socket.io-parser](https://github.com/socketio/socket.io-parser) from 4.2.2 to 4.2.3.
- [Release notes](https://github.com/socketio/socket.io-parser/releases)
- [Changelog](https://github.com/socketio/socket.io-parser/blob/main/CHANGELOG.md)
- [Commits](https://github.com/socketio/socket.io-parser/compare/4.2.2...4.2.3)

---
updated-dependencies:
- dependency-name: socket.io-parser
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-24 08:31:19 +00:00
dependabot[bot] b9dc63efb5
Bump github.com/docker/distribution from 2.8.1+incompatible to 2.8.2+incompatible (#2154)
Bumps [github.com/docker/distribution](https://github.com/docker/distribution) from 2.8.1+incompatible to 2.8.2+incompatible.
- [Release notes](https://github.com/docker/distribution/releases)
- [Commits](https://github.com/docker/distribution/compare/v2.8.1...v2.8.2)

---
updated-dependencies:
- dependency-name: github.com/docker/distribution
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-15 20:16:31 +00:00
dependabot[bot] 279f6794dc
Bump github.com/docker/docker from 20.10.16+incompatible to 20.10.24+incompatible (#2142)
Bumps [github.com/docker/docker](https://github.com/docker/docker) from 20.10.16+incompatible to 20.10.24+incompatible.
- [Release notes](https://github.com/docker/docker/releases)
- [Commits](https://github.com/docker/docker/compare/v20.10.16...v20.10.24)

---
updated-dependencies:
- dependency-name: github.com/docker/docker
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-10 03:35:43 +00:00
dependabot[bot] 6351e80614
Bump engine.io and socket.io in /pkg/new-ui/v1beta1/frontend (#2152)
Bumps [engine.io](https://github.com/socketio/engine.io) and [socket.io](https://github.com/socketio/socket.io). These dependencies needed to be updated together.

Updates `engine.io` from 6.2.1 to 6.4.2
- [Release notes](https://github.com/socketio/engine.io/releases)
- [Changelog](https://github.com/socketio/engine.io/blob/main/CHANGELOG.md)
- [Commits](https://github.com/socketio/engine.io/compare/6.2.1...6.4.2)

Updates `socket.io` from 4.5.1 to 4.6.1
- [Release notes](https://github.com/socketio/socket.io/releases)
- [Changelog](https://github.com/socketio/socket.io/blob/main/CHANGELOG.md)
- [Commits](https://github.com/socketio/socket.io/compare/4.5.1...4.6.1)

---
updated-dependencies:
- dependency-name: engine.io
  dependency-type: indirect
- dependency-name: socket.io
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-05 11:33:24 +00:00
Andrey Velichkevich fcea7a36cb
SDK: Import all Kubernetes Models (#2148) 2023-04-20 16:56:40 +00:00
nagar-ajay 195ce776a9
Fix conformance docker image (#2147) 2023-04-16 18:17:19 +00:00
nagar-ajay be965ae9c2
Containerize tests for katib-conformance (#2146) 2023-04-14 12:55:16 +00:00
nagar-ajay 7a4c118410
Namespace and trial pod annotations as CLI argument (#2138)
* disable istio sidecar injection for example manifests

* add namespace as commnad line arg to python test script

* revert disable istio sidecar injection

* add option to pass trial pod annotations

* split command over multiple lines

* remove redundant config loading

* add resource limit to containers of random experiment's trial spec pod

* update code to support already present annotations

* raise NotImplementedError if trailSpec is different from Job

* add metrics-collector-injection to namespace under test if missing
2023-04-10 17:41:54 +00:00
Yuki Iwai 1d3ab5726f
Relax dependencies restriction for the gRPC libraries (#2140)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-04-03 16:48:01 +00:00
Andrey Velichkevich d2d9cab1ca
Add SDK Breaking Change to Changelog (#2133) 2023-03-24 13:58:22 +00:00
Andrey Velichkevich c8fe90ea0f
Add Changelog for Katib v0.15.0 (#2129) 2023-03-24 11:38:22 +00:00
Yuki Iwai af0f775079
Increase the free spaces in CI (#2131)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-03-23 14:33:22 +00:00
Andrey Velichkevich acedc82aad
Bump Katib Python SDK to 0.15.0 version (#2130) 2023-03-22 18:09:44 +00:00
Elena Zioga 2e27185f82
kwa(front): Support all namespaces (#2119)
Add support for all-namespaces in KWA.

Signed-off-by: Elena Zioga <elena@arrikto.com>
2023-02-24 12:50:25 +00:00
Andrey Velichkevich 622af87b42
Add Changelog for Katib v0.15.0-rc.1 (#2123) 2023-02-23 20:16:24 +00:00
Andrey Velichkevich cff0002e6a
Add Changelog for Katib v0.15.0-rc.0 (#2106)
* Add Changelog for Katib v0.15.0-rc.0

* Move Optuna Grid Algorithm to the Core

* Add Breaking and Major Changes
2023-02-23 15:53:24 +00:00
Orfeas Kourkakis b6afce7d89
kwa(front): Update the use of SnackBarService (#2113)
* build: Update COMMIT file

Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>

* kwa(front): Update the use of SnackBarService

Update the use of SnackBarService in order to pass required data via a
`config` object and provide MAT_SNACK_BAR_DEFAULT_OPTIONS.

Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>

---------

Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
2023-02-22 13:33:42 +00:00
Yuki Iwai 22babe4eb1
UI: Remove an unsed import, EventV1beta1Api (#2116)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-17 22:09:36 +00:00
Andrey Velichkevich 1f3dce9032
Bump Katib Python SDK to 0.15.0rc1 version (#2121) 2023-02-15 19:50:36 +00:00
Elena Zioga 1429d61b00
[kwa-trials-logs] Create the LOGS tab of Trial's details page in KWA (#2101)
* backend: Update error message when no logs could be found

* Update the message the backend sends to not just expose that logs are
  not there because 'retain' might not be set, but also because the
  cluster was scaled down.

Signed-off-by: Elena Zioga <elena@arrikto.com>

* frontend: Add LOGS tab in Trial details page

In this commit:

* Create a distinct LOGS tab, which displays the trial's logs in the
  Trial details page.
* Don't show the backend's error popup for logs, but show the message
  error in the admonition.

Signed-off-by: Elena Zioga <elena@arrikto.com>

---------

Signed-off-by: Elena Zioga <elena@arrikto.com>
2023-02-14 19:39:25 +00:00
dependabot[bot] 6064c14806
Bump http-cache-semantics from 4.1.0 to 4.1.1 in /pkg/new-ui/v1beta1/frontend (#2107)
Bumps [http-cache-semantics](https://github.com/kornelski/http-cache-semantics) from 4.1.0 to 4.1.1.
- [Release notes](https://github.com/kornelski/http-cache-semantics/releases)
- [Commits](https://github.com/kornelski/http-cache-semantics/compare/v4.1.0...v4.1.1)

---
updated-dependencies:
- dependency-name: http-cache-semantics
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-02-14 15:34:26 +00:00
Yuki Iwai 099756684f
Reformat katib-operators (#2114)
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
2023-02-14 15:14:26 +00:00
Andrey Velichkevich 22b740802a
Bump Katib Python SDK to 0.15.0rc0 version (#2105) 2023-01-28 00:10:02 +00:00
774 changed files with 50113 additions and 62121 deletions

View File

@ -4,5 +4,3 @@ docs
manifests
pkg/ui/*/frontend/node_modules
pkg/ui/*/frontend/build
pkg/new-ui/*/frontend/node_modules
pkg/new-ui/*/frontend/build

4
.flake8 Normal file
View File

@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
# E203 is ignored to avoid conflicts with Black's formatting, as it's not PEP 8 compliant
extend-ignore = W503, E203

View File

@ -1,26 +0,0 @@
---
name: Bug report
about: Tell us about a problem you are experiencing
---
/kind bug
**What steps did you take and what happened:**
[A clear and concise description of what the bug is.]
**What did you expect to happen:**
**Anything else you would like to add:**
[Miscellaneous information that will assist in solving the issue.]
**Environment:**
- Katib version (check the Katib controller image version):
- Kubernetes version: (`kubectl version`):
- OS (`uname -a`):
---
<!-- Don't delete this message to encourage users to support your issue! -->
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

50
.github/ISSUE_TEMPLATE/bug_report.yaml vendored Normal file
View File

@ -0,0 +1,50 @@
name: Bug Report
description: Tell us about a problem you are experiencing with Katib
labels: ["kind/bug", "lifecycle/needs-triage"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this Katib bug report!
- type: textarea
id: problem
attributes:
label: What happened?
description: |
Please provide as much info as possible. Not doing so may result in your bug not being
addressed in a timely manner.
validations:
required: true
- type: textarea
id: expected
attributes:
label: What did you expect to happen?
validations:
required: true
- type: textarea
id: environment
attributes:
label: Environment
value: |
Kubernetes version:
```bash
$ kubectl version
```
Katib controller version:
```bash
$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
```
Katib Python SDK version:
```bash
$ pip show kubeflow-katib
```
validations:
required: true
- type: input
id: votes
attributes:
label: Impacted by this bug?
value: Give it a 👍 We prioritize the issues with most 👍

View File

@ -1,9 +1,12 @@
blank_issues_enabled: false
blank_issues_enabled: true
contact_links:
- name: Katib Documentation
url: https://www.kubeflow.org/docs/components/katib/
about: Much help can be found in the docs
- name: AutoML Slack Channel
url: https://kubeflow.slack.com/archives/C018PMV53NW
about: Ask the Katib community on Slack
- name: Kubeflow Katib Slack Channel
url: https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels
about: Ask the Katib community on CNCF Slack
- name: Kubeflow Katib Community Meeting
url: https://bit.ly/2PWVCkV
about: Join the Kubeflow AutoML working group meeting

View File

@ -1,18 +0,0 @@
---
name: Feature enhancement request
about: Suggest an idea for this project
---
/kind feature
**Describe the solution you'd like**
[A clear and concise description of what you want to happen.]
**Anything else you would like to add:**
[Miscellaneous information that will assist in solving the issue.]
---
<!-- Don't delete this message to encourage users to support your issue! -->
Love this feature? Give it a 👍 We prioritize the features with the most 👍

View File

@ -0,0 +1,28 @@
name: Feature Request
description: Suggest an idea for Katib
labels: ["kind/feature", "lifecycle/needs-triage"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this Katib feature request!
- type: textarea
id: feature
attributes:
label: What you would like to be added?
description: |
A clear and concise description of what you want to add to Katib.
Please consider to write Katib enhancement proposal if it is a large feature request.
validations:
required: true
- type: textarea
id: rationale
attributes:
label: Why is this needed?
validations:
required: true
- type: input
id: votes
attributes:
label: Love this feature?
value: Give it a 👍 We prioritize the features with most 👍

View File

@ -1,6 +1,6 @@
<!-- Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, check our contributor guidelines https://www.kubeflow.org/docs/about/contributing
2. To know more about Katib components, check developer guide https://github.com/kubeflow/katib/blob/master/docs/developer-guide.md
2. To know more about Katib components, check developer guide https://github.com/kubeflow/katib/blob/master/CONTRIBUTING.md
3. If you want *faster* PR reviews, check how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
-->

20
.github/stale.yml vendored
View File

@ -1,20 +0,0 @@
# Configuration for stale probot https://probot.github.io/apps/stale/
# Number of days of inactivity before an issue becomes stale
daysUntilStale: 90
# Number of days of inactivity before a stale issue is closed
daysUntilClose: 20
# Issues with these labels will never be considered stale
exemptLabels:
- lifecycle/frozen
# Label to use when marking an issue as stale
staleLabel: lifecycle/stale
# Comment to post when marking an issue as stale. Set to `false` to disable
markComment: >
This issue has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
# Comment to post when closing a stale issue. Set to `false` to disable
closeComment: >
This issue has been automatically closed because it has not had recent
activity. Please comment "/reopen" to reopen it.

View File

@ -1,5 +1,5 @@
# Reusable workflows for publishing Katib images.
name: Build And Publish Images
name: Build and Publish Images
on:
workflow_call:
@ -21,31 +21,50 @@ on:
jobs:
build-and-publish:
name: Publish Image
runs-on: ubuntu-latest
name: Build and Publish Images
runs-on: ubuntu-22.04
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Docker Login
# Trigger workflow only for kubeflow/katib repository with specific branch (master, release-.*) or tag (v.*).
if: >-
github.repository == 'kubeflow/katib' &&
(github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/heads/release-') || startsWith(github.ref, 'refs/tags/v'))
uses: docker/login-action@v2
- name: Set Publish Condition
id: publish-condition
shell: bash
run: |
if [[ "${{ github.repository }}" == 'kubeflow/katib' && \
( "${{ github.ref }}" == 'refs/heads/master' || \
"${{ github.ref }}" =~ ^refs/heads/release- || \
"${{ github.ref }}" =~ ^refs/tags/v ) ]]; then
echo "should_publish=true" >> $GITHUB_OUTPUT
else
echo "should_publish=false" >> $GITHUB_OUTPUT
fi
- name: GHCR Login
if: steps.publish-condition.outputs.should_publish == 'true'
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: DockerHub Login
if: steps.publish-condition.outputs.should_publish == 'true'
uses: docker/login-action@v3
with:
registry: docker.io
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Publish Component ${{ inputs.component-name }}
# Trigger workflow only for kubeflow/katib repository with specific branch (master, release-.*) or tag (v.*).
if: >-
github.repository == 'kubeflow/katib' &&
(github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/heads/release-') || startsWith(github.ref, 'refs/tags/v'))
if: steps.publish-condition.outputs.should_publish == 'true'
id: publish
uses: ./.github/workflows/template-publish-image
with:
image: docker.io/kubeflowkatib/${{ inputs.component-name }}
image: |
ghcr.io/kubeflow/katib/${{ inputs.component-name }}
docker.io/kubeflowkatib/${{ inputs.component-name }}
dockerfile: ${{ inputs.dockerfile }}
platforms: ${{ inputs.platforms }}
push: true
@ -54,7 +73,9 @@ jobs:
if: steps.publish.outcome == 'skipped'
uses: ./.github/workflows/template-publish-image
with:
image: docker.io/kubeflowkatib/${{ inputs.component-name }}
image: |
ghcr.io/kubeflow/katib/${{ inputs.component-name }}
docker.io/kubeflowkatib/${{ inputs.component-name }}
dockerfile: ${{ inputs.dockerfile }}
platforms: ${{ inputs.platforms }}
push: false

View File

@ -3,28 +3,25 @@ name: E2E Test with darts-cnn-cifar10
on:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
jobs:
e2e:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: "3.7"
python-version: "3.11"
- name: Run e2e test with ${{ matrix.experiments }} experiments
uses: ./.github/workflows/template-e2e-test
@ -36,6 +33,6 @@ jobs:
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.23.13", "v1.24.7", "v1.25.3"]
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
# Comma Delimited
experiments: ["darts-cpu"]

View File

@ -3,22 +3,19 @@ name: E2E Test with enas-cnn-cifar10
on:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
jobs:
e2e:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
@ -36,6 +33,6 @@ jobs:
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.23.13", "v1.24.7", "v1.25.3"]
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
# Comma Delimited
experiments: ["enas-cpu"]

View File

@ -1,45 +0,0 @@
name: E2E Test with mxnet-mnist
on:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
jobs:
e2e:
runs-on: ubuntu-20.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: "3.9"
- name: Run e2e test with ${{ matrix.experiments }} experiments
uses: ./.github/workflows/template-e2e-test
with:
experiments: ${{ matrix.experiments }}
# Comma Delimited
trial-images: mxnet-mnist
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.23.13", "v1.24.7", "v1.25.3"]
# Comma Delimited
experiments:
# suggestion-hyperopt
- "long-running-resume,from-volume-resume,median-stop"
# others
- "grid,bayesian-optimization,tpe,multivariate-tpe,cma-es,hyperband"

View File

@ -3,22 +3,19 @@ name: E2E Test with pytorch-mnist
on:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
jobs:
e2e:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
@ -37,8 +34,13 @@ jobs:
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.23.13", "v1.24.7", "v1.25.3"]
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
# Comma Delimited
experiments:
# suggestion-hyperopt
- "long-running-resume,from-volume-resume,median-stop"
# others
- "grid,bayesian-optimization,tpe,multivariate-tpe,cma-es,hyperband"
- "hyperopt-distribution,optuna-distribution"
- "file-metrics-collector,pytorchjob-mnist"
- "median-stop-with-json-format,file-metrics-collector-with-json-format"

View File

@ -3,22 +3,19 @@ name: E2E Test with simple-pbt
on:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
jobs:
e2e:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
@ -36,6 +33,6 @@ jobs:
fail-fast: false
matrix:
# Detail: https://hub.docker.com/r/kindest/node
kubernetes-version: ["v1.23.13", "v1.24.7", "v1.25.3"]
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
# Comma Delimited
experiments: ["simple-pbt"]

View File

@ -3,22 +3,19 @@ name: E2E Test with tf-mnist-with-summaries
on:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
jobs:
e2e:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
@ -36,6 +33,6 @@ jobs:
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.23.13", "v1.24.7", "v1.25.3"]
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
# Comma Delimited
experiments: ["tfjob-mnist-with-summaries"]

View File

@ -0,0 +1,40 @@
name: E2E Test with tune API
on:
pull_request:
paths-ignore:
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
e2e:
runs-on: ubuntu-22.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
- name: Install Katib SDK with extra requires
shell: bash
run: |
pip install --prefer-binary -e 'sdk/python/v1beta1[huggingface]'
- name: Run e2e test with tune API
uses: ./.github/workflows/template-e2e-test
with:
tune-api: true
training-operator: true
strategy:
fail-fast: false
matrix:
# Detail: https://hub.docker.com/r/kindest/node
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]

View File

@ -7,16 +7,13 @@ concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
jobs:
e2e:
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
@ -28,11 +25,11 @@ jobs:
with:
experiments: random
# Comma Delimited
trial-images: mxnet-mnist
trial-images: pytorch-mnist-cpu
katib-ui: true
database-type: postgres
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.23.13", "v1.24.7", "v1.25.3"]
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]

View File

@ -0,0 +1,49 @@
name: Free-Up Disk Space
description: Remove Non-Essential Tools And Move Docker Data Directory to /mnt/docker
runs:
using: composite
steps:
# This step is a Workaround to avoid the "No space left on device" error.
# ref: https://github.com/actions/runner-images/issues/2840
- name: Remove unnecessary files
shell: bash
run: |
echo "Disk usage before cleanup:"
df -hT
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf /usr/local/share/boost
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/local/share/powershell
sudo rm -rf /usr/share/swift
echo "Disk usage after cleanup:"
df -hT
- name: Prune docker images
shell: bash
run: |
docker image prune -a -f
docker system df
df -hT
- name: Move docker data directory
shell: bash
run: |
echo "Stopping docker service ..."
sudo systemctl stop docker
DOCKER_DEFAULT_ROOT_DIR=/var/lib/docker
DOCKER_ROOT_DIR=/mnt/docker
echo "Moving ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
sudo mv ${DOCKER_DEFAULT_ROOT_DIR} ${DOCKER_ROOT_DIR}
echo "Creating symlink ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
sudo ln -s ${DOCKER_ROOT_DIR} ${DOCKER_DEFAULT_ROOT_DIR}
echo "$(sudo ls -l ${DOCKER_DEFAULT_ROOT_DIR})"
echo "Starting docker service ..."
sudo systemctl daemon-reload
sudo systemctl start docker
echo "Docker service status:"
sudo systemctl --no-pager -l -o short status docker

View File

@ -4,7 +4,7 @@ on:
push:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"
- "pkg/ui/v1beta1/frontend/**"
jobs:
algorithm:

View File

@ -0,0 +1,24 @@
name: Publish Katib Conformance Test Images
on:
- push
- pull_request
jobs:
core:
name: Publish Image
uses: ./.github/workflows/build-and-publish-images.yaml
with:
component-name: ${{ matrix.component-name }}
platforms: linux/amd64,linux/arm64
dockerfile: ${{ matrix.dockerfile }}
secrets:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
strategy:
fail-fast: false
matrix:
include:
- component-name: katib-conformance
dockerfile: Dockerfile.conformance

View File

@ -25,9 +25,7 @@ jobs:
- component-name: katib-db-manager
dockerfile: cmd/db-manager/v1beta1/Dockerfile
- component-name: katib-ui
dockerfile: cmd/new-ui/v1beta1/Dockerfile
- component-name: cert-generator
dockerfile: cmd/cert-generator/v1beta1/Dockerfile
dockerfile: cmd/ui/v1beta1/Dockerfile
- component-name: file-metrics-collector
dockerfile: cmd/metricscollector/v1beta1/file-metricscollector/Dockerfile
- component-name: tfevent-metrics-collector

View File

@ -4,7 +4,7 @@ on:
push:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"
- "pkg/ui/v1beta1/frontend/**"
jobs:
trial:
@ -22,9 +22,6 @@ jobs:
fail-fast: false
matrix:
include:
- trial-name: mxnet-mnist
platforms: linux/amd64,linux/arm64
dockerfile: examples/v1beta1/trial-images/mxnet-mnist/Dockerfile
- trial-name: pytorch-mnist-cpu
platforms: linux/amd64,linux/arm64
dockerfile: examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.cpu

42
.github/workflows/stale.yaml vendored Normal file
View File

@ -0,0 +1,42 @@
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
#
# You can adjust the behavior by modifying this file.
# For more information, see:
# https://github.com/actions/stale
name: Mark stale issues and pull requests
on:
schedule:
- cron: "0 */5 * * *"
jobs:
stale:
runs-on: ubuntu-22.04
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/stale@v5
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
days-before-stale: 90
days-before-close: 20
stale-issue-message: >
This issue has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
close-issue-message: >
This issue has been automatically closed because it has not had recent
activity. Please comment "/reopen" to reopen it.
stale-issue-label: lifecycle/stale
exempt-issue-labels: lifecycle/frozen
stale-pr-message: >
This pull request has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
close-pr-message: >
This pull request has been automatically closed because it has not had recent
activity. Please comment "/reopen" to reopen it.
stale-pr-label: lifecycle/stale
exempt-pr-labels: lifecycle/frozen

View File

@ -4,15 +4,17 @@ description: Run e2e test using the minikube cluster
inputs:
experiments:
required: true
required: false
description: comma delimited experiment name
default: ""
training-operator:
required: false
description: whether to deploy training-operator or not
default: false
trial-images:
required: true
required: false
description: comma delimited trial image name
default: ""
katib-ui:
required: true
description: whether to deploy katib-ui or not
@ -21,13 +23,17 @@ inputs:
required: false
description: mysql or postgres
default: mysql
tune-api:
required: true
description: whether to execute tune-api test or not
default: false
runs:
using: composite
steps:
- name: Setup Minikube Cluster
shell: bash
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh ${{ inputs.katib-ui }} ${{ inputs.trial-images }} ${{ inputs.experiments }}
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh ${{ inputs.katib-ui }} ${{ inputs.tune-api }} ${{ inputs.trial-images }} ${{ inputs.experiments }}
- name: Setup Katib
shell: bash
@ -35,4 +41,9 @@ runs:
- name: Run E2E Experiment
shell: bash
run: ./test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.sh ${{ inputs.experiments }}
run: |
if "${{ inputs.tune-api }}"; then
./test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.sh
else
./test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.sh ${{ inputs.experiments }}
fi

View File

@ -19,15 +19,31 @@ inputs:
runs:
using: composite
steps:
# This step is a Workaround to avoid the "No space left on device" error.
# ref: https://github.com/actions/runner-images/issues/2840
- name: Remove unnecessary files
shell: bash
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/local/share/powershell
sudo rm -rf /usr/share/swift
echo "Disk usage after cleanup:"
df -h
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
uses: docker/setup-qemu-action@v3
- name: Set Up Docker Buildx
uses: docker/setup-buildx-action@v2
uses: docker/setup-buildx-action@v3
- name: Add Docker Tags
id: meta
uses: docker/metadata-action@v4
uses: docker/metadata-action@v5
with:
images: ${{ inputs.image }}
tags: |
@ -35,12 +51,12 @@ runs:
type=sha,prefix=v1beta1-
- name: Build and Push
uses: docker/build-push-action@v3
uses: docker/build-push-action@v5
with:
context: .
file: ${{ inputs.dockerfile }}
push: ${{ inputs.push }}
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
cache-to: type=gha,mode=max,ignore-error=true
platforms: ${{ inputs.platforms }}

View File

@ -15,20 +15,31 @@ inputs:
runs:
using: composite
steps:
- name: Setup Minikube Cluster
uses: manusa/actions-setup-minikube@v2.7.2
# This step is a Workaround to avoid the "No space left on device" error.
# ref: https://github.com/actions/runner-images/issues/2840
- name: Free-Up Disk Space
uses: ./.github/workflows/free-up-disk-space
- name: Setup kubectl
uses: azure/setup-kubectl@v4
with:
minikube version: v1.28.0
kubernetes version: ${{ inputs.kubernetes-version }}
start args: --wait-timeout=60s
version: ${{ inputs.kubernetes-version }}
- name: Setup Minikube Cluster
uses: medyagh/setup-minikube@v0.0.18
with:
network-plugin: cni
cni: flannel
driver: none
github token: ${{ env.GITHUB_TOKEN }}
kubernetes-version: ${{ inputs.kubernetes-version }}
minikube-version: 1.34.0
start-args: --wait-timeout=120s
- name: Setup Docker Buildx
uses: docker/setup-buildx-action@v2
uses: docker/setup-buildx-action@v3
- name: Setup Python
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ inputs.python-version }}

View File

@ -1,114 +0,0 @@
name: Charmed Katib
on:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
charmed:
name: Lint and Test
runs-on: ubuntu-20.04
steps:
- name: Check out code
uses: actions/checkout@v3
- name: Install dependencies
run: |
set -eux
sudo apt update
sudo apt install -y python3-pip
sudo apt install python3-setuptools
sudo pip3 install black flake8
sudo snap install juju --classic
sudo snap install juju-bundle --classic
sudo snap install juju-wait --classic
sudo pip3 install charmcraft==1.3.1
- name: Check black
run: black --check operators/*/src
- name: Check flake8
run: cd operators && flake8 ./katib*/src
- uses: balchua/microk8s-actions@v0.2.2
with:
channel: "1.21/stable"
addons: '["dns", "storage", "rbac"]'
- name: Set Up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Build Docker images
run: |
set -eux
images=("katib-controller" "katib-ui" "katib-db-manager")
folders=("katib-controller" "new-ui" "db-manager")
for idx in {0..2}; do
docker buildx build . \
-t docker.io/kubeflowkatib/${images[$idx]}:latest \
-f cmd/${folders[$idx]}/v1beta1/Dockerfile \
--load
docker save docker.io/kubeflowkatib/${images[$idx]} > ${images[$idx]}.tar
microk8s ctr image import ${images[$idx]}.tar
done
- name: Deploy Katib
env:
CHARMCRAFT_DEVELOPER: "1"
run: |
set -eux
cd operators/
git clone git://git.launchpad.net/canonical-osm
cp -r canonical-osm/charms/interfaces/juju-relation-mysql mysql
sg microk8s -c 'juju bootstrap microk8s uk8s'
juju add-model kubeflow
juju bundle deploy --build --destructive-mode --serial
juju wait -wvt 600
- name: Test Katib
run: kubectl apply -f examples/v1beta1/hp-tuning/random.yaml
- name: Get pod statuses
run: kubectl get all -A
if: failure()
- name: Get juju status
run: juju status
if: failure()
- name: Get katib-controller workload logs
run: kubectl logs --tail 100 -nkubeflow -lapp.kubernetes.io/name=katib-controller
if: failure()
- name: Get katib-controller operator logs
run: kubectl logs --tail 100 -nkubeflow -loperator.juju.is/name=katib-controller
if: failure()
- name: Get katib-ui workload logs
run: kubectl logs --tail 100 -nkubeflow -lapp.kubernetes.io/name=katib-ui
if: failure()
- name: Get katib-ui operator logs
run: kubectl logs --tail 100 -nkubeflow -loperator.juju.is/name=katib-ui
if: failure()
- name: Get katib-db-manager workload logs
run: kubectl logs --tail 100 -nkubeflow -lapp.kubernetes.io/name=katib-db-manager
if: failure()
- name: Get katib-db-manager operator logs
run: kubectl logs --tail 100 -nkubeflow -loperator.juju.is/name=katib-db-manager
if: failure()
- name: Upload charmcraft logs
uses: actions/upload-artifact@v2
with:
name: charmcraft-logs
path: /tmp/charmcraft-log-*
if: failure()

View File

@ -3,7 +3,7 @@ name: Go Test
on:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
@ -12,7 +12,7 @@ concurrency:
jobs:
generatetests:
name: Generate And Format Test
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
env:
GOPATH: ${{ github.workspace }}/go
defaults:
@ -20,21 +20,22 @@ jobs:
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
steps:
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@v4
with:
path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
- name: Setup Go
uses: actions/setup-go@v3
uses: actions/setup-go@v5
with:
go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.mod
cache-dependency-path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.sum
- name: Check Go Modules, Generated Go/Python codes, and Format
run: make check
unittests:
name: Unit Test
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
env:
GOPATH: ${{ github.workspace }}/go
defaults:
@ -42,14 +43,15 @@ jobs:
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
steps:
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@v4
with:
path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
- name: Setup Go
uses: actions/setup-go@v3
uses: actions/setup-go@v5
with:
go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.mod
cache-dependency-path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.sum
- name: Run Go test
run: go mod download && make test ENVTEST_K8S_VERSION=${{ matrix.kubernetes-version }}
@ -59,9 +61,19 @@ jobs:
with:
path-to-profile: coverage.out
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
parallel: true
strategy:
fail-fast: false
matrix:
# Detail: `setup-envtest list --arch amd64`
kubernetes-version: ["1.23.5", "1.24.2", "1.25.0"]
# Detail: `setup-envtest list`
kubernetes-version: ["1.29.3", "1.30.0", "1.31.0"]
# notifies that all test jobs are finished.
finish:
needs: unittests
runs-on: ubuntu-22.04
steps:
- uses: shogo82148/actions-goveralls@v1
with:
parallel-finished: true

View File

@ -3,7 +3,7 @@ name: Lint Files
on:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
@ -12,19 +12,19 @@ concurrency:
jobs:
lint:
name: Lint
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: 3.9
- name: Check YAML files
run: make yamllint
- name: Check shell scripts
run: make shellcheck
- name: Run pre-commit
uses: pre-commit/action@v3.0.1

View File

@ -3,7 +3,7 @@ name: Frontend Test
on:
pull_request:
paths:
- pkg/new-ui/v1beta1/frontend/**
- pkg/ui/v1beta1/frontend/**
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
@ -12,43 +12,43 @@ concurrency:
jobs:
test:
name: Code format and lint
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v3
uses: actions/setup-node@v4
with:
node-version: 12.18.1
node-version: 16.20.2
- name: Format katib code
run: |
npm install prettier --prefix ./pkg/new-ui/v1beta1/frontend
npm install prettier --prefix ./pkg/ui/v1beta1/frontend
make prettier-check
- name: Lint katib code
run: |
cd pkg/new-ui/v1beta1/frontend
cd pkg/ui/v1beta1/frontend
npm run lint-check
frontend-unit-tests:
name: Frontend Unit Tests
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v3
uses: actions/setup-node@v4
with:
node-version: 12.18.1
node-version: 16.20.2
- name: Fetch Kubeflow and install common code dependencies
run: |
COMMIT=$(cat pkg/new-ui/v1beta1/frontend/COMMIT)
COMMIT=$(cat pkg/ui/v1beta1/frontend/COMMIT)
cd /tmp && git clone https://github.com/kubeflow/kubeflow.git
cd kubeflow
git checkout $COMMIT
@ -59,13 +59,13 @@ jobs:
- name: Install KWA dependencies
run: |
cd pkg/new-ui/v1beta1/frontend
cd pkg/ui/v1beta1/frontend
npm i
npm link kubeflow
- name: Run unit tests
run: |
cd pkg/new-ui/v1beta1/frontend
cd pkg/ui/v1beta1/frontend
npm run test:prod
frontend-ui-tests:
@ -73,15 +73,15 @@ jobs:
runs-on: ubuntu-22.04
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup node version to 12
uses: actions/setup-node@v3
uses: actions/checkout@v4
- name: Setup node version to 16
uses: actions/setup-node@v4
with:
node-version: 12
node-version: 16
- name: Fetch Kubeflow and install common code dependencies
run: |
COMMIT=$(cat pkg/new-ui/v1beta1/frontend/COMMIT)
COMMIT=$(cat pkg/ui/v1beta1/frontend/COMMIT)
cd /tmp && git clone https://github.com/kubeflow/kubeflow.git
cd kubeflow
git checkout $COMMIT
@ -91,11 +91,11 @@ jobs:
npm link ./dist/kubeflow
- name: Install KWA dependencies
run: |
cd pkg/new-ui/v1beta1/frontend
cd pkg/ui/v1beta1/frontend
npm i
npm link kubeflow
- name: Serve UI & run Cypress tests in Chrome and Firefox
run: |
cd pkg/new-ui/v1beta1/frontend
cd pkg/ui/v1beta1/frontend
npm run start & npx wait-on http://localhost:4200
npm run ui-test-ci-all

View File

@ -3,7 +3,7 @@ name: Python Test
on:
pull_request:
paths-ignore:
- "pkg/new-ui/v1beta1/frontend/**"
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
@ -12,16 +12,36 @@ concurrency:
jobs:
test:
name: Test
runs-on: ubuntu-latest
runs-on: ubuntu-22.04
steps:
- name: Check out code
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: 3.11
- name: Run Python test
run: make pytest
# The skopt service doesn't work appropriately with Python 3.11.
# So, we need to run the test with Python 3.9.
# TODO (tenzen-y): Once we stop to support skopt, we can remove this test.
# REF: https://github.com/kubeflow/katib/issues/2280
test-skopt:
name: Test Skopt
runs-on: ubuntu-22.04
steps:
- name: Check out code
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: 3.9
- name: Run Python test
run: make pytest
run: make pytest-skopt

3
.gitignore vendored
View File

@ -78,3 +78,6 @@ $RECYCLE.BIN/
## Vendor dir
vendor
# Jupyter Notebooks.
**/.ipynb_checkpoints

38
.pre-commit-config.yaml Normal file
View File

@ -0,0 +1,38 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
args: [--allow-multiple-documents]
- id: check-json
- repo: https://github.com/pycqa/isort
rev: 5.11.5
hooks:
- id: isort
name: isort
entry: isort --profile black
- repo: https://github.com/psf/black
rev: 24.2.0
hooks:
- id: black
files: (sdk|examples|pkg)/.*
- repo: https://github.com/pycqa/flake8
rev: 7.1.1
hooks:
- id: flake8
files: (sdk|examples|pkg)/.*
exclude: |
(?x)^(
.*zz_generated.deepcopy.*|
.*pb.go|
pkg/apis/manager/.*pb2(?:_grpc)?.py(?:i)?|
pkg/apis/v1beta1/openapi_generated.go|
pkg/mock/.*|
pkg/client/controller/.*|
sdk/python/v1beta1/kubeflow/katib/configuration.py|
sdk/python/v1beta1/kubeflow/katib/rest.py|
sdk/python/v1beta1/kubeflow/katib/__init__.py|
sdk/python/v1beta1/kubeflow/katib/exceptions.py|
sdk/python/v1beta1/kubeflow/katib/api_client.py|
sdk/python/v1beta1/kubeflow/katib/models/.*
)$

View File

@ -17,3 +17,4 @@ Please keep the list in alphabetical order.
| [CyberAgent](https://www.cyberagent.co.jp/en/) | [@tenzen-y](https://github.com/tenzen-y) | Experiment in CyberAgent internal ML Platform on Private Cloud |
| [fuzhi](http://www.fuzhi.ai/) | [@planck0591](https://github.com/planck0591) | Experiment and Trial in autoML Platform |
| [karrot](https://uk.karrotmarket.com/) | [@muik](https://github.com/muik) | Hyperparameter tuning in Karrot ML Platform |
| [PITS Global Data Recovery Services](https://www.pitsdatarecovery.net/) | [@pheianox](https://github.com/pheianox) | CyberAgent and ML Platform |

View File

@ -1,10 +1,764 @@
# Changelog
## [v0.14.0](https://github.com/kubeflow/katib/tree/v0.14.0) (2022-08-18)
# [v0.18.0](https://github.com/kubeflow/katib/tree/v0.18.0) (2025-03-25)
# New Features
## Breaking Changes
## Core Features
- Move Katib manifest image references to ghcr ([#2535](https://github.com/kubeflow/katib/pull/2535) by [@saileshd1402](https://github.com/saileshd1402))
- Migrate docker images to ghcr ([#2531](https://github.com/kubeflow/katib/pull/2531) by [@mahdikhashan](https://github.com/mahdikhashan))
- Upgrade Kubernetes to v1.31.3 ([#2478](https://github.com/kubeflow/katib/pull/2478) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Upgrade Kubernetes to v1.30.7 ([#2463](https://github.com/kubeflow/katib/pull/2463) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Drop Python 3.7 and Support Python 3.11 in the SDK ([#2337](https://github.com/kubeflow/katib/pull/2337) by [@tenzen-y](https://github.com/tenzen-y))
## New Features
### Hyperparameter Optimization for LLMs
- [DOCS] move llm hyperparameter optimisation design image to the proposal directory and rename it ([#2472](https://github.com/kubeflow/katib/pull/2472) by [@mahdikhashan](https://github.com/mahdikhashan))
- [GSoC] Update `tune` API for LLM hyperparameters optimization ([#2393](https://github.com/kubeflow/katib/pull/2393) by [@helenxie-bit](https://github.com/helenxie-bit))
- [GSoC] Create LLM Hyperparameters Optimization API Proposal ([#2333](https://github.com/kubeflow/katib/pull/2333) by [@helenxie-bit](https://github.com/helenxie-bit))
### Support for Advanced Distributions for HPO
- [GSOC] `optuna` suggestion service logic update ([#2446](https://github.com/kubeflow/katib/pull/2446) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] `hyperopt` suggestion service logic update ([#2412](https://github.com/kubeflow/katib/pull/2412) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] Add validator for feasible space distribution ([#2404](https://github.com/kubeflow/katib/pull/2404) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] added Unknown distribution and convertDistribution in suggestion client ([#2403](https://github.com/kubeflow/katib/pull/2403) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] Support for various Parameter distributions in Katib ([#2334](https://github.com/kubeflow/katib/pull/2334) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSoC] Added `DistributionType` to Experiment API ([#2377](https://github.com/kubeflow/katib/pull/2377) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
### Push-based Metrics Collector
- [GSoC] Provide a PyTorch MNIST Example for Push-based Metrics Collection ([#2437](https://github.com/kubeflow/katib/pull/2437) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] Compatibility Changes in Trial Controller ([#2394](https://github.com/kubeflow/katib/pull/2394) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] New Interface `report_metrics` in Python SDK ([#2371](https://github.com/kubeflow/katib/pull/2371) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] KEP for Project 6: Push-based Metrics Collection for Katib ([#2328](https://github.com/kubeflow/katib/pull/2328) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] Add New Parameter in `tune` ([#2369](https://github.com/kubeflow/katib/pull/2369) by [@Electronic-Waste](https://github.com/Electronic-Waste))
### SDK Updates
- [SDK] Support PyTorchJob as a Trial Worker ([#2512](https://github.com/kubeflow/katib/pull/2512) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] test: Add e2e test for tune function. ([#2399](https://github.com/kubeflow/katib/pull/2399) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [SDK] improve PVC creation name error ([#2496](https://github.com/kubeflow/katib/pull/2496) by [@mahdikhashan](https://github.com/mahdikhashan))
- [SDK] Fix empty list for env variables and numpy version ([#2360](https://github.com/kubeflow/katib/pull/2360) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Explain Python version support cycle ([#2354](https://github.com/kubeflow/katib/pull/2354) by [@andreyvelich](https://github.com/andreyvelich))
## Bug Fixes
- fix(webhook): fix validation message in experiment webhook ([#2507](https://github.com/kubeflow/katib/pull/2507) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Install typing-extensions v4.10.0 to fix Python test error ([#2504](https://github.com/kubeflow/katib/pull/2504) by [@helenxie-bit](https://github.com/helenxie-bit))
- [SDK] Update `tune` API ([#2497](https://github.com/kubeflow/katib/pull/2497) by [@helenxie-bit](https://github.com/helenxie-bit))
- fix(api): resolve all api voilation exceptions in katib api ([#2482](https://github.com/kubeflow/katib/pull/2482) by [@truc0](https://github.com/truc0))
- fix(trial): use propagated gomega to improve debuggability. ([#2432](https://github.com/kubeflow/katib/pull/2432) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- fix(ui): update None Collector with Push Collector. ([#2418](https://github.com/kubeflow/katib/pull/2418) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- fix: Resolve errors in e2e tests for cypress in Katib UI ([#2384](https://github.com/kubeflow/katib/pull/2384) by [@tariq-hasan](https://github.com/tariq-hasan))
- doc(example): fix the broken link. ([#2433](https://github.com/kubeflow/katib/pull/2433) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- fix: remove remaining MXNet dependency. ([#2456](https://github.com/kubeflow/katib/pull/2456) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Remove Dropout layer from ENAS Trial container to fix E2E tests ([#2455](https://github.com/kubeflow/katib/pull/2455) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] fix grpc related bugs in Python SDK ([#2398](https://github.com/kubeflow/katib/pull/2398) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [SDK] Fix types error ([#2424](https://github.com/kubeflow/katib/pull/2424) by [@helenxie-bit](https://github.com/helenxie-bit))
- fix: remove the dependency of `protocmp` in `google.golang.org/protobuf/testing/protocmp`. ([#2391](https://github.com/kubeflow/katib/pull/2391) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Fix TestReconcileBatchJob ([#2350](https://github.com/kubeflow/katib/pull/2350) by [@forsaken628](https://github.com/forsaken628))
- Fix apple silicon rosetta error when building images from the source code ([#2342](https://github.com/kubeflow/katib/pull/2342) by [@helenxie-bit](https://github.com/helenxie-bit))
- fix katib use crds token pipeline trail template guide ([#2330](https://github.com/kubeflow/katib/pull/2330) by [@Jerry-yz](https://github.com/Jerry-yz))
- Fix Scikit-Learn Version for Skopt Tests ([#2336](https://github.com/kubeflow/katib/pull/2336) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Support old-style TensorFlow events (tensorboard) ([#2517](https://github.com/kubeflow/katib/pull/2517) by [@garymm](https://github.com/garymm))
- Set experiment names at a max of 40 characters. ([#2468](https://github.com/kubeflow/katib/pull/2468) by [@AydanPirani](https://github.com/AydanPirani))
- [CI] optimize katib ui dockerfile ([#2505](https://github.com/kubeflow/katib/pull/2505) by [@mahdikhashan](https://github.com/mahdikhashan))
- Sort experiments by descending creation date by default in katib-ui ([#2498](https://github.com/kubeflow/katib/pull/2498) by [@Doris-xm](https://github.com/Doris-xm))
- [GSoC] Add unit tests for `tune` API ([#2423](https://github.com/kubeflow/katib/pull/2423) by [@helenxie-bit](https://github.com/helenxie-bit))
- Update MutatingWebhookConfiguration: Switch from objectSelector to AdmissionWebhookMatchConditions ([#2241](https://github.com/kubeflow/katib/pull/2241) by [@lianghao208](https://github.com/lianghao208))
- chore: supporting the listen-address parameter on db-manager ([#2465](https://github.com/kubeflow/katib/pull/2465) by [@caiofralmeida](https://github.com/caiofralmeida))
- Upgrade klog to v2 ([#2470](https://github.com/kubeflow/katib/pull/2470) by [@Doris-xm](https://github.com/Doris-xm))
- Ignore cache exporting errors in the image building workflows ([#2487](https://github.com/kubeflow/katib/pull/2487) by [@Doris-xm](https://github.com/Doris-xm))
- Upgrade grpcio version to v1.64.1 ([#2483](https://github.com/kubeflow/katib/pull/2483) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- docs: remove katib workflow ([#2443](https://github.com/kubeflow/katib/pull/2443) by [@gonmmarques](https://github.com/gonmmarques))
- Migrate KatibCertGenerator to OPA CertController ([#2345](https://github.com/kubeflow/katib/pull/2345) by [@forsaken628](https://github.com/forsaken628))
- Promote @Electronic-Waste and @helenxie-bit as Katib reviewers ([#2439](https://github.com/kubeflow/katib/pull/2439) by [@andreyvelich](https://github.com/andreyvelich))
- Update README and out-of-date docs ([#2438](https://github.com/kubeflow/katib/pull/2438) by [@andreyvelich](https://github.com/andreyvelich))
- Changes isort profile to black, to be fully compatible and adds 'pkg' dir for black and flake8 ([#2413](https://github.com/kubeflow/katib/pull/2413) by [@Ygnas](https://github.com/Ygnas))
- Introduced error constants and replaced reflect with cmp ([#2289](https://github.com/kubeflow/katib/pull/2289) by [@tariq-hasan](https://github.com/tariq-hasan))
- [Test] Refactor `inject_webhook_test.go` according to the Developer Guide ([#2401](https://github.com/kubeflow/katib/pull/2401) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Enhance pre-commit hooks with flake8 and black ([#2407](https://github.com/kubeflow/katib/pull/2407) by [@Ygnas](https://github.com/Ygnas))
- added `Distribution` field to feasibleSpace in `api.proto` ([#2397](https://github.com/kubeflow/katib/pull/2397) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- Begin enabling pre-commit hooks ([#2242](https://github.com/kubeflow/katib/pull/2242) by [@droctothorpe](https://github.com/droctothorpe))
- Update Instructions for Argo Workflows ([#2382](https://github.com/kubeflow/katib/pull/2382) by [@jaffe-fly](https://github.com/jaffe-fly))
- docs: update suggestion.md ([#2387](https://github.com/kubeflow/katib/pull/2387) by [@eltociear](https://github.com/eltociear))
- Add command to re-run GitHub Actions tests ([#2385](https://github.com/kubeflow/katib/pull/2385) by [@andreyvelich](https://github.com/andreyvelich))
- Bump Katib Python SDK to 0.17.0 version ([#2379](https://github.com/kubeflow/katib/pull/2379) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.17.0 ([#2380](https://github.com/kubeflow/katib/pull/2380) by [@andreyvelich](https://github.com/andreyvelich))
- Replaced hpcloud with nxadm for tail package in Go ([#2375](https://github.com/kubeflow/katib/pull/2375) by [@tariq-hasan](https://github.com/tariq-hasan))
- Use ErrorList for experiment validator ([#2329](https://github.com/kubeflow/katib/pull/2329) by [@ckcd](https://github.com/ckcd))
- Add Changelog for Katib v0.17.0-rc.1 ([#2370](https://github.com/kubeflow/katib/pull/2370) by [@andreyvelich](https://github.com/andreyvelich))
- Remove default caBundle value ([#2368](https://github.com/kubeflow/katib/pull/2368) by [@vihangm](https://github.com/vihangm))
- Bump Katib Python SDK to 0.17.0rc1 version ([#2365](https://github.com/kubeflow/katib/pull/2365) by [@andreyvelich](https://github.com/andreyvelich))
- Add unit test for `create_experiment` in the `katib_client` module ([#2325](https://github.com/kubeflow/katib/pull/2325) by [@tariq-hasan](https://github.com/tariq-hasan))
- Remove code generation from release script ([#2363](https://github.com/kubeflow/katib/pull/2363) by [@andreyvelich](https://github.com/andreyvelich))
- Upgrade the protobuf version to >=4.21.12,<5 ([#2358](https://github.com/kubeflow/katib/pull/2358) by [@tenzen-y](https://github.com/tenzen-y))
- Replace gRPC code generation tool from Znly/protoc to Buf ([#2344](https://github.com/kubeflow/katib/pull/2344) by [@forsaken628](https://github.com/forsaken628))
- Replace already closed github.com/golang/mock with go.uber.org/mock ([#2357](https://github.com/kubeflow/katib/pull/2357) by [@forsaken628](https://github.com/forsaken628))
- Use cache-dependency-path in actions/setup-go for CI workflow ([#2355](https://github.com/kubeflow/katib/pull/2355) by [@forsaken628](https://github.com/forsaken628))
- Update Slack Invitation ([#2349](https://github.com/kubeflow/katib/pull/2349) by [@andreyvelich](https://github.com/andreyvelich))
- Update GitHub template to better triage Issues ([#2335](https://github.com/kubeflow/katib/pull/2335) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.17.0-rc.0 ([#2319](https://github.com/kubeflow/katib/pull/2319) by [@andreyvelich](https://github.com/andreyvelich))
- Update outdated actions ([#2324](https://github.com/kubeflow/katib/pull/2324) by [@Mersho](https://github.com/Mersho))
- Make test fields private in Go unit tests ([#2316](https://github.com/kubeflow/katib/pull/2316) by [@tariq-hasan](https://github.com/tariq-hasan))
- Bump Katib Python SDK to 0.17.0rc0 Version ([#2318](https://github.com/kubeflow/katib/pull/2318) by [@andreyvelich](https://github.com/andreyvelich))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.17.0...v0.18.0)
# [v0.18.0-rc.0](https://github.com/kubeflow/katib/tree/v0.18.0-rc.0) (2025-02-13)
## Breaking Changes
- Upgrade Kubernetes to v1.31.3 ([#2478](https://github.com/kubeflow/katib/pull/2478) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Upgrade Kubernetes to v1.30.7 ([#2463](https://github.com/kubeflow/katib/pull/2463) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Drop Python 3.7 and Support Python 3.11 in the SDK ([#2337](https://github.com/kubeflow/katib/pull/2337) by [@tenzen-y](https://github.com/tenzen-y))
## New Features
### Hyperparameter Optimization for LLMs
- [DOCS] move llm hyperparameter optimisation design image to the proposal directory and rename it ([#2472](https://github.com/kubeflow/katib/pull/2472) by [@mahdikhashan](https://github.com/mahdikhashan))
- [GSoC] Update `tune` API for LLM hyperparameters optimization ([#2393](https://github.com/kubeflow/katib/pull/2393) by [@helenxie-bit](https://github.com/helenxie-bit))
- [GSoC] Create LLM Hyperparameters Optimization API Proposal ([#2333](https://github.com/kubeflow/katib/pull/2333) by [@helenxie-bit](https://github.com/helenxie-bit))
### Support for Advanced Distributions for HPO
- [GSOC] `optuna` suggestion service logic update ([#2446](https://github.com/kubeflow/katib/pull/2446) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] `hyperopt` suggestion service logic update ([#2412](https://github.com/kubeflow/katib/pull/2412) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] Add validator for feasible space distribution ([#2404](https://github.com/kubeflow/katib/pull/2404) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] added Unknown distribution and convertDistribution in suggestion client ([#2403](https://github.com/kubeflow/katib/pull/2403) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] Support for various Parameter distributions in Katib ([#2334](https://github.com/kubeflow/katib/pull/2334) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSoC] Added `DistributionType` to Experiment API ([#2377](https://github.com/kubeflow/katib/pull/2377) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
### Push-based Metrics Collector
- [GSoC] Provide a PyTorch MNIST Example for Push-based Metrics Collection ([#2437](https://github.com/kubeflow/katib/pull/2437) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] Compatibility Changes in Trial Controller ([#2394](https://github.com/kubeflow/katib/pull/2394) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] New Interface `report_metrics` in Python SDK ([#2371](https://github.com/kubeflow/katib/pull/2371) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] KEP for Project 6: Push-based Metrics Collection for Katib ([#2328](https://github.com/kubeflow/katib/pull/2328) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] Add New Parameter in `tune` ([#2369](https://github.com/kubeflow/katib/pull/2369) by [@Electronic-Waste](https://github.com/Electronic-Waste))
### SDK Updates
- [SDK] Support PyTorchJob as a Trial Worker ([#2512](https://github.com/kubeflow/katib/pull/2512) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] test: Add e2e test for tune function. ([#2399](https://github.com/kubeflow/katib/pull/2399) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [SDK] improve PVC creation name error ([#2496](https://github.com/kubeflow/katib/pull/2496) by [@mahdikhashan](https://github.com/mahdikhashan))
- [SDK] Fix empty list for env variables and numpy version ([#2360](https://github.com/kubeflow/katib/pull/2360) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Explain Python version support cycle ([#2354](https://github.com/kubeflow/katib/pull/2354) by [@andreyvelich](https://github.com/andreyvelich))
## Bug Fixes
- fix(webhook): fix validation message in experiment webhook ([#2507](https://github.com/kubeflow/katib/pull/2507) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Install typing-extensions v4.10.0 to fix Python test error ([#2504](https://github.com/kubeflow/katib/pull/2504) by [@helenxie-bit](https://github.com/helenxie-bit))
- [SDK] Update `tune` API ([#2497](https://github.com/kubeflow/katib/pull/2497) by [@helenxie-bit](https://github.com/helenxie-bit))
- fix(api): resolve all api voilation exceptions in katib api ([#2482](https://github.com/kubeflow/katib/pull/2482) by [@truc0](https://github.com/truc0))
- fix(trial): use propagated gomega to improve debuggability. ([#2432](https://github.com/kubeflow/katib/pull/2432) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- fix(ui): update None Collector with Push Collector. ([#2418](https://github.com/kubeflow/katib/pull/2418) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- fix: Resolve errors in e2e tests for cypress in Katib UI ([#2384](https://github.com/kubeflow/katib/pull/2384) by [@tariq-hasan](https://github.com/tariq-hasan))
- doc(example): fix the broken link. ([#2433](https://github.com/kubeflow/katib/pull/2433) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- fix: remove remaining MXNet dependency. ([#2456](https://github.com/kubeflow/katib/pull/2456) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Remove Dropout layer from ENAS Trial container to fix E2E tests ([#2455](https://github.com/kubeflow/katib/pull/2455) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] fix grpc related bugs in Python SDK ([#2398](https://github.com/kubeflow/katib/pull/2398) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [SDK] Fix types error ([#2424](https://github.com/kubeflow/katib/pull/2424) by [@helenxie-bit](https://github.com/helenxie-bit))
- fix: remove the dependency of `protocmp` in `google.golang.org/protobuf/testing/protocmp`. ([#2391](https://github.com/kubeflow/katib/pull/2391) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Fix TestReconcileBatchJob ([#2350](https://github.com/kubeflow/katib/pull/2350) by [@forsaken628](https://github.com/forsaken628))
- Fix apple silicon rosetta error when building images from the source code ([#2342](https://github.com/kubeflow/katib/pull/2342) by [@helenxie-bit](https://github.com/helenxie-bit))
- fix katib use crds token pipeline trail template guide ([#2330](https://github.com/kubeflow/katib/pull/2330) by [@Jerry-yz](https://github.com/Jerry-yz))
- Fix Scikit-Learn Version for Skopt Tests ([#2336](https://github.com/kubeflow/katib/pull/2336) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Set experiment names at a max of 40 characters. ([#2468](https://github.com/kubeflow/katib/pull/2468) by [@AydanPirani](https://github.com/AydanPirani))
- [CI] optimize katib ui dockerfile ([#2505](https://github.com/kubeflow/katib/pull/2505) by [@mahdikhashan](https://github.com/mahdikhashan))
- Sort experiments by descending creation date by default in katib-ui ([#2498](https://github.com/kubeflow/katib/pull/2498) by [@Doris-xm](https://github.com/Doris-xm))
- [GSoC] Add unit tests for `tune` API ([#2423](https://github.com/kubeflow/katib/pull/2423) by [@helenxie-bit](https://github.com/helenxie-bit))
- Update MutatingWebhookConfiguration: Switch from objectSelector to AdmissionWebhookMatchConditions ([#2241](https://github.com/kubeflow/katib/pull/2241) by [@lianghao208](https://github.com/lianghao208))
- chore: supporting the listen-address parameter on db-manager ([#2465](https://github.com/kubeflow/katib/pull/2465) by [@caiofralmeida](https://github.com/caiofralmeida))
- Upgrade klog to v2 ([#2470](https://github.com/kubeflow/katib/pull/2470) by [@Doris-xm](https://github.com/Doris-xm))
- Ignore cache exporting errors in the image building workflows ([#2487](https://github.com/kubeflow/katib/pull/2487) by [@Doris-xm](https://github.com/Doris-xm))
- Upgrade grpcio version to v1.64.1 ([#2483](https://github.com/kubeflow/katib/pull/2483) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- docs: remove katib workflow ([#2443](https://github.com/kubeflow/katib/pull/2443) by [@gonmmarques](https://github.com/gonmmarques))
- Migrate KatibCertGenerator to OPA CertController ([#2345](https://github.com/kubeflow/katib/pull/2345) by [@forsaken628](https://github.com/forsaken628))
- Promote @Electronic-Waste and @helenxie-bit as Katib reviewers ([#2439](https://github.com/kubeflow/katib/pull/2439) by [@andreyvelich](https://github.com/andreyvelich))
- Update README and out-of-date docs ([#2438](https://github.com/kubeflow/katib/pull/2438) by [@andreyvelich](https://github.com/andreyvelich))
- Changes isort profile to black, to be fully compatible and adds 'pkg' dir for black and flake8 ([#2413](https://github.com/kubeflow/katib/pull/2413) by [@Ygnas](https://github.com/Ygnas))
- Introduced error constants and replaced reflect with cmp ([#2289](https://github.com/kubeflow/katib/pull/2289) by [@tariq-hasan](https://github.com/tariq-hasan))
- [Test] Refactor `inject_webhook_test.go` according to the Developer Guide ([#2401](https://github.com/kubeflow/katib/pull/2401) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Enhance pre-commit hooks with flake8 and black ([#2407](https://github.com/kubeflow/katib/pull/2407) by [@Ygnas](https://github.com/Ygnas))
- added `Distribution` field to feasibleSpace in `api.proto` ([#2397](https://github.com/kubeflow/katib/pull/2397) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- Begin enabling pre-commit hooks ([#2242](https://github.com/kubeflow/katib/pull/2242) by [@droctothorpe](https://github.com/droctothorpe))
- Update Instructions for Argo Workflows ([#2382](https://github.com/kubeflow/katib/pull/2382) by [@jaffe-fly](https://github.com/jaffe-fly))
- docs: update suggestion.md ([#2387](https://github.com/kubeflow/katib/pull/2387) by [@eltociear](https://github.com/eltociear))
- Add command to re-run GitHub Actions tests ([#2385](https://github.com/kubeflow/katib/pull/2385) by [@andreyvelich](https://github.com/andreyvelich))
- Bump Katib Python SDK to 0.17.0 version ([#2379](https://github.com/kubeflow/katib/pull/2379) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.17.0 ([#2380](https://github.com/kubeflow/katib/pull/2380) by [@andreyvelich](https://github.com/andreyvelich))
- Replaced hpcloud with nxadm for tail package in Go ([#2375](https://github.com/kubeflow/katib/pull/2375) by [@tariq-hasan](https://github.com/tariq-hasan))
- Use ErrorList for experiment validator ([#2329](https://github.com/kubeflow/katib/pull/2329) by [@ckcd](https://github.com/ckcd))
- Add Changelog for Katib v0.17.0-rc.1 ([#2370](https://github.com/kubeflow/katib/pull/2370) by [@andreyvelich](https://github.com/andreyvelich))
- Remove default caBundle value ([#2368](https://github.com/kubeflow/katib/pull/2368) by [@vihangm](https://github.com/vihangm))
- Bump Katib Python SDK to 0.17.0rc1 version ([#2365](https://github.com/kubeflow/katib/pull/2365) by [@andreyvelich](https://github.com/andreyvelich))
- Add unit test for `create_experiment` in the `katib_client` module ([#2325](https://github.com/kubeflow/katib/pull/2325) by [@tariq-hasan](https://github.com/tariq-hasan))
- Remove code generation from release script ([#2363](https://github.com/kubeflow/katib/pull/2363) by [@andreyvelich](https://github.com/andreyvelich))
- Upgrade the protobuf version to >=4.21.12,<5 ([#2358](https://github.com/kubeflow/katib/pull/2358) by [@tenzen-y](https://github.com/tenzen-y))
- Replace gRPC code generation tool from Znly/protoc to Buf ([#2344](https://github.com/kubeflow/katib/pull/2344) by [@forsaken628](https://github.com/forsaken628))
- Replace already closed github.com/golang/mock with go.uber.org/mock ([#2357](https://github.com/kubeflow/katib/pull/2357) by [@forsaken628](https://github.com/forsaken628))
- Use cache-dependency-path in actions/setup-go for CI workflow ([#2355](https://github.com/kubeflow/katib/pull/2355) by [@forsaken628](https://github.com/forsaken628))
- Update Slack Invitation ([#2349](https://github.com/kubeflow/katib/pull/2349) by [@andreyvelich](https://github.com/andreyvelich))
- Update GitHub template to better triage Issues ([#2335](https://github.com/kubeflow/katib/pull/2335) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.17.0-rc.0 ([#2319](https://github.com/kubeflow/katib/pull/2319) by [@andreyvelich](https://github.com/andreyvelich))
- Update outdated actions ([#2324](https://github.com/kubeflow/katib/pull/2324) by [@Mersho](https://github.com/Mersho))
- Make test fields private in Go unit tests ([#2316](https://github.com/kubeflow/katib/pull/2316) by [@tariq-hasan](https://github.com/tariq-hasan))
- Bump Katib Python SDK to 0.17.0rc0 Version ([#2318](https://github.com/kubeflow/katib/pull/2318) by [@andreyvelich](https://github.com/andreyvelich))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.17.0...v0.18.0-rc.0)
# [v0.17.0](https://github.com/kubeflow/katib/tree/v0.17.0) (2024-07-12)
## Breaking Changes
- [SDK] Drop Python 3.7 and Support Python 3.11 ([#2337](https://github.com/kubeflow/katib/pull/2337) by [@tenzen-y](https://github.com/tenzen-y))
- [SDK] Upgrade the protobuf version to >=4.21.12,<5 ([#2358](https://github.com/kubeflow/katib/pull/2358) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.26, and support Kubernetes v1.29 ([#2308](https://github.com/kubeflow/katib/pull/2308) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.25, and Support Kubernetes v1.28 ([#2303](https://github.com/kubeflow/katib/pull/2303) by [@tenzen-y](https://github.com/tenzen-y))
- Remove MXNet examples ([#2267](https://github.com/kubeflow/katib/pull/2267) by [@tenzen-y](https://github.com/tenzen-y))
## New Features
### Core Features
- Replace gRPC code generation tool from Znly/protoc to Buf ([#2344](https://github.com/kubeflow/katib/pull/2344) by [@forsaken628](https://github.com/forsaken628))
- Support ARM64 arch for release images ([#2315](https://github.com/kubeflow/katib/pull/2315) by [@andreyvelich](https://github.com/andreyvelich))
- DB: Add environment variable option to skip DB table creationˆ ([#2245](https://github.com/kubeflow/katib/pull/2245) by [@lkaybob](https://github.com/lkaybob))
- Add environment variable option to set postgres ssl mode ([#2266](https://github.com/kubeflow/katib/pull/2266) by [@ckcd](https://github.com/ckcd))
- Upgrade TensorFlow version to v2.16.1 ([#2282](https://github.com/kubeflow/katib/pull/2282) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade PyTorch version to v2.2.1 ([#2279](https://github.com/kubeflow/katib/pull/2279) by [@tenzen-y](https://github.com/tenzen-y))
### SDK Features
- [SDK] Generate Name functionality for creating experiments. ([#2272](https://github.com/kubeflow/katib/pull/2272) by [@bharathk005](https://github.com/bharathk005))
- [SDK] Add `env` & `env_from` in client tune ([#2235](https://github.com/kubeflow/katib/pull/2235) by [@shipengcheng1230](https://github.com/shipengcheng1230))
- [SDK] Add 'algorithm_settings' in client tune ([#2227](https://github.com/kubeflow/katib/pull/2227) by [@shipengcheng1230](https://github.com/shipengcheng1230))
- [SDK] Raise more human-readable name conflict exception ([#2199](https://github.com/kubeflow/katib/pull/2199) by [@droctothorpe](https://github.com/droctothorpe))
## Bug Fixes
- Remove code generation from release script ([#2364](https://github.com/kubeflow/katib/pull/2364) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Fix empty list for env variables and numpy version ([#2360](https://github.com/kubeflow/katib/pull/2360) by [@andreyvelich](https://github.com/andreyvelich))
- Use cache-dependency-path in actions/setup-go for CI workflow ([#2355](https://github.com/kubeflow/katib/pull/2355) by [@forsaken628](https://github.com/forsaken628))
- Fix TestReconcileBatchJob ([#2350](https://github.com/kubeflow/katib/pull/2350) by [@forsaken628](https://github.com/forsaken628))
- Fix Scikit-Learn Version for Skopt Tests ([#2336](https://github.com/kubeflow/katib/pull/2336) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Fix env per Trial parameter in tune API ([#2304](https://github.com/kubeflow/katib/pull/2304) by [@andreyvelich](https://github.com/andreyvelich))
- Fix: clean up UTs for file metrics collector ([#2285](https://github.com/kubeflow/katib/pull/2285) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Fix tensor devices for DARTS Trial ([#2273](https://github.com/kubeflow/katib/pull/2273) by [@sifa1024](https://github.com/sifa1024))
- Typo fix stale.yaml ([#2257](https://github.com/kubeflow/katib/pull/2257) by [@tarilabs](https://github.com/tarilabs))
- Fix Optuna Validation for CMA-ES ([#2240](https://github.com/kubeflow/katib/pull/2240) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Replace already closed github.com/golang/mock with go.uber.org/mock ([#2357](https://github.com/kubeflow/katib/pull/2357) by [@forsaken628](https://github.com/forsaken628))
- Update outdated actions ([#2324](https://github.com/kubeflow/katib/pull/2324) by [@Mersho](https://github.com/Mersho))
- Upgrade Go version to v1.22 ([#2309](https://github.com/kubeflow/katib/pull/2309) by [@tenzen-y](https://github.com/tenzen-y))
- CI: Enable parallel mode for the coveralls ([#2297](https://github.com/kubeflow/katib/pull/2297) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Python version to 3.11 ([#2278](https://github.com/kubeflow/katib/pull/2278) by [@tenzen-y](https://github.com/tenzen-y))
- chore: add unit testcases for files in Text format. ([#2274](https://github.com/kubeflow/katib/pull/2274) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Upgrade google/go-containerregistry/pkg/authn/k8schain ([#2252](https://github.com/kubeflow/katib/pull/2252) by [@tenzen-y](https://github.com/tenzen-y))
- Add Technical and style guide to the contribution guide ([#2250](https://github.com/kubeflow/katib/pull/2250) by [@tenzen-y](https://github.com/tenzen-y))
- Install typing-extensions v4.6.3 for Optuna ([#2251](https://github.com/kubeflow/katib/pull/2251) by [@tenzen-y](https://github.com/tenzen-y))
- Remove legacy BO code ([#2246](https://github.com/kubeflow/katib/pull/2246) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.16.0 ([#2239](https://github.com/kubeflow/katib/pull/2239) by [@andreyvelich](https://github.com/andreyvelich))
- Add Katib ROADMAP 2022/2023 ([#2153](https://github.com/kubeflow/katib/pull/2153) by [@andreyvelich](https://github.com/andreyvelich))
- Update Ubuntu to 22.04 for E2E Tests ([#2222](https://github.com/kubeflow/katib/pull/2222) by [@andreyvelich](https://github.com/andreyvelich))
- Run Stale Action Every 5th Hour ([#2221](https://github.com/kubeflow/katib/pull/2221) by [@andreyvelich](https://github.com/andreyvelich))
- Add Stale GitHub Action ([#2220](https://github.com/kubeflow/katib/pull/2220) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.16.0-rc.1 ([#2218](https://github.com/kubeflow/katib/pull/2218) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.16.0-rc.0 ([#2204](https://github.com/kubeflow/katib/pull/2204) by [@andreyvelich](https://github.com/andreyvelich))
- Use the controller-runtime logger in the cert-generator ([#2219](https://github.com/kubeflow/katib/pull/2219) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.16.0...v0.17.0)
# [v0.17.0-rc.1](https://github.com/kubeflow/katib/tree/v0.17.0-rc.1) (2024-06-20)
## Breaking Changes
- [SDK] Drop Python 3.7 and Support Python 3.11 ([#2337](https://github.com/kubeflow/katib/pull/2337) by [@tenzen-y](https://github.com/tenzen-y))
- [SDK] Upgrade the protobuf version to >=4.21.12,<5 ([#2358](https://github.com/kubeflow/katib/pull/2358) by [@tenzen-y](https://github.com/tenzen-y))
## New Features
- Replace gRPC code generation tool from Znly/protoc to Buf ([#2344](https://github.com/kubeflow/katib/pull/2344) by [@forsaken628](https://github.com/forsaken628))
## Bug Fixes
- Remove code generation from release script ([#2364](https://github.com/kubeflow/katib/pull/2364) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Fix empty list for env variables and numpy version ([#2360](https://github.com/kubeflow/katib/pull/2360) by [@andreyvelich](https://github.com/andreyvelich))
- Use cache-dependency-path in actions/setup-go for CI workflow ([#2355](https://github.com/kubeflow/katib/pull/2355) by [@forsaken628](https://github.com/forsaken628))
- Fix TestReconcileBatchJob ([#2350](https://github.com/kubeflow/katib/pull/2350) by [@forsaken628](https://github.com/forsaken628))
- Fix Scikit-Learn Version for Skopt Tests ([#2336](https://github.com/kubeflow/katib/pull/2336) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Replace already closed github.com/golang/mock with go.uber.org/mock ([#2357](https://github.com/kubeflow/katib/pull/2357) by [@forsaken628](https://github.com/forsaken628))
- Update outdated actions ([#2324](https://github.com/kubeflow/katib/pull/2324) by [@Mersho](https://github.com/Mersho))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.17.0-rc.0...v0.17.0-rc.1)
# [v0.17.0-rc.0](https://github.com/kubeflow/katib/tree/v0.17.0-rc.0) (2024-04-29)
## Breaking Changes
- Drop Kubernetes v1.26, and support Kubernetes v1.29 ([#2308](https://github.com/kubeflow/katib/pull/2308) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.25, and Support Kubernetes v1.28 ([#2303](https://github.com/kubeflow/katib/pull/2303) by [@tenzen-y](https://github.com/tenzen-y))
## New Features
### Core Features
- Support ARM64 arch for release images ([#2315](https://github.com/kubeflow/katib/pull/2315) by [@andreyvelich](https://github.com/andreyvelich))
- DB: Add environment variable option to skip DB table creationˆ ([#2245](https://github.com/kubeflow/katib/pull/2245) by [@lkaybob](https://github.com/lkaybob))
- Add environment variable option to set postgres ssl mode ([#2266](https://github.com/kubeflow/katib/pull/2266) by [@ckcd](https://github.com/ckcd))
- Upgrade TensorFlow version to v2.16.1 ([#2282](https://github.com/kubeflow/katib/pull/2282) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade PyTorch version to v2.2.1 ([#2279](https://github.com/kubeflow/katib/pull/2279) by [@tenzen-y](https://github.com/tenzen-y))
### SDK Features
- [SDK] Generate Name functionality for creating experiments. ([#2272](https://github.com/kubeflow/katib/pull/2272) by [@bharathk005](https://github.com/bharathk005))
- [SDK] Add `env` & `env_from` in client tune ([#2235](https://github.com/kubeflow/katib/pull/2235) by [@shipengcheng1230](https://github.com/shipengcheng1230))
- [SDK] Add 'algorithm_settings' in client tune ([#2227](https://github.com/kubeflow/katib/pull/2227) by [@shipengcheng1230](https://github.com/shipengcheng1230))
- [SDK] Raise more human-readable name conflict exception ([#2199](https://github.com/kubeflow/katib/pull/2199) by [@droctothorpe](https://github.com/droctothorpe))
## Bug Fixes
- [SDK] Fix env per Trial parameter in tune API ([#2304](https://github.com/kubeflow/katib/pull/2304) by [@andreyvelich](https://github.com/andreyvelich))
- Fix: clean up UTs for file metrics collector ([#2285](https://github.com/kubeflow/katib/pull/2285) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Fix tensor devices for DARTS Trial ([#2273](https://github.com/kubeflow/katib/pull/2273) by [@sifa1024](https://github.com/sifa1024))
- Typo fix stale.yaml ([#2257](https://github.com/kubeflow/katib/pull/2257) by [@tarilabs](https://github.com/tarilabs))
- Fix Optuna Validation for CMA-ES ([#2240](https://github.com/kubeflow/katib/pull/2240) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Upgrade Go version to v1.22 ([#2309](https://github.com/kubeflow/katib/pull/2309) by [@tenzen-y](https://github.com/tenzen-y))
- CI: Enable parallel mode for the coveralls ([#2297](https://github.com/kubeflow/katib/pull/2297) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Python version to 3.11 ([#2278](https://github.com/kubeflow/katib/pull/2278) by [@tenzen-y](https://github.com/tenzen-y))
- chore: add unit testcases for files in Text format. ([#2274](https://github.com/kubeflow/katib/pull/2274) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Upgrade google/go-containerregistry/pkg/authn/k8schain ([#2252](https://github.com/kubeflow/katib/pull/2252) by [@tenzen-y](https://github.com/tenzen-y))
- Remove MXNet examples ([#2267](https://github.com/kubeflow/katib/pull/2267) by [@tenzen-y](https://github.com/tenzen-y))
- Add Technical and style guide to the contribution guide ([#2250](https://github.com/kubeflow/katib/pull/2250) by [@tenzen-y](https://github.com/tenzen-y))
- Install typing-extensions v4.6.3 for Optuna ([#2251](https://github.com/kubeflow/katib/pull/2251) by [@tenzen-y](https://github.com/tenzen-y))
- Remove legacy BO code ([#2246](https://github.com/kubeflow/katib/pull/2246) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.16.0 ([#2239](https://github.com/kubeflow/katib/pull/2239) by [@andreyvelich](https://github.com/andreyvelich))
- Add Katib ROADMAP 2022/2023 ([#2153](https://github.com/kubeflow/katib/pull/2153) by [@andreyvelich](https://github.com/andreyvelich))
- Update Ubuntu to 22.04 for E2E Tests ([#2222](https://github.com/kubeflow/katib/pull/2222) by [@andreyvelich](https://github.com/andreyvelich))
- Run Stale Action Every 5th Hour ([#2221](https://github.com/kubeflow/katib/pull/2221) by [@andreyvelich](https://github.com/andreyvelich))
- Add Stale GitHub Action ([#2220](https://github.com/kubeflow/katib/pull/2220) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.16.0-rc.1 ([#2218](https://github.com/kubeflow/katib/pull/2218) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.16.0-rc.0 ([#2204](https://github.com/kubeflow/katib/pull/2204) by [@andreyvelich](https://github.com/andreyvelich))
- Use the controller-runtime logger in the cert-generator ([#2219](https://github.com/kubeflow/katib/pull/2219) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.16.0...v0.17.0-rc.0)
# [v0.16.0](https://github.com/kubeflow/katib/tree/v0.16.0) (2023-10-31)
## Breaking Changes
- Implement KatibConfig API ([#2176](https://github.com/kubeflow/katib/pull/2176) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.24 and support Kubernetes v1.27 ([#2182](https://github.com/kubeflow/katib/pull/2182) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.23 and support Kubernetes v1.26 ([#2177](https://github.com/kubeflow/katib/pull/2177) by [@tenzen-y](https://github.com/tenzen-y))
- Change failurePolicy to Fail for Katib Webhooks ([#2018](https://github.com/kubeflow/katib/pull/2018) by [@andreyvelich](https://github.com/andreyvelich))
## New Features
### Core Features
- Consolidate the Katib Cert Generator to the Katib Controller ([#2185](https://github.com/kubeflow/katib/pull/2185) by [@tenzen-y](https://github.com/tenzen-y))
- Containerize tests for Katib Conformance ([#2146](https://github.com/kubeflow/katib/pull/2146) by [@nagar-ajay](https://github.com/nagar-ajay))
### UI Improvements
- [UI] Default Resume Policy to never from UI ([#2195](https://github.com/kubeflow/katib/pull/2195) by [@mChowdhury-91](https://github.com/mChowdhury-91))
- [UI] Remove Deprecated Katib UI ([#2179](https://github.com/kubeflow/katib/pull/2179) by [@andreyvelich](https://github.com/andreyvelich))
- [UI] Fix Trial Logs when Kubernetes Job Fails ([#2164](https://github.com/kubeflow/katib/pull/2164) by [@andreyvelich](https://github.com/andreyvelich))
- kwa(front): Support all namespaces ([#2119](https://github.com/kubeflow/katib/pull/2119) by [@elenzio9](https://github.com/elenzio9))
- kwa(front): Update the use of SnackBarService ([#2113](https://github.com/kubeflow/katib/pull/2113) by [@orfeas-k](https://github.com/orfeas-k))
- UI: Remove an unsed import, EventV1beta1Api ([#2116](https://github.com/kubeflow/katib/pull/2116) by [@tenzen-y](https://github.com/tenzen-y))
### SDK Improvements
- [SDK] Enable resource specification for trial containers ([#2192](https://github.com/kubeflow/katib/pull/2192) by [@droctothorpe](https://github.com/droctothorpe))
- [SDK] Add namespace parameter to KatibClient ([#2183](https://github.com/kubeflow/katib/pull/2183) by [@droctothorpe](https://github.com/droctothorpe))
- [SDK] Import all Kubernetes Models ([#2148](https://github.com/kubeflow/katib/pull/2148) by [@andreyvelich](https://github.com/andreyvelich))
## Bug fixes
- Bug: Wait for the certs to be mounted inside the container ([#2213](https://github.com/kubeflow/katib/pull/2213) by [@tenzen-y](https://github.com/tenzen-y))
- Start waiting for certs to be ready before sending data to the channel ([#2215](https://github.com/kubeflow/katib/pull/2215) by [@tenzen-y](https://github.com/tenzen-y))
- E2E: Add additional checks to verify if the components are ready ([#2212](https://github.com/kubeflow/katib/pull/2212) by [@tenzen-y](https://github.com/tenzen-y))
- Remove a katib-webhook-cert Secret from components ([#2214](https://github.com/kubeflow/katib/pull/2214) by [@tenzen-y](https://github.com/tenzen-y))
- Skip to inject the metrics-collector pods to the Katib controller ([#2211](https://github.com/kubeflow/katib/pull/2211) by [@tenzen-y](https://github.com/tenzen-y))
- Sending an empty data to the certsReady channel ([#2196](https://github.com/kubeflow/katib/pull/2196) by [@tenzen-y](https://github.com/tenzen-y))
- Fix conformance docker image ([#2147](https://github.com/kubeflow/katib/pull/2147) by [@nagar-ajay](https://github.com/nagar-ajay))
## Documentation
- Add PITS Global Data Recovery Services to the adopters list ([#2160](https://github.com/kubeflow/katib/pull/2160) by [@ghost](https://github.com/ghost))
- Add SDK Breaking Change to Changelog ([#2133](https://github.com/kubeflow/katib/pull/2133) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.15.0 ([#2129](https://github.com/kubeflow/katib/pull/2129) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.15.0-rc.1 ([#2123](https://github.com/kubeflow/katib/pull/2123) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.15.0-rc.0 ([#2106](https://github.com/kubeflow/katib/pull/2106) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Upgrade Tensorflow version to v2.13.0 ([#2216](https://github.com/kubeflow/katib/pull/2216) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Go version to v1.20 ([#2190](https://github.com/kubeflow/katib/pull/2190) by [@tenzen-y](https://github.com/tenzen-y))
- Replace grpc_health_probe with the built-in gRPC container probe feature ([#2189](https://github.com/kubeflow/katib/pull/2189) by [@tenzen-y](https://github.com/tenzen-y))
- Allow install binaries for the arm64 in the envtest ([#2188](https://github.com/kubeflow/katib/pull/2188) by [@tenzen-y](https://github.com/tenzen-y))
- Replace action to setup minikube with medyagh/setup-minikube ([#2178](https://github.com/kubeflow/katib/pull/2178) by [@tenzen-y](https://github.com/tenzen-y))
- Remove Charmed Operators for Katib ([#2161](https://github.com/kubeflow/katib/pull/2161) by [@ca-scribner](https://github.com/ca-scribner))
- Namespace and trial pod annotations as CLI argument ([#2138](https://github.com/kubeflow/katib/pull/2138) by [@nagar-ajay](https://github.com/nagar-ajay))
- Relax dependencies restriction for the gRPC libraries ([#2140](https://github.com/kubeflow/katib/pull/2140) by [@tenzen-y](https://github.com/tenzen-y))
- Add SDK Breaking Change to Changelog ([#2133](https://github.com/kubeflow/katib/pull/2133) by [@andreyvelich](https://github.com/andreyvelich))
- Increase the free spaces in CI ([#2131](https://github.com/kubeflow/katib/pull/2131) by [@tenzen-y](https://github.com/tenzen-y))
- Reformat katib-operators ([#2114](https://github.com/kubeflow/katib/pull/2114) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.15.0...v0.16.0)
# [v0.16.0-rc.1](https://github.com/kubeflow/katib/tree/v0.16.0-rc.1) (2023-08-16)
## New Features
- Upgrade Tensorflow version to v2.13.0 ([#2216](https://github.com/kubeflow/katib/pull/2216) by [@tenzen-y](https://github.com/tenzen-y))
## Bug Fixes
- Bug: Wait for the certs to be mounted inside the container ([#2213](https://github.com/kubeflow/katib/pull/2213) by [@tenzen-y](https://github.com/tenzen-y))
- Start waiting for certs to be ready before sending data to the channel ([#2215](https://github.com/kubeflow/katib/pull/2215) by [@tenzen-y](https://github.com/tenzen-y))
- E2E: Add additional checks to verify if the components are ready ([#2212](https://github.com/kubeflow/katib/pull/2212) by [@tenzen-y](https://github.com/tenzen-y))
- Remove a katib-webhook-cert Secret from components ([#2214](https://github.com/kubeflow/katib/pull/2214) by [@tenzen-y](https://github.com/tenzen-y))
- Skip to inject the metrics-collector pods to the Katib controller ([#2211](https://github.com/kubeflow/katib/pull/2211) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.16.0-rc.0...v0.16.0-rc.1)
# [v0.16.0-rc.0](https://github.com/kubeflow/katib/tree/v0.16.0-rc.0) (2023-08-05)
## Breaking Changes
- Implement KatibConfig API ([#2176](https://github.com/kubeflow/katib/pull/2176) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.24 and support Kubernetes v1.27 ([#2182](https://github.com/kubeflow/katib/pull/2182) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.23 and support Kubernetes v1.26 ([#2177](https://github.com/kubeflow/katib/pull/2177) by [@tenzen-y](https://github.com/tenzen-y))
- Change failurePolicy to Fail for Katib Webhooks ([#2018](https://github.com/kubeflow/katib/pull/2018) by [@andreyvelich](https://github.com/andreyvelich))
## New Features
### Core Features
- Consolidate the Katib Cert Generator to the Katib Controller ([#2185](https://github.com/kubeflow/katib/pull/2185) by [@tenzen-y](https://github.com/tenzen-y))
- Containerize tests for Katib Conformance ([#2146](https://github.com/kubeflow/katib/pull/2146) by [@nagar-ajay](https://github.com/nagar-ajay))
### UI Improvements
- [UI] Default Resume Policy to never from UI ([#2195](https://github.com/kubeflow/katib/pull/2195) by [@mChowdhury-91](https://github.com/mChowdhury-91))
- [UI] Remove Deprecated Katib UI ([#2179](https://github.com/kubeflow/katib/pull/2179) by [@andreyvelich](https://github.com/andreyvelich))
- [UI] Fix Trial Logs when Kubernetes Job Fails ([#2164](https://github.com/kubeflow/katib/pull/2164) by [@andreyvelich](https://github.com/andreyvelich))
- kwa(front): Support all namespaces ([#2119](https://github.com/kubeflow/katib/pull/2119) by [@elenzio9](https://github.com/elenzio9))
- kwa(front): Update the use of SnackBarService ([#2113](https://github.com/kubeflow/katib/pull/2113) by [@orfeas-k](https://github.com/orfeas-k))
- UI: Remove an unsed import, EventV1beta1Api ([#2116](https://github.com/kubeflow/katib/pull/2116) by [@tenzen-y](https://github.com/tenzen-y))
### SDK Improvements
- [SDK] Enable resource specification for trial containers ([#2192](https://github.com/kubeflow/katib/pull/2192) by [@droctothorpe](https://github.com/droctothorpe))
- [SDK] Add namespace parameter to KatibClient ([#2183](https://github.com/kubeflow/katib/pull/2183) by [@droctothorpe](https://github.com/droctothorpe))
- [SDK] Import all Kubernetes Models ([#2148](https://github.com/kubeflow/katib/pull/2148) by [@andreyvelich](https://github.com/andreyvelich))
## Bug fixes
- Sending an empty data to the certsReady channel ([#2196](https://github.com/kubeflow/katib/pull/2196) by [@tenzen-y](https://github.com/tenzen-y))
- Fix conformance docker image ([#2147](https://github.com/kubeflow/katib/pull/2147) by [@nagar-ajay](https://github.com/nagar-ajay))
## Documentation
- Add PITS Global Data Recovery Services to the adopters list ([#2160](https://github.com/kubeflow/katib/pull/2160) by [@ghost](https://github.com/ghost))
- Add SDK Breaking Change to Changelog ([#2133](https://github.com/kubeflow/katib/pull/2133) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.15.0 ([#2129](https://github.com/kubeflow/katib/pull/2129) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.15.0-rc.1 ([#2123](https://github.com/kubeflow/katib/pull/2123) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.15.0-rc.0 ([#2106](https://github.com/kubeflow/katib/pull/2106) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Upgrade Go version to v1.20 ([#2190](https://github.com/kubeflow/katib/pull/2190) by [@tenzen-y](https://github.com/tenzen-y))
- Replace grpc_health_probe with the built-in gRPC container probe feature ([#2189](https://github.com/kubeflow/katib/pull/2189) by [@tenzen-y](https://github.com/tenzen-y))
- Allow install binaries for the arm64 in the envtest ([#2188](https://github.com/kubeflow/katib/pull/2188) by [@tenzen-y](https://github.com/tenzen-y))
- Replace action to setup minikube with medyagh/setup-minikube ([#2178](https://github.com/kubeflow/katib/pull/2178) by [@tenzen-y](https://github.com/tenzen-y))
- Remove Charmed Operators for Katib ([#2161](https://github.com/kubeflow/katib/pull/2161) by [@ca-scribner](https://github.com/ca-scribner))
- Namespace and trial pod annotations as CLI argument ([#2138](https://github.com/kubeflow/katib/pull/2138) by [@nagar-ajay](https://github.com/nagar-ajay))
- Relax dependencies restriction for the gRPC libraries ([#2140](https://github.com/kubeflow/katib/pull/2140) by [@tenzen-y](https://github.com/tenzen-y))
- Add SDK Breaking Change to Changelog ([#2133](https://github.com/kubeflow/katib/pull/2133) by [@andreyvelich](https://github.com/andreyvelich))
- Increase the free spaces in CI ([#2131](https://github.com/kubeflow/katib/pull/2131) by [@tenzen-y](https://github.com/tenzen-y))
- Reformat katib-operators ([#2114](https://github.com/kubeflow/katib/pull/2114) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.15.0...v0.16.0-rc.0)
# [v0.15.0](https://github.com/kubeflow/katib/tree/v0.15.0) (2023-03-22)
## Breaking Changes
- Use **Never** Resume Policy as Default ([#2102](https://github.com/kubeflow/katib/pull/2102) by [@andreyvelich](https://github.com/andreyvelich))
- Chocolate Suggestion Service is removed ([#2071](https://github.com/kubeflow/katib/pull/2071) by [@tenzen-y](https://github.com/tenzen-y))
- `request_number` is removed from the GRPC APIs ([#1994](https://github.com/kubeflow/katib/pull/1994) by [@johnugeorge](https://github.com/johnugeorge))
- Enabling Authorization in Katib UI ([#1983](https://github.com/kubeflow/katib/pull/1983) and [#2041](https://github.com/kubeflow/katib/pull/2041) by [@apo-ger](https://github.com/apo-ger))
- The new improved and refactored Katib SDK is not backward compatible ([#2075](https://github.com/kubeflow/katib/pull/2075) by [@andreyvelich](https://github.com/andreyvelich))
## New Features
### Major Features
- Narrow down Katib RBAC rules ([#2091](https://github.com/kubeflow/katib/pull/2091) by [@johnugeorge](https://github.com/johnugeorge))
- Support Postgres as a Katib DB ([#1921](https://github.com/kubeflow/katib/pull/1921) by [@anencore94](https://github.com/anencore94))
- More Suggestion container fields in Katib Config ([#2000](https://github.com/kubeflow/katib/pull/2000) by [@fischor](https://github.com/fischor))
- Katib UI: Create the LOGS tab of Trial's details page ([#2117](https://github.com/kubeflow/katib/pull/2117) by [@elenzio9](https://github.com/elenzio9))
- Katib UI: Enable pagination/sorting/filtering ([#2017](https://github.com/kubeflow/katib/pull/2017) and [#2040](https://github.com/kubeflow/katib/pull/2040) by [@elenzio9](https://github.com/elenzio9))
- [SDK] Create Tune API in the Katib SDK ([#1951](https://github.com/kubeflow/katib/pull/1951) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Get Trial Metrics from Katib DB ([#2050](https://github.com/kubeflow/katib/pull/2050) by [@andreyvelich](https://github.com/andreyvelich))
### Core Features
- Add Conformance Program Doc for AutoML and Training WG ([#2048](https://github.com/kubeflow/katib/pull/2048) by [@andreyvelich](https://github.com/andreyvelich))
- Support for grid search algorithm in Optuna Suggestion Service ([#2060](https://github.com/kubeflow/katib/pull/2060) by [@tenzen-y](https://github.com/tenzen-y))
- Add Trial Labels During Pod Mutation ([#2047](https://github.com/kubeflow/katib/pull/2047) by [@andreyvelich](https://github.com/andreyvelich))
- Support for k8s v1.25 in CI ([#1997](https://github.com/kubeflow/katib/pull/1997) by [@johnugeorge](https://github.com/johnugeorge))
- Add the CI to build multi-platform container images ([#1956](https://github.com/kubeflow/katib/pull/1956) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.21 and introduce Kubernetes v1.24 ([#1953](https://github.com/kubeflow/katib/pull/1953) by [@tenzen-y](https://github.com/tenzen-y))
- Add --connect-timeout flag to katib-db-manager ([#1937](https://github.com/kubeflow/katib/pull/1937) by [@tenzen-y](https://github.com/tenzen-y))
- Implement validations for DARTS suggestion service ([#1926](https://github.com/kubeflow/katib/pull/1926) by [@tenzen-y](https://github.com/tenzen-y))
- Implement validation for Optuna suggestion service ([#1924](https://github.com/kubeflow/katib/pull/1924) by [@tenzen-y](https://github.com/tenzen-y))
### UI Improvements
- Make links in KWA's tables actual links ([#2090](https://github.com/kubeflow/katib/pull/2090) by [@elenzio9](https://github.com/elenzio9))
- frontend: Rework the trial graph using ECharts in KWA ([#2089](https://github.com/kubeflow/katib/pull/2089) by [@elenzio9](https://github.com/elenzio9))
- kwa(front): Add UI tests with Cypress ([#2088](https://github.com/kubeflow/katib/pull/2088) by [@orfeas-k](https://github.com/orfeas-k))
- frontend: Enable actions in experiment graph ([#2065](https://github.com/kubeflow/katib/pull/2065) by [@elenzio9](https://github.com/elenzio9))
- frontend: Show message in case of uncompleted trial instead of the graph ([#2063](https://github.com/kubeflow/katib/pull/2063) by [@elenzio9](https://github.com/elenzio9))
- frontend: Add source maps in the browser ([#2043](https://github.com/kubeflow/katib/pull/2043) by [@elenzio9](https://github.com/elenzio9))
- Backend for getting logs of a trial ([#2039](https://github.com/kubeflow/katib/pull/2039) by [@d-gol](https://github.com/d-gol))
- frontend: Show the successful trials in the experiment graph (#2013) ([#2033](https://github.com/kubeflow/katib/pull/2033) by [@elenzio9](https://github.com/elenzio9))
- frontend: Migrate from tslint to eslint in KWA ([#2042](https://github.com/kubeflow/katib/pull/2042) by [@elenzio9](https://github.com/elenzio9))
- Dedicated yaml tab for Trials ([#2034](https://github.com/kubeflow/katib/pull/2034) by [@elenzio9](https://github.com/elenzio9))
- KWA: Use new Editor component (Monaco) ([#2023](https://github.com/kubeflow/katib/pull/2023) by [@orfeas-k](https://github.com/orfeas-k))
- kwa(build): Introduce COMMIT file for building KWA ([#2014](https://github.com/kubeflow/katib/pull/2014) by [@orfeas-k](https://github.com/orfeas-k))
- frontend: Fix 500 error after detail page refresh (#1967) ([#2001](https://github.com/kubeflow/katib/pull/2001) by [@elenzio9](https://github.com/elenzio9))
- Introduce KWA's frontend component for kfp links ([#1991](https://github.com/kubeflow/katib/pull/1991) by [@elenzio9](https://github.com/elenzio9))
- UI: Rename and right align the age column ([#1989](https://github.com/kubeflow/katib/pull/1989) by [@elenzio9](https://github.com/elenzio9))
- Show the trials table's status column first ([#1990](https://github.com/kubeflow/katib/pull/1990) by [@elenzio9](https://github.com/elenzio9))
- UI: Make KWA's main table responsive and add toolbar ([#1982](https://github.com/kubeflow/katib/pull/1982) by [@elenzio9](https://github.com/elenzio9))
- UI: Fix unit tests ([#1977](https://github.com/kubeflow/katib/pull/1977) by [@elenzio9](https://github.com/elenzio9))
- UI: Format code ([#1979](https://github.com/kubeflow/katib/pull/1979) by [@orfeas-k](https://github.com/orfeas-k))
- Recreate the Experiments Parallel Coordinates Graph ([#1974](https://github.com/kubeflow/katib/pull/1974) by [@elenzio9](https://github.com/elenzio9))
- Improve UI API/controller logging to ease troubleshooting ([#1966](https://github.com/kubeflow/katib/pull/1966) by [@lukeogg](https://github.com/lukeogg))
### SDK Improvements
- [SDK] Use Katib SDK for E2E Tests ([#2075](https://github.com/kubeflow/katib/pull/2075) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Use Katib Client without Kube Config ([#2098](https://github.com/kubeflow/katib/pull/2098) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Fix namespace parameter in tune API ([#1981](https://github.com/kubeflow/katib/pull/1981) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Remove Final Keyword from constants ([#1980](https://github.com/kubeflow/katib/pull/1980) by [@andreyvelich](https://github.com/andreyvelich))
## Bug fixes
- Fix Release Script for Updating SDK Version ([#2104](https://github.com/kubeflow/katib/pull/2104) by [@andreyvelich](https://github.com/andreyvelich))
- [Fix] add early stopped trials in converter ([#2004](https://github.com/kubeflow/katib/pull/2004) by [@shaowei-su](https://github.com/shaowei-su))
- [bugfix] Fix value passing bug in New Experiment form ([#2027](https://github.com/kubeflow/katib/pull/2027) by [@orfeas-k](https://github.com/orfeas-k))
- Fix main process retrieve logic for early stopping ([#1988](https://github.com/kubeflow/katib/pull/1988) by [@shaowei-su](https://github.com/shaowei-su))
- [hotfix]: filter by name of experiment ([#1920](https://github.com/kubeflow/katib/pull/1920) by [@anencore94](https://github.com/anencore94))
- Fix push script to include new images ([#1911](https://github.com/kubeflow/katib/pull/1911) by [@johnugeorge](https://github.com/johnugeorge))
- fix: only validate Kubernetes Job ([#2025](https://github.com/kubeflow/katib/pull/2025) by [@zhixian82](https://github.com/zhixian82))
- Upgrade grpc-health-probe version to fix some security issues ([#2093](https://github.com/kubeflow/katib/pull/2093) by [@tenzen-y](https://github.com/tenzen-y))
- Format Katib Charm Operator ([#2115](https://github.com/kubeflow/katib/pull/2115) by [@tenzen-y](https://github.com/tenzen-y))
## Documentation
- Add CERN to adopters ([#2010](https://github.com/kubeflow/katib/pull/2010) by [@d-gol](https://github.com/d-gol))
- Add More Katib Presentations 2022 ([#2009](https://github.com/kubeflow/katib/pull/2009) by [@andreyvelich](https://github.com/andreyvelich))
- Add the documentation for simple-pbt ([#1978](https://github.com/kubeflow/katib/pull/1978) by [@tenzen-y](https://github.com/tenzen-y))
- Add the license to pbt ([#1958](https://github.com/kubeflow/katib/pull/1958) by [@tenzen-y](https://github.com/tenzen-y))
- Update the Katib version in docs ([#1950](https://github.com/kubeflow/katib/pull/1950) by [@tenzen-y](https://github.com/tenzen-y))
- Update CHANGELOG for v0.14.0 release ([#1932](https://github.com/kubeflow/katib/pull/1932) by [@johnugeorge](https://github.com/johnugeorge))
## Misc
- Update Training operator Image in CI ([#2103](https://github.com/kubeflow/katib/pull/2103) by [@johnugeorge](https://github.com/johnugeorge))
- Upgrade Go libraries to resolve security issues ([#2094](https://github.com/kubeflow/katib/pull/2094) by [@tenzen-y](https://github.com/tenzen-y))
- Run e2e with various Python versions to verify Python SDK ([#2092](https://github.com/kubeflow/katib/pull/2092) by [@tenzen-y](https://github.com/tenzen-y))
- Add a --prefer-binary flag to 'pip install' command ([#2096](https://github.com/kubeflow/katib/pull/2096) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade PyTorch version to v1.13.0 ([#2082](https://github.com/kubeflow/katib/pull/2082) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Tensorflow version ([#2079](https://github.com/kubeflow/katib/pull/2079) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Python version to 3.10 ([#2057](https://github.com/kubeflow/katib/pull/2057) by [@tenzen-y](https://github.com/tenzen-y))
- Pin the NumPy version with v1.23.5 in some images ([#2070](https://github.com/kubeflow/katib/pull/2070) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade the actions-setup-minikube version to v2.7.2 ([#2064](https://github.com/kubeflow/katib/pull/2064) by [@tenzen-y](https://github.com/tenzen-y))
- Remove Certificate Chain from Cert Generator ([#2045](https://github.com/kubeflow/katib/pull/2045) by [@andreyvelich](https://github.com/andreyvelich))
- Add resources to earlystopping container ([#2038](https://github.com/kubeflow/katib/pull/2038) by [@zhixian82](https://github.com/zhixian82))
- Add scripts to verify generated codes and Go Modules ([#1999](https://github.com/kubeflow/katib/pull/1999) by [@tenzen-y](https://github.com/tenzen-y))
- [Test] Reduce Katib GitHub Action Runs ([#2036](https://github.com/kubeflow/katib/pull/2036) by [@andreyvelich](https://github.com/andreyvelich))
- gh-actions: Extend action to run Frontend Unit tests ([#1998](https://github.com/kubeflow/katib/pull/1998) by [@orfeas-k](https://github.com/orfeas-k))
- [chore] Upgrade docker/metadata-action, actions/checkout, and actions/setup-python version ([#1996](https://github.com/kubeflow/katib/pull/1996) by [@tenzen-y](https://github.com/tenzen-y))
- [chore] Upgrade Go version to v1.19 ([#1995](https://github.com/kubeflow/katib/pull/1995) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in simple-pbt image ([#1948](https://github.com/kubeflow/katib/pull/1948) by [@tenzen-y](https://github.com/tenzen-y))
- Support arm64 in darts-cnn-cifar10 image ([#1947](https://github.com/kubeflow/katib/pull/1947) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in enas-cnn-cifar10 image ([#1944](https://github.com/kubeflow/katib/pull/1944) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in pytorch-mnist image ([#1943](https://github.com/kubeflow/katib/pull/1943) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in mxnet-mnist image ([#1940](https://github.com/kubeflow/katib/pull/1940) by [@tenzen-y](https://github.com/tenzen-y))
- Use the katib-new-ui for Charmed gh-actions ([#1987](https://github.com/kubeflow/katib/pull/1987) by [@tenzen-y](https://github.com/tenzen-y))
- [feat] health check for katib-controller ([#1934](https://github.com/kubeflow/katib/pull/1934) by [@anencore94](https://github.com/anencore94))
- Upgrade Optuna from v2.x.x to v3.0.0 ([#1942](https://github.com/kubeflow/katib/pull/1942) by [@keisuke-umezawa](https://github.com/keisuke-umezawa))
- Add validation webhooks for maxFailedTrialCount and parallelTrialCount ([#1936](https://github.com/kubeflow/katib/pull/1936) by [@tenzen-y](https://github.com/tenzen-y))
- Introduce Automatic platform ARGs ([#1935](https://github.com/kubeflow/katib/pull/1935) by [@tenzen-y](https://github.com/tenzen-y))
- Update training operator image in CI ([#1933](https://github.com/kubeflow/katib/pull/1933) by [@johnugeorge](https://github.com/johnugeorge))
- Update Katib SDK version ([#1931](https://github.com/kubeflow/katib/pull/1931) by [@johnugeorge](https://github.com/johnugeorge))
- [chore] Upgrade Go version to v1.18 ([#1925](https://github.com/kubeflow/katib/pull/1925) by [@tenzen-y](https://github.com/tenzen-y))
- Add the pytorch-mnist with GPU support container image ([#1916](https://github.com/kubeflow/katib/pull/1916) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.14.0...v0.15.0)
# [v0.15.0-rc.1](https://github.com/kubeflow/katib/tree/v0.15.0-rc.1) (2023-02-15)
## New Features
- UI: Create the LOGS tab of Trial's details page ([#2117](https://github.com/kubeflow/katib/pull/2117) by [@elenzio9](https://github.com/elenzio9))
## Bug Fixes
- Format Katib Charm Operator ([#2115](https://github.com/kubeflow/katib/pull/2115) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.15.0-rc.0...v0.15.0-rc.1)
# [v0.15.0-rc.0](https://github.com/kubeflow/katib/tree/v0.15.0-rc.0) (2023-01-27)
## Breaking Changes
- Use **Never** Resume Policy as Default ([#2102](https://github.com/kubeflow/katib/pull/2102) by [@andreyvelich](https://github.com/andreyvelich))
- Chocolate Suggestion Service is removed ([#2071](https://github.com/kubeflow/katib/pull/2071) by [@tenzen-y](https://github.com/tenzen-y))
- `request_number` is removed from the GRPC APIs ([#1994](https://github.com/kubeflow/katib/pull/1994) by [@johnugeorge](https://github.com/johnugeorge))
- The new improved and refactored Katib SDK is not backward compatible ([#2075](https://github.com/kubeflow/katib/pull/2075) by [@andreyvelich](https://github.com/andreyvelich))
## New Features
### Major Features
- Narrow down Katib RBAC rules ([#2091](https://github.com/kubeflow/katib/pull/2091) by [@johnugeorge](https://github.com/johnugeorge))
- Support Postgres as a Katib DB ([#1921](https://github.com/kubeflow/katib/pull/1921) by [@anencore94](https://github.com/anencore94))
- More Suggestion container fields in Katib Config ([#2000](https://github.com/kubeflow/katib/pull/2000) by [@fischor](https://github.com/fischor))
- Katib UI: Enable pagination/sorting/filtering ([#2017](https://github.com/kubeflow/katib/pull/2017) and [#2040](https://github.com/kubeflow/katib/pull/2040) by [@elenzio9](https://github.com/elenzio9))
- Katib UI: Add authorization mechanisms ([#1983](https://github.com/kubeflow/katib/pull/1983) by [@apo-ger](https://github.com/apo-ger))
- [SDK] Create Tune API in the Katib SDK ([#1951](https://github.com/kubeflow/katib/pull/1951) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Get Trial Metrics from Katib DB ([#2050](https://github.com/kubeflow/katib/pull/2050) by [@andreyvelich](https://github.com/andreyvelich))
### Core Features
- Add Conformance Program Doc for AutoML and Training WG ([#2048](https://github.com/kubeflow/katib/pull/2048) by [@andreyvelich](https://github.com/andreyvelich))
- Support for grid search algorithm in Optuna Suggestion Service ([#2060](https://github.com/kubeflow/katib/pull/2060) by [@tenzen-y](https://github.com/tenzen-y))
- Add Trial Labels During Pod Mutation ([#2047](https://github.com/kubeflow/katib/pull/2047) by [@andreyvelich](https://github.com/andreyvelich))
- Support for k8s v1.25 in CI ([#1997](https://github.com/kubeflow/katib/pull/1997) by [@johnugeorge](https://github.com/johnugeorge))
- Add the CI to build multi-platform container images ([#1956](https://github.com/kubeflow/katib/pull/1956) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.21 and introduce Kubernetes v1.24 ([#1953](https://github.com/kubeflow/katib/pull/1953) by [@tenzen-y](https://github.com/tenzen-y))
- Add --connect-timeout flag to katib-db-manager ([#1937](https://github.com/kubeflow/katib/pull/1937) by [@tenzen-y](https://github.com/tenzen-y))
- Implement validations for DARTS suggestion service ([#1926](https://github.com/kubeflow/katib/pull/1926) by [@tenzen-y](https://github.com/tenzen-y))
- Implement validation for Optuna suggestion service ([#1924](https://github.com/kubeflow/katib/pull/1924) by [@tenzen-y](https://github.com/tenzen-y))
### UI Improvements
- Make links in KWA's tables actual links ([#2090](https://github.com/kubeflow/katib/pull/2090) by [@elenzio9](https://github.com/elenzio9))
- frontend: Rework the trial graph using ECharts in KWA ([#2089](https://github.com/kubeflow/katib/pull/2089) by [@elenzio9](https://github.com/elenzio9))
- kwa(front): Add UI tests with Cypress ([#2088](https://github.com/kubeflow/katib/pull/2088) by [@orfeas-k](https://github.com/orfeas-k))
- Update manifests to enable authorization check mechanisms for Katib UI in Kubeflow mode ([#2041](https://github.com/kubeflow/katib/pull/2041) by [@apo-ger](https://github.com/apo-ger))
- frontend: Enable actions in experiment graph ([#2065](https://github.com/kubeflow/katib/pull/2065) by [@elenzio9](https://github.com/elenzio9))
- frontend: Show message in case of uncompleted trial instead of the graph ([#2063](https://github.com/kubeflow/katib/pull/2063) by [@elenzio9](https://github.com/elenzio9))
- frontend: Add source maps in the browser ([#2043](https://github.com/kubeflow/katib/pull/2043) by [@elenzio9](https://github.com/elenzio9))
- Backend for getting logs of a trial ([#2039](https://github.com/kubeflow/katib/pull/2039) by [@d-gol](https://github.com/d-gol))
- frontend: Show the successful trials in the experiment graph (#2013) ([#2033](https://github.com/kubeflow/katib/pull/2033) by [@elenzio9](https://github.com/elenzio9))
- frontend: Migrate from tslint to eslint in KWA ([#2042](https://github.com/kubeflow/katib/pull/2042) by [@elenzio9](https://github.com/elenzio9))
- Dedicated yaml tab for Trials ([#2034](https://github.com/kubeflow/katib/pull/2034) by [@elenzio9](https://github.com/elenzio9))
- KWA: Use new Editor component (Monaco) ([#2023](https://github.com/kubeflow/katib/pull/2023) by [@orfeas-k](https://github.com/orfeas-k))
- kwa(build): Introduce COMMIT file for building KWA ([#2014](https://github.com/kubeflow/katib/pull/2014) by [@orfeas-k](https://github.com/orfeas-k))
- frontend: Fix 500 error after detail page refresh (#1967) ([#2001](https://github.com/kubeflow/katib/pull/2001) by [@elenzio9](https://github.com/elenzio9))
- Introduce KWA's frontend component for kfp links ([#1991](https://github.com/kubeflow/katib/pull/1991) by [@elenzio9](https://github.com/elenzio9))
- UI: Rename and right align the age column ([#1989](https://github.com/kubeflow/katib/pull/1989) by [@elenzio9](https://github.com/elenzio9))
- Show the trials table's status column first ([#1990](https://github.com/kubeflow/katib/pull/1990) by [@elenzio9](https://github.com/elenzio9))
- UI: Make KWA's main table responsive and add toolbar ([#1982](https://github.com/kubeflow/katib/pull/1982) by [@elenzio9](https://github.com/elenzio9))
- UI: Fix unit tests ([#1977](https://github.com/kubeflow/katib/pull/1977) by [@elenzio9](https://github.com/elenzio9))
- UI: Format code ([#1979](https://github.com/kubeflow/katib/pull/1979) by [@orfeas-k](https://github.com/orfeas-k))
- Recreate the Experiments Parallel Coordinates Graph ([#1974](https://github.com/kubeflow/katib/pull/1974) by [@elenzio9](https://github.com/elenzio9))
- Improve UI API/controller logging to ease troubleshooting ([#1966](https://github.com/kubeflow/katib/pull/1966) by [@lukeogg](https://github.com/lukeogg))
### SDK Improvements
- [SDK] Use Katib SDK for E2E Tests ([#2075](https://github.com/kubeflow/katib/pull/2075) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Use Katib Client without Kube Config ([#2098](https://github.com/kubeflow/katib/pull/2098) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Fix namespace parameter in tune API ([#1981](https://github.com/kubeflow/katib/pull/1981) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Remove Final Keyword from constants ([#1980](https://github.com/kubeflow/katib/pull/1980) by [@andreyvelich](https://github.com/andreyvelich))
## Bug fixes
- Fix Release Script for Updating SDK Version ([#2104](https://github.com/kubeflow/katib/pull/2104) by [@andreyvelich](https://github.com/andreyvelich))
- [Fix] add early stopped trials in converter ([#2004](https://github.com/kubeflow/katib/pull/2004) by [@shaowei-su](https://github.com/shaowei-su))
- [bugfix] Fix value passing bug in New Experiment form ([#2027](https://github.com/kubeflow/katib/pull/2027) by [@orfeas-k](https://github.com/orfeas-k))
- Fix main process retrieve logic for early stopping ([#1988](https://github.com/kubeflow/katib/pull/1988) by [@shaowei-su](https://github.com/shaowei-su))
- [hotfix]: filter by name of experiment ([#1920](https://github.com/kubeflow/katib/pull/1920) by [@anencore94](https://github.com/anencore94))
- Fix push script to include new images ([#1911](https://github.com/kubeflow/katib/pull/1911) by [@johnugeorge](https://github.com/johnugeorge))
- fix: only validate Kubernetes Job ([#2025](https://github.com/kubeflow/katib/pull/2025) by [@zhixian82](https://github.com/zhixian82))
- Upgrade grpc-health-probe version to fix some security issues ([#2093](https://github.com/kubeflow/katib/pull/2093) by [@tenzen-y](https://github.com/tenzen-y))
## Documentation
- Add CERN to adopters ([#2010](https://github.com/kubeflow/katib/pull/2010) by [@d-gol](https://github.com/d-gol))
- Add More Katib Presentations 2022 ([#2009](https://github.com/kubeflow/katib/pull/2009) by [@andreyvelich](https://github.com/andreyvelich))
- Add the documentation for simple-pbt ([#1978](https://github.com/kubeflow/katib/pull/1978) by [@tenzen-y](https://github.com/tenzen-y))
- Add the license to pbt ([#1958](https://github.com/kubeflow/katib/pull/1958) by [@tenzen-y](https://github.com/tenzen-y))
- Update the Katib version in docs ([#1950](https://github.com/kubeflow/katib/pull/1950) by [@tenzen-y](https://github.com/tenzen-y))
- Update CHANGELOG for v0.14.0 release ([#1932](https://github.com/kubeflow/katib/pull/1932) by [@johnugeorge](https://github.com/johnugeorge))
## Misc
- Update Training operator Image in CI ([#2103](https://github.com/kubeflow/katib/pull/2103) by [@johnugeorge](https://github.com/johnugeorge))
- Upgrade Go libraries to resolve security issues ([#2094](https://github.com/kubeflow/katib/pull/2094) by [@tenzen-y](https://github.com/tenzen-y))
- Run e2e with various Python versions to verify Python SDK ([#2092](https://github.com/kubeflow/katib/pull/2092) by [@tenzen-y](https://github.com/tenzen-y))
- Add a --prefer-binary flag to 'pip install' command ([#2096](https://github.com/kubeflow/katib/pull/2096) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade PyTorch version to v1.13.0 ([#2082](https://github.com/kubeflow/katib/pull/2082) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Tensorflow version ([#2079](https://github.com/kubeflow/katib/pull/2079) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Python version to 3.10 ([#2057](https://github.com/kubeflow/katib/pull/2057) by [@tenzen-y](https://github.com/tenzen-y))
- Pin the NumPy version with v1.23.5 in some images ([#2070](https://github.com/kubeflow/katib/pull/2070) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade the actions-setup-minikube version to v2.7.2 ([#2064](https://github.com/kubeflow/katib/pull/2064) by [@tenzen-y](https://github.com/tenzen-y))
- Remove Certificate Chain from Cert Generator ([#2045](https://github.com/kubeflow/katib/pull/2045) by [@andreyvelich](https://github.com/andreyvelich))
- Add resources to earlystopping container ([#2038](https://github.com/kubeflow/katib/pull/2038) by [@zhixian82](https://github.com/zhixian82))
- Add scripts to verify generated codes and Go Modules ([#1999](https://github.com/kubeflow/katib/pull/1999) by [@tenzen-y](https://github.com/tenzen-y))
- [Test] Reduce Katib GitHub Action Runs ([#2036](https://github.com/kubeflow/katib/pull/2036) by [@andreyvelich](https://github.com/andreyvelich))
- gh-actions: Extend action to run Frontend Unit tests ([#1998](https://github.com/kubeflow/katib/pull/1998) by [@orfeas-k](https://github.com/orfeas-k))
- [chore] Upgrade docker/metadata-action, actions/checkout, and actions/setup-python version ([#1996](https://github.com/kubeflow/katib/pull/1996) by [@tenzen-y](https://github.com/tenzen-y))
- [chore] Upgrade Go version to v1.19 ([#1995](https://github.com/kubeflow/katib/pull/1995) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in simple-pbt image ([#1948](https://github.com/kubeflow/katib/pull/1948) by [@tenzen-y](https://github.com/tenzen-y))
- Support arm64 in darts-cnn-cifar10 image ([#1947](https://github.com/kubeflow/katib/pull/1947) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in enas-cnn-cifar10 image ([#1944](https://github.com/kubeflow/katib/pull/1944) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in pytorch-mnist image ([#1943](https://github.com/kubeflow/katib/pull/1943) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in mxnet-mnist image ([#1940](https://github.com/kubeflow/katib/pull/1940) by [@tenzen-y](https://github.com/tenzen-y))
- Use the katib-new-ui for Charmed gh-actions ([#1987](https://github.com/kubeflow/katib/pull/1987) by [@tenzen-y](https://github.com/tenzen-y))
- [feat] health check for katib-controller ([#1934](https://github.com/kubeflow/katib/pull/1934) by [@anencore94](https://github.com/anencore94))
- Upgrade Optuna from v2.x.x to v3.0.0 ([#1942](https://github.com/kubeflow/katib/pull/1942) by [@keisuke-umezawa](https://github.com/keisuke-umezawa))
- Add validation webhooks for maxFailedTrialCount and parallelTrialCount ([#1936](https://github.com/kubeflow/katib/pull/1936) by [@tenzen-y](https://github.com/tenzen-y))
- Introduce Automatic platform ARGs ([#1935](https://github.com/kubeflow/katib/pull/1935) by [@tenzen-y](https://github.com/tenzen-y))
- Update training operator image in CI ([#1933](https://github.com/kubeflow/katib/pull/1933) by [@johnugeorge](https://github.com/johnugeorge))
- Update Katib SDK version ([#1931](https://github.com/kubeflow/katib/pull/1931) by [@johnugeorge](https://github.com/johnugeorge))
- [chore] Upgrade Go version to v1.18 ([#1925](https://github.com/kubeflow/katib/pull/1925) by [@tenzen-y](https://github.com/tenzen-y))
- Add the pytorch-mnist with GPU support container image ([#1916](https://github.com/kubeflow/katib/pull/1916) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.14.0...v0.15.0-rc.0)
# [v0.14.0](https://github.com/kubeflow/katib/tree/v0.14.0) (2022-08-18)
## New Features
### Core Features
- Population based training ([#1833](https://github.com/kubeflow/katib/pull/1833) by [@a9p](https://github.com/a9p))
- Support JSON format logs in `file-metrics-collector` ([#1765](https://github.com/kubeflow/katib/pull/1765) by [@tenzen-y](https://github.com/tenzen-y))
@ -12,14 +766,12 @@
- Allow running examples on Apple Silicon M1 and fix image build errors for arm64 ([#1898](https://github.com/kubeflow/katib/pull/1898) by [@tenzen-y](https://github.com/tenzen-y))
- Configurable job name and service name for cert generator ([#1889](https://github.com/kubeflow/katib/pull/1889) by [@shaowei-su](https://github.com/shaowei-su))
## UI Features and Enhancements
### UI Features and Enhancements
- Add PBT to experiment creation form ([#1909](https://github.com/kubeflow/katib/pull/1909) by [@a9p](https://github.com/a9p))
- Distinct page for each Trial in the UI ([#1783](https://github.com/kubeflow/katib/pull/1783) by [@d-gol](https://github.com/d-gol))
# Bug fixes
## Bug fixes
- Add the pytorch-mnist with GPU support container image ([#1917](https://github.com/kubeflow/katib/pull/1917) by [@tenzen-y](https://github.com/tenzen-y))
- Fix push script to include new images ([#1912](https://github.com/kubeflow/katib/pull/1912) by [@johnugeorge](https://github.com/johnugeorge))
@ -28,19 +780,19 @@
- Reconcile trial assignments by comparing suggestion and trials being executed ([#1831](https://github.com/kubeflow/katib/pull/1831) by [@henrysecond1](https://github.com/henrysecond1))
- Increate the probes seconds in manifests ([#1845](https://github.com/kubeflow/katib/pull/1845) by [@haoxins](https://github.com/haoxins))
- Set upper constraint for Optuna ([#1852](https://github.com/kubeflow/katib/pull/1852) by [@himkt](https://github.com/himkt))
- Don't check if trial's metadata is in spec.parameters ([#1848](https://github.com/kubeflow/katib/pull/1848) by [@alexeygorobets](https://github.com/alexeygorobets))
- Don't check if trial's metadata is in spec.parameters ([#1848](https://github.com/kubeflow/katib/pull/1848) by [@alexeygorobets](https://github.com/alexeygorobets))
# Documentation
## Documentation
- Fix the FPGA examples documentation ([#1841](https://github.com/kubeflow/katib/pull/1841) by [@eliaskoromilas](https://github.com/eliaskoromilas))
- Add CyberAgent to adopters ([#1894](https://github.com/kubeflow/katib/pull/1894) by [@tenzen-y](https://github.com/tenzen-y))
# Misc
## Misc
- Updating the training operator image in CI ([#1910](https://github.com/kubeflow/katib/pull/1910) by [@johnugeorge](https://github.com/johnugeorge))
- Upgrade Python and Pytorch versions for some examples ([#1906](https://github.com/kubeflow/katib/pull/1906) by [@tenzen-y](https://github.com/tenzen-y))
- Linting for K8s YAML files ([#1901](https://github.com/kubeflow/katib/pull/1901) by [@Rishit-dagli](https://github.com/Rishit-dagli))
- Change integration test sysytem from KinD Cluster to Minikube Cluster ([#1899](https://github.com/kubeflow/katib/pull/1899) by [@tenzen-y](https://github.com/tenzen-y))
- Change integration test sysytem from KinD Cluster to Minikube Cluster ([#1899](https://github.com/kubeflow/katib/pull/1899) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade mysql version to v8.0.29 ([#1897](https://github.com/kubeflow/katib/pull/1897) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade tensorflow-aarch64 version to v2.9.1 ([#1891](https://github.com/kubeflow/katib/pull/1891) by [@tenzen-y](https://github.com/tenzen-y))
- chore: Upgrade Go libraries to resolve some security issues in the katib-controller ([#1888](https://github.com/kubeflow/katib/pull/1888) by [@tenzen-y](https://github.com/tenzen-y))
@ -61,12 +813,9 @@
- Add prometheus scraping and grafana support to charmed katib-controller operator ([#1839](https://github.com/kubeflow/katib/pull/1839) by [@jardon](https://github.com/jardon))
- Upgrade Black to fix linting ([#1842](https://github.com/kubeflow/katib/pull/1842) by [@jardon](https://github.com/jardon))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.13.0...v0.14.0).
# Change Log
Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...v0.14.0).
## [v0.13.0](https://github.com/kubeflow/katib/tree/v0.13.0) (2022-03-04)
# [v0.13.0](https://github.com/kubeflow/katib/tree/v0.13.0) (2022-03-04)
## New Features
@ -125,7 +874,6 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
- Fix default label for Training Operators ([#1813](https://github.com/kubeflow/katib/pull/1813) by [@andreyvelich](https://github.com/andreyvelich))
- Update supported Python version for Katib SDK ([#1798](https://github.com/kubeflow/katib/pull/1798) by [@tenzen-y](https://github.com/tenzen-y))
## Misc
- Use release tags for Trial images ([#1757](https://github.com/kubeflow/katib/pull/1757) by [@andreyvelich](https://github.com/andreyvelich))
@ -140,10 +888,9 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
- Add envtest to check `reconcileRBAC` ([#1678](https://github.com/kubeflow/katib/pull/1678) by [@tenzen-y](https://github.com/tenzen-y))
- Use golangci-lint as linter for Go ([#1671](https://github.com/kubeflow/katib/pull/1671) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.12.0...v0.13.0)
## [v0.13.0-rc.1](https://github.com/kubeflow/katib/tree/v0.13.0-rc.1) (2022-02-15)
# [v0.13.0-rc.1](https://github.com/kubeflow/katib/tree/v0.13.0-rc.1) (2022-02-15)
## Bug fixes
@ -152,7 +899,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.13.0-rc.0...v0.13.0-rc.1)
## [v0.13.0-rc.0](https://github.com/kubeflow/katib/tree/v0.13.0-rc.0) (2022-01-25)
# [v0.13.0-rc.0](https://github.com/kubeflow/katib/tree/v0.13.0-rc.0) (2022-01-25)
## New Features
@ -225,7 +972,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.12.0...v0.13.0-rc.0)
## [v0.12.0](https://github.com/kubeflow/katib/tree/v0.12.0) (2021-10-05)
# [v0.12.0](https://github.com/kubeflow/katib/tree/v0.12.0) (2021-10-05)
## New Features
@ -281,7 +1028,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.11.1...v0.12.0)
## [v0.12.0-rc.1](https://github.com/kubeflow/katib/tree/v0.12.0-rc.1) (2021-09-07)
# [v0.12.0-rc.1](https://github.com/kubeflow/katib/tree/v0.12.0-rc.1) (2021-09-07)
## Bug Fixes
@ -290,7 +1037,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.12.0-rc.0...v0.12.0-rc.1)
## [v0.12.0-rc.0](https://github.com/kubeflow/katib/tree/v0.12.0-rc.0) (2021-08-19)
# [v0.12.0-rc.0](https://github.com/kubeflow/katib/tree/v0.12.0-rc.0) (2021-08-19)
## New Features
@ -344,7 +1091,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.11.1...v0.12.0-rc.0)
## [v0.11.1](https://github.com/kubeflow/katib/tree/v0.11.1) (2021-06-09)
# [v0.11.1](https://github.com/kubeflow/katib/tree/v0.11.1) (2021-06-09)
## Bug fixes
@ -358,7 +1105,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.11.0...v0.11.1)
## [v0.11.0](https://github.com/kubeflow/katib/tree/v0.11.0) (2021-03-22)
# [v0.11.0](https://github.com/kubeflow/katib/tree/v0.11.0) (2021-03-22)
## New Features
@ -415,7 +1162,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.10.1...v0.11.0)
## [v0.10.1](https://github.com/kubeflow/katib/tree/v0.10.1) (2021-03-02)
# [v0.10.1](https://github.com/kubeflow/katib/tree/v0.10.1) (2021-03-02)
## Features and Bug Fixes
@ -449,7 +1196,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.10.0...v0.10.1)
## [v0.10.0](https://github.com/kubeflow/katib/tree/v0.10.0) (2020-11-07)
# [v0.10.0](https://github.com/kubeflow/katib/tree/v0.10.0) (2020-11-07)
## New Features
@ -493,7 +1240,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.9.0...v0.10.0)
## [v0.9.0](https://github.com/kubeflow/katib/tree/v0.9.0) (2020-06-10)
# [v0.9.0](https://github.com/kubeflow/katib/tree/v0.9.0) (2020-06-10)
## Features and Bug Fixes
@ -750,7 +1497,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.6.0-rc.0...v0.9.0)
## [v0.6.0-rc.0](https://github.com/kubeflow/katib/tree/v0.6.0-rc.0) (2019-06-28)
# [v0.6.0-rc.0](https://github.com/kubeflow/katib/tree/v0.6.0-rc.0) (2019-06-28)
## Features and Bug Fixes
@ -1005,7 +1752,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
[Full Changelog](https://github.com/kubeflow/katib/compare/826657c14602a3f36263f3d6769451af0a75d18a...v0.6.0-rc.0)
## [0.2](https://github.com/kubeflow/katib/tree/0.2) (2018-08-20)
# [0.2](https://github.com/kubeflow/katib/tree/0.2) (2018-08-20)
## Features
@ -1032,7 +1779,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.1.2-alpha...826657c14602a3f36263f3d6769451af0a75d18a)
## [v0.1.2-alpha](https://github.com/kubeflow/katib/tree/v0.1.2-alpha) (2018-06-05)
# [v0.1.2-alpha](https://github.com/kubeflow/katib/tree/v0.1.2-alpha) (2018-06-05)
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.1.1-alpha...v0.1.2-alpha)
@ -1063,7 +1810,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
- Refine API [\#74](https://github.com/kubeflow/katib/pull/74) ([YujiOshima](https://github.com/YujiOshima))
- worker: Rename worker_interface to worker [\#70](https://github.com/kubeflow/katib/pull/70) ([gaocegege](https://github.com/gaocegege))
## [v0.1.1-alpha](https://github.com/kubeflow/katib/tree/v0.1.1-alpha) (2018-04-24)
# [v0.1.1-alpha](https://github.com/kubeflow/katib/tree/v0.1.1-alpha) (2018-04-24)
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.1.0-alpha...v0.1.1-alpha)
@ -1101,7 +1848,7 @@ Check the [Full Change Log](https://github.com/kubeflow/katib/compare/v0.13.0...
- New db log schema [\#35](https://github.com/kubeflow/katib/pull/35) ([YujiOshima](https://github.com/YujiOshima))
- Fix CI failures [\#27](https://github.com/kubeflow/katib/pull/27) ([gaocegege](https://github.com/gaocegege))
## [v0.1.0-alpha](https://github.com/kubeflow/katib/tree/v0.1.0-alpha) (2018-04-10)
# [v0.1.0-alpha](https://github.com/kubeflow/katib/tree/v0.1.0-alpha) (2018-04-10)
**Closed issues:**

43
CITATION.cff Normal file
View File

@ -0,0 +1,43 @@
cff-version: 1.2.0
message: "If you use Katib in your scientific publication, please cite it as below."
authors:
- family-names: "George"
given-names: "Johnu"
- family-names: "Gao"
given-names: "Ce"
- family-names: "Liu"
given-names: "Richard"
- family-names: "Liu"
given-names: "Hou Gang"
- family-names: "Tang"
given-names: "Yuan"
- family-names: "Pydipaty"
given-names: "Ramdoot"
- family-names: "Saha"
given-names: "Amit Kumar"
title: "Katib"
type: software
repository-code: "https://github.com/kubeflow/katib"
preferred-citation:
type: misc
title: "A Scalable and Cloud-Native Hyperparameter Tuning System"
authors:
- family-names: "George"
given-names: "Johnu"
- family-names: "Gao"
given-names: "Ce"
- family-names: "Liu"
given-names: "Richard"
- family-names: "Liu"
given-names: "Hou Gang"
- family-names: "Tang"
given-names: "Yuan"
- family-names: "Pydipaty"
given-names: "Ramdoot"
- family-names: "Saha"
given-names: "Amit Kumar"
year: 2020
url: "https://arxiv.org/abs/2006.02085"
identifiers:
- type: "other"
value: "arXiv:2006.02085"

167
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,167 @@
# Developer Guide
This developer guide is for people who want to contribute to the Katib project.
If you're interesting in using Katib in your machine learning project,
see the following guides:
- [Getting started with Katib](https://kubeflow.org/docs/components/katib/hyperparameter/).
- [How to configure Katib Experiment](https://kubeflow.org/docs/components/katib/experiment/).
- [Katib architecture and concepts](https://www.kubeflow.org/docs/components/katib/reference/architecture/)
for hyperparameter tuning and neural architecture search.
## Requirements
- [Go](https://golang.org/) (1.22 or later)
- [Docker](https://docs.docker.com/) (24.0 or later)
- [Docker Buildx](https://docs.docker.com/build/buildx/) (0.8.0 or later)
- [Java](https://docs.oracle.com/javase/8/docs/technotes/guides/install/install_overview.html) (8 or later)
- [Python](https://www.python.org/) (3.11 or later)
- [kustomize](https://kustomize.io/) (4.0.5 or later)
- [pre-commit](https://pre-commit.com/)
## Build from source code
**Note** that your Docker Desktop should
[enable containerd image store](https://docs.docker.com/desktop/containerd/#enable-the-containerd-image-store)
to build multi-arch images. Check source code as follows:
```bash
make build REGISTRY=<image-registry> TAG=<image-tag>
```
If you are using an Apple Silicon machine and encounter the "rosetta error: bss_size overflow," go to Docker Desktop -> General and uncheck "Use Rosetta for x86_64/amd64 emulation on Apple Silicon."
To use your custom images for the Katib components, modify
[Kustomization file](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml)
and [Katib Config](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/katib-config.yaml)
You can deploy Katib v1beta1 manifests into a Kubernetes cluster as follows:
```bash
make deploy
```
You can undeploy Katib v1beta1 manifests from a Kubernetes cluster as follows:
```bash
make undeploy
```
## Technical and style guide
The following guidelines apply primarily to Katib,
but other projects like [Training Operator](https://github.com/kubeflow/training-operator) might also adhere to them.
## Go Development
When coding:
- Follow [effective go](https://go.dev/doc/effective_go) guidelines.
- Run locally [`make check`](https://github.com/kubeflow/katib/blob/46173463027e4fd2e604e25d7075b2b31a702049/Makefile#L31)
to verify if changes follow best practices before submitting PRs.
Testing:
- Use [`cmp.Diff`](https://pkg.go.dev/github.com/google/go-cmp/cmp#Diff) instead of `reflect.Equal`, to provide useful comparisons.
- Define test cases as maps instead of slices to avoid dependencies on the running order.
Map key should be equal to the test case name.
## Modify controller APIs
If you want to modify Katib controller APIs, you have to
generate deepcopy, clientset, listers, informers, open-api and Python SDK with the changed APIs.
You can update the necessary files as follows:
```bash
make generate
```
## Controller Flags
Below is a list of command-line flags accepted by Katib controller:
| Name | Type | Default | Description |
| ------------ | ------ | ------- | -------------------------------------------------------------------------------------------------------------------------------- |
| katib-config | string | "" | The katib-controller will load its initial configuration from this file. Omit this flag to use the default configuration values. |
## DB Manager Flags
Below is a list of command-line flags accepted by Katib DB Manager:
| Name | Type | Default | Description |
| --------------- | ------------- | -------------| ------------------------------------------------------------------- |
| connect-timeout | time.Duration | 60s | Timeout before calling error during database connection |
| listen-address | string | 0.0.0.0:6789 | The network interface or IP address to receive incoming connections |
## Katib admission webhooks
Katib uses three [Kubernetes admission webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/).
1. `validator.experiment.katib.kubeflow.org` -
[Validating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook)
to validate the Katib Experiment before the creation.
1. `defaulter.experiment.katib.kubeflow.org` -
[Mutating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook)
to set the [default values](../pkg/apis/controller/experiments/v1beta1/experiment_defaults.go)
in the Katib Experiment before the creation.
1. `mutator.pod.katib.kubeflow.org` - Mutating admission webhook to inject the metrics
collector sidecar container to the training pod. Learn more about the Katib's
metrics collector in the
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/).
You can find the YAMLs for the Katib webhooks
[here](../manifests/v1beta1/components/webhook/webhooks.yaml).
**Note:** If you are using a private Kubernetes cluster, you have to allow traffic
via `TCP:8443` by specifying the firewall rule and you have to update the master
plane CIDR source range to use the Katib webhooks
### Katib cert generator
Katib Controller has the internal `cert-generator` to generate certificates for the webhooks.
Once Katib is deployed in the Kubernetes cluster, the `cert-generator` follows these steps:
- Generate the self-signed certificate and private key.
- Update a Kubernetes Secret with the self-signed TLS certificate and private key.
- Patch the webhooks with the `CABundle`.
Once the `cert-generator` finished, the Katib controller starts to register controllers such as `experiment-controller` to the manager.
You can find the `cert-generator` source code [here](../pkg/certgenerator/v1beta1).
NOTE: the Katib also supports the [cert-manager](https://cert-manager.io/) to generate certs for the admission webhooks instead of using cert-generator.
You can find the installation with the cert-manager [here](../manifests/v1beta1/installs/katib-cert-manager).
## Implement a new algorithm and use it in Katib
Please see [new-algorithm-service.md](./new-algorithm-service.md).
## Katib UI documentation
Please see [Katib UI README](../pkg/ui/v1beta1).
## Design proposals
Please see [proposals](./proposals).
## Code Style
### pre-commit
Make sure to install [pre-commit](https://pre-commit.com/) (`pip install
pre-commit`) and run `pre-commit install` from the root of the repository at
least once before creating git commits.
The pre-commit [hooks](../.pre-commit-config.yaml) ensure code quality and
consistency. They are executed in CI. PRs that fail to comply with the hooks
will not be able to pass the corresponding CI gate. The hooks are only executed
against staged files unless you run `pre-commit run --all`, in which case,
they'll be executed against every file in the repository.
Specific programmatically generated files listed in the `exclude` field in
[.pre-commit-config.yaml](../.pre-commit-config.yaml) are deliberately excluded
from the hooks.

32
Dockerfile.conformance Normal file
View File

@ -0,0 +1,32 @@
# Copyright 2023 The Kubeflow Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Dockerfile for building the source code of conformance tests
FROM python:3.10-slim
WORKDIR /kubeflow/katib
COPY sdk/ /kubeflow/katib/sdk/
COPY examples/ /kubeflow/katib/examples/
COPY test/ /kubeflow/katib/test/
COPY pkg/ /kubeflow/katib/pkg/
COPY conformance/run.sh .
# Add test script.
RUN chmod +x run.sh
RUN pip install --prefer-binary -e sdk/python/v1beta1
ENTRYPOINT [ "./run.sh" ]

View File

@ -5,25 +5,23 @@ HAS_SETUP_ENVTEST := $(shell command -v setup-envtest;)
HAS_MOCKGEN := $(shell command -v mockgen;)
COMMIT := v1beta1-$(shell git rev-parse --short=7 HEAD)
KATIB_REGISTRY := docker.io/kubeflowkatib
CPU_ARCH ?= amd64
ENVTEST_K8S_VERSION ?= 1.25
MOCKGEN_VERSION ?= $(shell grep 'github.com/golang/mock' go.mod | cut -d ' ' -f 2)
KATIB_REGISTRY := ghcr.io/kubeflow/katib
CPU_ARCH ?= linux/amd64,linux/arm64
ENVTEST_K8S_VERSION ?= 1.31
MOCKGEN_VERSION ?= $(shell grep 'go.uber.org/mock' go.mod | cut -d ' ' -f 2)
GO_VERSION=$(shell grep '^go' go.mod | cut -d ' ' -f 2)
GOPATH ?= $(shell go env GOPATH)
# for pytest
PYTHONPATH := $(PYTHONPATH):$(CURDIR)/pkg/apis/manager/v1beta1/python:$(CURDIR)/pkg/apis/manager/health/python
PYTHONPATH := $(PYTHONPATH):$(CURDIR)/pkg/metricscollector/v1beta1/common:$(CURDIR)/pkg/metricscollector/v1beta1/tfevent-metricscollector
TEST_TENSORFLOW_EVENT_FILE_PATH ?= $(CURDIR)/test/unit/v1beta1/metricscollector/testdata/tfevent-metricscollector/logs
# Run tests
.PHONY: test
test: envtest
KUBEBUILDER_ASSETS="$(shell setup-envtest --arch=amd64 use $(ENVTEST_K8S_VERSION) -p path)" go test ./pkg/... ./cmd/... -coverprofile coverage.out
KUBEBUILDER_ASSETS="$(shell setup-envtest use $(ENVTEST_K8S_VERSION) -p path)" go test ./pkg/... ./cmd/... -coverprofile coverage.out
envtest:
ifndef HAS_SETUP_ENVTEST
go install sigs.k8s.io/controller-runtime/tools/setup-envtest@2c3a6fa2996c026b284c7fe2b055274cd9a556bc #v0.13.0
go install sigs.k8s.io/controller-runtime/tools/setup-envtest@release-0.19
$(info "setup-envtest has been installed")
endif
$(info "setup-envtest has already installed")
@ -35,7 +33,7 @@ fmt:
lint:
ifndef HAS_LINT
go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.50.1
go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.64.7
$(info "golangci-lint has been installed")
endif
hack/verify-golangci-lint.sh
@ -81,24 +79,29 @@ endif
sync-go-mod:
go mod tidy -go $(GO_VERSION)
.PHONY: go-mod-download
go-mod-download:
go mod download
CONTROLLER_GEN = $(shell pwd)/bin/controller-gen
.PHONY: controller-gen
controller-gen:
@GOBIN=$(shell pwd)/bin GO111MODULE=on go install sigs.k8s.io/controller-tools/cmd/controller-gen@v0.16.5
# Run this if you update any existing controller APIs.
# 1. Generate deepcopy, clientset, listers, informers for the APIs (hack/update-codegen.sh)
# 2. Generate open-api for the APIs (hack/update-openapigen)
# 3. Generate Python SDK for Katib (hack/gen-python-sdk/gen-sdk.sh)
# 4. Generate gRPC manager APIs (pkg/apis/manager/v1beta1/build.sh and pkg/apis/manager/health/build.sh)
# 5. Generate Go mock codes
generate:
ifndef GOPATH
$(error GOPATH not defined, please define GOPATH. Run "go help gopath" to learn more about GOPATH)
endif
generate: go-mod-download controller-gen
ifndef HAS_MOCKGEN
go install github.com/golang/mock/mockgen@$(MOCKGEN_VERSION)
go install go.uber.org/mock/mockgen@$(MOCKGEN_VERSION)
$(info "mockgen has been installed")
endif
go generate ./pkg/... ./cmd/...
hack/gen-python-sdk/gen-sdk.sh
pkg/apis/manager/v1beta1/build.sh
pkg/apis/manager/health/build.sh
hack/update-proto.sh
hack/update-mockgen.sh
# Build images for the Katib v1beta1 components.
@ -116,14 +119,12 @@ push-latest: generate
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(COMMIT)
# Build and push Katib images for the given tag.
push-tag: generate
push-tag:
ifeq ($(TAG),)
$(error TAG must be set. Usage: make push-tag TAG=<release-tag>)
endif
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) $(TAG) $(CPU_ARCH)
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) $(COMMIT) $(CPU_ARCH)
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(TAG)
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(COMMIT)
# Release a new version of Katib.
release:
@ -143,7 +144,7 @@ endif
# Prettier UI format check for Katib v1beta1.
prettier-check:
npm run format:check --prefix pkg/new-ui/v1beta1/frontend
npm run format:check --prefix pkg/ui/v1beta1/frontend
# Update boilerplate for the source code.
update-boilerplate:
@ -152,7 +153,6 @@ update-boilerplate:
prepare-pytest:
pip install --prefer-binary -r test/unit/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/hyperopt/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/skopt/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/optuna/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/hyperband/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/nas/enas/v1beta1/requirements.txt
@ -160,13 +160,34 @@ prepare-pytest:
pip install --prefer-binary -r cmd/suggestion/pbt/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/earlystopping/medianstop/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/metricscollector/v1beta1/tfevent-metricscollector/requirements.txt
# `TypeIs` was introduced in typing-extensions 4.10.0, and torch 2.6.0 requires typing-extensions>=4.10.0.
# REF: https://github.com/kubeflow/katib/pull/2504
# TODO (tenzen-y): Once we upgrade libraries depended on typing-extensions==4.5.0, we can remove this line.
pip install typing-extensions==4.10.0
prepare-pytest-testdata:
ifeq ("$(wildcard $(TEST_TENSORFLOW_EVENT_FILE_PATH))", "")
python examples/v1beta1/trial-images/tf-mnist-with-summaries/mnist.py --epochs 5 --batch-size 200 --log-path $(TEST_TENSORFLOW_EVENT_FILE_PATH)
endif
# TODO(Electronic-Waste): Remove the import rewrite when protobuf supports `python_package` option.
# REF: https://github.com/protocolbuffers/protobuf/issues/7061
pytest: prepare-pytest prepare-pytest-testdata
PYTHONPATH=$(PYTHONPATH) pytest ./test/unit/v1beta1/suggestion
PYTHONPATH=$(PYTHONPATH) pytest ./test/unit/v1beta1/earlystopping
PYTHONPATH=$(PYTHONPATH) pytest ./test/unit/v1beta1/metricscollector
pytest ./test/unit/v1beta1/suggestion --ignore=./test/unit/v1beta1/suggestion/test_skopt_service.py
pytest ./test/unit/v1beta1/earlystopping
pytest ./test/unit/v1beta1/metricscollector
cp ./pkg/apis/manager/v1beta1/python/api_pb2.py ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2.py
cp ./pkg/apis/manager/v1beta1/python/api_pb2_grpc.py ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2_grpc.py
sed -i "s/api_pb2/kubeflow\.katib\.katib_api_pb2/g" ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2_grpc.py
pytest ./sdk/python/v1beta1/kubeflow/katib
rm ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2.py ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2_grpc.py
# The skopt service doesn't work appropriately with Python 3.11.
# So, we need to run the test with Python 3.9.
# TODO (tenzen-y): Once we stop to support skopt, we can remove this test.
# REF: https://github.com/kubeflow/katib/issues/2280
pytest-skopt:
pip install six
pip install --prefer-binary -r test/unit/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/skopt/v1beta1/requirements.txt
pytest ./test/unit/v1beta1/suggestion/test_skopt_service.py

4
OWNERS
View File

@ -1,8 +1,10 @@
approvers:
- andreyvelich
- gaocegege
- tenzen-y
- johnugeorge
reviewers:
- anencore94
- c-bata
- Electronic-Waste
emeritus_approvers:
- tenzen-y

164
README.md
View File

@ -1,15 +1,18 @@
<h1 align="center">
<img src="./docs/images/logo-title.png" alt="logo" width="200">
<br>
</h1>
# Kubeflow Katib
[![Build Status](https://github.com/kubeflow/katib/actions/workflows/test-go.yaml/badge.svg?branch=master)](https://github.com/kubeflow/katib/actions/workflows/test-go.yaml?branch=master)
[![Coverage Status](https://coveralls.io/repos/github/kubeflow/katib/badge.svg?branch=master)](https://coveralls.io/github/kubeflow/katib?branch=master)
[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/katib)](https://goreportcard.com/report/github.com/kubeflow/katib)
[![Releases](https://img.shields.io/github/release-pre/kubeflow/katib.svg?sort=semver)](https://github.com/kubeflow/katib/releases)
[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://kubeflow.slack.com/archives/C018PMV53NW)
[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels)
[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/9941/badge)](https://www.bestpractices.dev/projects/9941)
Katib is a Kubernetes-native project for automated machine learning (AutoML).
<h1 align="center">
<img src="./docs/images/logo-title.png" alt="logo" width="200">
<br>
</h1>
Kubeflow Katib is a Kubernetes-native project for automated machine learning (AutoML).
Katib supports
[Hyperparameter Tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization),
[Early Stopping](https://en.wikipedia.org/wiki/Early_stopping) and
@ -18,8 +21,7 @@ Katib supports
Katib is the project which is agnostic to machine learning (ML) frameworks.
It can tune hyperparameters of applications written in any language of the
users choice and natively supports many ML frameworks, such as
[TensorFlow](https://www.tensorflow.org/), [Apache MXNet](https://mxnet.apache.org/),
[PyTorch](https://pytorch.org/), [XGBoost](https://xgboost.readthedocs.io/en/latest/), and others.
[TensorFlow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), [XGBoost](https://xgboost.readthedocs.io/en/latest/), and others.
Katib can perform training jobs using any Kubernetes
[Custom Resources](https://www.kubeflow.org/docs/components/katib/trial-template/)
@ -29,13 +31,13 @@ and many more.
Katib stands for `secretary` in Arabic.
# Search Algorithms
## Search Algorithms
Katib supports several search algorithms. Follow the
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/experiment/#search-algorithms-in-detail)
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-algorithm/#hp-tuning-algorithms)
to know more about each algorithm and check the
[Suggestion service guide](/docs/new-algorithm-service.md) to implement your
custom algorithm.
[this guide](https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-algorithm/#use-custom-algorithm-in-katib)
to implement your custom algorithm.
<table>
<tbody>
@ -137,142 +139,68 @@ custom algorithm.
</tbody>
</table>
To perform above algorithms Katib supports the following frameworks:
To perform the above algorithms Katib supports the following frameworks:
- [Goptuna](https://github.com/c-bata/goptuna)
- [Hyperopt](https://github.com/hyperopt/hyperopt)
- [Optuna](https://github.com/optuna/optuna)
- [Scikit Optimize](https://github.com/scikit-optimize/scikit-optimize)
# Installation
For the various Katib installs check the
[Kubeflow guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-setup).
Follow the next steps to install Katib standalone.
## Prerequisites
This is the minimal requirements to install Katib:
Please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/installation/#prerequisites)
for prerequisites to install Katib.
- Kubernetes >= 1.23
- `kubectl` >= 1.23
## Installation
## Latest Version
Please follow [the Kubeflow Katib guide](https://www.kubeflow.org/docs/components/katib/installation/#installing-katib)
for the detailed instructions on how to install Katib.
For the latest Katib version run this command:
### Installing the Control Plane
Run the following command to install the latest stable release of Katib control plane:
```
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.17.0"
```
Run the following command to install the latest changes of Katib control plane:
```
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master"
```
## Release Version
For the specific Katib release (for example `v0.14.0`) run this command:
```
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.14.0"
```
Make sure that all Katib components are running:
```
$ kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
katib-cert-generator-rw95w 0/1 Completed 0 35s
katib-controller-566595bdd8-hbxgf 1/1 Running 0 36s
katib-db-manager-57cd769cdb-4g99m 1/1 Running 0 36s
katib-mysql-7894994f88-5d4s5 1/1 Running 0 36s
katib-ui-5767cfccdc-pwg2x 1/1 Running 0 36s
```
For the Katib Experiments check the [complete examples list](./examples/v1beta1).
# Quickstart
### Installing the Python SDK
You can run your first HyperParameter Tuning Experiment using [Katib Python SDK](./sdk/python/v1beta1).
Katib implements [a Python SDK](https://pypi.org/project/kubeflow-katib/) to simplify creation of
hyperparameter tuning jobs for Data Scientists.
In the following example we are going to maximize a simple objective function:
$F(a,b) = 4a - b^2$. The bigger $a$ and the lesser $b$ value, the bigger the function value $F$.
Run the following command to install the latest stable release of Katib SDK:
```python
import kubeflow.katib as katib
# Step 1. Create an objective function.
def objective(parameters):
# Import required packages.
import time
time.sleep(5)
# Calculate objective function.
result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
# Katib parses metrics in this format: <metric-name>=<metric-value>.
print(f"result={result}")
# Step 2. Create HyperParameter search space.
parameters = {
"a": katib.search.int(min=10, max=20),
"b": katib.search.double(min=0.1, max=0.2)
}
# Step 3. Create Katib Experiment.
katib_client = katib.KatibClient()
name = "tune-experiment"
katib_client.tune(
name=name,
objective=objective,
parameters=parameters,
objective_metric_name="result",
max_trial_count=12
)
# Step 4. Get the best HyperParameters.
print(katib_client.get_optimal_hyperparameters(name))
```sh
pip install -U kubeflow-katib
```
# Documentation
## Getting Started
- Check
[the Katib getting started guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-search-algorithm).
Please refer to [the getting started guide](https://www.kubeflow.org/docs/components/katib/getting-started/#getting-started-with-katib-python-sdk)
to quickly create your first hyperparameter tuning Experiment using the Python SDK.
- Learn about Katib **Concepts** in this
[guide](https://www.kubeflow.org/docs/components/katib/overview/#katib-concepts).
## Community
- Learn about Katib **Interfaces** in this
[guide](https://www.kubeflow.org/docs/components/katib/overview/#katib-interfaces).
The following links provide information on how to get involved in the community:
- Learn about Katib **Components** in this
[guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-components).
- Know more about Katib in the [presentations and demos list](./docs/presentations.md).
# Community
We are always growing our community and invite new users and AutoML enthusiasts
to contribute to the Katib project. The following links provide information
about getting involved in the community:
- Subscribe to the
[AutoML calendar](https://calendar.google.com/calendar/u/0/r?cid=ZDQ5bnNpZWZzbmZna2Y5MW8wdThoMmpoazRAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ)
to attend Working Group bi-weekly community meetings.
- Check the
[AutoML and Training Working Group meeting notes](https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit).
- If you use Katib, please update [the adopters list](ADOPTERS.md).
- Attend [the bi-weekly AutoML and Training Working Group](https://bit.ly/2PWVCkV)
community meeting.
- Join our [`#kubeflow-katib`](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels)
Slack channel.
- Check out [who is using Katib](ADOPTERS.md) and [presentations about Katib project](docs/presentations.md).
## Contributing
Please feel free to test the system! [Developer guide](./docs/developer-guide.md)
is a good starting point for our developers.
## Blog posts
- [Kubeflow Katib: Scalable, Portable and Cloud Native System for AutoML](https://blog.kubeflow.org/katib/)
(by Andrey Velichkevich)
## Events
- [AutoML and Training WG Summit. 16th of July 2021](https://docs.google.com/document/d/1vGluSPHmAqEr8k9Dmm82RcQ-MVnqbYYSfnjMGB-aPuo/edit?usp=sharing)
Please refer to the [CONTRIBUTING guide](CONTRIBUTING.md).
## Citation

View File

@ -1,3 +1,45 @@
# Katib 2022/2023 Roadmap
## AutoML Features
- Support advance HyperParameter tuning algorithms:
- Population Based Training (PBT) - [#1382](https://github.com/kubeflow/katib/issues/1382)
- Tree of Parzen Estimators (TPE)
- Multivariate TPE
- Sobols Quasirandom Sequence
- Asynchronous Successive Halving - [ASHA](https://arxiv.org/pdf/1810.05934.pdf)
- Support multi-objective optimization - [#1549](https://github.com/kubeflow/katib/issues/1549)
- Support various HP distributions (log-uniform, uniform, normal) - [#1207](https://github.com/kubeflow/katib/issues/1207)
- Support Auto Model Compression - [#460](https://github.com/kubeflow/katib/issues/460)
- Support Auto Feature Engineering - [#475](https://github.com/kubeflow/katib/issues/475)
- Improve Neural Architecture Search design
## Backend and API Enhancements
- Conformance tests for Katib - [#2044](https://github.com/kubeflow/katib/issues/2044)
- Support push-based metrics collection in Katib - [#577](https://github.com/kubeflow/katib/issues/577)
- Support PostgreSQL as a Katib DB - [#915](https://github.com/kubeflow/katib/issues/915)
- Improve Katib scalability - [#1847](https://github.com/kubeflow/katib/issues/1847)
- Promote Katib APIs to the `v1` version
- Support multiple CRD versions (`v1beta1`, `v1`) with conversion webhook
## Improve Katib User Experience
- Simplify Katib Experiment creation with Katib SDK - [#1951](https://github.com/kubeflow/katib/pull/1951)
- Fully migrate to a new Katib UI - [Project 1](https://github.com/kubeflow/katib/projects/1)
- Expose Trial logs in Katib UI - [#971](https://github.com/kubeflow/katib/issues/971)
- Enhance Katib UI visualization metrics for AutoML Experiments
- Improve Katib Config UX - [#2150](https://github.com/kubeflow/katib/issues/2150)
## Integration with Kubeflow Components
- Kubeflow Pipeline as a Katib Trial target - [#1914](https://github.com/kubeflow/katib/issues/1914)
- Improve data passing when Katib Experiment is part of Kubeflow Pipeline - [#1846](https://github.com/kubeflow/katib/issues/1846)
# History
# Katib 2021 Roadmap
## New Features
@ -24,8 +66,6 @@
- Support multiple CRD version with conversion webhook
- MLMD integration with Katib Experiments
# History
# Katib 2020 Roadmap
## New Features

64
SECURITY.md Normal file
View File

@ -0,0 +1,64 @@
# Security Policy
## Supported Versions
Kubeflow Katib versions are expressed as `vX.Y.Z`, where X is the major version,
Y is the minor version, and Z is the patch version, following the
[Semantic Versioning](https://semver.org/) terminology.
The Kubeflow Katib project maintains release branches for the most recent two minor releases.
Applicable fixes, including security fixes, may be backported to those two release branches,
depending on severity and feasibility.
Users are encouraged to stay updated with the latest releases to benefit from security patches and
improvements.
## Reporting a Vulnerability
We're extremely grateful for security researchers and users that report vulnerabilities to the
Kubeflow Open Source Community. All reports are thoroughly investigated by Kubeflow projects owners.
You can use the following ways to report security vulnerabilities privately:
- Using the Kubeflow Katib repository [GitHub Security Advisory](https://github.com/kubeflow/katib/security/advisories/new).
- Using our private Kubeflow Steering Committee mailing list: ksc@kubeflow.org.
Please provide detailed information to help us understand and address the issue promptly.
## Disclosure Process
**Acknowledgment**: We will acknowledge receipt of your report within 10 business days.
**Assessment**: The Kubeflow projects owners will investigate the reported issue to determine its
validity and severity.
**Resolution**: If the issue is confirmed, we will work on a fix and prepare a release.
**Notification**: Once a fix is available, we will notify the reporter and coordinate a public
disclosure.
**Public Disclosure**: Details of the vulnerability and the fix will be published in the project's
release notes and communicated through appropriate channels.
## Prevention Mechanisms
Kubeflow Katib employs several measures to prevent security issues:
**Code Reviews**: All code changes are reviewed by maintainers to ensure code quality and security.
**Dependency Management**: Regular updates and monitoring of dependencies (e.g. Dependabot) to
address known vulnerabilities.
**Continuous Integration**: Automated testing and security checks are integrated into the CI/CD pipeline.
**Image Scanning**: Container images are scanned for vulnerabilities.
## Communication Channels
For the general questions please join the following resources:
- Kubeflow [Slack channels](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels).
- Kubeflow discuss [mailing list](https://www.kubeflow.org/docs/about/community/#kubeflow-mailing-list).
Please **do not report** security vulnerabilities through public channels.

View File

@ -1,25 +0,0 @@
# Build the Katib Cert Generator.
FROM golang:alpine AS build-env
ARG TARGETARCH
WORKDIR /go/src/github.com/kubeflow/katib
# Download packages.
COPY go.mod .
COPY go.sum .
RUN go mod download -x
# Copy sources.
COPY cmd/ cmd/
COPY pkg/ pkg/
# Build the binary.
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o katib-cert-generator ./cmd/cert-generator/v1beta1
# Copy the cert-generator into a thin image.
FROM gcr.io/distroless/static:nonroot
WORKDIR /app
COPY --from=build-env /go/src/github.com/kubeflow/katib/katib-cert-generator /app/
USER 65532:65532
ENTRYPOINT ["./katib-cert-generator"]

View File

@ -1,42 +0,0 @@
/*
Copyright 2022 The Kubeflow Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package main
import (
"github.com/kubeflow/katib/pkg/cert-generator/v1beta1"
"k8s.io/client-go/kubernetes/scheme"
"k8s.io/klog"
"os"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/client/config"
)
func main() {
kubeClient, err := client.New(config.GetConfigOrDie(), client.Options{Scheme: scheme.Scheme})
if err != nil {
klog.Fatalf("Failed to create kube client.")
}
cmd, err := v1beta1.NewKatibCertGeneratorCmd(kubeClient)
if err != nil {
klog.Fatalf("Failed to generate cert: %v", err)
}
if err = cmd.Execute(); err != nil {
os.Exit(1)
}
}

View File

@ -2,7 +2,6 @@
FROM golang:alpine AS build-env
ARG TARGETARCH
ENV GRPC_HEALTH_PROBE_VERSION v0.4.15
WORKDIR /go/src/github.com/kubeflow/katib
@ -18,13 +17,8 @@ COPY pkg/ pkg/
# Build the binary.
RUN CGO_ENABLED=0 GOOS=linux GOARCH="${TARGETARCH}" go build -a -o katib-db-manager ./cmd/db-manager/v1beta1
# Add GRPC health probe.
RUN wget -qO /bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-${TARGETARCH} \
&& chmod +x /bin/grpc_health_probe
# Copy the db-manager into a thin image.
FROM alpine:3.15
WORKDIR /app
COPY --from=build-env /bin/grpc_health_probe /bin/
COPY --from=build-env /go/src/github.com/kubeflow/katib/katib-db-manager /app/
ENTRYPOINT ["./katib-db-manager"]

View File

@ -28,14 +28,14 @@ import (
api_pb "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
db "github.com/kubeflow/katib/pkg/db/v1beta1"
"github.com/kubeflow/katib/pkg/db/v1beta1/common"
"k8s.io/klog"
"k8s.io/klog/v2"
"google.golang.org/grpc"
"google.golang.org/grpc/reflection"
)
const (
port = "0.0.0.0:6789"
defaultListenAddress = "0.0.0.0:6789"
defaultConnectTimeout = time.Second * 60
)
@ -90,7 +90,9 @@ func (s *server) Check(ctx context.Context, in *health_pb.HealthCheckRequest) (*
func main() {
var connectTimeout time.Duration
var listenAddress string
flag.DurationVar(&connectTimeout, "connect-timeout", defaultConnectTimeout, "Timeout before calling error during database connection. (e.g. 120s)")
flag.StringVar(&listenAddress, "listen-address", defaultListenAddress, "The network interface or IP address to receive incoming connections. (e.g. 0.0.0.0:6789)")
flag.Parse()
var err error
@ -104,13 +106,13 @@ func main() {
klog.Fatalf("Failed to open db connection: %v", err)
}
dbIf.DBInit()
listener, err := net.Listen("tcp", port)
listener, err := net.Listen("tcp", listenAddress)
if err != nil {
klog.Fatalf("Failed to listen: %v", err)
}
size := 1<<31 - 1
klog.Infof("Start Katib manager: %s", port)
klog.Infof("Start Katib manager: %s", listenAddress)
s := grpc.NewServer(grpc.MaxRecvMsgSize(size), grpc.MaxSendMsgSize(size))
api_pb.RegisterDBManagerServer(s, &server{})
health_pb.RegisterHealthServer(s, &server{})

View File

@ -20,7 +20,7 @@ import (
"context"
"testing"
"github.com/golang/mock/gomock"
"go.uber.org/mock/gomock"
health_pb "github.com/kubeflow/katib/pkg/apis/manager/health"
api_pb "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"

View File

@ -1,4 +1,4 @@
FROM python:3.10-slim
FROM python:3.11-slim
ARG TARGETARCH
ENV TARGET_DIR /opt/katib

View File

@ -12,12 +12,14 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import grpc
import time
import logging
import time
from concurrent import futures
import grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.earlystopping.v1beta1.medianstop.service import MedianStopService
from concurrent import futures
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6788"

View File

@ -1,5 +1,5 @@
grpcio==1.41.1
protobuf==3.19.5
grpcio>=1.64.1
protobuf>=4.21.12,<5
googleapis-common-protos==1.6.0
kubernetes==22.6.0
cython>=0.29.24

View File

@ -24,6 +24,7 @@ import (
"os"
"github.com/spf13/viper"
"k8s.io/apimachinery/pkg/runtime"
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
"sigs.k8s.io/controller-runtime/pkg/client/config"
"sigs.k8s.io/controller-runtime/pkg/healthz"
@ -31,62 +32,67 @@ import (
"sigs.k8s.io/controller-runtime/pkg/log/zap"
"sigs.k8s.io/controller-runtime/pkg/manager"
"sigs.k8s.io/controller-runtime/pkg/manager/signals"
metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
"sigs.k8s.io/controller-runtime/pkg/webhook"
configv1beta1 "github.com/kubeflow/katib/pkg/apis/config/v1beta1"
apis "github.com/kubeflow/katib/pkg/apis/controller"
controller "github.com/kubeflow/katib/pkg/controller.v1beta1"
cert "github.com/kubeflow/katib/pkg/certgenerator/v1beta1"
"github.com/kubeflow/katib/pkg/controller.v1beta1"
"github.com/kubeflow/katib/pkg/controller.v1beta1/consts"
trialutil "github.com/kubeflow/katib/pkg/controller.v1beta1/trial/util"
webhook "github.com/kubeflow/katib/pkg/webhook/v1beta1"
"github.com/kubeflow/katib/pkg/util/v1beta1/katibconfig"
webhookv1beta1 "github.com/kubeflow/katib/pkg/webhook/v1beta1"
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
)
var (
scheme = runtime.NewScheme()
log = logf.Log.WithName("entrypoint")
)
func init() {
utilruntime.Must(apis.AddToScheme(scheme))
utilruntime.Must(configv1beta1.AddToScheme(scheme))
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
}
func main() {
logf.SetLogger(zap.New())
log := logf.Log.WithName("entrypoint")
var experimentSuggestionName string
var metricsAddr string
var healthzAddr string
var webhookPort int
var injectSecurityContext bool
var enableGRPCProbeInSuggestion bool
var trialResources trialutil.GvkListFlag
var enableLeaderElection bool
var leaderElectionID string
flag.StringVar(&experimentSuggestionName, "experiment-suggestion-name",
"default", "The implementation of suggestion interface in experiment controller (default)")
flag.StringVar(&metricsAddr, "metrics-addr", ":8080", "The address the metric endpoint binds to.")
flag.StringVar(&healthzAddr, "healthz-addr", ":18080", "The address the healthz endpoint binds to.")
flag.BoolVar(&injectSecurityContext, "webhook-inject-securitycontext", false, "Inject the securityContext of container[0] in the sidecar")
flag.BoolVar(&enableGRPCProbeInSuggestion, "enable-grpc-probe-in-suggestion", true, "enable grpc probe in suggestions")
flag.Var(&trialResources, "trial-resources", "The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org)")
flag.IntVar(&webhookPort, "webhook-port", 8443, "The port number to be used for admission webhook server.")
// For leader election
flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, "Enable leader election for katib-controller. Enabling this will ensure there is only one active katib-controller.")
flag.StringVar(&leaderElectionID, "leader-election-id", "3fbc96e9.katib.kubeflow.org", "The ID for leader election.")
// TODO (andreyvelich): Currently it is not possible to set different webhook service name.
// flag.StringVar(&serviceName, "webhook-service-name", "katib-controller", "The service name which will be used in webhook")
// TODO (andreyvelich): Currently is is not possible to store webhook cert in the local file system.
// flag.BoolVar(&certLocalFS, "cert-localfs", false, "Store the webhook cert in local file system")
var katibConfigFile string
flag.StringVar(&katibConfigFile, "katib-config", "",
"The katib-controller will load its initial configuration from this file. "+
"Omit this flag to use the default configuration values. ")
flag.Parse()
initConfig, err := katibconfig.GetInitConfigData(scheme, katibConfigFile)
if err != nil {
log.Error(err, "Failed to get KatibConfig")
os.Exit(1)
}
// Set the config in viper.
viper.Set(consts.ConfigExperimentSuggestionName, experimentSuggestionName)
viper.Set(consts.ConfigInjectSecurityContext, injectSecurityContext)
viper.Set(consts.ConfigEnableGRPCProbeInSuggestion, enableGRPCProbeInSuggestion)
viper.Set(consts.ConfigTrialResources, trialResources)
viper.Set(consts.ConfigExperimentSuggestionName, initConfig.ControllerConfig.ExperimentSuggestionName)
viper.Set(consts.ConfigInjectSecurityContext, initConfig.ControllerConfig.InjectSecurityContext)
viper.Set(consts.ConfigEnableGRPCProbeInSuggestion, initConfig.ControllerConfig.EnableGRPCProbeInSuggestion)
trialGVKs, err := katibconfig.TrialResourcesToGVKs(initConfig.ControllerConfig.TrialResources)
if err != nil {
log.Error(err, "Failed to parse trialResources")
os.Exit(1)
}
viper.Set(consts.ConfigTrialResources, trialGVKs)
log.Info("Config:",
consts.ConfigExperimentSuggestionName,
viper.GetString(consts.ConfigExperimentSuggestionName),
"webhook-port",
webhookPort,
initConfig.ControllerConfig.WebhookPort,
"metrics-addr",
metricsAddr,
initConfig.ControllerConfig.MetricsAddr,
"healthz-addr",
healthzAddr,
initConfig.ControllerConfig.HealthzAddr,
consts.ConfigInjectSecurityContext,
viper.GetBool(consts.ConfigInjectSecurityContext),
consts.ConfigEnableGRPCProbeInSuggestion,
@ -104,10 +110,13 @@ func main() {
// Create a new katib controller to provide shared dependencies and start components
mgr, err := manager.New(cfg, manager.Options{
MetricsBindAddress: metricsAddr,
HealthProbeBindAddress: healthzAddr,
LeaderElection: enableLeaderElection,
LeaderElectionID: leaderElectionID,
Metrics: metricsserver.Options{
BindAddress: initConfig.ControllerConfig.MetricsAddr,
},
HealthProbeBindAddress: initConfig.ControllerConfig.HealthzAddr,
LeaderElection: initConfig.ControllerConfig.EnableLeaderElection,
LeaderElectionID: initConfig.ControllerConfig.LeaderElectionID,
Scheme: scheme,
})
if err != nil {
log.Error(err, "Failed to create the manager")
@ -116,11 +125,50 @@ func main() {
log.Info("Registering Components.")
// Setup Scheme for all resources
if err := apis.AddToScheme(mgr.GetScheme()); err != nil {
log.Error(err, "Unable to add APIs to scheme")
// Create a webhook server.
hookServer := webhook.NewServer(webhook.Options{
Port: *initConfig.ControllerConfig.WebhookPort,
CertDir: consts.CertDir,
})
ctx := signals.SetupSignalHandler()
certsReady := make(chan struct{})
defer close(certsReady)
// The setupControllers will register controllers to the manager
// after generated certs for the admission webhooks.
go setupControllers(mgr, certsReady, hookServer)
if initConfig.CertGeneratorConfig.Enable {
if err = cert.AddToManager(mgr, initConfig.CertGeneratorConfig, certsReady); err != nil {
log.Error(err, "Failed to set up cert-generator")
}
} else {
certsReady <- struct{}{}
}
log.Info("Setting up health checker.")
if err := mgr.AddReadyzCheck("readyz", hookServer.StartedChecker()); err != nil {
log.Error(err, "Unable to add readyz endpoint to the manager")
os.Exit(1)
}
if err = mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
log.Error(err, "Add webhook server health checker to the manager failed")
os.Exit(1)
}
// Start the Cmd
log.Info("Starting the manager.")
if err = mgr.Start(ctx); err != nil {
log.Error(err, "Unable to run the manager")
os.Exit(1)
}
}
func setupControllers(mgr manager.Manager, certsReady chan struct{}, hookServer webhook.Server) {
// The certsReady blocks to register controllers until generated certs.
<-certsReady
log.Info("Certs ready")
// Setup all Controllers
log.Info("Setting up controller.")
@ -130,26 +178,8 @@ func main() {
}
log.Info("Setting up webhooks.")
if err := webhook.AddToManager(mgr, webhookPort); err != nil {
if err := webhookv1beta1.AddToManager(mgr, hookServer); err != nil {
log.Error(err, "Unable to register webhooks to the manager")
os.Exit(1)
}
log.Info("Setting up health checker.")
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
log.Error(err, "Unable to add healthz endpoint to the manager")
os.Exit(1)
}
// TODO (@anencore94) need to more detailed check whether is it possible to communicate with k8s-apiserver or db-manager at '/readyz' ?
if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
log.Error(err, "Unable to add readyz endpoint to the manager")
os.Exit(1)
}
// Start the Cmd
log.Info("Starting the Cmd.")
if err := mgr.Start(signals.SetupSignalHandler()); err != nil {
log.Error(err, "Unable to run the manager")
os.Exit(1)
}
}

View File

@ -49,11 +49,11 @@ import (
"strings"
"time"
"github.com/hpcloud/tail"
"github.com/nxadm/tail"
psutil "github.com/shirou/gopsutil/v3/process"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
"k8s.io/klog"
"k8s.io/klog/v2"
commonv1beta1 "github.com/kubeflow/katib/pkg/apis/controller/common/v1beta1"
api "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
@ -134,7 +134,11 @@ func printMetricsFile(mFile string) {
checkMetricFile(mFile)
// Print lines from metrics file.
t, _ := tail.TailFile(mFile, tail.Config{Follow: true})
t, err := tail.TailFile(mFile, tail.Config{Follow: true, ReOpen: true})
if err != nil {
klog.Errorf("Failed to open metrics file: %v", err)
}
for line := range t.Lines {
klog.Info(line.Text)
}
@ -307,7 +311,7 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
}
// Create connection and client for Early Stopping service.
conn, err := grpc.Dial(*earlyStopServiceAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))
conn, err := grpc.NewClient(*earlyStopServiceAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))
if err != nil {
klog.Fatalf("Could not connect to Early Stopping service, error: %v", err)
}
@ -429,7 +433,7 @@ func main() {
func reportMetrics(filters []string, fileFormat commonv1beta1.FileFormat) {
conn, err := grpc.Dial(*dbManagerServiceAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))
conn, err := grpc.NewClient(*dbManagerServiceAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))
if err != nil {
klog.Fatalf("Could not connect to DB manager service, error: %v", err)
}

View File

@ -1,4 +1,4 @@
FROM python:3.10-slim
FROM python:3.11-slim
ARG TARGETARCH
ENV TARGET_DIR /opt/katib

View File

@ -12,13 +12,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import grpc
import argparse
from logging import INFO, StreamHandler, getLogger
import api_pb2
from pns import WaitMainProcesses
import api_pb2_grpc
import const
import grpc
from pns import WaitMainProcesses
from tfevent_loader import MetricsCollector
from logging import getLogger, StreamHandler, INFO
timeout_in_seconds = 60
@ -55,25 +57,28 @@ if __name__ == '__main__':
wait_all_processes = opt.wait_all_processes.lower() == "true"
db_manager_server = opt.db_manager_server_addr.split(':')
if len(db_manager_server) != 2:
raise Exception("Invalid Katib DB manager service address: %s" %
opt.db_manager_server_addr)
raise Exception(
f"Invalid Katib DB manager service address: {opt.db_manager_server_addr}"
)
WaitMainProcesses(
pool_interval=opt.poll_interval,
timout=opt.timeout,
wait_all=wait_all_processes,
completed_marked_dir=opt.metrics_file_dir)
completed_marked_dir=opt.metrics_file_dir,
)
mc = MetricsCollector(opt.metric_names.split(';'))
mc = MetricsCollector(opt.metric_names.split(";"))
observation_log = mc.parse_file(opt.metrics_file_dir)
channel = grpc.beta.implementations.insecure_channel(
db_manager_server[0], int(db_manager_server[1]))
with api_pb2.beta_create_DBManager_stub(channel) as client:
logger.info("In " + opt.trial_name + " " +
str(len(observation_log.metric_logs)) + " metrics will be reported.")
client.ReportObservationLog(api_pb2.ReportObservationLogRequest(
trial_name=opt.trial_name,
observation_log=observation_log
), timeout=timeout_in_seconds)
with grpc.insecure_channel(opt.db_manager_server_addr) as channel:
stub = api_pb2_grpc.DBManagerStub(channel)
logger.info(
f"In {opt.trial_name} {str(len(observation_log.metric_logs))} metrics will be reported."
)
stub.ReportObservationLog(
api_pb2.ReportObservationLogRequest(
trial_name=opt.trial_name, observation_log=observation_log
),
timeout=timeout_in_seconds,
)

View File

@ -1,5 +1,6 @@
psutil==5.9.4
rfc3339>=6.2
grpcio==1.41.1
grpcio>=1.64.1
googleapis-common-protos==1.6.0
tensorflow==2.11.0
tensorflow==2.16.1
protobuf>=4.21.12,<5

View File

@ -1,61 +0,0 @@
# --- Clone the kubeflow/kubeflow code ---
FROM ubuntu AS fetch-kubeflow-kubeflow
RUN apt-get update && apt-get install git -y
WORKDIR /kf
COPY ./pkg/new-ui/v1beta1/frontend/COMMIT ./
RUN git clone https://github.com/kubeflow/kubeflow.git && \
COMMIT=$(cat ./COMMIT) && \
cd kubeflow && \
git checkout $COMMIT
# --- Build the frontend kubeflow library ---
FROM node:12 AS frontend-kubeflow-lib
WORKDIR /src
ARG LIB=/kf/kubeflow/components/crud-web-apps/common/frontend/kubeflow-common-lib
COPY --from=fetch-kubeflow-kubeflow $LIB/package*.json ./
RUN npm ci
COPY --from=fetch-kubeflow-kubeflow $LIB/ ./
RUN npm run build
# --- Build the frontend ---
FROM node:12 AS frontend
WORKDIR /src
COPY ./pkg/new-ui/v1beta1/frontend/package*.json ./
RUN npm ci
COPY ./pkg/new-ui/v1beta1/frontend/ .
COPY --from=frontend-kubeflow-lib /src/dist/kubeflow/ ./node_modules/kubeflow/
RUN npm run build:prod
# --- Build the backend ---
FROM golang:alpine AS go-build
ARG TARGETARCH
WORKDIR /go/src/github.com/kubeflow/katib
# Download packages.
COPY go.mod .
COPY go.sum .
RUN go mod download -x
# Copy sources.
COPY cmd/ cmd/
COPY pkg/ pkg/
# Build the binary.
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o katib-ui ./cmd/new-ui/v1beta1
# --- Compose the web app ---
FROM alpine:3.15
WORKDIR /app
COPY --from=go-build /go/src/github.com/kubeflow/katib/katib-ui /app/
COPY --from=frontend /src/dist/static /app/build/static/
ENTRYPOINT ["./katib-ui"]

View File

@ -1,76 +0,0 @@
/*
Copyright 2022 The Kubeflow Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package main
import (
"flag"
"fmt"
"log"
"net/http"
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
common_v1beta1 "github.com/kubeflow/katib/pkg/common/v1beta1"
ui "github.com/kubeflow/katib/pkg/new-ui/v1beta1"
)
var (
port, host, buildDir, dbManagerAddr *string
)
func init() {
port = flag.String("port", "8080", "The port to listen to for incoming HTTP connections")
host = flag.String("host", "0.0.0.0", "The host to listen to for incoming HTTP connections")
buildDir = flag.String("build-dir", "/app/build", "The dir of frontend")
dbManagerAddr = flag.String("db-manager-address", common_v1beta1.GetDBManagerAddr(), "The address of Katib DB manager")
}
func main() {
flag.Parse()
kuh := ui.NewKatibUIHandler(*dbManagerAddr)
log.Printf("Serving the frontend dir %s", *buildDir)
frontend := http.FileServer(http.Dir(*buildDir))
http.HandleFunc("/katib/", kuh.ServeIndex(*buildDir))
http.Handle("/katib/static/", http.StripPrefix("/katib/", frontend))
http.HandleFunc("/katib/fetch_experiments/", kuh.FetchExperiments)
http.HandleFunc("/katib/create_experiment/", kuh.CreateExperiment)
http.HandleFunc("/katib/delete_experiment/", kuh.DeleteExperiment)
http.HandleFunc("/katib/fetch_experiment/", kuh.FetchExperiment)
http.HandleFunc("/katib/fetch_trial/", kuh.FetchTrial)
http.HandleFunc("/katib/fetch_suggestion/", kuh.FetchSuggestion)
http.HandleFunc("/katib/fetch_hp_job_info/", kuh.FetchHPJobInfo)
http.HandleFunc("/katib/fetch_hp_job_trial_info/", kuh.FetchHPJobTrialInfo)
http.HandleFunc("/katib/fetch_nas_job_info/", kuh.FetchNASJobInfo)
http.HandleFunc("/katib/fetch_trial_templates/", kuh.FetchTrialTemplates)
http.HandleFunc("/katib/add_template/", kuh.AddTemplate)
http.HandleFunc("/katib/edit_template/", kuh.EditTemplate)
http.HandleFunc("/katib/delete_template/", kuh.DeleteTemplate)
http.HandleFunc("/katib/fetch_namespaces", kuh.FetchNamespaces)
http.HandleFunc("/katib/fetch_trial_logs/", kuh.FetchTrialLogs)
log.Printf("Serving at %s:%s", *host, *port)
if err := http.ListenAndServe(fmt.Sprintf("%s:%s", *host, *port), nil); err != nil {
panic(err)
}
}

View File

@ -2,7 +2,6 @@
FROM golang:alpine AS build-env
ARG TARGETARCH
ENV GRPC_HEALTH_PROBE_VERSION v0.4.15
WORKDIR /go/src/github.com/kubeflow/katib
@ -18,10 +17,6 @@ COPY pkg/ pkg/
# Build the binary.
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1
# Add GRPC health probe.
RUN wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-${TARGETARCH} \
&& chmod +x /bin/grpc_health_probe
# Copy the Goptuna suggestion into a thin image.
FROM alpine:3.15
@ -29,7 +24,6 @@ ENV TARGET_DIR /opt/katib
WORKDIR ${TARGET_DIR}
COPY --from=build-env /bin/grpc_health_probe /bin/
COPY --from=build-env /go/src/github.com/kubeflow/katib/goptuna-suggestion ${TARGET_DIR}/
RUN chgrp -R 0 ${TARGET_DIR} \

View File

@ -24,7 +24,7 @@ import (
api_v1_beta1 "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
suggestion "github.com/kubeflow/katib/pkg/suggestion/v1beta1/goptuna"
"google.golang.org/grpc"
"k8s.io/klog"
"k8s.io/klog/v2"
)
const (

View File

@ -1,12 +1,4 @@
FROM alpine:3.15 AS downloader
ARG TARGETARCH
ENV GRPC_HEALTH_PROBE_VERSION v0.4.15
RUN wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-${TARGETARCH} \
&& chmod +x /bin/grpc_health_probe
FROM python:3.10-slim
FROM python:3.11-slim
ARG TARGETARCH
ENV TARGET_DIR /opt/katib
@ -22,7 +14,6 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ] || [ "${TARGETARCH}" = "arm64" ]; then \
ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}

View File

@ -12,13 +12,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import grpc
import time
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.hyperband.service import HyperbandService
from concurrent import futures
import grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.suggestion.v1beta1.hyperband.service import HyperbandService
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,9 +1,9 @@
grpcio==1.41.1
grpcio>=1.64.1
cloudpickle==0.5.6
numpy>=1.20.0
numpy>=1.25.2
scikit-learn>=0.24.0
scipy>=1.5.4
forestci==0.3
protobuf==3.19.5
protobuf>=4.21.12,<5
googleapis-common-protos==1.6.0
cython>=0.29.24

View File

@ -1,12 +1,4 @@
FROM alpine:3.15 AS downloader
ARG TARGETARCH
ENV GRPC_HEALTH_PROBE_VERSION v0.4.15
RUN wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-${TARGETARCH} \
&& chmod +x /bin/grpc_health_probe
FROM python:3.10-slim
FROM python:3.11-slim
ARG TARGETARCH
ENV TARGET_DIR /opt/katib
@ -22,7 +14,6 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}

View File

@ -12,13 +12,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import grpc
import time
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.hyperopt.service import HyperoptService
from concurrent import futures
import grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.suggestion.v1beta1.hyperopt.service import HyperoptService
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,10 +1,10 @@
grpcio==1.41.1
grpcio>=1.64.1
cloudpickle==0.5.6
numpy>=1.20.0
numpy>=1.25.2
scikit-learn>=0.24.0
scipy>=1.5.4
forestci==0.3
protobuf==3.19.5
protobuf>=4.21.12,<5
googleapis-common-protos==1.6.0
hyperopt==0.2.5
cython>=0.29.24

View File

@ -1,12 +1,4 @@
FROM alpine:3.15 as downloader
ARG TARGETARCH
ENV GRPC_HEALTH_PROBE_VERSION v0.4.15
RUN wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-${TARGETARCH} \
&& chmod +x /bin/grpc_health_probe
FROM python:3.10-slim
FROM python:3.11-slim
ARG TARGETARCH
ENV TARGET_DIR /opt/katib
@ -22,7 +14,6 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}

View File

@ -12,13 +12,14 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import grpc
from concurrent import futures
import time
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.nas.darts.service import DartsService
from concurrent import futures
import grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.suggestion.v1beta1.nas.darts.service import DartsService
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,4 +1,4 @@
grpcio==1.41.1
protobuf==3.19.5
grpcio>=1.64.1
protobuf>=4.21.12,<5
googleapis-common-protos==1.6.0
cython>=0.29.24

View File

@ -1,17 +1,8 @@
FROM alpine:3.15 AS downloader
ARG TARGETARCH
ENV GRPC_HEALTH_PROBE_VERSION v0.4.15
RUN wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-${TARGETARCH} \
&& chmod +x /bin/grpc_health_probe
FROM python:3.10-slim
FROM python:3.11-slim
ARG TARGETARCH
ENV TARGET_DIR /opt/katib
ENV SUGGESTION_DIR cmd/suggestion/nas/enas/v1beta1
ENV GRPC_HEALTH_PROBE_VERSION v0.4.15
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
@ -23,7 +14,6 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}

View File

@ -12,15 +12,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import grpc
from concurrent import futures
import time
from concurrent import futures
import grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.suggestion.v1beta1.nas.enas.service import EnasService
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,4 +1,5 @@
grpcio==1.41.1
grpcio>=1.64.1
googleapis-common-protos==1.6.0
cython>=0.29.24
tensorflow==2.11.0
tensorflow==2.16.1
protobuf>=4.21.12,<5

View File

@ -1,12 +1,4 @@
FROM alpine:3.15 AS downloader
ARG TARGETARCH
ENV GRPC_HEALTH_PROBE_VERSION v0.4.15
RUN wget -qO /bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-${TARGETARCH} \
&& chmod +x /bin/grpc_health_probe
FROM python:3.10-slim
FROM python:3.11-slim
ARG TARGETARCH
ENV TARGET_DIR /opt/katib
@ -22,7 +14,6 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}

View File

@ -12,13 +12,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import grpc
import time
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.optuna.service import OptunaService
from concurrent import futures
import grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.suggestion.v1beta1.optuna.service import OptunaService
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,4 +1,4 @@
grpcio==1.41.1
protobuf==3.19.5
grpcio>=1.64.1
protobuf>=4.21.12,<5
googleapis-common-protos==1.53.0
optuna>=3.0.0
optuna==3.3.0

View File

@ -1,12 +1,4 @@
FROM alpine:3.15 AS downloader
ARG TARGETARCH
ENV GRPC_HEALTH_PROBE_VERSION v0.4.15
RUN wget -qO /bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-${TARGETARCH} \
&& chmod +x /bin/grpc_health_probe
FROM python:3.10-slim
FROM python:3.11-slim
ARG TARGETARCH
ENV TARGET_DIR /opt/katib
@ -22,7 +14,6 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}

View File

@ -12,13 +12,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import grpc
import time
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.pbt.service import PbtService
from concurrent import futures
import grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.suggestion.v1beta1.pbt.service import PbtService
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,4 +1,4 @@
grpcio==1.41.1
protobuf==3.19.5
grpcio>=1.64.1
protobuf>=4.21.12,<5
googleapis-common-protos==1.53.0
numpy==1.22.2
numpy==1.25.2

View File

@ -1,11 +1,3 @@
FROM alpine:3.15 AS downloader
ARG TARGETARCH
ENV GRPC_HEALTH_PROBE_VERSION v0.4.15
RUN wget -qO /bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-${TARGETARCH} \
&& chmod +x /bin/grpc_health_probe
FROM python:3.10-slim
ARG TARGETARCH
@ -22,7 +14,6 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}

View File

@ -12,13 +12,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import grpc
import time
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.skopt.service import SkoptService
from concurrent import futures
import grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.suggestion.v1beta1.skopt.service import SkoptService
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,13 +1,13 @@
grpcio==1.41.1
grpcio>=1.64.1
cloudpickle==0.5.6
# This is a workaround to avoid the following error.
# AttributeError: module 'numpy' has no attribute 'int'
# See more: https://github.com/numpy/numpy/pull/22607
numpy==1.23.5
scikit-learn>=0.24.0
scikit-learn>=0.24.0, <=1.3.0
scipy>=1.5.4
forestci==0.3
protobuf==3.19.5
protobuf>=4.21.12,<5
googleapis-common-protos==1.6.0
scikit-optimize>=0.9.0
cython>=0.29.24

View File

@ -1,13 +1,52 @@
# Build the Katib UI.
FROM node:12.18.1 AS npm-build
# --- Clone the kubeflow/kubeflow code ---
FROM alpine/git AS fetch-kubeflow-kubeflow
# Build frontend.
ADD /pkg/ui/v1beta1/frontend /frontend
RUN cd /frontend && npm ci
RUN cd /frontend && npm run build
RUN rm -rf /frontend/node_modules
WORKDIR /kf
COPY ./pkg/ui/v1beta1/frontend/COMMIT ./
RUN git clone https://github.com/kubeflow/kubeflow.git && \
COMMIT=$(cat ./COMMIT) && \
cd kubeflow && \
git checkout $COMMIT
# Build backend.
# --- Build the frontend kubeflow library ---
FROM node:16-alpine AS frontend-kubeflow-lib
WORKDIR /src
ARG LIB=/kf/kubeflow/components/crud-web-apps/common/frontend/kubeflow-common-lib
COPY --from=fetch-kubeflow-kubeflow $LIB/package*.json ./
RUN npm config set fetch-retry-mintimeout 200000 && \
npm config set fetch-retry-maxtimeout 1200000 && \
npm config get registry && \
npm config set registry https://registry.npmjs.org/ && \
npm config delete https-proxy && \
npm config set loglevel verbose && \
npm cache clean --force && \
npm ci --force --prefer-offline --no-audit
COPY --from=fetch-kubeflow-kubeflow $LIB/ ./
RUN npm run build
# --- Build the frontend ---
FROM node:16-alpine AS frontend
WORKDIR /src
COPY ./pkg/ui/v1beta1/frontend/package*.json ./
RUN npm config set fetch-retry-mintimeout 200000 && \
npm config set fetch-retry-maxtimeout 1200000 && \
npm config get registry && \
npm config set registry https://registry.npmjs.org/ && \
npm config delete https-proxy && \
npm config set loglevel verbose && \
npm cache clean --force && \
npm ci --force --prefer-offline --no-audit
COPY ./pkg/ui/v1beta1/frontend/ .
COPY --from=frontend-kubeflow-lib /src/dist/kubeflow/ ./node_modules/kubeflow/
RUN npm run build:prod
# --- Build the backend ---
FROM golang:alpine AS go-build
ARG TARGETARCH
@ -24,11 +63,11 @@ COPY cmd/ cmd/
COPY pkg/ pkg/
# Build the binary.
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o katib-ui ./cmd/ui/v1beta1
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o katib-ui ./cmd/ui/v1beta1
# Copy the backend and frontend into a thin image.
# --- Compose the web app ---
FROM alpine:3.15
WORKDIR /app
COPY --from=go-build /go/src/github.com/kubeflow/katib/katib-ui /app/
COPY --from=npm-build /frontend/build /app/build
COPY --from=frontend /src/dist/static /app/build/static/
ENTRYPOINT ["./katib-ui"]

View File

@ -33,7 +33,7 @@ var (
)
func init() {
port = flag.String("port", "80", "The port to listen to for incoming HTTP connections")
port = flag.String("port", "8080", "The port to listen to for incoming HTTP connections")
host = flag.String("host", "0.0.0.0", "The host to listen to for incoming HTTP connections")
buildDir = flag.String("build-dir", "/app/build", "The dir of frontend")
dbManagerAddr = flag.String("db-manager-address", common_v1beta1.GetDBManagerAddr(), "The address of Katib DB manager")
@ -45,17 +45,17 @@ func main() {
log.Printf("Serving the frontend dir %s", *buildDir)
frontend := http.FileServer(http.Dir(*buildDir))
http.Handle("/katib/", http.StripPrefix("/katib/", frontend))
http.HandleFunc("/katib/", kuh.ServeIndex(*buildDir))
http.Handle("/katib/static/", http.StripPrefix("/katib/", frontend))
http.HandleFunc("/katib/fetch_experiments/", kuh.FetchAllExperiments)
http.HandleFunc("/katib/fetch_experiments/", kuh.FetchExperiments)
http.HandleFunc("/katib/submit_yaml/", kuh.SubmitYamlJob)
http.HandleFunc("/katib/submit_hp_job/", kuh.SubmitParamsJob)
http.HandleFunc("/katib/submit_nas_job/", kuh.SubmitParamsJob)
http.HandleFunc("/katib/create_experiment/", kuh.CreateExperiment)
http.HandleFunc("/katib/delete_experiment/", kuh.DeleteExperiment)
http.HandleFunc("/katib/fetch_experiment/", kuh.FetchExperiment)
http.HandleFunc("/katib/fetch_trial/", kuh.FetchTrial)
http.HandleFunc("/katib/fetch_suggestion/", kuh.FetchSuggestion)
http.HandleFunc("/katib/fetch_hp_job_info/", kuh.FetchHPJobInfo)
@ -67,6 +67,7 @@ func main() {
http.HandleFunc("/katib/edit_template/", kuh.EditTemplate)
http.HandleFunc("/katib/delete_template/", kuh.DeleteTemplate)
http.HandleFunc("/katib/fetch_namespaces", kuh.FetchNamespaces)
http.HandleFunc("/katib/fetch_trial_logs/", kuh.FetchTrialLogs)
log.Printf("Serving at %s:%s", *host, *port)
if err := http.ListenAndServe(fmt.Sprintf("%s:%s", *host, *port), nil); err != nil {

13
conformance/run.sh Normal file
View File

@ -0,0 +1,13 @@
#!/bin/sh
# Run conformance test and generate test report.
python test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py --experiment-path examples/v1beta1/hp-tuning/random.yaml --namespace kf-conformance \
--trial-pod-labels '{"sidecar.istio.io/inject": "false"}' | tee /tmp/katib-conformance.log
# Create the done file.
touch /tmp/katib-conformance.done
echo "Done..."
# Keep the container running so the test logs can be downloaded.
while true; do sleep 10000; done

5
docs/README.md Normal file
View File

@ -0,0 +1,5 @@
# Katib Documentation
Welcome to Kubeflow Katib!
The Katib documentation is available on [kubeflow.org](https://www.kubeflow.org/docs/components/katib/).

View File

@ -1,139 +0,0 @@
# Developer Guide
This developer guide is for people who want to contribute to the Katib project.
If you're interesting in using Katib in your machine learning project,
see the following user guides:
- [Concepts](https://www.kubeflow.org/docs/components/katib/overview/)
in Katib, hyperparameter tuning, and neural architecture search.
- [Getting started with Katib](https://kubeflow.org/docs/components/katib/hyperparameter/).
- Detailed guide to [configuring and running a Katib
experiment](https://kubeflow.org/docs/components/katib/experiment/).
## Requirements
- [Go](https://golang.org/) (1.19 or later)
- [Docker](https://docs.docker.com/) (20.10 or later)
- [Docker Buildx](https://docs.docker.com/build/buildx/) (0.8.0 or later)
- [Java](https://docs.oracle.com/javase/8/docs/technotes/guides/install/install_overview.html) (8 or later)
- [Python](https://www.python.org/) (3.10 or later)
- [kustomize](https://kustomize.io/) (4.0.5 or later)
## Build from source code
Check source code as follows:
```bash
make build REGISTRY=<image-registry> TAG=<image-tag>
```
To use your custom images for the Katib components, modify
[Kustomization file](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml)
and [Katib Config](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/controller/katib-config.yaml)
You can deploy Katib v1beta1 manifests into a Kubernetes cluster as follows:
```bash
make deploy
```
You can undeploy Katib v1beta1 manifests from a Kubernetes cluster as follows:
```bash
make undeploy
```
## Modify controller APIs
If you want to modify Katib controller APIs, you have to
generate deepcopy, clientset, listers, informers, open-api and Python SDK with the changed APIs.
You can update the necessary files as follows:
```bash
make generate
```
## Controller Flags
Below is a list of command-line flags accepted by Katib controller:
| Name | Type | Default | Description |
| ------------------------------- | ------------------------- | ----------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| enable-grpc-probe-in-suggestion | bool | true | Enable grpc probe in suggestions |
| experiment-suggestion-name | string | "default" | The implementation of suggestion interface in experiment controller |
| metrics-addr | string | ":8080" | The address that the metrics endpoint binds to |
| healthz-addr | string | ":18080" | The address that the healthz endpoint binds to |
| trial-resources | []schema.GroupVersionKind | null | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
| webhook-inject-securitycontext | bool | false | Inject the securityContext of container[0] in the sidecar |
| webhook-port | int | 8443 | The port number to be used for admission webhook server |
| enable-leader-election | bool | false | Enable leader election for katib-controller. Enabling this will ensure there is only one active katib-controller. |
| leader-election-id | string | "3fbc96e9.katib.kubeflow.org" | The ID for leader election. |
## DB Manager Flags
Below is a list of command-line flags accepted by Katib DB Manager:
| Name | Type | Default | Description |
| --------------- | ------------- | ------- | ------------------------------------------------------- |
| connect-timeout | time.Duration | 60s | Timeout before calling error during database connection |
## Workflow design
Please see [workflow-design.md](./workflow-design.md).
## Katib admission webhooks
Katib uses three [Kubernetes admission webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/).
1. `validator.experiment.katib.kubeflow.org` -
[Validating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook)
to validate the Katib Experiment before the creation.
1. `defaulter.experiment.katib.kubeflow.org` -
[Mutating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook)
to set the [default values](../pkg/apis/controller/experiments/v1beta1/experiment_defaults.go)
in the Katib Experiment before the creation.
1. `mutator.pod.katib.kubeflow.org` - Mutating admission webhook to inject the metrics
collector sidecar container to the training pod. Learn more about the Katib's
metrics collector in the
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector).
You can find the YAMLs for the Katib webhooks
[here](../manifests/v1beta1/components/webhook/webhooks.yaml).
**Note:** If you are using a private Kubernetes cluster, you have to allow traffic
via `TCP:8443` by specifying the firewall rule and you have to update the master
plane CIDR source range to use the Katib webhooks
### Katib cert generator
Katib uses the custom `cert-generator` [Kubernetes Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/)
to generate certificates for the webhooks.
Once Katib is deployed in the Kubernetes cluster, the `cert-generator` Job follows these steps:
- Generate the self-signed certificate and private key.
- Create a Kubernetes Secret with the self-signed TLS certificate and private key.
Secret has the `katib-webhook-cert` name and `cert-generator` Job's
`ownerReference` to clean-up resources once Katib is uninstalled.
Once Secret is created, the Katib controller Deployment spawns the Pod,
since the controller has the `katib-webhook-cert` Secret volume.
- Patch the webhooks with the `CABundle`.
You can find the `cert-generator` source code [here](../cmd/cert-generator/v1beta1).
## Implement a new algorithm and use it in Katib
Please see [new-algorithm-service.md](./new-algorithm-service.md).
## Katib UI documentation
Please see [Katib UI README](https://github.com/kubeflow/katib/tree/master/pkg/ui/v1beta1).
## Design proposals
Please see [proposals](./proposals).

View File

@ -5,7 +5,7 @@ Here you can find the location for images that are used in Katib.
## Katib Components Images
The following table shows images for the
[Katib components](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-components).
[Katib components](https://www.kubeflow.org/docs/components/katib/reference/architecture/#katib-control-plane-components).
<table>
<tbody>
@ -22,7 +22,7 @@ The following table shows images for the
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/katib-controller</code>
<code>ghcr.io/kubeflow/katib/katib-controller</code>
</td>
<td>
Katib Controller
@ -33,7 +33,7 @@ The following table shows images for the
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/katib-ui</code>
<code>ghcr.io/kubeflow/katib/katib-ui</code>
</td>
<td>
Katib User Interface
@ -44,7 +44,7 @@ The following table shows images for the
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/katib-db-manager</code>
<code>ghcr.io/kubeflow/katib/katib-db-manager</code>
</td>
<td>
Katib DB Manager
@ -64,24 +64,13 @@ The following table shows images for the
<a href="https://github.com/docker-library/mysql/blob/c506174eab8ae160f56483e8d72410f8f1e1470f/8.0/Dockerfile.debian">Dockerfile</a>
</td>
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/cert-generator</code>
</td>
<td>
Katib Cert Generator
</td>
<td>
<a href="https://github.com/kubeflow/katib/blob/master/cmd/cert-generator/v1beta1/Dockerfile">Dockerfile</a>
</td>
</tr>
</tbody>
</table>
## Katib Metrics Collectors Images
The following table shows images for the
[Katib Metrics Collectors](https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector).
[Katib Metrics Collectors](https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/).
<table>
<tbody>
@ -98,7 +87,7 @@ The following table shows images for the
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/file-metrics-collector</code>
<code>ghcr.io/kubeflow/katib/file-metrics-collector</code>
</td>
<td>
File Metrics Collector
@ -109,7 +98,7 @@ The following table shows images for the
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/tfevent-metrics-collector</code>
<code>ghcr.io/kubeflow/katib/tfevent-metrics-collector</code>
</td>
<td>
Tensorflow Event Metrics Collector
@ -124,8 +113,8 @@ The following table shows images for the
## Katib Suggestions and Early Stopping Images
The following table shows images for the
[Katib Suggestions](https://www.kubeflow.org/docs/components/katib/experiment/#search-algorithms-in-detail)
and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/components/katib/early-stopping/).
[Katib Suggestion services](https://www.kubeflow.org/docs/components/katib/reference/architecture/#suggestion)
and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/components/katib/user-guides/early-stopping/#early-stopping-algorithms).
<table>
<tbody>
@ -142,7 +131,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/suggestion-hyperopt</code>
<code>ghcr.io/kubeflow/katib/suggestion-hyperopt</code>
</td>
<td>
<a href="https://github.com/hyperopt/hyperopt">Hyperopt</a> Suggestion
@ -153,7 +142,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/suggestion-skopt</code>
<code>ghcr.io/kubeflow/katib/suggestion-skopt</code>
</td>
<td>
<a href="https://github.com/scikit-optimize/scikit-optimize">Skopt</a> Suggestion
@ -164,7 +153,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/suggestion-optuna</code>
<code>ghcr.io/kubeflow/katib/suggestion-optuna</code>
</td>
<td>
<a href="https://github.com/optuna/optuna">Optuna</a> Suggestion
@ -175,7 +164,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/suggestion-goptuna</code>
<code>ghcr.io/kubeflow/katib/suggestion-goptuna</code>
</td>
<td>
<a href="https://github.com/c-bata/goptuna">Goptuna</a> Suggestion
@ -186,7 +175,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/suggestion-hyperband</code>
<code>ghcr.io/kubeflow/katib/suggestion-hyperband</code>
</td>
<td>
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#hyperband">Hyperband</a> Suggestion
@ -197,7 +186,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/suggestion-enas</code>
<code>ghcr.io/kubeflow/katib/suggestion-enas</code>
</td>
<td>
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#enas">ENAS</a> Suggestion
@ -208,7 +197,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/suggestion-darts</code>
<code>ghcr.io/kubeflow/katib/suggestion-darts</code>
</td>
<td>
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#differentiable-architecture-search-darts">DARTS</a> Suggestion
@ -219,7 +208,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/earlystopping-medianstop</code>
<code>ghcr.io/kubeflow/katib/earlystopping-medianstop</code>
</td>
<td>
<a href="https://www.kubeflow.org/docs/components/katib/early-stopping/#median-stopping-rule">Median Stopping Rule</a>
@ -234,7 +223,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
## Training Containers Images
The following table shows images for training containers which are used in the
[Katib Trials](https://www.kubeflow.org/docs/components/katib/experiment/#packaging-your-training-code-in-a-container-image).
[Katib Trials](https://www.kubeflow.org/docs/components/katib/reference/architecture/#trial).
<table>
<tbody>
@ -251,18 +240,7 @@ The following table shows images for training containers which are used in the
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/mxnet-mnist</code>
</td>
<td>
MXNet MNIST example with collecting metrics time
</td>
<td>
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/mxnet-mnist/Dockerfile">Dockerfile</a>
</td>
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/pytorch-mnist-cpu</code>
<code>ghcr.io/kubeflow/katib/pytorch-mnist-cpu</code>
</td>
<td>
PyTorch MNIST example with printing metrics to the file or StdOut with CPU support
@ -273,7 +251,7 @@ The following table shows images for training containers which are used in the
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/pytorch-mnist-gpu</code>
<code>ghcr.io/kubeflow/katib/pytorch-mnist-gpu</code>
</td>
<td>
PyTorch MNIST example with printing metrics to the file or StdOut with GPU support
@ -284,7 +262,7 @@ The following table shows images for training containers which are used in the
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/tf-mnist-with-summaries</code>
<code>ghcr.io/kubeflow/katib/tf-mnist-with-summaries</code>
</td>
<td>
Tensorflow MNIST example with saving metrics in the summaries
@ -295,18 +273,7 @@ The following table shows images for training containers which are used in the
</tr>
<tr align="center">
<td>
<code>docker.io/bytepsimage/mxnet</code>
</td>
<td>
Distributed BytePS example for MXJob
</td>
<td>
<a href="https://github.com/bytedance/byteps/blob/v0.2.5/docker/Dockerfile">Dockerfile</a>
</td>
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/xgboost-lightgbm</code>
<code>ghcr.io/kubeflow/katib/xgboost-lightgbm</code>
</td>
<td>
Distributed LightGBM example for XGBoostJob
@ -339,7 +306,7 @@ The following table shows images for training containers which are used in the
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/enas-cnn-cifar10-gpu</code>
<code>ghcr.io/kubeflow/katib/enas-cnn-cifar10-gpu</code>
</td>
<td>
Keras CIFAR-10 CNN example for ENAS with GPU support
@ -350,7 +317,7 @@ The following table shows images for training containers which are used in the
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/enas-cnn-cifar10-cpu</code>
<code>ghcr.io/kubeflow/katib/enas-cnn-cifar10-cpu</code>
</td>
<td>
Keras CIFAR-10 CNN example for ENAS with CPU support
@ -361,7 +328,7 @@ The following table shows images for training containers which are used in the
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/darts-cnn-cifar10-gpu</code>
<code>ghcr.io/kubeflow/katib/darts-cnn-cifar10-gpu</code>
</td>
<td>
PyTorch CIFAR-10 CNN example for DARTS with GPU support
@ -372,7 +339,7 @@ The following table shows images for training containers which are used in the
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/darts-cnn-cifar10-cpu</code>
<code>ghcr.io/kubeflow/katib/darts-cnn-cifar10-cpu</code>
</td>
<td>
PyTorch CIFAR-10 CNN example for DARTS with CPU support

Binary file not shown.

Before

Width:  |  Height:  |  Size: 102 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 192 KiB

View File

@ -1,189 +0,0 @@
# Document about how to add a new algorithm in Katib
## Implement a new algorithm and use it in Katib
### Implement the algorithm
The design of Katib follows the `ask-and-tell` pattern:
> They often follow a pattern a bit like this: 1. ask for a new set of parameters 1. walk to the Experiment and program in the new parameters 1. observe the outcome of running the Experiment 1. walk back to your laptop and tell the optimizer about the outcome 1. go to step 1
When an Experiment is created, one algorithm service will be created. Then Katib asks for new sets of parameters via `GetSuggestions` GRPC call. After that, Katib creates new trials according to the sets and observe the outcome. When the trials are finished, Katib tells the metrics of the finished trials to the algorithm, and ask another new sets.
The new algorithm needs to implement `Suggestion` service defined in [api.proto](../pkg/apis/manager/v1beta1/api.proto). One sample algorithm looks like:
```python
from pkg.apis.manager.v1beta1.python import api_pb2
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.suggestion.v1beta1.internal.search_space import HyperParameter, HyperParameterSearchSpace
from pkg.suggestion.v1beta1.internal.trial import Trial, Assignment
from pkg.suggestion.v1beta1.hyperopt.base_service import BaseHyperoptService
from pkg.suggestion.v1beta1.internal.base_health_service import HealthServicer
# Inherit SuggestionServicer and implement GetSuggestions.
class HyperoptService(
api_pb2_grpc.SuggestionServicer, HealthServicer):
def ValidateAlgorithmSettings(self, request, context):
# Optional, it is used to validate algorithm settings defined by users.
pass
def GetSuggestions(self, request, context):
# Convert the Experiment in GRPC request to the search space.
# search_space example:
# HyperParameterSearchSpace(
# goal: MAXIMIZE,
# params: [HyperParameter(name: param-1, type: INTEGER, min: 1, max: 5, step: 0),
# HyperParameter(name: param-2, type: CATEGORICAL, list: cat1, cat2, cat3),
# HyperParameter(name: param-3, type: DISCRETE, list: 3, 2, 6),
# HyperParameter(name: param-4, type: DOUBLE, min: 1, max: 5, step: )]
# )
search_space = HyperParameterSearchSpace.convert(request.experiment)
# Convert the trials in GRPC request to the trials in algorithm side.
# trials example:
# [Trial(
# assignment: [Assignment(name=param-1, value=2),
# Assignment(name=param-2, value=cat1),
# Assignment(name=param-3, value=2),
# Assignment(name=param-4, value=3.44)],
# target_metric: Metric(name="metric-2" value="5643"),
# additional_metrics: [Metric(name=metric-1, value=435),
# Metric(name=metric-3, value=5643)],
# Trial(
# assignment: [Assignment(name=param-1, value=3),
# Assignment(name=param-2, value=cat2),
# Assignment(name=param-3, value=6),
# Assignment(name=param-4, value=4.44)],
# target_metric: Metric(name="metric-2" value="3242"),
# additional_metrics: [Metric(name=metric=1, value=123),
# Metric(name=metric-3, value=543)],
trials = Trial.convert(request.trials)
#--------------------------------------------------------------
# Your code here
# Implement the logic to generate new assignments for the given current request number.
# For example, if request.current_request_number is 2, you should return:
# [
# [Assignment(name=param-1, value=3),
# Assignment(name=param-2, value=cat2),
# Assignment(name=param-3, value=3),
# Assignment(name=param-4, value=3.22)
# ],
# [Assignment(name=param-1, value=4),
# Assignment(name=param-2, value=cat4),
# Assignment(name=param-3, value=2),
# Assignment(name=param-4, value=4.32)
# ],
# ]
list_of_assignments = your_logic(search_space, trials, request.current_request_number)
#--------------------------------------------------------------
# Convert list_of_assignments to
return api_pb2.GetSuggestionsReply(
trials=Assignment.generate(list_of_assignments)
)
```
### Make a GRPC server for the algorithm
Create a package under [cmd/suggestion](../cmd/suggestion). Then create the main function and Dockerfile. The new GRPC server should serve in port 6789.
Here is an example: [cmd/suggestion/hyperopt](../cmd/suggestion/hyperopt).
Then build the Docker image.
### Use the algorithm in Katib.
Update the [Katib config](../manifests/v1beta1/components/controller/katib-config.yaml) and [operator](../operators/katib-controller/src/suggestion.json) with the new algorithm entity:
```diff
suggestion: |-
{
"tpe": {
"image": "docker.io/kubeflowkatib/suggestion-hyperopt"
},
"random": {
"image": "docker.io/kubeflowkatib/suggestion-hyperopt"
},
+ "<new-algorithm-name>": {
+ "image": "image built in the previous stage"
+ }
}
```
Learn more about Katib config in the
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/katib-config/)
### Contribute the algorithm to Katib
If you want to contribute the algorithm to Katib, you could add unit test and/or
e2e test for it in the CI and submit a PR.
#### Unit Test
Here is an example [test_hyperopt_service.py](../test/unit/v1beta1/suggestion/test_hyperopt_service.py):
```python
import grpc
import grpc_testing
import unittest
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2
from pkg.suggestion.v1beta1.hyperopt.service import HyperoptService
class TestHyperopt(unittest.TestCase):
def setUp(self):
servicers = {
api_pb2.DESCRIPTOR.services_by_name['Suggestion']: HyperoptService()
}
self.test_server = grpc_testing.server_from_dictionary(
servicers, grpc_testing.strict_real_time())
if __name__ == '__main__':
unittest.main()
```
You can setup the GRPC server using `grpc_testing`, then define your own test cases.
#### E2E Test (Optional)
E2E tests help Katib verify that the algorithm works well.
Follow below steps to add your algorithm (Suggestion) to the Katib CI
(replace `<name>` with your Suggestion name):
1. Submit a PR to add a new ECR private registry to the AWS
[`ECR_Private_Registry_List`](https://github.com/kubeflow/testing/blob/master/aws/IaC/CDK/test-infra/config/static_config/ECR_Resources.py#L18).
Registry name should follow the pattern: `katib/v1beta1/suggestion-<name>`
1. Create a new Experiment YAML in the [examples/v1beta1](../examples/v1beta1)
with the new algorithm.
1. Update [`setup-katib.sh`](../test/e2e/v1beta1/scripts/setup-katib.sh)
script to modify `katib-config.yaml` with the new test Suggestion image name.
For example:
```sh
sed -i -e "s@docker.io/kubeflowkatib/suggestion-<name>@${ECR_REGISTRY}/${REPO_NAME}/v1beta1/suggestion-<name>@" ${CONFIG_PATCH}
```
1. Update the following variables in [`argo_workflow.py`](../test/e2e/v1beta1/argo_workflow.py):
- [`KATIB_IMAGES`](../test/e2e/v1beta1/argo_workflow.py#L43) with your Suggestion Dockerfile location:
```diff
. . .
"suggestion-goptuna": "cmd/suggestion/goptuna/v1beta1/Dockerfile",
"suggestion-optuna": "cmd/suggestion/optuna/v1beta1/Dockerfile",
+ "suggestion-<name>": "cmd/suggestion/<name>/v1beta1/Dockerfile",
. . .
```
- [`KATIB_EXPERIMENTS`](../test/e2e/v1beta1/argo_workflow.py#L69) with your Experiment YAML location:
```diff
. . .
"multivariate-tpe": "examples/v1beta1/hp-tuning/multivariate-tpe.yaml",
"cmaes": "examples/v1beta1/hp-tuning/cma-es.yaml",
+ "<algorithm-name>: "examples/v1beta1/hp-tuning/<algorithm-name>.yaml",
. . .
```

View File

@ -1,4 +1,4 @@
# Support custom CRD in Trial Job proposal
# KEP-1214: Support custom CRD in Trial Job proposal
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
@ -180,7 +180,7 @@ SucceededCondition: Succeeded
Previously, we had problems with Istio sidecar containers,
check [kubeflow/issue#1081](https://github.com/kubeflow/kubeflow/issues/4742).
In some cases, it is unable to properly download datasets in training pod.
It was fixed by adding annotation `sidecar.istio.io/inject: false` to appropriate Trial job in Katib controller.
It was fixed by adding label `sidecar.istio.io/inject: false` to appropriate Trial job in Katib controller.
Various CRD can have unified design and it is hard to understand where annotation must be specified
to disable Istio injection for the running pods.

View File

@ -1,4 +1,4 @@
# Conformance Test for AutoML and Training Working Group
# KEP-2044: Conformance Test for AutoML and Training Working Group
Andrey Velichkevich ([@andreyvelich](https://github.com/andreyvelich))
Johnu George ([@johnugeorge](https://github.com/johnugeorge))
@ -61,7 +61,7 @@ the 3 category of tests:
## Design for the CRD-based tests
![conformance-crd-test](../images/conformance-crd-test.png)
![conformance-crd-test](conformance-crd-test.png)
The design is similar to the KFP conformance program for the API-based tests.

View File

Before

Width:  |  Height:  |  Size: 77 KiB

After

Width:  |  Height:  |  Size: 77 KiB

View File

@ -0,0 +1,240 @@
# KEP-2339: HyperParameter Optimization API for LLM Fine-Tuning
- [HyperParameter Optimization API for LLM Fine-Tuning](#hyperparameter-optimization-api-for-llm-fine-tuning)
- [Links](#links)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Design for API](#design-for-api)
- [Example](#example)
- [Implementation](#implementation)
## Links
- [katib/issues#2291 (Tuning API in Katib for LLMs)](https://github.com/kubeflow/katib/issues/2291)
## Motivation
The rapid advancements and growing popularity of Large Language Models (LLMs) have driven an increased need for effective LLMOps in Kubernetes environments. To address this, we developed a [train API](https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) within the Training Python SDK, simplifying the process of fine-tuning LLMs using distributed PyTorchJob workers. However, hyperparameter optimization remains a crucial yet labor-intensive task for enhancing model performance. Automating this tuning process through a dedicated API will facilitate efficient and scalable exploration of hyperparameters, ultimately improving model performance and reducing manual effort.
## Goals
Our goal is to develop a high-level API for tuning hyperparameters of LLMs that simplifies the process of hyperparameter optimization in Kubernetes. This API will seamlessly integrate with external platforms like HuggingFace and S3 for importing pretrained models and datasets. By specifying parameters for the training objective, trial configurations, and PyTorch worker configurations, the API will automate the creation of experiments and execution of trials. This abstraction of Kubernetes infrastructure complexities will enable data scientists to optimize hyperparameters efficiently and effectively.
## Non-Goals
1. Incorporate early stopping strategy into the API to optimize training efficiency.
2. Expand support for distributed training in frameworks beyond PyTorch by leveraging their distributed training capabilities.
3. Support adding custom providers through configmap or CRD approach to enhance flexibility.
4. Enable users to deploy tuned models for inference within their applications or seamlessly integrate them into existing NLP pipelines for specialized tasks.
## Design for API
![Design for API](hp-optimization-api-design.jpg)
```python
import kubeflow.katib as katib
from kubeflow.katib import KatibClient
class KatibClient(object):
def tune(
self,
name: str,
namespace: Optional[str] = None,
model_provider_parameters: Optional[HuggingFaceModelParams] = None,
dataset_provider_parameters: Optional[Union[HuggingFaceDatasetParams, S3DatasetParams]] = None,
trainer_parameters: Union[HuggingFaceTrainerParams, Dict[str, Any]] = None,
storage_config: Dict[str, Optional[Union[str, List[str]]]] = {
"size": constants.PVC_DEFAULT_SIZE,
"storage_class": None,
"access_modes": constants.PVC_DEFAULT_ACCESS_MODES,
},
objective: Optional[Callable] = None,
base_image: Optional[str] = None,
algorithm_name: str = "random",
algorithm_settings: Union[dict, List[models.V1beta1AlgorithmSetting], None] = None,
objective_metric_name: str = "eval_accuracy",
additional_metric_names: List[str] = [],
objective_type: str = "maximize",
objective_goal: float = None,
max_trial_count: int = None,
parallel_trial_count: int = None,
max_failed_trial_count: int = None,
resources_per_trial = Union[dict, client.V1ResourceRequirements, types.TrainerResources, None] = None,
retain_trials: bool = False,
env_per_trial: Optional[Union[Dict[str, str], List[Union[client.V1EnvVar, client.V1EnvFromSource]]]] = None,
packages_to_install: List[str] = None,
pip_index_url: str = "https://pypi.org/simple",
):
"""
Initiates a hyperparameter tuning experiment in Katib.
Model, dataset and parameters can be configured using one of the following options:
- Using the Storage Initializer: Specify `model_provider_parameters`, `dataset_provider_parameters`, and `trainer_parameters`. This option downloads models and datasets from external platforms like HuggingFace and S3, and utilizes `Trainer.train()` in HuggingFace to train the model.
- Defining a custom objective function: Specify the `objective` parameter to define your own objective function, and use the `base_image` parameter to execute the objective function.
Parameters:
- name: Name for the experiment.
- namespace: Namespace for the experiment. Defaults to the namespace of the 'KatibClient' object.
- model_provider_parameters: Parameters for providing the model. Compatible with model providers like HuggingFace.
- dataset_provider_parameters: Parameters for providing the dataset. Compatible with dataset providers like HuggingFace or S3.
- trainer_parameters: Parameters for configuring the training process, including settings for hyperparameters search space.
- storage_config: Configuration for Storage Initializer PVC to download pre-trained model and dataset.
- objective: Objective function that Katib uses to train the model.
- base_image: Image to use when executing the objective function.
- algorithm_name: Tuning algorithm name (e.g., 'random', 'bayesian').
- algorithm_settings: Settings for the tuning algorithm.
- objective_metric_name: Primary metric to optimize.
- additional_metric_names: List of additional metrics to collect.
- objective_type: Optimization direction for the objective metric, "minimize" or "maximize".
- objective_goal: Desired value of the objective metric.
- max_trial_count: Maximum number of trials to run.
- parallel_trial_count: Number of trials to run in parallel.
- max_failed_trial_count: Maximum number of allowed failed trials.
- resources_per_trial: Resources assigned to per trial, which can be specified using one of the following options:
- Non-distributed Training: Specify a kubernetes.client.V1ResourceRequirements object or a dicitionary that includes one or more of the following keys: `cpu`, `memory`, or `gpu` (other keys will be ignored).
- Distributed Training in Pytorch: Specify a types.TrainerResources, which includes the following parameters:
- num_workers: Number of PyTorchJob workers.
- num_procs_per_worker: Number of processes per PyTorchJob worker.
- resources_per_worker: Resources assigned to per PyTorchJob worker container, specified as either a kubernetes.client.V1ResourceRequirements object or a dicitionary that includes one or more of the following keys: `cpu`, `memory`, or `gpu` (other keys will be ignored).
- retain_trials: Whether to retain trial resources after completion.
- env_per_trial: Environment variables for worker containers.
- packages_to_install: Additional Python packages to install.
- pip_index_url: URL of the PyPI index for installing packages.
"""
pass # Implementation logic for initiating the experiment
```
### Example
```python
import kubeflow.katib as katib
from kubeflow.katib import KatibClient
import transformers
from peft import LoraConfig
from kubeflow.storage_initializer.hugging_face import (
HuggingFaceModelParams,
HuggingFaceDatasetParams,
HuggingFaceTrainerParams,
)
# Create a Katib client.
cl = KatibClient(namespace="kubeflow")
# Run the tuning experiment.
exp_name = "llm-experiment1"
cl.tune(
name = exp_name,
# BERT model URI and type of Transformer to train it.
model_provider_parameters = HuggingFaceModelParams(
model_uri = "hf://google-bert/bert-base-cased",
transformer_type = transformers.AutoModelForSequenceClassification,
),
# Use 3000 samples from Yelp dataset.
dataset_provider_parameters = HuggingFaceDatasetParams(
repo_id = "yelp_review_full",
split = "train[:3000]",
),
# Specify HuggingFace Trainer parameters.
trainer_parameters = HuggingFaceTrainerParams(
training_parameters = transformers.TrainingArguments(
output_dir = "test_tune_api",
save_strategy = "no",
learning_rate = katib.search.double(min=1e-05, max=5e-05),
num_train_epochs=3,
),
# Set LoRA config to reduce number of trainable model parameters.
lora_config = LoraConfig(
r = katib.search.int(min=8, max=32),
lora_alpha = 8,
lora_dropout = 0.1,
bias = "none",
),
),
objective_metric_name = "train_loss",
objective_type = "minimize",
algorithm_name = "random",
max_trial_count = 10,
parallel_trial_count = 2,
resources_per_trial={
"gpu": "2",
"cpu": "4",
"memory": "10G",
},
# For distribued training, please specify `resources_per_trial` using `types.TrainerResources` (To be implemented).
)
# Wait until Katib Experiment is complete
cl.wait_for_experiment_condition(name=exp_name)
# Get the best hyperparameters.
print(cl.get_optimal_hyperparameters(exp_name))
```
## Implementation
By passing the specified parameters, this API will automate hyperparameter optimization for LLMs. The implementation will focus on the following aspects:
**Model and Dataset Management**: We will leverage the [storage_initializer](https://github.com/kubeflow/training-operator/tree/master/sdk/python/kubeflow/storage_initializer) from the Training Python SDK for seamless integration of pretrained models and datasets from platforms like HuggingFace and S3. This component manages downloading and storing pretrained models and datasets via a PersistentVolumeClaim (PVC), which is shared across containers, ensuring efficient access to the pretrained model and dataset without redundant downloads.
**Hyperparameter Configuration**: Users specify training parameters and the hyperparameters to be optimized within `trainer_parameters`. The API will first traverse `trainer_parameters.training_parameters` and `trainer_parameters.lora_config` to identify tunable hyperparameters and set up their values for the Experiment and Trials. These parameters are then passed as `args` to the container spec of workers.
```python
# Traverse and set up hyperparameters
input_params = {}
experiment_params = []
trial_params = []
training_args = trainer_parameters.training_parameters
for p_name, p_value in training_args.to_dict().items():
if not hasattr(training_args, p_name):
logger.warning(f"Training parameter {p_name} is not supported by the current transformer.")
continue
if isinstance(p_value, models.V1beta1ParameterSpec):
value = f"${{trialParameters.{p_name}}}"
setattr(training_args, p_name, value)
p_value.name = p_name
experiment_params.append(p_value)
trial_params.append(models.V1beta1TrialParameterSpec(name=p_name, reference=p_name))
elif p_value is not None:
value = type(old_attr)(p_value)
setattr(training_args, p_name, value)
input_params['training_args'] = training_args
# Note: Repeat similar logic for `lora_config`
# create container spec of worker
container_spec = client.V1Container(
...
args=[
"--model_uri",
model_provider_parameters.model_uri,
"--transformer_type",
model_provider_parameters.transformer_type.__name__,
"--model_dir",
"REPLACE_WITH_ACTUAL_MODEL_PATH",
"--dataset_dir",
"REPLACE_WITH_ACTUAL_DATASET_PATH",
"--lora_config",
json.dumps(input_params['lora_config'].__dict__, cls=utils.SetEncoder),
"--training_parameters",
json.dumps(input_params['training_args'].to_dict()),
],
...
)
```
**Hyperparameter Optimization**: This API will create an Experiment that defines the search space for identified tunable hyperparameters, the objective metric, optimization algorithm, etc. The Experiment will orchestrate the hyperparameter tuning process, generating Trials for each configuratin. Each Trial will be implemented as a Kubernete PyTorchJob, with the `trialTemplate` specifying the exact values for hyperparameters. The `trialTemplate` will also define master and worker containers, facilitating effective resource distribution and parallel execution of Trials. Trial results will then be fed back to the Experiment, which will evaluate the outcomes to identify the optimal set of hyperparameters.
**Dependencies Update**: To reuse existing assets from the Training Python SDK and integrate packages from HuggingFace, dependencies will be added to the `setup.py` of the Katib Python SDK as follows:
```python
setuptools.setup(
...// Configurations of the package
extras_require={
"huggingface": ["kubeflow-training[huggingface]==1.8.0rc1"],
},
)
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 302 KiB

View File

@ -0,0 +1,185 @@
# KEP-2340: Push-based Metrics Collection Proposal
## Links
- [katib/issues#577([Enhancement Request] Metrics Collector Push-based Implementation)](https://github.com/kubeflow/katib/issues/577)
## Motivation
[Katib](https://github.com/kubeflow/katib) is a Kubernetes-native project for automated machine learning (AutoML). It can not only tune hyperparameters of applications written in any language and natively supports many ML frameworks, but also supports features like early stopping and neural architecture search.
In the procedure of tuning hyperparameters, Metrics Collector, which is implemented as a sidecar container attached to each training container in the [current design](https://github.com/kubeflow/katib/blob/master/docs/proposals/metrics-collector.md), will collect training logs from Trials once the training is complete. Then, the Metrics Collector will parse training logs to get appropriate metrics like accuracy or loss and pass the evaluation results to the HyperParameter tuning algorithm.
However, current implementation of Metrics Collector is pull-based, raising some [design problems](https://github.com/kubeflow/training-operator/issues/722#issuecomment-405669269) such as determining the frequency we scrape the metrics, performance issues like the overhead caused by too many sidecar containers, and restrictions on developing environments which must support sidecar containers. Thus, we should implement a new API for Katib Python SDK to offer users a push-based way to store metrics directly into the Katib DB and resolve those issues raised by pull-based metrics collection.
![](./push-based-metrics-collection.png)
Fig.1 Architecture of the new design
### Goals
1. **A new parameter in Python SDK function `tune`**: allow users to specify the method of collecting metrics(push-based/pull-based).
2. **A new interface `report_metrics` in Python SDK**: push the metrics to Katib DB directly.
3. The final metrics of worker pods should be **pushed to Katib DB directly** in the push mode of metrics collection.
### Non-Goals
1. Implement authentication model for Katib DB to push metrics.
2. Support pushing data to different types of storage system(prometheus, self-defined interface etc.)
## API
### New Parameter in Python SDK Function `tune`
We decided to add `metrics_collector_config` to `tune` function in Python SDK.
```Python
def tune(
self,
name: str,
objective: Callable,
parameters: Dict[str, Any],
base_image: str = constants.BASE_IMAGE_TENSORFLOW,
namespace: Optional[str] = None,
env_per_trial: Optional[Union[Dict[str, str], List[Union[client.V1EnvVar, client.V1EnvFromSource]]]] = None,
algorithm_name: str = "random",
algorithm_settings: Union[dict, List[models.V1beta1AlgorithmSetting], None] = None,
objective_metric_name: str = None,
additional_metric_names: List[str] = [],
objective_type: str = "maximize",
objective_goal: float = None,
max_trial_count: int = None,
parallel_trial_count: int = None,
max_failed_trial_count: int = None,
resources_per_trial: Union[dict, client.V1ResourceRequirements, None] = None,
retain_trials: bool = False,
packages_to_install: List[str] = None,
pip_index_url: str = "https://pypi.org/simple",
# The newly added parameter metrics_collector_config.
# It specifies the config of metrics collector, for example,
# metrics_collector_config={"kind": "Push"},
metrics_collector_config: Dict[str, Any] = {"kind": "StdOut"},
)
```
### New Interface `report_metrics` in Python SDK
```Python
"""Push Metrics Directly to Katib DB
[!!!] Trial name should always be passed into Katib Trials as env variable `KATIB_TRIAL_NAME`.
Args:
metrics: Dict of metrics pushed to Katib DB.
For examle, `metrics = {"loss": 0.01, "accuracy": 0.99}`.
db-manager-address: Address for the Katib DB Manager in this format: `ip-address:port`.
timeout: Optional, gRPC API Server timeout in seconds to report metrics.
Raises:
RuntimeError: Unable to push Trial metrics to Katib DB.
"""
def report_metrics(
metrics: Dict[str, Any],
db_manager_address: str = constants.DEFAULT_DB_MANAGER_ADDRESS,
timeout: int = constants.DEFAULT_TIMEOUT,
)
```
### A Simple Example:
```Python
import kubeflow.katib as katib
# Step 1. Create an objective function with push-based metrics collection.
def objective(parameters):
# Import required packages.
import kubeflow.katib as katib
# Calculate objective function.
result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2
# Push metrics to Katib DB.
katib.report_metrics({"result": result})
# Step 2. Create HyperParameter search space.
parameters = {
"a": katib.search.int(min=10, max=20),
"b": katib.search.double(min=0.1, max=0.2)
}
# Step 3. Create Katib Experiment with 12 Trials and 2 GPUs per Trial.
katib_client = katib.KatibClient(namespace="kubeflow")
name = "tune-experiment"
katib_client.tune(
name=name,
objective=objective,
parameters=parameters,
objective_metric_name="result",
max_trial_count=12,
resources_per_trial={"gpu": "2"},
metrics_collector_config={"kind": "Push"},
)
# Step 4. Get the best HyperParameters.
print(katib_client.get_optimal_hyperparameters(name))
```
## Implementation
### Add New Parameter in `tune`
As mentioned above, we decided to add `metrics_collector_config` to the tune function in Python SDK. Also, we have some changes to be made:
1. Configure the way of metrics collection: set the configuration `spec.metricsCollectionSpec.collector.kind`(specify the way of metrics collection) to `Push`.
2. Rename metrics collector from `None` to `Push`: It's not correct to call push-based metrics collection `None`. We should modify related code to rename it.
3. Write env variables into Trial spec: set `KATIB_TRIAL_NAME` for `report_metrics` function to dial db manager.
### New Interface `report_metrics` in Python SDK
We decide to implement this funcion to push metrics directly to Katib DB with the help of grpc. Trial name should always be passed into Katib Trials (and then into this function) as env variable `KATIB_TRIAL_NAME`.
Also, the function is supposed to be implemented as **global function** because it is called in the user container.
Steps:
1. Wrap metrics into `katib_api_pb2.ReportObservationLogRequest`:
Firstly, convert metrics (in dict format) into `katib_api_pb2.ReportObservationLogRequest` type for the following grpc call, referring to https://github.com/kubeflow/katib/blob/master/pkg/apis/manager/v1beta1/gen-doc/api.md#reportobservationlogrequest
2. Dial Katib DBManager Service
We'll create a DBManager Stub and make a grpc call to report metrics to Katib DB.
### Compatibility Changes in Trial Controller
We need to make appropriate changes in the Trial controller to make sure we insert unavailable value into Katib DB, if user doesn't report metric accidentally. The current implementation handles unavailable metrics in:
```Golang
// If observation is empty metrics collector doesn't finish.
// For early stopping metrics collector are reported logs before Trial status is changed to EarlyStopped.
if jobStatus.Condition == trialutil.JobSucceeded && instance.Status.Observation == nil {
logger.Info("Trial job is succeeded but metrics are not reported, reconcile requeued")
return errMetricsNotReported
}
```
1. Distinguish pull-based and push-based metrics collection
We decide to add a if-else statement in the code above to distinguish pull-based and push-based metrics collection. In the push-based collection, the Trial does not need to be requeued. Instead, we'll insert a unavailable value to Katib DB.
2. Update the status of Trial to `MetricsUnavailable`
In the current implementation of pull-based metrics collection, Trials will be re-queued when the metrics collector finds the `.Status.Observation` is empty. However, it's not compatible with push-based metrics collection because the forgotten metrics won't be reported in the new round of reconcile. So, we need to update its status in the function `UpdateTrialStatusCondition` in accommodation with the pull-based metrics collection. The following code will be insert into lines before [trial_controller_util.go#L69](https://github.com/kubeflow/katib/blob/7959ffd54851216dbffba791e1da13c8485d1085/pkg/controller.v1beta1/trial/trial_controller_util.go#L69)
```Golang
else if instance.Spec.MetricCollector.Collector.Kind == "Push" {
... // Update the status of this Trial to `MetricsUnavailable` and output the reason.
}
```
### Collection of Final Metrics
The final metrics of worker pods should be pushed to Katib DB directly in the push mode of metrics collection.

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

View File

@ -0,0 +1,169 @@
# KEP-2374: Proposal for Supporting various parameter distributions in Katib
## Summary
The goal of this project is to enhance the existing Katib Experiment APIs to support various parameter distributions such as uniform, log-uniform, and qlog-uniform. Then extend the suggestion services to be able to configure distributions for search space using libraries provided in each framework.
## Motivation
Currently, [Katib](https://github.com/kubeflow/katib) is limited to supporting only uniform distribution for integer, float, and categorical hyperparameters. By introducing additional distributions, Katib will become more flexible and powerful in conducting hyperparameter optimization tasks.
A Data Scientist requires Katib to support multiple hyperparameter distributions, such as log-uniform, normal, and log-normal, in addition to the existing uniform distribution. This enhancement is crucial for more flexible and precise hyperparameter optimization. For instance, learning rates often benefit from a log-uniform distribution because small values can significantly impact performance. Similarly, normal distributions are useful for parameters that are expected to vary around a central value.
### Goals
- Add `Distribution` field to `FeasibleSpace` alongside `ParameterType`.
- Support for the log-uniform, normal, and log-normal Distributions.
- Update the Experiment and gRPC API to support `Distribution`.
- Update logic to handle the new parameter distributions for each suggestion service (e.g., Optuna, Hyperopt).
- Extend the Python SDK to support the new `Distribution` field.
### Non-Goals
- This proposal do not aim to create new version for CRD APIs.
- This proposal do not aim to make the necessary Katib UI changes.
- No changes will be made to the core optimization algorithms beyond supporting new distributions.
## Proposal
### Parameter Distribution Comparison Table
| Distribution Type | Hyperopt | Optuna | Ray Tune | Nevergrad |
|-------------------------------|-----------------------|-------------------------------------------------|-----------------------|---------------------------------------------|
| **Uniform Continuous** | `hp.uniform` | `FloatDistribution` | `tune.uniform` | `p.Scalar` with uniform transformation |
| **Quantized Uniform** | `hp.quniform` | `DiscreteUniformDistribution` (deprecated) | `tune.quniform` | `p.Scalar` with uniform and step specified |
| **Log Uniform** | `hp.loguniform` | `LogUniformDistribution` (deprecated) | `tune.loguniform` | `p.Log` with uniform transformation |
| **Uniform Integer** | `hp.randint` or quantized distributions with step size `q` set to 1 | `IntDistribution` | `tune.randint` | `p.Scalar` with integer transformation |
| **Categorical** | `hp.choice` | `CategoricalDistribution` | `tune.choice` | `p.Choice` |
| **Quantized Log Uniform** | `hp.qloguniform` | Custom Implementation | `tune.qloguniform` | `p.Log` with uniform and step specified |
| **Normal** | `hp.normal` | (Not directly supported) | `tune.randn` | (Not directly supported) |
| **Quantized Normal** | `hp.qnormal` | (Not directly supported) | `tune.qrandn` | (Not directly supported) |
| **Log Normal** | `hp.lognormal` | (Not directly supported) | (Use custom transformation in `tune.randn`) | (Not directly supported) |
| **Quantized Log Normal** | `hp.qlognormal` | (Not directly supported) | (Use custom transformation in `tune.qrandn`) | (Not directly supported) |
| **Quantized Integer** | `hp.quniformint` | `IntUniformDistribution` (deprecated) | | `p.Scalar` with integer and step specified |
| **Log Integer** | | `IntLogUniformDistribution` (deprecated) | `tune.lograndint` | `p.Scalar` with log-integer transformation |
- Note:
In `Nevergrad`, parameter types like `p.Scalar`, `p.Log`, and `p.Choice` are mapped to corresponding `Hyperopt` search space definitions like `hp.uniform`, `hp.loguniform`, and `hp.choice` using internal functions to convert parameter bounds and distributions.
## API Design
### FeasibleSpace
Feasible space for optimization.
Int and Double type use Max/Min.
Discrete and Categorical type use List.
| Field | Type | Label | Description |
| ----- | ---- | ----- | ----------- |
| max | [string](#string) | | Max Value |
| min | [string](#string) | | Minimum Value |
| list | [string](#string) | repeated | List of Values. |
| step | [string](#string) | | Step for double or int parameter or q for quantization|
| distribution | [Distribution](#api-v1-beta1-Distribution) | | Type of the Distribution. |
<a name="api-v1-beta1-Distribution"></a>
### Distribution
- Types of value for HyperParameter Distributions.
- We add the `distribution` field to represent the hyperparameters search space rather than [`ParameterType`](https://github.com/kubeflow/katib/blob/2c575227586ff1c03cf6b5190d066e2f3061a404/pkg/apis/controller/experiments/v1beta1/experiment_types.go#L199-L207).
- The `distribution` allows users to configure more granular search space customizations.
- In this enhancement, we would propose the following 4 distributions:
| Name | Number | Description |
| ---- | ------ | ----------- |
| UNIFORM | 0 | Continuous uniform distribution. Samples values evenly between a minimum and maximum value. Use &#34;Max/Min&#34;. Use &#34;Step&#34; for `q`. |
| LOGUNIFORM | 1 | Samples values such that their logarithm is uniformly distributed. Use &#34;Max/Min&#34;. Use &#34;Step&#34; for `q`. |
| NORMAL | 2 | Normal (Gaussian) distribution type. Samples values according to a normal distribution characterized by a mean and standard deviation. Use &#34;Max/Min&#34;. Use &#34;Step&#34; for `q`. |
| LOGNORMAL | 3 | Log-normal distribution type. Samples values such that their logarithm is normally distributed. Use &#34;Max/Min&#34;. Use &#34;Step&#34; for `q`. |
## Experiment API changes
Scope: `pkg/apis/controller/experiments/v1beta1/experiment_types.go`
```go
type ParameterSpec struct {
Name string `json:"name,omitempty"`
ParameterType ParameterType `json:"parameterType,omitempty"`
FeasibleSpace FeasibleSpace `json:"feasibleSpace,omitempty"`
}
```
- Adding new field `Distribution` to `FeasibleSpace`
- The `Step` field can be used to define quantization steps for uniform or log-uniform distributions, effectively covering q-quantization requirements.
Updated `FeasibleSpace` struct
```diff
type FeasibleSpace struct {
Max string `json:"max,omitempty"`
Min string `json:"min,omitempty"`
List []string `json:"list,omitempty"`
Step string `json:"step,omitempty"` // Step can be used to define q-quantization
+ Distribution Distribution `json:"distribution,omitempty"` // Added Distribution field
}
```
- New Field Description: `Distribution`
- Type: `Distribution`
- Description: The Distribution field specifies the type of statistical distribution to be applied to the parameter. This allows the definition of various distributions, such as uniform, log-uniform, or other supported types.
- Defining `Distribution` type
```go
type Distribution string
const (
DistributionUniform Distribution = "uniform"
DistributionLogUniform Distribution = "logUniform"
DistributionNormal Distribution = "normal"
DistributionLogNormal Distribution = "logNormal"
)
```
## gRPC API changes
Scope: `pkg/apis/manager/v1beta1/api.proto`
- Add the `Distribution` field to the `FeasibleSpace` message
```diff
/**
* Feasible space for optimization.
* Int and Double type use Max/Min.
* Discrete and Categorical type use List.
*/
message FeasibleSpace {
string max = 1; /// Max Value
string min = 2; /// Minimum Value
repeated string list = 3; /// List of Values.
string step = 4; /// Step for double or int parameter
+ Distribution distribution = 4; // Distribution of the parameter.
}
```
- Define the `Distribution` enum
```
/**
* Distribution types for HyperParameter.
*/
enum Distribution {
UNIFORM = 0;
LOG_UNIFORM = 1;
NORMAL = 2;
LOG_NORMAL = 3;
}
```
## Suggestion Service Logic
- For each suggestion service (e.g., Optuna, Hyperopt), the logic will be updated to handle the new parameter distributions.
- This involves modifying the conversion functions to map Katib distributions to the corresponding framework-specific distributions.
#### Optuna
ref: https://optuna.readthedocs.io/en/stable/reference/distributions.html
For example:
- Update the `_get_optuna_search_space` for new Distributions.
scope: `pkg/suggestion/v1beta1/optuna/base_service.py`
#### Goptuna
ref: https://github.com/c-bata/goptuna/blob/2245ddd9e8d1edba750839893c8a618f852bc1cf/distribution.go
#### Hyperopt
ref: http://hyperopt.github.io/hyperopt/getting-started/search_spaces/#parameter-expressions
#### Ray-tune
ref: https://docs.ray.io/en/latest/tune/api/search_space.html
## Python SDK
Extend the Python SDK to support the new `Distribution` field.

View File

@ -1,28 +1,27 @@
# Suggestion CRD Design Document
# KEP-507: Suggestion CRD Design Document
Table of Contents
=================
# Table of Contents
* [Suggestion CRD Design Document](#suggestion-crd-design-document)
* [Table of Contents](#table-of-contents)
* [Background](#background)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Design](#design)
* [Kubernetes API](#kubernetes-api)
* [GRPC API](#grpc-api)
* [Workflow](#workflow)
* [Example](#example)
* [Algorithm Supports](#algorithm-supports)
* [Random](#random)
* [Grid](#grid)
* [Bayes Optimization](#bayes-optimization)
* [HyperBand](#hyperband)
* [BOHB](#bohb)
* [TPE](#tpe)
* [SMAC](#smac)
* [CMA-ES](#cma-es)
* [Sobol](#sobol)
- [Suggestion CRD Design Document](#suggestion-crd-design-document)
- [Table of Contents](#table-of-contents)
- [Background](#background)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Design](#design)
- [Kubernetes API](#kubernetes-api)
- [GRPC API](#grpc-api)
- [Workflow](#workflow)
- [Example](#example)
- [Algorithm Supports](#algorithm-supports)
- [Random](#random)
- [Grid](#grid)
- [Bayes Optimization](#bayes-optimization)
- [HyperBand](#hyperband)
- [BOHB](#bohb)
- [TPE](#tpe)
- [SMAC](#smac)
- [CMA-ES](#cma-es)
- [Sobol](#sobol)
Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)
@ -30,7 +29,7 @@ Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)
Katib makes suggestions long-running in v1alpha3. And the suggestions need to communicate with Katib DB manager to get experiments and trials from Katib db driver. This design hurts high availability.
Thus we proposed a new design to implement a CRD for suggestion and remove Katib db communication from main workflow. The new design simplifies the implmentation of experiment and trial controller, and makes Katib Kubernetes native.
Thus we proposed a new design to implement a CRD for suggestion and remove Katib db communication from main workflow. The new design simplifies the implementation of experiment and trial controller, and makes Katib Kubernetes native.
This document is to illustrate the details of the new design.
@ -118,7 +117,7 @@ message ExperimentSpec {
}
message ParameterSpecs {
repeated ParameterSpec parameters = 1;
repeated ParameterSpec parameters = 1;
}
message AlgorithmSpec {
@ -228,28 +227,28 @@ spec:
algorithmName: random
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1
kind: Job
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
template:
spec:
containers:
- name: {{.Trial}}
image: katib/mxnet-mnist-example
command:
- "python"
- "/mxnet/example/image-classification/train_mnist.py"
- "--batch-size=64"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
restartPolicy: Never
rawTemplate: |-
apiVersion: batch/v1
kind: Job
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
template:
spec:
containers:
- name: {{.Trial}}
image: katib/mxnet-mnist-example
command:
- "python"
- "/mxnet/example/image-classification/train_mnist.py"
- "--batch-size=64"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
restartPolicy: Never
parameters:
- name: --lr
parameterType: double
@ -265,9 +264,9 @@ spec:
parameterType: categorical
feasibleSpace:
list:
- sgd
- adam
- ftrl
- sgd
- adam
- ftrl
```
Then, Experiment controller needs 3 parallel trials to run. It creates the Suggestions:

Some files were not shown because too many files have changed in this diff Show More