* add e2e test for tune api
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* upgrade training-operator sdk
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* specify the version of training operator sdk
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix num_labels error and update the version of training operator controller
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the version of training operator
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* debug
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check import path of HuggingFaceModelParams
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update the version of training operator sdk
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update the name of experiment
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* add step of checking pod
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of pod
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* add check
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check reason for imagepullbackoff
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* revert timeout limit
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix format
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* extend timeout limit
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update training operator sdk version
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of pod
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* rerun tests
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update the function of getting logs
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* add the step of describing pod
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check disk space
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* change work directory
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* change work directory
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* increase timeout limit
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of controller and events
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* change work directory
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* change work directory
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* change work directory
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of kubelet
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of kubelet
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* increase cpu
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of training operator
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the use of resources
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of container 'pytorch' and 'storage_initializer'
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix error of checking use of resources
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* add other checks to find the error reason
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* set 'storage_config'
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* reduce the number of tests
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* Check container runtime logs
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* set the driver of minikube as docker
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* set the driver of minikube to none
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check logs of pod
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check memory usage
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* increase 'termination_grace_period_seconds' in podspec
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix annotations error
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* restart docker
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* delete restarting docker
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* use original docker data directory
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update installation of Katib SDK with extra requires
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* test trainer image built with cpu
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* add action of free up disk space (including move docker data directory)
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* delete unnecessary checks and update the part of fetching pod description and logs
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* delete fetching pod logs
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* add blank line at the end of free-up-disk-space yaml file
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update experiment name
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update test function name to be consistent with experiment name
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* move import statements inside the function
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* apply pprint for the logging output
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update experiment names
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix format
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix format
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix the sequence of arguments in 'trial_template'
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* test example in user guide
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix access token error
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix the error of setup
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix the error of setup
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* reverse back
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix format
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix format
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
---------
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* Fix Istio sidecar injection by moving from annotations to labels
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
* Update Istio sidecar injection from annotations to labels across the codebase
Replace annotations with labels for Istio sidecar injection according to Istio recommendations. Update conformance tests, examples, constants, composers, and utilities to use the new label-based approach consistently.
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
* fix: Update SuggestionLabels function and composer implementation for Istio label injection
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
* Fix linting issues in mpi-job-horovod.py
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
* update: function moved from annotations to labels
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
---------
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
* Support old-style TensorFlow events (tensorboard)
Fixes: https://github.com/kubeflow/katib/issues/2466
Signed-off-by: Gary Miguel <garymm@garymm.org>
* format
Signed-off-by: Gary Miguel <garymm@garymm.org>
* test
Signed-off-by: Gary Miguel <garymm@garymm.org>
* don't continue loops
Signed-off-by: Gary Miguel <garymm@garymm.org>
* format
Signed-off-by: Gary Miguel <garymm@garymm.org>
---------
Signed-off-by: Gary Miguel <garymm@garymm.org>
* improve pvc name error message by failing early and clear message with correct name example
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* fix lint
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* fix lint
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* raise value error for wrong name format by reconciliation
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* revert created utils
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* improve test case name
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* improve value error message
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* improve code flow
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
---------
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* Sort experiments by descending creation date by default in katib-ui
Signed-off-by: Xinmin Du <2812493086@qq.com>
* fix: Update "renders every Experiment name into the table" test to not check order
Signed-off-by: Xinmin Du <2812493086@qq.com>
* fix: Update "renders every Experiment name into the table" test in order of startTime
Signed-off-by: Xinmin Du <2812493086@qq.com>
---------
Signed-off-by: Xinmin Du <2812493086@qq.com>
* Upgrade klog dependency to v2
Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
* fix: fix conflict with k8s upate
Signed-off-by: Xinmin Du <2812493086@qq.com>
---------
Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Du Xinmin <dux.m.in@sjtu.edu.cn>
Co-authored-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
* Chnage the isort profile to black, and add pkg dir for black and flake8
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
* Fix the formating
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
* Fix flake8 lint issues
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
---------
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
* added validator for feasible space distribution
Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>
validation logic fixed
added unit test
added unit test for valid distribution
requested changes made
Update pkg/webhook/v1beta1/experiment/validator/validator.go
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
fmt
* fmt fix
Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>
---------
Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>
* Add black formater and flake8 linter to pre-commit
Also add's the flake8 config file
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
* Fixes black formating
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
* Fixes flake8 linting errors
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
---------
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
* added logger for katib_client module
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
* added API_VERSION as a constant
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
* updated the KatibClient constructor to match the TrainingClient constructor
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
* added test for create_experiment in katib_client
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
---------
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
* DB: Add env to skip DB creationˆ
* DB: Rename var name & Remove new function
* Migration -> Initialization
* Remove GetBoolEnvOrDefault
* DB: Rearrange dependencies
* Wait for the certs to be mounted inside the container
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Initialize fullServiceDomain when adding certgenerator to the manager
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Output logs every 15 seconds if the certs don't yet exist in the container
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Consolidate the katib-cert-generator to the katib-controller
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Use deployed secret instead of creating a new secret when the cert-generator saves certs on secret
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Rename secretName with webhookSecretName in the .init.certGenerator
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Fix manifests
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Remove unneeded comments
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Restore unintentionally deleted log
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Rename package cert-generator with certgenerator
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Add test cases to check if the enable is set to true when the webhookServiceName or webhookSecretName is set
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Update the developer guide
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Swap livness probe and readiness probe
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Introduce SSA to the cert-generator
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Use the same member names between CertGenerator and KatibConfig
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Disable leader election on the cert-generator
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Drop unneeded fields from SSA patches
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
This PR removes the Charmed Operators for Katib, as well as the associated tests. In the past this repo was the source of truth for these operators, but they have since been maintained [here](https://github.com/canonical/katib-operators/) and we've done a poor job of keeping the repos in sync. This commit removes the redundancy.
* disable istio sidecar injection for example manifests
* add namespace as commnad line arg to python test script
* revert disable istio sidecar injection
* add option to pass trial pod annotations
* split command over multiple lines
* remove redundant config loading
* add resource limit to containers of random experiment's trial spec pod
* update code to support already present annotations
* raise NotImplementedError if trailSpec is different from Job
* add metrics-collector-injection to namespace under test if missing
* build: Update COMMIT file
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* kwa(front): Update the use of SnackBarService
Update the use of SnackBarService in order to pass required data via a
`config` object and provide MAT_SNACK_BAR_DEFAULT_OPTIONS.
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
---------
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* backend: Update error message when no logs could be found
* Update the message the backend sends to not just expose that logs are
not there because 'retain' might not be set, but also because the
cluster was scaled down.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Add LOGS tab in Trial details page
In this commit:
* Create a distinct LOGS tab, which displays the trial's logs in the
Trial details page.
* Don't show the backend's error popup for logs, but show the message
error in the admonition.
Signed-off-by: Elena Zioga <elena@arrikto.com>
---------
Signed-off-by: Elena Zioga <elena@arrikto.com>
* Update training operator image in CI
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Support for k8s v1.25 in CI
* Revert "Support for k8s v1.25 in CI"
This reverts commit 16e6fe4b16.
* Update training operator image in Katib CI
* build: Update COMMIT file
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Make links actual links in experiments table
Make KWA's experiments table links actual links by using the new LinkValue
class.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Make links actual links in trials table
Make KWA's trials table links actual links by using the new LinkValue
class.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Rework the trial graph using echarts
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Remove d3 references
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* kwa(front): Install Cypress
- Install Cypress & npm scripts for UI tests
- Remove Protractor files
- Add README.md file to include UI tests instructions
- Modify .gitignore
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* kwa(front): Add UI tests with Cypress
Add UI tests with Cypress to check that:
- New Experiment form page loads template without errors.
- Index page
* has an "Experiments" title
* lists experiments without errors
* renders every experiment name into the table
* renders properly Status icon for all experiments
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* gh-actions(kwa): Add UI tests in test-node action
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* More container fields for SuggestionConfig
* Inline corev1.Container into SuggestionConfig
* Set default value for suggestion container name
* Append suggestion volume and port only if not present
* Deep-Copy base suggestion container
* Check for suggestion container port number as well
* Prohibit suggestion port to be set in suggestion config
* Update training operator image in CI
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Support for k8s v1.25 in CI
* Revert "Support for k8s v1.25 in CI"
This reverts commit 16e6fe4b16.
* Narrow down rbac
* Narrow down rbac
* Narrow down rbac
* Narrow down rbac
* Narrow down rbac
* Narrow down rbac
* Narrow down rbac
* Update tekton and argo docs
* Update tekton and argo docs
* Upgrade manifests to enable authorization check mechanisms for katib-UI
Changes to install-with-kubeflow manifests:
* Enable istio sidecar injection for katib-ui component
* Add AuthorizationPolicy to allow only istio-ingressgateway
to talk to katib-ui [user traffic].
* Set APP_DISABLE_AUTH ENV var to false when in kubeflow-mode
to enable authorization checks in UI's backend
* Extend the RBAC persmissions of katib-ui so it can crate SAR objects
when in kubeflow-mode
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* UI(back): Secure /katib/fetch_trial/ route
Introduce authn/authz checks in the backend
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* review: Remove duplicate dependencies
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* review: Move patch to a separate file
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Pin the h5py version with 3.7.0
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Add a --prefer-binary flag to 'pip install' command
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* [SDK] Use Katib SDK for E2E tests
* Fix pvc deletion
* Add list_suggestions API
* Remove wait from edit Experiment function
* Add shell to GitHub action
* Add protobuf package to Katib SDK
* Add Experiment Timeout to 40 min
* Modify SDK Examples
* Fix example text
* Change to custom_api
* Enable verbose logging for Katib E2E
* Use expected condition arg
* Add timeout and delete options
* Modify logging to debug
* Use read API to check resource status
* frontend: Show message in case of uncompleted trial instead of the graph
Signed-off-by: Elena Zioga <elena@arrikto.com>
* build: Update COMMIT file
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Define the spinner text
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* Backend for getting logs of a trial
* Check Write return + use PrimaryPodLabels
* Add auth + use constants for labels + cleanup
* TODO comment for using controller-runtime client for logs
* Authorization for list pods and get logs, reduce RBAC
* Use corev1 for specifying resources, edit kf install RBAC
* Check namespace and trialName from request
* Remove auth checks for listing the pods
* Use context.Background()
* Update training operator image in CI
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Support for k8s v1.25 in CI
* Revert "Support for k8s v1.25 in CI"
This reverts commit 16e6fe4b16.
* Update Owners file
* add early stopped trials in converter
* error out early
* Update pkg/suggestion/v1beta1/internal/trial.py
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* add incomplete trial filter
* fix ut
* more fixes
* filter on es
* enrich existing tests
Co-authored-by: shaowei su <shaowei.su@airbnb.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* frontend: Remove TSLint
Remove TSLint since it's deprecated.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Introduce ESLint
Introduce ESLint by using the following Angular command [1]:
ng add @angular-eslint/schematics
[1] https://github.com/angular-eslint/angular-eslint#quick-start-with-angular-v12-and-later
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Fix linting errors
Fix linting errors.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* gh-actions: Add GH action to run a lint check
Introduce a Github action to run a lint check.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* build: Update the COMMIT file
* Update the COMMIT file.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Support paging/sorting/filtering in trials table (#1441)
* Make trials table support paging, sorting and filtering.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Create unit tests for trials table (#1441)
* Create unit tests for trials-table component.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Create a yaml tab for Trials (#2011)
* Create a dedicated yaml tab for Trials.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Rename components
* Rename trial-modal component to trial-details.
* Rename trial-modal-overview component to trial-overview.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* [Test] Reduce Katib GitHub Action Runs
* Add cancel-in-progress flag
* Use single job for Charmed Katib
* Add cancel-in-progress for all actions except publish
* Bump ubuntu to 20.04 for Charmed tests
* UI(back): Add authorization mechanisms in new Katib UI backend
* Introduce helper ENV vars and functions for authentication and
authorization checks. The authz checks are using SubjectAcessReviews
objects.
* BACKEND_MODE={dev,prod}: skip authz when in dev mode
* APP_DISABLE_AUTH={bool}: skip authz if explicity requested
* Introduce a client-go client to construct SubjectAccessReview objects.
* Before any request proceed to K8s api-server:
* check if authorization must be skipped (BACKEND_MODE, APP_DISBLE_AUTH)
* check if a username is proviced in request Header
* query the K8s api-server with SAR to ensure that the user has
appropriate access privilleges
* Replace the /katib/fetch_experiment/ route with /katib/fetch_namespaces_experiments.
This route expects a namespace as a query parameter from which all experiments will be fetched.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* UI(front): Provide a namespace as a query parameter
This is needed for the new /katib/fetch_namespaced_experiments route.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Update README for running locally without auth
Update the README of the web app to expose that devs should set
APP_DISABLE_AUTH=true when running locally, since there's no authnz when
running locally.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* remove duplicated variable types
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Review fixes
* proper error handling.
* switch to Go's build-in errors package.
* set appropriate verbs when constructing SAR objects.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* review: Use controller-runtime client to create SAR objects
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Review fixes
* fix backend routes.
* '/katib/fetch_namespaces' to fetch experiments in a namespace
* 'FetchExperiments' handler
* hit the appropriate route from frontend and provide namespace as a
query parameter to fetch experiments
* remove remove BACKEND_MODE env var in
favour of the more specific APP_DISABLE_AUTH
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Review fixes
* Add constants for CRUD actions
* Add plural for experiments and suggestions as constants
* Add GetUsername logic under IsAuthorized and handle errors properly
* Have APP_DISABLE_AUTH by default as true, since currently Katib
doesn't support this feature in standalone mode.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* [bugfix] Fix value passing bug in New Experiment form
Add missing logic in New Experiment form in order to pass the value
of the editor content in Metrics Collector tab, when Kind is set to
Custom.
* Adjust unit tests for custom yaml metrics collector
* kwa(front): Add new Editor component
Import new Editor component from Kubeflow Common Library and replace
all instances of previous Ace Editor.
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* Update COMMIT file to a more recent one in Kubeflow
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
Introduce COMMIT file that contains the commit where Katib needs to
checkout inside Kubeflow's common code in order to be built. This file
was integrated in the following places as well, thus a developer may
only update one file each time we need to checkout to a newer commit.
- Dockerfile
- GH actions
- README.md
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
Fix 500 error when refreshing KWA's detail page by also adding the
namespace variable as a query param to the route.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* Introduce the kfp-run component as a distinct component.
* Make the pipeline button a link.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* gh-actions: Extend action to run Frontend Unit tests
Extend Frontend Test action to run also KWA frontend unit tests.
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* gh-actions: Exclude actions when there are only UI changes
Prevent the following workflows when a PR contains changes that affect
only the frontend:
- Charmed Katib
- E2E Test with darts-cnn-cifar10
- E2E Test with enas-cnn-cifar10
- E2E Test with mxnet-mnist
- E2E Test with pytorch-mnist
- E2E Test with simple-pbt
- E2E Test with tf-mnist-with-summaries
- Go Test
- Publish AutoML Algorithm Images
- Publish Katib Core Images
- Publish Trial Images
- Python Test
- Shellcheck
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* gh-actions: Add action to build Katib UI image.
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* Rename the Age header to Created at and right align it.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* Support for k8s v1.25 in CI
* Revert "Support for k8s v1.25 in CI"
This reverts commit 16e6fe4b16.
* Support for k8s v1.25 in CI
* Support for k8s v1.25 in CI
* Support for k8s v1.25 in CI
* Add Readme changes
* Update training operator image in CI
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Support for k8s v1.25 in CI
* Revert "Support for k8s v1.25 in CI"
This reverts commit 16e6fe4b16.
* Move the status column to the first position of the trials table as
it is in the other tables.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* UI: Make KWA's main table responsive and add toolbar
* Add a top row toolbar with the title of the app and the button to
create a new Experiment.
* Replace the card with a responsive table that shows the items. The
component also has a paginator.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* build: Update Dockerfile and README file
Update Dockerfile and README file to check out to the commit in master
branch from the Kubeflow repository that includes the corresponding
changes.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* UI: Import echarts and ngx-echarts (#1879)
* Import echarts module and ngx-echarts directive for Echarts.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* UI: Remove trials graph component (#1879)
* Remove trials graph component.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* UI: Introduce graph's component (#1879)
* Create a new component that uses Echarts Parallel Graph.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* UI: Modify graph's wrapper component (#1879)
* Make the wrapper component use the new graph.
* Show the graph when at least one trial is completed.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* UI: Parallel Graph unit test (#1879)
* Create unit test for Parallel Graph.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* Create Tune API in the Katib SDK
* Add Final to consts
Modify packages_to_install doc
Create validate objective function
* Add GPU TF Image
Change k8s version package
* Create search module
* Fix link in README
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Fix licence date
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* implement postgres for katib db
* fix yaml lint
* apply go mod tidy
* Update manifests/v1beta1/components/postgres/postgres.yaml
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* refactoring by reviews
- split openconnection to common packages
- add unit test for postgres db
* change to install only mysql by default
* remove useless import
* add postgres kustomization and e2e test for it
* change mysql installation files to be variable
* fix shell scripts
* fix lint
* fix image name
* set default value on github action workflow
* make postgres deployment to use pvc
* temporarily comments
* uncomment invalid experiments
* test with for loop
* sleep until controller created well
* add some comments
* Update pkg/db/v1beta1/postgres/postgres.go
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Update pkg/db/v1beta1/postgres/postgres_test.go
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* refactor by reviews
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* docs: update new algorithm service details
* feat: trial augmentation strategy
* feat: pbt suggestion service
* feat: PbtTemplate and associated test image
* feat: introduce annotation field to trial specifications
* feat: trial assignment changes to support annotations from suggestion
- Add new Annotation types to suggestion_types.go
- Add Annotation object and update Trial parser in trial.py
* feat: update pbt suggestion to use new Annotation api
- Suggestion uses exact match to track spawned trials
- Trials that get transmitted, but not created (or added to experiment) are added back to the respawn pool (population_size consistency)
* chore: gofmt and black run across PBT changes
* feedback: remove tf summary export, change default print unit, reduce range to be percentage compatible.
* feedback: move PBT template to example.
* feedback: changes to inject_webhook and utils.
- Rename mutateVolume to mutateMetricsCollectorVolume
- Add addContainerVolumeMount
- Add getPrimaryContainerIndex
* feedback: change suggestion mutation mount variable name and add to consts
* feedback: Add trial_names to GetSuggestionsReply and change suggestion path to <experiment>/<trial>
* feedback: removed unnecessary checks and moved to async pbt implementation
* feedback: update trial name override location and change annotations override to labels.
* feedback: add pbt to github workflow
* feedback: move labels to ParameterAssignments in GetSuggestionsReply and cleanup pbt.yaml.
* feedback: remove operator changes
* feedback: GHA updates
* feedback: new formatting changes
* feedback: add suggestion-pbt to gh-actions build-load.sh.
* fix: missing pbt->simple-pbt name changes, add simple-pbt to update-images.sh update yaml function (causing failing gha).
* feedback: add pointer to website from main readme for pbt
* include MetricsUnavailable condition to Complete in Trial
It is not easy for users to find why Trial failed when training code output incorrect format logs
since the trial-controller sets Succeeded condition with False to Trial if there are unavailable metrics in Katib DB as described in https://github.com/kubeflow/katib/issues/1343.
So we also include MetricsUnavailable condition to Complete in Trial.
* add gh-actions tasks to verify generated codes
* fix gh-actions workflow
* when the number of Failed Trials reaches maxTrialCount, experiment-controller sets Failed to Experiment status
* fix e2e test
* To avoid being set Failed in Experiment status when and is equal to 0, we need to add condition,
* migrate test-infra to GitHub Actions
* change python base image to python:3.9-slim
* move from minikube to kind
* separate darts container images by device type
* run e2e test with multi kubernetes version
* disble to deploy katib-ui by default
* change kind kubernetes cluster version
* fix update-images.sh
* fix shellcheck
* fix script to setup katib
* split enas, darts and tf-mnist-with-summaries with trial images
* specicy experiments in pytorch-mnist-e2e-test
* reduce storage size for mysql
* fix trial image name for enas and darts
* fix trial image name for file-metrics-collector-with-json-format
* change kubernetes versions
* do not run e2e test on push master branch
* remove backoffLimit field in examples
* Deprecate Katib presubmit on optional-test-infra
This PR serves as sub-PR to deprecate katib presubmit on optional-test-infra.
* Update prow_config.yaml
Update config file
* upgrade the tensorflow version to address some security issues
* fix enas example codes
* upgrade tensorflow to v2.9.1 and tensorflow-aarch64 to v2.9.0
* install protobuf (>= 3.9.2, < 3.20) for tensorflow-aarch64
* upgrade the grpc_health_probe version to v0.4.11 to resolve security vulnerability CVE-2022-27191
* increase batch size of tfjob-mnist-with-summaries
* add primaryPodLabels to tfjob's example
* upgrade kubebuilder version from v2.3.0 to v3.2.0
* fix envtest for experiment-controller
* fix suite test
To avoid the `timeout waiting for process kube-apiserver to stop` error, we must use the `context.WithCancel`.
Ref: https://github.com/kubernetes-sigs/controller-runtime/issues/1571#issuecomment-945535598
* update Go version to v1.17 in kubeflow-katib-presubmit
To avoid the `../../../../pkg/mod/k8s.io/client-go@v0.22.2/plugin/pkg/client/auth/exec/metrics.go:21:2: package io/fs is not in GOROOT (/usr/local/go/src/io/fs)` error,
we must use Go v1.16 or later, but as described in https://github.com/kubeflow/training-operator/issues/1581,
we do not have permission to update `public.ecr.aws/j1r0q0g6/kubeflow-testing:latest` so we need to update it in this.
Due to pypa/setuptools_scm#713, we are experiencing errors when building
charms both locally and in the CI. This change will prevent the error
from happening until the issue is fixed.
* do not check trial parameter in experiment parameters if it's trial's metadata
* revert unnecessary change
* add handle Labels[label] and Annotations[annotation]
* fix test description
* Add prometheus scraping and grafana support to charmed operator
* Upgrade black version to 22.3.0 to fix issue with click dependency
* fix: unpin `black`, fix formatting errors
* fix: minor refactor of prometheus integration
Revert to defaults for relation names and paths, where appropriate.
* fix: apply operator linting checks only to source code
* chore: point katib-db to charm in charmhub
* fix: remove unneeded handling of prometheus relation event
* feat: Add template dashboard and alert rules
These are not working properly. When connecting to grafana, the dashboard shows up but does not populate properly with data. The data source appears wrong
* fix: handle leader-elected events
Without this, upgrade-charm does not work.
* fix: correctly template the sample grafana dashboard
* fix: remove placeholder grafana/prometheus files
* fix: bump wait time to avoid flaky test failure
Co-authored-by: Andrew Scribner <ca.scribner+1@gmail.com>
* Set `kube-system` as the suggested namespace
Signed-off-by: Elias Koromilas <elias.koromilas@gmail.com>
* Replace broken link
Signed-off-by: Elias Koromilas <elias.koromilas@gmail.com>
* support JSON format logs in file-metrics-collector
* review: convert fileFormat to type FileSystemFileFormat
* Update cmd/metricscollector/v1beta1/file-metricscollector/main.go
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: remove func (f FileSystemFileFormat) String()
* review: get metricRegList only when the format is TEXT
* review: change var name in a script for e2e
* review: explict specify the cloudml-hypyertune in the Dockerfile
* review: use reflect.DeepEqual instead of go-cmp.Diff
* review: stop using 'JSON' directly in error statements
* review: install specific version cloudml-hypertune
* review: get objType in the updateStopRules function
* review: save optimalObjValue across multiple stopRules
* review: add warning messages to parseTimestamp func
* review: generate test files with go test command
* review: change api for new feature
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Upgrade black version to 22.3.0 to fix issue with click dependency
* fix: unpin `black`, fix formatting errors
Co-authored-by: Andrew Scribner <ca.scribner+1@gmail.com>
* bump Python to 3.9
* modify script to build container image
* fix example for enas
* update scripts to modify image name in ci
* review: change docker build command
* review: use new tf-mnist-with-example in Ci for tfjob
* review: refactor tf-mnist-with-summaries
* review: remove Dockerfile.ppc64le for new-ui
* review: update docs related tf-mnist-with-summaries
* TFEventMetricsCollector supports TF>=2.0 and stop supporting TF <=1.x
* review: add help command to scripts/v1beta1/build.sh
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* fix unit test for tfevent-metricscollector
* review: generate tf event files on CI
* add test command to Makefile
* update publish-trial-images
* update update-images.sh
* reduce batch size
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* implement validation for early stopping
* fix some documents
* fix error messages
* implement gRPC API to verify parameters for early stopping
* review: use early_stopping as gRPC API
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: fix error description
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: remove t.Run
* review: remove condition to verify algorithmName for early stopping
* remove description about updating gRPC API docs in kubeflow website
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* fix: check if parameter references exist in experiment parameters
* Fix validator test
* Update some comments and test descriptions
* Check trial parameter reference only when experiment parameters are not empty
* Add a test for the case 'spec.parameters' is mepty
* fix: close mysql statement and rows resources when sql exec end
* fix: close mysql statement and rows resources when sql exec end
* style: move code to other place
* style: correct the typo(Prepare)
Co-authored-by: 陈文军 <chenwenjun01@corp.netease.com>
* Init commit with e2e example
* Add Early Stopping and MPI Examples
* Add MPI to README
* Modify SDK for MPI example
* Modify doc
* Update Early Stopping example
* Finish e2e example
* Modify links for KFP guide
Signed-off-by: Jaeyeon Kim <anencore94@gmail.com>
Co-authored-by: Seongjin Kim <seongjinkim1123@gmail.com>
Co-authored-by: Seongjin Kim <seongjinkim1123@gmail.com>
* add cert-generator command
* go mod tidy
* fix gofmt lint check
* fix unittest for katib-cert-generator
* remove unnecessary test code
* fix comment
* review: fix kubeClient
* review: stop to use k8s.io/utils
* review: delete containers[].securityContext
* review: change directory name for cert-generator
* review: fix const
Co-authored-by: andreyvelich <andrey.velichkevich@gmail.com>
* review: stop to use k8s.io/utils
Co-authored-by: andreyvelich <andrey.velichkevich@gmail.com>
* review: delete containers[].securityContext
* review: change directory name for cert-generator
* review: fix const
Co-authored-by: andreyvelich <andrey.velichkevich@gmail.com>
* review: take webhook domain as consts
* review: keep the name testDescription and err
* review: do not try to patch webhook configuration in many times
* review: fix some functions to generate cert
* review: add comments
Co-authored-by: andreyvelich <andrey.velichkevich@gmail.com>
* review: remove v1beta1 from admissionReviewVersions in ValidatingWebhookConfiguration and MutatingWebhookConfiguration
* fix comments
* review: remove the securityContext field
Co-authored-by: andreyvelich <andrey.velichkevich@gmail.com>
* Use YAML input if TrialParams are missing
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Separate TrialTemplates in two words
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* SDK: change list apis to return objects as default
- change list_trials, list_experiments to return list of objects as
a default
- also, give 'in_short' parameter for who wants only name and status
as before
* [enh]: change return type from List[dict] to List[V1beta1Experiment]
* [enh]: deserialize dict to katib's custom class
* [docs]: refactor KatibClient docs
* change deserialize method location to utils
* remove useless import
* Add objects necessary to deserilization in swagger
* use fakeresponse rather than duplicating codes
* Update sdk/python/v1beta1/kubeflow/katib/utils/utils.py
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update sdk/python/v1beta1/kubeflow/katib/utils/utils.py
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
| [Ant Group](https://www.antgroup.com/) | [@ohmystack](https://github.com/ohmystack) | Automatic training in Ant Group internal AI Platform |
| [Ant Group](https://www.antgroup.com/) | [@ohmystack](https://github.com/ohmystack) | Automatic training in Ant Group internal AI Platform |
| [babylon health](https://www.babylonhealth.com/) | [@jeremievallee](https://github.com/jeremievallee) | Hyperparameter tuning for AIR internal AI Platform |
| [babylon health](https://www.babylonhealth.com/) | [@jeremievallee](https://github.com/jeremievallee) | Hyperparameter tuning for AIR internal AI Platform |
| [caicloud](https://caicloud.io/) | [@gaocegege](https://github.com/gaocegege) | Hyperparameter tuning in Caicloud Cloud-Native AI Platform |
| [caicloud](https://caicloud.io/) | [@gaocegege](https://github.com/gaocegege) | Hyperparameter tuning in Caicloud Cloud-Native AI Platform |
| [canonical](https://ubuntu.com/) | [@RFMVasconcelos](https://github.com/rfmvasconcelos) | Hyperparameter tuning for customer projects in Defense and Fintech |
| [canonical](https://ubuntu.com/) | [@RFMVasconcelos](https://github.com/rfmvasconcelos) | Hyperparameter tuning for customer projects in Defense and Fintech |
| [CERN](https://home.cern/) | [@d-gol](https://github.com/d-gol) | Hyperparameter tuning within the ML platform on private cloud |
| [cisco](https://cisco.com/) | [@ramdootp](https://github.com/ramdootp) | Hyperparameter tuning for conversational AI interface using Rasa |
| [cisco](https://cisco.com/) | [@ramdootp](https://github.com/ramdootp) | Hyperparameter tuning for conversational AI interface using Rasa |
| [cubonacci](https://www.cubonacci.com) | [@janvdvegt](https://github.com/janvdvegt) | Hyperparameter tuning within the Cubonacci machine learning platform |
| [cubonacci](https://www.cubonacci.com) | [@janvdvegt](https://github.com/janvdvegt) | Hyperparameter tuning within the Cubonacci machine learning platform |
| [CyberAgent](https://www.cyberagent.co.jp/en/) | [@tenzen-y](https://github.com/tenzen-y) | Experiment in CyberAgent internal ML Platform on Private Cloud |
| [fuzhi](http://www.fuzhi.ai/) | [@planck0591](https://github.com/planck0591) | Experiment and Trial in autoML Platform |
| [fuzhi](http://www.fuzhi.ai/) | [@planck0591](https://github.com/planck0591) | Experiment and Trial in autoML Platform |
| [karrot](https://uk.karrotmarket.com/) | [@muik](https://github.com/muik) | Hyperparameter tuning in Karrot ML Platform |
| [karrot](https://uk.karrotmarket.com/) | [@muik](https://github.com/muik) | Hyperparameter tuning in Karrot ML Platform |
| [PITS Global Data Recovery Services](https://www.pitsdatarecovery.net/) | [@pheianox](https://github.com/pheianox) | CyberAgent and ML Platform |
to build multi-arch images. Check source code as follows:
```bash
make build REGISTRY=<image-registry> TAG=<image-tag>
```
If you are using an Apple Silicon machine and encounter the "rosetta error: bss_size overflow," go to Docker Desktop -> General and uncheck "Use Rosetta for x86_64/amd64 emulation on Apple Silicon."
To use your custom images for the Katib components, modify
| katib-config | string | "" | The katib-controller will load its initial configuration from this file. Omit this flag to use the default configuration values. |
## DB Manager Flags
Below is a list of command-line flags accepted by Katib DB Manager:
**Note:** If you are using a private Kubernetes cluster, you have to allow traffic
via `TCP:8443` by specifying the firewall rule and you have to update the master
plane CIDR source range to use the Katib webhooks
### Katib cert generator
Katib Controller has the internal `cert-generator` to generate certificates for the webhooks.
Once Katib is deployed in the Kubernetes cluster, the `cert-generator` follows these steps:
- Generate the self-signed certificate and private key.
- Update a Kubernetes Secret with the self-signed TLS certificate and private key.
- Patch the webhooks with the `CABundle`.
Once the `cert-generator` finished, the Katib controller starts to register controllers such as `experiment-controller` to the manager.
You can find the `cert-generator` source code [here](../pkg/certgenerator/v1beta1).
NOTE: the Katib also supports the [cert-manager](https://cert-manager.io/) to generate certs for the admission webhooks instead of using cert-generator.
You can find the installation with the cert-manager [here](../manifests/v1beta1/installs/katib-cert-manager).
## Implement a new algorithm and use it in Katib
Please see [new-algorithm-service.md](./new-algorithm-service.md).
## Katib UI documentation
Please see [Katib UI README](../pkg/ui/v1beta1).
## Design proposals
Please see [proposals](./proposals).
## Code Style
### pre-commit
Make sure to install [pre-commit](https://pre-commit.com/) (`pip install
pre-commit`) and run `pre-commit install` from the root of the repository at
least once before creating git commits.
The pre-commit [hooks](../.pre-commit-config.yaml) ensure code quality and
consistency. They are executed in CI. PRs that fail to comply with the hooks
will not be able to pass the corresponding CI gate. The hooks are only executed
against staged files unless you run `pre-commit run --all`, in which case,
they'll be executed against every file in the repository.
Specific programmatically generated files listed in the `exclude` field in
[.pre-commit-config.yaml](../.pre-commit-config.yaml) are deliberately excluded
Katib consists of several components as shown below. Each component is running
on Kubernetes as a deployment. Each component communicates with others via GRPC
and the API is defined at `pkg/apis/manager/v1beta1/api.proto`.
- Katib main components:
- `katib-db-manager` - the GRPC API server of Katib which is the DB Interface.
- `katib-mysql` - the data storage backend of Katib using mysql.
- `katib-ui` - the user interface of Katib.
- `katib-controller` - the controller for the Katib CRDs in Kubernetes.
## Web UI
Katib provides a Web UI.
During 1.3 we've worked on a new iteration of the UI, which is rewritten in
Angular and is utilizing the common code of the other Kubeflow [dashboards](https://github.com/kubeflow/kubeflow/tree/master/components/crud-web-apps).
The users are currently able to list, delete and create Experiments in their
cluster via this new UI as well as inspect the owned Trials. One important
missing functionalities are the ability to edit the Trial templates ConfigMaps
and view Neural Architecture Search models. Check [this Project](https://github.com/kubeflow/katib/projects/1)
to monitor the current progress.

To use the old Katib UI you can update the Katib image `newName` with the previous
image tag `docker.io/kubeflowkatib/katib-ui:v0.11.1` in the [Kustomize](./manifests/v1beta1/installs/katib-standalone/kustomization.yaml#L29)
manifests.
## GRPC API documentation
Check the [Katib v1beta1 API reference docs](https://www.kubeflow.org/docs/reference/katib/v1beta1/katib/).
## Installation
## Installation
For standard installation of Katib with support for all job operators,
@ -87,25 +89,30 @@ func (s *server) Check(ctx context.Context, in *health_pb.HealthCheckRequest) (*
}
}
funcmain(){
funcmain(){
varconnectTimeouttime.Duration
varlistenAddressstring
flag.DurationVar(&connectTimeout,"connect-timeout",defaultConnectTimeout,"Timeout before calling error during database connection. (e.g. 120s)")
flag.StringVar(&listenAddress,"listen-address",defaultListenAddress,"The network interface or IP address to receive incoming connections. (e.g. 0.0.0.0:6789)")
"default","The implementation of suggestion interface in experiment controller (default)")
flag.StringVar(&metricsAddr,"metrics-addr",":8080","The address the metric endpoint binds to.")
flag.BoolVar(&injectSecurityContext,"webhook-inject-securitycontext",false,"Inject the securityContext of container[0] in the sidecar")
flag.BoolVar(&enableGRPCProbeInSuggestion,"enable-grpc-probe-in-suggestion",true,"enable grpc probe in suggestions")
flag.Var(&trialResources,"trial-resources","The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org)")
flag.IntVar(&webhookPort,"webhook-port",8443,"The port number to be used for admission webhook server.")
// TODO (andreyvelich): Currently it is not possible to set different webhook service name.
// flag.StringVar(&serviceName, "webhook-service-name", "katib-controller", "The service name which will be used in webhook")
// TODO (andreyvelich): Currently is is not possible to store webhook cert in the local file system.
// flag.BoolVar(&certLocalFS, "cert-localfs", false, "Store the webhook cert in local file system")
| experiment-suggestion-name | string | "default" | The implementation of suggestion interface in experiment controller |
| metrics-addr | string | ":8080" | The address the metric endpoint binds to |
| trial-resources | []schema.GroupVersionKind | null | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
| webhook-inject-securitycontext | bool | false | Inject the securityContext of container[0] in the sidecar |
| webhook-port | int | 8443 | The port number to be used for admission webhook server |
## Workflow design
Please see [workflow-design.md](./workflow-design.md).
## Katib admission webhooks
Katib uses three [Kubernetes admission webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/).
# Document about how to add a new algorithm in Katib
## Implement a new algorithm and use it in Katib
### Implement the algorithm
The design of Katib follows the `ask-and-tell` pattern:
> They often follow a pattern a bit like this: 1. ask for a new set of parameters 1. walk to the Experiment and program in the new parameters 1. observe the outcome of running the Experiment 1. walk back to your laptop and tell the optimizer about the outcome 1. go to step 1
When an Experiment is created, one algorithm service will be created. Then Katib asks for new sets of parameters via `GetSuggestions` GRPC call. After that, Katib creates new trials according to the sets and observe the outcome. When the trials are finished, Katib tells the metrics of the finished trials to the algorithm, and ask another new sets.
The new algorithm needs to implement `Suggestion` service defined in [api.proto](../pkg/apis/manager/v1beta1/api.proto). One sample algorithm looks like:
```python
from pkg.apis.manager.v1beta1.python import api_pb2
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.suggestion.v1beta1.internal.search_space import HyperParameter, HyperParameterSearchSpace
from pkg.suggestion.v1beta1.internal.trial import Trial, Assignment
from pkg.suggestion.v1beta1.hyperopt.base_service import BaseHyperoptService
from pkg.suggestion.v1beta1.base_health_service import HealthServicer
# Inherit SuggestionServicer and implement GetSuggestions.
Create a package under [cmd/suggestion](../cmd/suggestion). Then create the main function and Dockerfile. The new GRPC server should serve in port 6789.
Here is an example: [cmd/suggestion/hyperopt](../cmd/suggestion/hyperopt).
Then build the Docker image.
### Use the algorithm in Katib.
Update the [Katib config](../manifests/v1beta1/components/controller/katib-config.yaml)
and [Katib config patch](../manifests/v1beta1/installs/katib-standalone/katib-config-patch.yaml)
| [KF Community: Using Pipelines in Katib](https://youtu.be/BszcHMkGLgc) | Andrey Velichkevich | 2020-11-10 |
| [Bridging into Python Ecosystem with Cloud-Native Distributed Machine Learning Pipelines](https://github.com/terrytangyuan/public-talks/tree/main/talks/bridging-into-python-ecosystem-with-cloud-native-distributed-machine-learning-pipelines-argocon-2021) | [Yuan Tang](https://terrytangyuan.github.io/about/) | ArgoCon | 2021-12-08 |
| [KubeCon 2020: From Notebook to Kubeflow Pipelines with HP Tuning](https://youtu.be/QK0NxhyADpM) | Stefano Fioravanzo, Ilias Katsakioris | 2020-09-04 |
| [Kubeflow Dojo: Distributed Training and HPO Deep Dive](https://youtu.be/KJFOlhD3L1E) | Andrew Butler, Qianyang Yu, Tommy Li, Animesh Singh | 2020-07-17 |
| [AutoML and Training WG Summit July 2021](https://youtube.com/playlist?list=PL2gwy7BdKoGd9HQBCz1iC7vyFVN7Wa9N2) | Kubeflow Community | Kubeflow Summit | 2021-07-16 |
| [Kubeflow 101: Hyperparameter Tuning with Katib](https://youtu.be/nIKVlosDvrc) | Stephanie Wong | 2020-06-21 |
| [MLOps and AutoML in Cloud-Native Way with Kubeflow and Katib](https://youtu.be/33VJ6KNBBvU) | Andrey Velichkevich | MLREPA | 2021-04-25 |
| [KubeCon 2019: Hyperparameter Tuning Using Kubeflow](https://youtu.be/OkAoiA6A2Ac) | Richard Liu, Johnu George | 2019-07-05 |
| [A Tour of Katib's new UI for Kubeflow 1.3](https://youtu.be/1DtjB_boWcQ) | Kimonas Sotirchos | Kubeflow Community Meeting | 2021-03-30 |
| [KF Community: Kubeflow Katib & Hyperparameter Tuning](https://youtu.be/1PKH_D6zjoM) | Richard Liu | 2019-03-29 |
| [New UI for Kubeflow components](https://youtu.be/OKqx3IS2_G4) | Stefano Fioravanzo | Kubeflow Community Meeting | 2020-12-08 |
| [Using Pipelines in Katib](https://youtu.be/BszcHMkGLgc) | Andrey Velichkevich | Kubeflow Community Meeting | 2020-11-10 |
| [From Notebook to Kubeflow Pipelines with HP Tuning](https://youtu.be/QK0NxhyADpM) | Stefano Fioravanzo, Ilias Katsakioris | KubeCon | 2020-09-04 |
| [Distributed Training and HPO Deep Dive](https://youtu.be/KJFOlhD3L1E) | Andrew Butler, Qianyang Yu, Tommy Li, Animesh Singh | Kubeflow Dojo | 2020-07-17 |
| [Hyperparameter Tuning with Katib](https://youtu.be/nIKVlosDvrc) | Stephanie Wong | Kubeflow 101 | 2020-06-21 |
| [Hyperparameter Tuning Using Kubeflow](https://youtu.be/OkAoiA6A2Ac) | Richard Liu, Johnu George | KubeCon | 2019-07-05 |
| [Kubeflow Katib & Hyperparameter Tuning](https://youtu.be/1PKH_D6zjoM) | Richard Liu | Kubeflow Community Meeting | 2019-03-29 |
| [Neural Architecture Search System on Kubeflow](https://youtu.be/WAK37UW7spo) | Andrey Velichkevich, Kirill Prosvirov, Jinan Zhou, Anubhav Garg | Kubeflow Community Meeting | 2019-03-26 |