* add e2e test for tune api
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* upgrade training-operator sdk
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* specify the version of training operator sdk
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix num_labels error and update the version of training operator controller
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the version of training operator
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* debug
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check import path of HuggingFaceModelParams
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update the version of training operator sdk
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update the name of experiment
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* add step of checking pod
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of pod
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* add check
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check reason for imagepullbackoff
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* revert timeout limit
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix format
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* extend timeout limit
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update training operator sdk version
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of pod
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* rerun tests
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update the function of getting logs
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* add the step of describing pod
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check disk space
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* change work directory
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* change work directory
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* increase timeout limit
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of controller and events
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* change work directory
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* change work directory
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* change work directory
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of kubelet
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of kubelet
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* increase cpu
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of training operator
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the use of resources
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check the logs of container 'pytorch' and 'storage_initializer'
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix error of checking use of resources
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* add other checks to find the error reason
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* set 'storage_config'
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* reduce the number of tests
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* Check container runtime logs
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* set the driver of minikube as docker
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* set the driver of minikube to none
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check logs of pod
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* check memory usage
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* increase 'termination_grace_period_seconds' in podspec
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix annotations error
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* restart docker
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* delete restarting docker
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* use original docker data directory
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update installation of Katib SDK with extra requires
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* test trainer image built with cpu
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* add action of free up disk space (including move docker data directory)
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* delete unnecessary checks and update the part of fetching pod description and logs
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* delete fetching pod logs
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* add blank line at the end of free-up-disk-space yaml file
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update experiment name
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update test function name to be consistent with experiment name
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* move import statements inside the function
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* apply pprint for the logging output
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* update experiment names
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix format
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix format
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix the sequence of arguments in 'trial_template'
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* test example in user guide
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix access token error
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix the error of setup
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix the error of setup
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* reverse back
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix format
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* fix format
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
---------
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
* Fix Istio sidecar injection by moving from annotations to labels
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
* Update Istio sidecar injection from annotations to labels across the codebase
Replace annotations with labels for Istio sidecar injection according to Istio recommendations. Update conformance tests, examples, constants, composers, and utilities to use the new label-based approach consistently.
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
* fix: Update SuggestionLabels function and composer implementation for Istio label injection
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
* Fix linting issues in mpi-job-horovod.py
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
* update: function moved from annotations to labels
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
---------
Signed-off-by: madmecodes <ayushguptadev1@gmail.com>
* Support old-style TensorFlow events (tensorboard)
Fixes: https://github.com/kubeflow/katib/issues/2466
Signed-off-by: Gary Miguel <garymm@garymm.org>
* format
Signed-off-by: Gary Miguel <garymm@garymm.org>
* test
Signed-off-by: Gary Miguel <garymm@garymm.org>
* don't continue loops
Signed-off-by: Gary Miguel <garymm@garymm.org>
* format
Signed-off-by: Gary Miguel <garymm@garymm.org>
---------
Signed-off-by: Gary Miguel <garymm@garymm.org>
* improve pvc name error message by failing early and clear message with correct name example
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* fix lint
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* fix lint
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* raise value error for wrong name format by reconciliation
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* revert created utils
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* improve test case name
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* improve value error message
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* improve code flow
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
---------
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
* Sort experiments by descending creation date by default in katib-ui
Signed-off-by: Xinmin Du <2812493086@qq.com>
* fix: Update "renders every Experiment name into the table" test to not check order
Signed-off-by: Xinmin Du <2812493086@qq.com>
* fix: Update "renders every Experiment name into the table" test in order of startTime
Signed-off-by: Xinmin Du <2812493086@qq.com>
---------
Signed-off-by: Xinmin Du <2812493086@qq.com>
* Upgrade klog dependency to v2
Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
* fix: fix conflict with k8s upate
Signed-off-by: Xinmin Du <2812493086@qq.com>
---------
Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Du Xinmin <dux.m.in@sjtu.edu.cn>
Co-authored-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
* Chnage the isort profile to black, and add pkg dir for black and flake8
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
* Fix the formating
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
* Fix flake8 lint issues
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
---------
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
* added validator for feasible space distribution
Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>
validation logic fixed
added unit test
added unit test for valid distribution
requested changes made
Update pkg/webhook/v1beta1/experiment/validator/validator.go
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
fmt
* fmt fix
Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>
---------
Signed-off-by: Shashank Mittal <shashank.mittal.mec22@itbhu.ac.in>
* Add black formater and flake8 linter to pre-commit
Also add's the flake8 config file
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
* Fixes black formating
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
* Fixes flake8 linting errors
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
---------
Signed-off-by: Ignas Baranauskas <ibaranau@redhat.com>
* added logger for katib_client module
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
* added API_VERSION as a constant
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
* updated the KatibClient constructor to match the TrainingClient constructor
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
* added test for create_experiment in katib_client
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
---------
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
* DB: Add env to skip DB creationˆ
* DB: Rename var name & Remove new function
* Migration -> Initialization
* Remove GetBoolEnvOrDefault
* DB: Rearrange dependencies
* Wait for the certs to be mounted inside the container
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Initialize fullServiceDomain when adding certgenerator to the manager
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Output logs every 15 seconds if the certs don't yet exist in the container
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Consolidate the katib-cert-generator to the katib-controller
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Use deployed secret instead of creating a new secret when the cert-generator saves certs on secret
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Rename secretName with webhookSecretName in the .init.certGenerator
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Fix manifests
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Remove unneeded comments
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Restore unintentionally deleted log
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Rename package cert-generator with certgenerator
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Add test cases to check if the enable is set to true when the webhookServiceName or webhookSecretName is set
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Update the developer guide
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Swap livness probe and readiness probe
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Introduce SSA to the cert-generator
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Use the same member names between CertGenerator and KatibConfig
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Disable leader election on the cert-generator
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Drop unneeded fields from SSA patches
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
---------
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
This PR removes the Charmed Operators for Katib, as well as the associated tests. In the past this repo was the source of truth for these operators, but they have since been maintained [here](https://github.com/canonical/katib-operators/) and we've done a poor job of keeping the repos in sync. This commit removes the redundancy.
* disable istio sidecar injection for example manifests
* add namespace as commnad line arg to python test script
* revert disable istio sidecar injection
* add option to pass trial pod annotations
* split command over multiple lines
* remove redundant config loading
* add resource limit to containers of random experiment's trial spec pod
* update code to support already present annotations
* raise NotImplementedError if trailSpec is different from Job
* add metrics-collector-injection to namespace under test if missing
* build: Update COMMIT file
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* kwa(front): Update the use of SnackBarService
Update the use of SnackBarService in order to pass required data via a
`config` object and provide MAT_SNACK_BAR_DEFAULT_OPTIONS.
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
---------
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* backend: Update error message when no logs could be found
* Update the message the backend sends to not just expose that logs are
not there because 'retain' might not be set, but also because the
cluster was scaled down.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Add LOGS tab in Trial details page
In this commit:
* Create a distinct LOGS tab, which displays the trial's logs in the
Trial details page.
* Don't show the backend's error popup for logs, but show the message
error in the admonition.
Signed-off-by: Elena Zioga <elena@arrikto.com>
---------
Signed-off-by: Elena Zioga <elena@arrikto.com>
* Update training operator image in CI
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Support for k8s v1.25 in CI
* Revert "Support for k8s v1.25 in CI"
This reverts commit 16e6fe4b16.
* Update training operator image in Katib CI
* build: Update COMMIT file
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Make links actual links in experiments table
Make KWA's experiments table links actual links by using the new LinkValue
class.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Make links actual links in trials table
Make KWA's trials table links actual links by using the new LinkValue
class.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Rework the trial graph using echarts
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Remove d3 references
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* kwa(front): Install Cypress
- Install Cypress & npm scripts for UI tests
- Remove Protractor files
- Add README.md file to include UI tests instructions
- Modify .gitignore
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* kwa(front): Add UI tests with Cypress
Add UI tests with Cypress to check that:
- New Experiment form page loads template without errors.
- Index page
* has an "Experiments" title
* lists experiments without errors
* renders every experiment name into the table
* renders properly Status icon for all experiments
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* gh-actions(kwa): Add UI tests in test-node action
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* More container fields for SuggestionConfig
* Inline corev1.Container into SuggestionConfig
* Set default value for suggestion container name
* Append suggestion volume and port only if not present
* Deep-Copy base suggestion container
* Check for suggestion container port number as well
* Prohibit suggestion port to be set in suggestion config
* Update training operator image in CI
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Support for k8s v1.25 in CI
* Revert "Support for k8s v1.25 in CI"
This reverts commit 16e6fe4b16.
* Narrow down rbac
* Narrow down rbac
* Narrow down rbac
* Narrow down rbac
* Narrow down rbac
* Narrow down rbac
* Narrow down rbac
* Update tekton and argo docs
* Update tekton and argo docs
* Upgrade manifests to enable authorization check mechanisms for katib-UI
Changes to install-with-kubeflow manifests:
* Enable istio sidecar injection for katib-ui component
* Add AuthorizationPolicy to allow only istio-ingressgateway
to talk to katib-ui [user traffic].
* Set APP_DISABLE_AUTH ENV var to false when in kubeflow-mode
to enable authorization checks in UI's backend
* Extend the RBAC persmissions of katib-ui so it can crate SAR objects
when in kubeflow-mode
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* UI(back): Secure /katib/fetch_trial/ route
Introduce authn/authz checks in the backend
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* review: Remove duplicate dependencies
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* review: Move patch to a separate file
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Pin the h5py version with 3.7.0
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Add a --prefer-binary flag to 'pip install' command
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* [SDK] Use Katib SDK for E2E tests
* Fix pvc deletion
* Add list_suggestions API
* Remove wait from edit Experiment function
* Add shell to GitHub action
* Add protobuf package to Katib SDK
* Add Experiment Timeout to 40 min
* Modify SDK Examples
* Fix example text
* Change to custom_api
* Enable verbose logging for Katib E2E
* Use expected condition arg
* Add timeout and delete options
* Modify logging to debug
* Use read API to check resource status
* frontend: Show message in case of uncompleted trial instead of the graph
Signed-off-by: Elena Zioga <elena@arrikto.com>
* build: Update COMMIT file
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Define the spinner text
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* Backend for getting logs of a trial
* Check Write return + use PrimaryPodLabels
* Add auth + use constants for labels + cleanup
* TODO comment for using controller-runtime client for logs
* Authorization for list pods and get logs, reduce RBAC
* Use corev1 for specifying resources, edit kf install RBAC
* Check namespace and trialName from request
* Remove auth checks for listing the pods
* Use context.Background()
* Update training operator image in CI
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Support for k8s v1.25 in CI
* Revert "Support for k8s v1.25 in CI"
This reverts commit 16e6fe4b16.
* Update Owners file
* add early stopped trials in converter
* error out early
* Update pkg/suggestion/v1beta1/internal/trial.py
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* add incomplete trial filter
* fix ut
* more fixes
* filter on es
* enrich existing tests
Co-authored-by: shaowei su <shaowei.su@airbnb.com>
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* frontend: Remove TSLint
Remove TSLint since it's deprecated.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Introduce ESLint
Introduce ESLint by using the following Angular command [1]:
ng add @angular-eslint/schematics
[1] https://github.com/angular-eslint/angular-eslint#quick-start-with-angular-v12-and-later
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Fix linting errors
Fix linting errors.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* gh-actions: Add GH action to run a lint check
Introduce a Github action to run a lint check.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* build: Update the COMMIT file
* Update the COMMIT file.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Support paging/sorting/filtering in trials table (#1441)
* Make trials table support paging, sorting and filtering.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Create unit tests for trials table (#1441)
* Create unit tests for trials-table component.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Create a yaml tab for Trials (#2011)
* Create a dedicated yaml tab for Trials.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* frontend: Rename components
* Rename trial-modal component to trial-details.
* Rename trial-modal-overview component to trial-overview.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* [Test] Reduce Katib GitHub Action Runs
* Add cancel-in-progress flag
* Use single job for Charmed Katib
* Add cancel-in-progress for all actions except publish
* Bump ubuntu to 20.04 for Charmed tests
* UI(back): Add authorization mechanisms in new Katib UI backend
* Introduce helper ENV vars and functions for authentication and
authorization checks. The authz checks are using SubjectAcessReviews
objects.
* BACKEND_MODE={dev,prod}: skip authz when in dev mode
* APP_DISABLE_AUTH={bool}: skip authz if explicity requested
* Introduce a client-go client to construct SubjectAccessReview objects.
* Before any request proceed to K8s api-server:
* check if authorization must be skipped (BACKEND_MODE, APP_DISBLE_AUTH)
* check if a username is proviced in request Header
* query the K8s api-server with SAR to ensure that the user has
appropriate access privilleges
* Replace the /katib/fetch_experiment/ route with /katib/fetch_namespaces_experiments.
This route expects a namespace as a query parameter from which all experiments will be fetched.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* UI(front): Provide a namespace as a query parameter
This is needed for the new /katib/fetch_namespaced_experiments route.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Update README for running locally without auth
Update the README of the web app to expose that devs should set
APP_DISABLE_AUTH=true when running locally, since there's no authnz when
running locally.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* remove duplicated variable types
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Review fixes
* proper error handling.
* switch to Go's build-in errors package.
* set appropriate verbs when constructing SAR objects.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* review: Use controller-runtime client to create SAR objects
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Review fixes
* fix backend routes.
* '/katib/fetch_namespaces' to fetch experiments in a namespace
* 'FetchExperiments' handler
* hit the appropriate route from frontend and provide namespace as a
query parameter to fetch experiments
* remove remove BACKEND_MODE env var in
favour of the more specific APP_DISABLE_AUTH
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Review fixes
* Add constants for CRUD actions
* Add plural for experiments and suggestions as constants
* Add GetUsername logic under IsAuthorized and handle errors properly
* Have APP_DISABLE_AUTH by default as true, since currently Katib
doesn't support this feature in standalone mode.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* [bugfix] Fix value passing bug in New Experiment form
Add missing logic in New Experiment form in order to pass the value
of the editor content in Metrics Collector tab, when Kind is set to
Custom.
* Adjust unit tests for custom yaml metrics collector
* kwa(front): Add new Editor component
Import new Editor component from Kubeflow Common Library and replace
all instances of previous Ace Editor.
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* Update COMMIT file to a more recent one in Kubeflow
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
Introduce COMMIT file that contains the commit where Katib needs to
checkout inside Kubeflow's common code in order to be built. This file
was integrated in the following places as well, thus a developer may
only update one file each time we need to checkout to a newer commit.
- Dockerfile
- GH actions
- README.md
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
Fix 500 error when refreshing KWA's detail page by also adding the
namespace variable as a query param to the route.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* Introduce the kfp-run component as a distinct component.
* Make the pipeline button a link.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* gh-actions: Extend action to run Frontend Unit tests
Extend Frontend Test action to run also KWA frontend unit tests.
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* gh-actions: Exclude actions when there are only UI changes
Prevent the following workflows when a PR contains changes that affect
only the frontend:
- Charmed Katib
- E2E Test with darts-cnn-cifar10
- E2E Test with enas-cnn-cifar10
- E2E Test with mxnet-mnist
- E2E Test with pytorch-mnist
- E2E Test with simple-pbt
- E2E Test with tf-mnist-with-summaries
- Go Test
- Publish AutoML Algorithm Images
- Publish Katib Core Images
- Publish Trial Images
- Python Test
- Shellcheck
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* gh-actions: Add action to build Katib UI image.
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
Signed-off-by: Orfeas Kourkakis <orfeas@arrikto.com>
* Rename the Age header to Created at and right align it.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* Support for k8s v1.25 in CI
* Revert "Support for k8s v1.25 in CI"
This reverts commit 16e6fe4b16.
* Support for k8s v1.25 in CI
* Support for k8s v1.25 in CI
* Support for k8s v1.25 in CI
* Add Readme changes
* Update training operator image in CI
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Remove deprecated GRPC var
* Support for k8s v1.25 in CI
* Revert "Support for k8s v1.25 in CI"
This reverts commit 16e6fe4b16.
* Move the status column to the first position of the trials table as
it is in the other tables.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* UI: Make KWA's main table responsive and add toolbar
* Add a top row toolbar with the title of the app and the button to
create a new Experiment.
* Replace the card with a responsive table that shows the items. The
component also has a paginator.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* build: Update Dockerfile and README file
Update Dockerfile and README file to check out to the commit in master
branch from the Kubeflow repository that includes the corresponding
changes.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* UI: Import echarts and ngx-echarts (#1879)
* Import echarts module and ngx-echarts directive for Echarts.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* UI: Remove trials graph component (#1879)
* Remove trials graph component.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* UI: Introduce graph's component (#1879)
* Create a new component that uses Echarts Parallel Graph.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* UI: Modify graph's wrapper component (#1879)
* Make the wrapper component use the new graph.
* Show the graph when at least one trial is completed.
Signed-off-by: Elena Zioga <elena@arrikto.com>
* UI: Parallel Graph unit test (#1879)
* Create unit test for Parallel Graph.
Signed-off-by: Elena Zioga <elena@arrikto.com>
Signed-off-by: Elena Zioga <elena@arrikto.com>
* Create Tune API in the Katib SDK
* Add Final to consts
Modify packages_to_install doc
Create validate objective function
* Add GPU TF Image
Change k8s version package
* Create search module
* Fix link in README
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Fix licence date
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* implement postgres for katib db
* fix yaml lint
* apply go mod tidy
* Update manifests/v1beta1/components/postgres/postgres.yaml
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* refactoring by reviews
- split openconnection to common packages
- add unit test for postgres db
* change to install only mysql by default
* remove useless import
* add postgres kustomization and e2e test for it
* change mysql installation files to be variable
* fix shell scripts
* fix lint
* fix image name
* set default value on github action workflow
* make postgres deployment to use pvc
* temporarily comments
* uncomment invalid experiments
* test with for loop
* sleep until controller created well
* add some comments
* Update pkg/db/v1beta1/postgres/postgres.go
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* Update pkg/db/v1beta1/postgres/postgres_test.go
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* refactor by reviews
Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* docs: update new algorithm service details
* feat: trial augmentation strategy
* feat: pbt suggestion service
* feat: PbtTemplate and associated test image
* feat: introduce annotation field to trial specifications
* feat: trial assignment changes to support annotations from suggestion
- Add new Annotation types to suggestion_types.go
- Add Annotation object and update Trial parser in trial.py
* feat: update pbt suggestion to use new Annotation api
- Suggestion uses exact match to track spawned trials
- Trials that get transmitted, but not created (or added to experiment) are added back to the respawn pool (population_size consistency)
* chore: gofmt and black run across PBT changes
* feedback: remove tf summary export, change default print unit, reduce range to be percentage compatible.
* feedback: move PBT template to example.
* feedback: changes to inject_webhook and utils.
- Rename mutateVolume to mutateMetricsCollectorVolume
- Add addContainerVolumeMount
- Add getPrimaryContainerIndex
* feedback: change suggestion mutation mount variable name and add to consts
* feedback: Add trial_names to GetSuggestionsReply and change suggestion path to <experiment>/<trial>
* feedback: removed unnecessary checks and moved to async pbt implementation
* feedback: update trial name override location and change annotations override to labels.
* feedback: add pbt to github workflow
* feedback: move labels to ParameterAssignments in GetSuggestionsReply and cleanup pbt.yaml.
* feedback: remove operator changes
* feedback: GHA updates
* feedback: new formatting changes
* feedback: add suggestion-pbt to gh-actions build-load.sh.
* fix: missing pbt->simple-pbt name changes, add simple-pbt to update-images.sh update yaml function (causing failing gha).
* feedback: add pointer to website from main readme for pbt
* include MetricsUnavailable condition to Complete in Trial
It is not easy for users to find why Trial failed when training code output incorrect format logs
since the trial-controller sets Succeeded condition with False to Trial if there are unavailable metrics in Katib DB as described in https://github.com/kubeflow/katib/issues/1343.
So we also include MetricsUnavailable condition to Complete in Trial.
* add gh-actions tasks to verify generated codes
* fix gh-actions workflow
* when the number of Failed Trials reaches maxTrialCount, experiment-controller sets Failed to Experiment status
* fix e2e test
* To avoid being set Failed in Experiment status when and is equal to 0, we need to add condition,
* migrate test-infra to GitHub Actions
* change python base image to python:3.9-slim
* move from minikube to kind
* separate darts container images by device type
* run e2e test with multi kubernetes version
* disble to deploy katib-ui by default
* change kind kubernetes cluster version
* fix update-images.sh
* fix shellcheck
* fix script to setup katib
* split enas, darts and tf-mnist-with-summaries with trial images
* specicy experiments in pytorch-mnist-e2e-test
* reduce storage size for mysql
* fix trial image name for enas and darts
* fix trial image name for file-metrics-collector-with-json-format
* change kubernetes versions
* do not run e2e test on push master branch
* remove backoffLimit field in examples
* Deprecate Katib presubmit on optional-test-infra
This PR serves as sub-PR to deprecate katib presubmit on optional-test-infra.
* Update prow_config.yaml
Update config file
* upgrade the tensorflow version to address some security issues
* fix enas example codes
* upgrade tensorflow to v2.9.1 and tensorflow-aarch64 to v2.9.0
* install protobuf (>= 3.9.2, < 3.20) for tensorflow-aarch64
* upgrade the grpc_health_probe version to v0.4.11 to resolve security vulnerability CVE-2022-27191
* increase batch size of tfjob-mnist-with-summaries
* add primaryPodLabels to tfjob's example
* upgrade kubebuilder version from v2.3.0 to v3.2.0
* fix envtest for experiment-controller
* fix suite test
To avoid the `timeout waiting for process kube-apiserver to stop` error, we must use the `context.WithCancel`.
Ref: https://github.com/kubernetes-sigs/controller-runtime/issues/1571#issuecomment-945535598
* update Go version to v1.17 in kubeflow-katib-presubmit
To avoid the `../../../../pkg/mod/k8s.io/client-go@v0.22.2/plugin/pkg/client/auth/exec/metrics.go:21:2: package io/fs is not in GOROOT (/usr/local/go/src/io/fs)` error,
we must use Go v1.16 or later, but as described in https://github.com/kubeflow/training-operator/issues/1581,
we do not have permission to update `public.ecr.aws/j1r0q0g6/kubeflow-testing:latest` so we need to update it in this.
Due to pypa/setuptools_scm#713, we are experiencing errors when building
charms both locally and in the CI. This change will prevent the error
from happening until the issue is fixed.
* do not check trial parameter in experiment parameters if it's trial's metadata
* revert unnecessary change
* add handle Labels[label] and Annotations[annotation]
* fix test description
* Add prometheus scraping and grafana support to charmed operator
* Upgrade black version to 22.3.0 to fix issue with click dependency
* fix: unpin `black`, fix formatting errors
* fix: minor refactor of prometheus integration
Revert to defaults for relation names and paths, where appropriate.
* fix: apply operator linting checks only to source code
* chore: point katib-db to charm in charmhub
* fix: remove unneeded handling of prometheus relation event
* feat: Add template dashboard and alert rules
These are not working properly. When connecting to grafana, the dashboard shows up but does not populate properly with data. The data source appears wrong
* fix: handle leader-elected events
Without this, upgrade-charm does not work.
* fix: correctly template the sample grafana dashboard
* fix: remove placeholder grafana/prometheus files
* fix: bump wait time to avoid flaky test failure
Co-authored-by: Andrew Scribner <ca.scribner+1@gmail.com>
* Set `kube-system` as the suggested namespace
Signed-off-by: Elias Koromilas <elias.koromilas@gmail.com>
* Replace broken link
Signed-off-by: Elias Koromilas <elias.koromilas@gmail.com>
* support JSON format logs in file-metrics-collector
* review: convert fileFormat to type FileSystemFileFormat
* Update cmd/metricscollector/v1beta1/file-metricscollector/main.go
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: remove func (f FileSystemFileFormat) String()
* review: get metricRegList only when the format is TEXT
* review: change var name in a script for e2e
* review: explict specify the cloudml-hypyertune in the Dockerfile
* review: use reflect.DeepEqual instead of go-cmp.Diff
* review: stop using 'JSON' directly in error statements
* review: install specific version cloudml-hypertune
* review: get objType in the updateStopRules function
* review: save optimalObjValue across multiple stopRules
* review: add warning messages to parseTimestamp func
* review: generate test files with go test command
* review: change api for new feature
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Upgrade black version to 22.3.0 to fix issue with click dependency
* fix: unpin `black`, fix formatting errors
Co-authored-by: Andrew Scribner <ca.scribner+1@gmail.com>
* bump Python to 3.9
* modify script to build container image
* fix example for enas
* update scripts to modify image name in ci
* review: change docker build command
* review: use new tf-mnist-with-example in Ci for tfjob
* review: refactor tf-mnist-with-summaries
* review: remove Dockerfile.ppc64le for new-ui
* review: update docs related tf-mnist-with-summaries
* TFEventMetricsCollector supports TF>=2.0 and stop supporting TF <=1.x
* review: add help command to scripts/v1beta1/build.sh
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* fix unit test for tfevent-metricscollector
* review: generate tf event files on CI
* add test command to Makefile
* update publish-trial-images
* update update-images.sh
* reduce batch size
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* implement validation for early stopping
* fix some documents
* fix error messages
* implement gRPC API to verify parameters for early stopping
* review: use early_stopping as gRPC API
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: fix error description
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: remove t.Run
* review: remove condition to verify algorithmName for early stopping
* remove description about updating gRPC API docs in kubeflow website
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* fix: check if parameter references exist in experiment parameters
* Fix validator test
* Update some comments and test descriptions
* Check trial parameter reference only when experiment parameters are not empty
* Add a test for the case 'spec.parameters' is mepty
* fix: close mysql statement and rows resources when sql exec end
* fix: close mysql statement and rows resources when sql exec end
* style: move code to other place
* style: correct the typo(Prepare)
Co-authored-by: 陈文军 <chenwenjun01@corp.netease.com>
* Init commit with e2e example
* Add Early Stopping and MPI Examples
* Add MPI to README
* Modify SDK for MPI example
* Modify doc
* Update Early Stopping example
* Finish e2e example
* Modify links for KFP guide
Signed-off-by: Jaeyeon Kim <anencore94@gmail.com>
Co-authored-by: Seongjin Kim <seongjinkim1123@gmail.com>
Co-authored-by: Seongjin Kim <seongjinkim1123@gmail.com>
* add cert-generator command
* go mod tidy
* fix gofmt lint check
* fix unittest for katib-cert-generator
* remove unnecessary test code
* fix comment
* review: fix kubeClient
* review: stop to use k8s.io/utils
* review: delete containers[].securityContext
* review: change directory name for cert-generator
* review: fix const
Co-authored-by: andreyvelich <andrey.velichkevich@gmail.com>
* review: stop to use k8s.io/utils
Co-authored-by: andreyvelich <andrey.velichkevich@gmail.com>
* review: delete containers[].securityContext
* review: change directory name for cert-generator
* review: fix const
Co-authored-by: andreyvelich <andrey.velichkevich@gmail.com>
* review: take webhook domain as consts
* review: keep the name testDescription and err
* review: do not try to patch webhook configuration in many times
* review: fix some functions to generate cert
* review: add comments
Co-authored-by: andreyvelich <andrey.velichkevich@gmail.com>
* review: remove v1beta1 from admissionReviewVersions in ValidatingWebhookConfiguration and MutatingWebhookConfiguration
* fix comments
* review: remove the securityContext field
Co-authored-by: andreyvelich <andrey.velichkevich@gmail.com>
* Use YAML input if TrialParams are missing
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Separate TrialTemplates in two words
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* SDK: change list apis to return objects as default
- change list_trials, list_experiments to return list of objects as
a default
- also, give 'in_short' parameter for who wants only name and status
as before
* [enh]: change return type from List[dict] to List[V1beta1Experiment]
* [enh]: deserialize dict to katib's custom class
* [docs]: refactor KatibClient docs
* change deserialize method location to utils
* remove useless import
* Add objects necessary to deserilization in swagger
* use fakeresponse rather than duplicating codes
* Update sdk/python/v1beta1/kubeflow/katib/utils/utils.py
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update sdk/python/v1beta1/kubeflow/katib/utils/utils.py
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Implement Optuna service and cmd
* Update pkg/suggestion/v1beta1/optuna/service.py
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update pkg/suggestion/v1beta1/optuna/service.py
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update pkg/suggestion/v1beta1/optuna/service.py
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update pkg/suggestion/v1beta1/optuna/service.py
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Merge the blocks of self.lock in OptunaService
* Remove Cython installation
* Update Python version for the Optuna suggestion service
* Add the example yaml of multivarite-tpe
* Fix the logic of handling unknown trials
* Use name and value instead of the string representation of assignment
* Turn on constant liar by default
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Create common function to test validate algorithm settings
Validate db exhausted for chocolate
* remove parentheses
* Use common util to test Suggestions
* Fix API name
* Fix indexing
* Change default image for Katib UI
* Change title for UI
* Fix image name
* Use katib-ui image for the new UI
* Remove build from CI workflow
* Add cache to Kaniko
* Add cache repo
* Add other cache repo
* Remove cache
* Add Support for Argo Workflows
* Few changes in README
* Add Argo to README
* Remove Argo access from Katib manifests
* Remove Tekton access from Katib manifests
* Few changes in README
* Change to Pipelines
* [enh]: validate for skopt algorithm settings
* [style]: refactor with reviews
- use staticmethod rather than classmethod
- change convertAlgorithmSpec method name to a snake_case
- use .format() rather than f-string
Signed-off-by: Jaeyeon Kim <anencore94@gmail.com>
* Update operator dependencies
Updates requirements.txt for each operator to use latest version from
pypi.org.
* Switch katib-ui operator to SDI interface
Switches katib-ui operator to the serialized-data-interface library,
which provides a way to declaratively define relationships.
* Error messages corrected
#1516
* ucfirst
The first letter of both error message is changed to uppercase.
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* fix kustomize manifests for kubeflow
* fix standalone and external-db manifests
* remove old namespace file
* remove PV from kubeflow manifest
* fix katib-external-db reference outside of root
* fix katib-with-kubeflow-cert-manager
* Move image tags to katib-config.yaml and remove patches
* use common namespace kustomization
* Make kubeflow-cert use kubeflow as a base
* Remove katib-cert-generator job from kubeflow-cert-generator manifests
* Move pv-patch to patches folder
* Create katib-cert-manager and make kubeflowuse this as base
* Fix release and CI scripts for new layout
* Remove unnecessary cert-generator images from kustomization.yaml
* Remove unnecessary SA, CR and CRB from katib-cert-manager
* Remove commonLabel from katib-with-kubeflow
* Separate cert-generator from webhook kustomization
* Create workflow for Go
* Add GOPATH env
* Move check up
* Add env
* Add go mod download
* Add ls command
* Add path
* Change path for run
* Change GOPATH
* Add kubebuilder
* Download coveralls
* Add node test
* Remove Travis
* Add coveralls step
* Change coveralls use
* Add working dir
* Remove run
* Update boilerplate for clients
* Modify not format boilerplate
* Modify boilerplate script for go files
* Update boilerplate manually for go files
* Modify script
* Generate boilerplate for Go files
* Add script for Python files
* Generate boilerplate for the Py files
* Include shell
* Generate boilerplate for shell scripts
* Add to makefile
* Fix comments
* Change comments
* Add small section for the new UI in README
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Run update-readme-toc
Adds a top level table-of-contents entry for the new UI as well.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Rephrase how to launch the new UI
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Add the lost newline before the TOC
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* ui(base): Use new commit of the latest common code
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* ui(standalone): Expose namespace dropdown
If the Central Dashboard is not present, then the app will show the
namespace selector and will try to fetch the namespaces. This will allow
the app to work in standalone mode, since the users will be able to
navigate between namespaces even without the central dashboard.
The next steps would be to add authorization checks to the backend to
perform SubjectAccessReviews.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* ui(form): Add Trial Parameters
The form will be showing a dynamic list of trial parameters that the
users will need to configure. This list is affected from the yaml
content.
The JS parses the yaml contents to find the trial parameters.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* ui(form): Refactor the algorithm component
Create a common component for the algorithm settings. We will need this
for Early Stopping, which also has algorithm settings.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* ui(form): Add support for early stopping
Add a distinct section for Early Stopping, when the Search Algorithm is
for Hyper Parameter tuning.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* ui(form): Add resume policy to form
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* ui(form): Small fixes to algo's settings
* The algorithm settings got applied in the form only after the user
selected a different algorithm. The preselected value would not have
the list of settings assigned once the form loads.
* Use null everywhere for the `random_state` parameter
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Use correct form group
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: Use LongRunning for resumepolicy default
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* ui(form): Use FormArray variable directly
Using "formGroup.get('trialParameters').controls" in the html results in
an error, during the build process. This is because Angular can't deduce that
the control returned from the get() method is a form array.
We will define a FormArray variable and use it directly instead of
get()ing it from he form group.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Refactor kustomize manifests
* Remove file
* Modify README
* Disable actions
* Update Trial images tag to v1beta1-c6c9172
* Fix few comments
* Remove test print
* Remove image pull policy
* Remove TODOs
* Exclude PV from Kubeflow install
Add image versions to katib-external-db install
* Rename Katib IBM install
* Create patch file for Katib config
* Fix path for MC
* Fix var name
* Change tag in patch file
* Change download mnist for mxnet example
* Change MNIST to FashionMNIST
* Remove comment from actions
* Fix image for MXNET mnist
Rename example folder for PyTorch mnist
* Add build process for Katib Trial template in the CI
Fix problems with current image build
* Change release registry to kubeflowkatib
* Few changes
* Simplify sed for mnist
* Add script to change trial images
* Modify script
* Add if for macos
* Enable PyTorch examples
* Remove test print
* manifests: Move manifests development upstream
As part of the work of wg-manifests for 1.3
(https://github.com/kubeflow/manifests/issues/1735), we are moving manifests
development in upstream repos. This gives the application developers full
ownership of their manifests, tracked in a single place.
This commit copies the manifests for application `Katib`
from path `apps/katib/upstream` of kubeflow/manifests to path
`manifests/v1beta1` of the upstream repo (https://github.com/kubeflow/katib).
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
* manifests: Fold base, overlays into components
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
* Update all dependencies to the latest versions
Add cert generator for the webhooks
Add manifests for the webhooks
* Modify Dockerfile for manager
* Remove comments
* Update Dockerfiles for Go images
* Add signerName: kubernetes.io/kube-apiserver-client to csr
Update roles for controller RBAC
Changes after review
* Fix not installed CRD error
* Update scripts
* Revert operator changes
* Describe controller pod in test
* Add log line to test
* Move kubectl version
* Change csr version to v1beta1
* Remove log
* Change signerName to kubernetes.io/kubelet-serving
* Modify common name
Co-authored-by: Yuki Iwai <68272500+tenzen-y@users.noreply.github.com>
* Add env variable to init container
Co-authored-by: Yuki Iwai <68272500+tenzen-y@users.noreply.github.com>
* Get namespace from env
Co-authored-by: Yuki Iwai <68272500+tenzen-y@users.noreply.github.com>
* Remove quotes
* Remove spaces
* Run cert generator script from the Job
* Modify new ui Dockerfile
* Disable Actions on PR
* Modify setup Katib script
* Fix PODNUM
* Remove imagePullPolicy from PyTorch and TFJob examples
* Disable Pytorch examples in e2e
* Add sleep to e2e test
* Activate Actions
* Disable actions
Co-authored-by: Yuki Iwai <68272500+tenzen-y@users.noreply.github.com>
* Create a folder for the new-ui
We will create a `new-ui` folder under the `pkg` dir to add the new UI.
This will ensure that we won't break any existing functionality.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Initial code for the frontend
This PR introduces the new UI. We hope that this will be the last big PR
in this repo and all of the subsequent ones will be smaller bit-sized
PRs.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* backend: Expose the entire status of an experiment
We want the table in the main UI page to show more information for each
experiment. This information lives in the status of each Experiment CR,
so we expand the API to also return the entire status for each
Experiment.
In the future we will probably need to just send the entire CR to the
frontend and not parse it at all in the backend.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* backend: Return the KFP run uid
We want to return the Pipeline UID for a Trial, if such exists.
When combined with Kale, a Trial initiates a KFP run. In this case,
there is an annotation with the KFP run ID, which we can use to navigate
the user to the KFP UI for the specific run.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* backend: sereve an Angular SPA
To serve an SPA the backend must return the index.html for any non-API
route. The index.html must be sent for any request to the app's page.
Then, once the javascript loads, the app will show to the user the
correct view.
In this commit we also completely remove any caching of the index.html,
for the browser to always request the latest version. This eliminates
the need to hard reload the page to view changes to the frontend code.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Dockerfile for the new Katib web app
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Extend the dockerignore for the new UI
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Update the README with build commands
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: use port 8080 instead of 80 in backend
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: use lowercase fields when fetchin exps
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Add seconds to the x-axis of Trial info
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Unify the npm run build commands
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Move TypeMeta values to a common place
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Remove section for max_old_space_size in README
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: add katib prefix to docs link
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: Correct link for new UI in README
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: Remove unused 'format' npm script
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Ensure format checks work with Travis
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Remove unused space
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: Use create_experiment route
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: fix travis govet test
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Rename the Bayesian settings
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: Rename the ParametersSpec
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: Remove setting TypeMeta and ObjectMeta
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Update README for build:watch
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: Fix a typo
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: Remove unused css
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Use types from k8s.models file
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Add kfp-run column if UID is present in trials
With Kale a Trial can launch a distinct KF Pipeline. The UID of this
pipeline will be also set as an annotation to the Trial owning the
Pipeline.
In this case the UI should have one extra column for this UID.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Properly expose the NAS fields
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Move MetricCollector enums to global enums file
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Remove unused volume enum
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* review: Don't send empty settings
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: Add parameters for TPE
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Update ADOPTERS.md
add adopter
* keep the list in alphabetical order
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
This will change the katib-controller and katib-ui
roles to clusterroles.
Additionally Dominik Fleischmann is being added to
the owners of the katib operators.
* Migrate katib to new test-infra
* Roll back test worker image
* Roll back all the changes
* Wrap integer
* Remove DDL and TTL configuration and use default
* Migrate to new aws account's configuration
* Clean up comment
Builds images locally with `latest` tag, and includes new bundle for
testing purposes that sets operator docker images to `latest` as well
instead of a particular revision.
* Add Katib Bundle for Juju
Adds Python operators for Katib, corresponding to the latest Katib manifests.
Adds an `operators` folder with an OWNERS file to hold the operators.
* Fixing code review items
* Update README.md
Update README.md
* Dedent OWNERS file
* Update README.md
Co-authored-by: Rui Vasconcelos <rui.vasconcelos.mail@gmail.com>
<!-- Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, read our contributor guidelines https://git.k8s.io/community/contributors/guide/pull-requests.md#the-pull-request-submit-process and developer guide https://git.k8s.io/community/contributors/devel/development.md#development-guide
2. If you want *faster* PR reviews, read how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
3. Follow the instructions for writing a release note: https://git.k8s.io/community/contributors/guide/release-notes.md
4. If the PR is unfinished, see how to mark it: https://git.k8s.io/community/contributors/guide/pull-requests.md#marking-unfinished-pull-requests
5. If this PR changes image versions, please title this PR "Bump <imagename> from x.x.x to y.y.y."
<!-- Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, check our contributor guidelines https://www.kubeflow.org/docs/about/contributing
2. To know more about Katib components, check developer guide https://github.com/kubeflow/katib/blob/master/CONTRIBUTING.md
3. If you want *faster* PR reviews, check how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
-->
**What this PR does / why we need it**:
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
**Which issue(s) this PR fixes** _(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)_:
Fixes #
**Special notes for your reviewer**:
**Checklist:**
1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
**Release note**:
<!-- Write your release note:
1. Enter your extended release note in the below block. If the PR requires additional action from users switching to the new release, include the string "action required".
2. If no release note is required, just write "NONE".
-->
```release-note
```
- [ ] [Docs](https://www.kubeflow.org/docs/components/katib/) included if any changes are user facing
| [Ant Group](https://www.antgroup.com/) | [@ohmystack](https://github.com/ohmystack) | Automatic training in Ant Group internal AI Platform |
| [babylon health](https://www.babylonhealth.com/) | [@jeremievallee](https://github.com/jeremievallee) | Hyperparameter tuning for AIR internal AI Platform |
| [caicloud](https://caicloud.io/) | [@gaocegege](https://github.com/gaocegege) | Hyperparameter tuning in Caicloud Cloud-Native AI Platform |
| [canonical](https://ubuntu.com/) | [@RFMVasconcelos](https://github.com/rfmvasconcelos) | Hyperparameter tuning for customer projects in Defense and Fintech |
| [CERN](https://home.cern/) | [@d-gol](https://github.com/d-gol) | Hyperparameter tuning within the ML platform on private cloud |
| [cisco](https://cisco.com/) | [@ramdootp](https://github.com/ramdootp) | Hyperparameter tuning for conversational AI interface using Rasa |
| [cubonacci](https://www.cubonacci.com) | [@janvdvegt](https://github.com/janvdvegt) | Hyperparameter tuning within the Cubonacci machine learning platform |
| [CyberAgent](https://www.cyberagent.co.jp/en/) | [@tenzen-y](https://github.com/tenzen-y) | Experiment in CyberAgent internal ML Platform on Private Cloud |
| [fuzhi](http://www.fuzhi.ai/) | [@planck0591](https://github.com/planck0591) | Experiment and Trial in autoML Platform |
| [karrot](https://uk.karrotmarket.com/) | [@muik](https://github.com/muik) | Hyperparameter tuning in Karrot ML Platform |
| [PITS Global Data Recovery Services](https://www.pitsdatarecovery.net/) | [@pheianox](https://github.com/pheianox) | CyberAgent and ML Platform |
to build multi-arch images. Check source code as follows:
```bash
make build REGISTRY=<image-registry> TAG=<image-tag>
```
If you are using an Apple Silicon machine and encounter the "rosetta error: bss_size overflow," go to Docker Desktop -> General and uncheck "Use Rosetta for x86_64/amd64 emulation on Apple Silicon."
To use your custom images for the Katib components, modify
| katib-config | string | "" | The katib-controller will load its initial configuration from this file. Omit this flag to use the default configuration values. |
## DB Manager Flags
Below is a list of command-line flags accepted by Katib DB Manager:
**Note:** If you are using a private Kubernetes cluster, you have to allow traffic
via `TCP:8443` by specifying the firewall rule and you have to update the master
plane CIDR source range to use the Katib webhooks
### Katib cert generator
Katib Controller has the internal `cert-generator` to generate certificates for the webhooks.
Once Katib is deployed in the Kubernetes cluster, the `cert-generator` follows these steps:
- Generate the self-signed certificate and private key.
- Update a Kubernetes Secret with the self-signed TLS certificate and private key.
- Patch the webhooks with the `CABundle`.
Once the `cert-generator` finished, the Katib controller starts to register controllers such as `experiment-controller` to the manager.
You can find the `cert-generator` source code [here](../pkg/certgenerator/v1beta1).
NOTE: the Katib also supports the [cert-manager](https://cert-manager.io/) to generate certs for the admission webhooks instead of using cert-generator.
You can find the installation with the cert-manager [here](../manifests/v1beta1/installs/katib-cert-manager).
## Implement a new algorithm and use it in Katib
Please see [new-algorithm-service.md](./new-algorithm-service.md).
## Katib UI documentation
Please see [Katib UI README](../pkg/ui/v1beta1).
## Design proposals
Please see [proposals](./proposals).
## Code Style
### pre-commit
Make sure to install [pre-commit](https://pre-commit.com/) (`pip install
pre-commit`) and run `pre-commit install` from the root of the repository at
least once before creating git commits.
The pre-commit [hooks](../.pre-commit-config.yaml) ensure code quality and
consistency. They are executed in CI. PRs that fail to comply with the hooks
will not be able to pass the corresponding CI gate. The hooks are only executed
against staged files unless you run `pre-commit run --all`, in which case,
they'll be executed against every file in the repository.
Specific programmatically generated files listed in the `exclude` field in
[.pre-commit-config.yaml](../.pre-commit-config.yaml) are deliberately excluded
Katib is a Kubernetes-based system for [Hyperparameter Tuning][1] and [Neural Architecture Search][2]. Katib supports a number of ML frameworks, including TensorFlow, Apache MXNet, PyTorch, XGBoost, and others.
Katib is the project which is agnostic to machine learning (ML) frameworks.
It can tune hyperparameters of applications written in any language of the
users’ choice and natively supports many ML frameworks, such as
[TensorFlow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), [XGBoost](https://xgboost.readthedocs.io/en/latest/), and others.
Table of Contents
=================
* [Getting Started](#getting-started)
* [Name](#name)
* [Concepts in Katib](#concepts-in-katib)
* [Experiment](#experiment)
* [Suggestion](#suggestion)
* [Trial](#trial)
* [Worker Job](#worker-job)
* [Components in Katib](#components-in-katib)
* [Web UI](#web-ui)
* [API documentation](#api-documentation)
* [Installation](#installation)
* [TF operator](#tf-operator)
* [PyTorch operator](#pytorch-operator)
* [Katib](#katib)
* [Running examples](#running-examples)
* [Cleanups](#cleanups)
* [Katib SDK](#katib-sdk)
* [Quick Start](#quick-start)
* [Who are using Katib?](#who-are-using-katib)
* [Citation](#citation)
* [CONTRIBUTING](#contributing)
Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)
A Suggestion is a proposed solution to the optimization problem which is one set of hyperparameter values or a list of parameter assignments. Then a `Trial` will be created to evaluate the parameter assignments.
`Suggestion` is defined as a CRD.
### Trial
A `Trial` is one iteration of the optimization process, which is one `worker job` instance with a list of parameter assignments(corresponding to a suggestion).
`Trial` is defined as a CRD.
### Worker Job
A `Worker Job` refers to a process responsible for evaluating a `Trial` and calculating its objective value.
The worker kind can be [Kubernetes Job](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/) which is a non distributed execution, [Kubeflow TFJob](https://www.kubeflow.org/docs/guides/components/tftraining/) or [Kubeflow PyTorchJob](https://www.kubeflow.org/docs/guides/components/pytorch/) which are distributed executions.
Thus, Katib supports multiple frameworks with the help of different job kinds.
Currently Katib supports the following exploration algorithms:
- Check out [who is using Katib](ADOPTERS.md) and [presentations about Katib project](docs/presentations.md).
For v1beta1 version, run the following:
## Contributing
```
git clone git@github.com:kubeflow/katib.git
bash katib/scripts/v1beta1/deploy.sh
```
For v1alpha3 version, run the following:
```
cd "${MANIFESTS_DIR}/katib/katib-crds/base"
kustomize build . | kubectl apply -f -
cd "${MANIFESTS_DIR}/katib/katib-controller/base"
kustomize build . | kubectl apply -f -
```
If you install Katib from Kubeflow manifest repository and you want to use Katib in a cluster that doesn't have a StorageClass for dynamic volume provisioning, you have to create persistent volume manually to bound your persistent volume claim.
This is sample yaml file for creating a persistent volume with local storage:
@ -71,25 +89,30 @@ func (s *server) Check(ctx context.Context, in *health_pb.HealthCheckRequest) (*
}
funcmain(){
varconnectTimeouttime.Duration
varlistenAddressstring
flag.DurationVar(&connectTimeout,"connect-timeout",defaultConnectTimeout,"Timeout before calling error during database connection. (e.g. 120s)")
flag.StringVar(&listenAddress,"listen-address",defaultListenAddress,"The network interface or IP address to receive incoming connections. (e.g. 0.0.0.0:6789)")
"default","The implementation of suggestion interface in experiment controller (default)")
flag.StringVar(&metricsAddr,"metrics-addr",":8080","The address the metric endpoint binds to.")
flag.IntVar(&webhookPort,"webhook-port",8443,"The port number to be used for admission webhook server.")
flag.BoolVar(&certLocalFS,"cert-localfs",false,"Store the webhook cert in local file system")
flag.BoolVar(&injectSecurityContext,"webhook-inject-securitycontext",false,"Inject the securityContext of container[0] in the sidecar")
flag.StringVar(&serviceName,"webhook-service-name","katib-controller","The service name which will be used in webhook")
flag.BoolVar(&enableGRPCProbeInSuggestion,"enable-grpc-probe-in-suggestion",true,"enable grpc probe in suggestions")
flag.Var(&trialResources,"trial-resources","The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org)")