katib/test
Hezhi (Helen) Xie 73b8c5c029
[GSoC] Add e2e test for `tune` api with LLM hyperparameter optimization (#2420)
* add e2e test for tune api

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* upgrade training-operator sdk

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* specify the version of training operator sdk

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix num_labels error and update the version of training operator controller

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the version of training operator

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* debug

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check import path of HuggingFaceModelParams

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update the version of training operator sdk

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update the name of experiment

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add step of checking pod

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of pod

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add check

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check reason for imagepullbackoff

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* revert timeout limit

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* extend timeout limit

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update training operator sdk version

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of pod

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* rerun tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update the function of getting logs

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add the step of describing pod

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check disk space

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* change work directory

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* change work directory

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* increase timeout limit

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of controller and events

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* change work directory

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* change work directory

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* change work directory

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of kubelet

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of kubelet

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* increase cpu

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of training operator

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the use of resources

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check the logs of container 'pytorch' and 'storage_initializer'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix error of checking use of resources

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add other checks to find the error reason

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* set 'storage_config'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* reduce the number of tests

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* Check container runtime logs

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* set the driver of minikube as docker

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* set the driver of minikube to none

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check logs of pod

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* check memory usage

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* increase 'termination_grace_period_seconds' in podspec

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix annotations error

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* restart docker

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete restarting docker

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* use original docker data directory

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update installation of Katib SDK with extra requires

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* test trainer image built with cpu

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add action of free up disk space (including move docker data directory)

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete unnecessary checks and update the part of fetching pod description and logs

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* delete fetching pod logs

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* add blank line at the end of free-up-disk-space yaml file

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update experiment name

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update test function name to be consistent with experiment name

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* move import statements inside the function

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* apply pprint for the logging output

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* update experiment names

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix the sequence of arguments in 'trial_template'

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* test example in user guide

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix access token error

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix the error of setup

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix the error of setup

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* reverse back

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

* fix format

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>

---------

Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
2025-06-26 14:13:16 +00:00
..
e2e/v1beta1
unit/v1beta1
__init__.py
conftest.py