Compare commits

..

4 Commits

Author SHA1 Message Date
Johnu George 0a7453d212 Katib official release v0.14.0 2022-08-18 17:04:37 +05:30
Yuki Iwai 12a4896ae0
[cherry-pick] Add the pytorch-mnist with GPU support container image (#1917) 2022-07-17 14:26:07 +00:00
Johnu George 8dcc7d3398
Fix push script to include new images (#1912) 2022-06-30 13:24:17 +00:00
Johnu George 73177dc229 Katib official release v0.14.0-rc.0 2022-06-30 12:24:20 +05:30
853 changed files with 67663 additions and 60317 deletions

View File

@ -4,3 +4,5 @@ docs
manifests manifests
pkg/ui/*/frontend/node_modules pkg/ui/*/frontend/node_modules
pkg/ui/*/frontend/build pkg/ui/*/frontend/build
pkg/new-ui/*/frontend/node_modules
pkg/new-ui/*/frontend/build

View File

@ -1,4 +0,0 @@
[flake8]
max-line-length = 100
# E203 is ignored to avoid conflicts with Black's formatting, as it's not PEP 8 compliant
extend-ignore = W503, E203

26
.github/ISSUE_TEMPLATE/bug_report.md vendored Normal file
View File

@ -0,0 +1,26 @@
---
name: Bug report
about: Tell us about a problem you are experiencing
---
/kind bug
**What steps did you take and what happened:**
[A clear and concise description of what the bug is.]
**What did you expect to happen:**
**Anything else you would like to add:**
[Miscellaneous information that will assist in solving the issue.]
**Environment:**
- Katib version (check the Katib controller image version):
- Kubernetes version: (`kubectl version`):
- OS (`uname -a`):
---
<!-- Don't delete this message to encourage users to support your issue! -->
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

View File

@ -1,50 +0,0 @@
name: Bug Report
description: Tell us about a problem you are experiencing with Katib
labels: ["kind/bug", "lifecycle/needs-triage"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this Katib bug report!
- type: textarea
id: problem
attributes:
label: What happened?
description: |
Please provide as much info as possible. Not doing so may result in your bug not being
addressed in a timely manner.
validations:
required: true
- type: textarea
id: expected
attributes:
label: What did you expect to happen?
validations:
required: true
- type: textarea
id: environment
attributes:
label: Environment
value: |
Kubernetes version:
```bash
$ kubectl version
```
Katib controller version:
```bash
$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
```
Katib Python SDK version:
```bash
$ pip show kubeflow-katib
```
validations:
required: true
- type: input
id: votes
attributes:
label: Impacted by this bug?
value: Give it a 👍 We prioritize the issues with most 👍

View File

@ -1,12 +1,9 @@
blank_issues_enabled: true blank_issues_enabled: false
contact_links: contact_links:
- name: Katib Documentation - name: Katib Documentation
url: https://www.kubeflow.org/docs/components/katib/ url: https://www.kubeflow.org/docs/components/katib/
about: Much help can be found in the docs about: Much help can be found in the docs
- name: Kubeflow Katib Slack Channel - name: AutoML Slack Channel
url: https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels url: https://kubeflow.slack.com/archives/C018PMV53NW
about: Ask the Katib community on CNCF Slack about: Ask the Katib community on Slack
- name: Kubeflow Katib Community Meeting
url: https://bit.ly/2PWVCkV
about: Join the Kubeflow AutoML working group meeting

View File

@ -0,0 +1,18 @@
---
name: Feature enhancement request
about: Suggest an idea for this project
---
/kind feature
**Describe the solution you'd like**
[A clear and concise description of what you want to happen.]
**Anything else you would like to add:**
[Miscellaneous information that will assist in solving the issue.]
---
<!-- Don't delete this message to encourage users to support your issue! -->
Love this feature? Give it a 👍 We prioritize the features with the most 👍

View File

@ -1,28 +0,0 @@
name: Feature Request
description: Suggest an idea for Katib
labels: ["kind/feature", "lifecycle/needs-triage"]
body:
- type: markdown
attributes:
value: |
Thanks for taking the time to fill out this Katib feature request!
- type: textarea
id: feature
attributes:
label: What you would like to be added?
description: |
A clear and concise description of what you want to add to Katib.
Please consider to write Katib enhancement proposal if it is a large feature request.
validations:
required: true
- type: textarea
id: rationale
attributes:
label: Why is this needed?
validations:
required: true
- type: input
id: votes
attributes:
label: Love this feature?
value: Give it a 👍 We prioritize the features with most 👍

View File

@ -1,6 +1,6 @@
<!-- Thanks for sending a pull request! Here are some tips for you: <!-- Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, check our contributor guidelines https://www.kubeflow.org/docs/about/contributing 1. If this is your first time, check our contributor guidelines https://www.kubeflow.org/docs/about/contributing
2. To know more about Katib components, check developer guide https://github.com/kubeflow/katib/blob/master/CONTRIBUTING.md 2. To know more about Katib components, check developer guide https://github.com/kubeflow/katib/blob/master/docs/developer-guide.md
3. If you want *faster* PR reviews, check how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews 3. If you want *faster* PR reviews, check how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
--> -->

20
.github/stale.yml vendored Normal file
View File

@ -0,0 +1,20 @@
# Configuration for stale probot https://probot.github.io/apps/stale/
# Number of days of inactivity before an issue becomes stale
daysUntilStale: 90
# Number of days of inactivity before a stale issue is closed
daysUntilClose: 20
# Issues with these labels will never be considered stale
exemptLabels:
- lifecycle/frozen
# Label to use when marking an issue as stale
staleLabel: lifecycle/stale
# Comment to post when marking an issue as stale. Set to `false` to disable
markComment: >
This issue has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
# Comment to post when closing a stale issue. Set to `false` to disable
closeComment: >
This issue has been automatically closed because it has not had recent
activity. Please comment "/reopen" to reopen it.

View File

@ -1,81 +0,0 @@
# Reusable workflows for publishing Katib images.
name: Build and Publish Images
on:
workflow_call:
inputs:
component-name:
required: true
type: string
platforms:
required: true
type: string
dockerfile:
required: true
type: string
secrets:
DOCKERHUB_USERNAME:
required: false
DOCKERHUB_TOKEN:
required: false
jobs:
build-and-publish:
name: Build and Publish Images
runs-on: ubuntu-22.04
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set Publish Condition
id: publish-condition
shell: bash
run: |
if [[ "${{ github.repository }}" == 'kubeflow/katib' && \
( "${{ github.ref }}" == 'refs/heads/master' || \
"${{ github.ref }}" =~ ^refs/heads/release- || \
"${{ github.ref }}" =~ ^refs/tags/v ) ]]; then
echo "should_publish=true" >> $GITHUB_OUTPUT
else
echo "should_publish=false" >> $GITHUB_OUTPUT
fi
- name: GHCR Login
if: steps.publish-condition.outputs.should_publish == 'true'
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: DockerHub Login
if: steps.publish-condition.outputs.should_publish == 'true'
uses: docker/login-action@v3
with:
registry: docker.io
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Publish Component ${{ inputs.component-name }}
if: steps.publish-condition.outputs.should_publish == 'true'
id: publish
uses: ./.github/workflows/template-publish-image
with:
image: |
ghcr.io/kubeflow/katib/${{ inputs.component-name }}
docker.io/kubeflowkatib/${{ inputs.component-name }}
dockerfile: ${{ inputs.dockerfile }}
platforms: ${{ inputs.platforms }}
push: true
- name: Test Build For Component ${{ inputs.component-name }}
if: steps.publish.outcome == 'skipped'
uses: ./.github/workflows/template-publish-image
with:
image: |
ghcr.io/kubeflow/katib/${{ inputs.component-name }}
docker.io/kubeflowkatib/${{ inputs.component-name }}
dockerfile: ${{ inputs.dockerfile }}
platforms: ${{ inputs.platforms }}
push: false

View File

@ -1,27 +1,22 @@
name: E2E Test with darts-cnn-cifar10 name: E2E Test with darts-cnn-cifar10
on: on:
pull_request: - pull_request
paths-ignore:
- "pkg/ui/v1beta1/frontend/**"
concurrency: env:
group: ${{ github.workflow }}-${{ github.ref }} GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
cancel-in-progress: true
jobs: jobs:
e2e: e2e:
runs-on: ubuntu-22.04 runs-on: ubuntu-20.04
timeout-minutes: 120 timeout-minutes: 120
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v2
- name: Setup Test Env - name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test uses: ./.github/workflows/template-setup-e2e-test
with: with:
kubernetes-version: ${{ matrix.kubernetes-version }} kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: "3.11"
- name: Run e2e test with ${{ matrix.experiments }} experiments - name: Run e2e test with ${{ matrix.experiments }} experiments
uses: ./.github/workflows/template-e2e-test uses: ./.github/workflows/template-e2e-test
@ -33,6 +28,8 @@ jobs:
strategy: strategy:
fail-fast: false fail-fast: false
matrix: matrix:
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"] # TODO (tenzen-y): We need to consider running tests on more kubernetes versions.
# kubernetes-version: ["v1.20.15", "v1.21.13", "v1.22.10", "v1.23.7", "v1.24.1"]
kubernetes-version: ["v1.21.13", "v1.22.10", "v1.23.7"]
# Comma Delimited # Comma Delimited
experiments: ["darts-cpu"] experiments: ["darts-cpu"]

View File

@ -1,40 +0,0 @@
name: E2E Test with tune API
on:
pull_request:
paths-ignore:
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
e2e:
runs-on: ubuntu-22.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
- name: Install Katib SDK with extra requires
shell: bash
run: |
pip install --prefer-binary -e 'sdk/python/v1beta1[huggingface]'
- name: Run e2e test with tune API
uses: ./.github/workflows/template-e2e-test
with:
tune-api: true
training-operator: true
strategy:
fail-fast: false
matrix:
# Detail: https://hub.docker.com/r/kindest/node
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]

View File

@ -1,35 +0,0 @@
name: E2E Test with Katib UI, random search, and postgres
on:
- pull_request
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
e2e:
runs-on: ubuntu-22.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
- name: Run e2e test with ${{ matrix.experiments }} experiments
uses: ./.github/workflows/template-e2e-test
with:
experiments: random
# Comma Delimited
trial-images: pytorch-mnist-cpu
katib-ui: true
database-type: postgres
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]

View File

@ -1,27 +1,22 @@
name: E2E Test with enas-cnn-cifar10 name: E2E Test with enas-cnn-cifar10
on: on:
pull_request: - pull_request
paths-ignore:
- "pkg/ui/v1beta1/frontend/**"
concurrency: env:
group: ${{ github.workflow }}-${{ github.ref }} GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
cancel-in-progress: true
jobs: jobs:
e2e: e2e:
runs-on: ubuntu-22.04 runs-on: ubuntu-20.04
timeout-minutes: 120 timeout-minutes: 120
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v2
- name: Setup Test Env - name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test uses: ./.github/workflows/template-setup-e2e-test
with: with:
kubernetes-version: ${{ matrix.kubernetes-version }} kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: "3.8"
- name: Run e2e test with ${{ matrix.experiments }} experiments - name: Run e2e test with ${{ matrix.experiments }} experiments
uses: ./.github/workflows/template-e2e-test uses: ./.github/workflows/template-e2e-test
@ -33,6 +28,8 @@ jobs:
strategy: strategy:
fail-fast: false fail-fast: false
matrix: matrix:
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"] # TODO (tenzen-y): We need to consider running tests on more kubernetes versions.
# kubernetes-version: ["v1.20.15", "v1.21.13", "v1.22.10", "v1.23.7", "v1.24.1"]
kubernetes-version: ["v1.21.13", "v1.22.10", "v1.23.7"]
# Comma Delimited # Comma Delimited
experiments: ["enas-cpu"] experiments: ["enas-cpu"]

View File

@ -1,49 +0,0 @@
name: Free-Up Disk Space
description: Remove Non-Essential Tools And Move Docker Data Directory to /mnt/docker
runs:
using: composite
steps:
# This step is a Workaround to avoid the "No space left on device" error.
# ref: https://github.com/actions/runner-images/issues/2840
- name: Remove unnecessary files
shell: bash
run: |
echo "Disk usage before cleanup:"
df -hT
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf /usr/local/share/boost
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/local/share/powershell
sudo rm -rf /usr/share/swift
echo "Disk usage after cleanup:"
df -hT
- name: Prune docker images
shell: bash
run: |
docker image prune -a -f
docker system df
df -hT
- name: Move docker data directory
shell: bash
run: |
echo "Stopping docker service ..."
sudo systemctl stop docker
DOCKER_DEFAULT_ROOT_DIR=/var/lib/docker
DOCKER_ROOT_DIR=/mnt/docker
echo "Moving ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
sudo mv ${DOCKER_DEFAULT_ROOT_DIR} ${DOCKER_ROOT_DIR}
echo "Creating symlink ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
sudo ln -s ${DOCKER_ROOT_DIR} ${DOCKER_DEFAULT_ROOT_DIR}
echo "$(sudo ls -l ${DOCKER_DEFAULT_ROOT_DIR})"
echo "Starting docker service ..."
sudo systemctl daemon-reload
sudo systemctl start docker
echo "Docker service status:"
sudo systemctl --no-pager -l -o short status docker

View File

@ -0,0 +1,32 @@
name: E2E Test for katib-ui
on:
- pull_request
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
jobs:
e2e:
runs-on: ubuntu-20.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
- name: Set Up Minikube Cluster
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh true
- name: Start Katib
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-katib.sh true false
strategy:
fail-fast: false
matrix:
# TODO (tenzen-y): We need to consider running tests on more kubernetes versions.
# kubernetes-version: ["v1.20.15", "v1.21.13", "v1.22.10", "v1.23.7", "v1.24.1"]
kubernetes-version: ["v1.21.13", "v1.22.10", "v1.23.7"]

23
.github/workflows/lint.yaml vendored Executable file
View File

@ -0,0 +1,23 @@
name: Lint YAML files
on:
- push
- pull_request
jobs:
lint:
name: Lint
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Check YAML
run: make yamllint

View File

@ -0,0 +1,41 @@
name: E2E Test with mxnet-mnist
on:
- pull_request
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
jobs:
e2e:
runs-on: ubuntu-20.04
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
- name: Run e2e test with ${{ matrix.experiments }} experiments
uses: ./.github/workflows/template-e2e-test
with:
experiments: ${{ matrix.experiments }}
# Comma Delimited
trial-images: mxnet-mnist
strategy:
fail-fast: false
matrix:
# TODO (tenzen-y): We need to consider running tests on more kubernetes versions.
# kubernetes-version: ["v1.20.15", "v1.21.13", "v1.22.10", "v1.23.7", "v1.24.1"]
kubernetes-version: ["v1.21.13", "v1.22.10", "v1.23.7"]
# Comma Delimited
experiments:
# suggestion-hyperopt
- "random,tpe,never-resume"
- "median-stop,from-volume-resume"
# others
- "grid,bayesian-optimization,tpe"
- "multivariate-tpe,cma-es,hyperband"

View File

@ -2,21 +2,28 @@ name: Publish AutoML Algorithm Images
on: on:
push: push:
pull_request: branches:
paths-ignore: - master
- "pkg/ui/v1beta1/frontend/**"
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
jobs: jobs:
algorithm: algorithm:
name: Publish Image name: Publish Image
uses: ./.github/workflows/build-and-publish-images.yaml # Trigger workflow only for kubeflow/katib repository.
with: if: github.repository == 'kubeflow/katib'
component-name: ${{ matrix.component-name }} runs-on: ubuntu-latest
platforms: linux/amd64,linux/arm64 steps:
dockerfile: ${{ matrix.dockerfile }} - name: Checkout
secrets: uses: actions/checkout@v2
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }} - name: Publish Component ${{ matrix.component-name }}
uses: ./.github/workflows/template-publish-image
with:
image: docker.io/kubeflowkatib/${{ matrix.component-name }}
dockerfile: ${{ matrix.dockerfile }}
strategy: strategy:
fail-fast: false fail-fast: false
@ -24,6 +31,8 @@ jobs:
include: include:
- component-name: suggestion-hyperopt - component-name: suggestion-hyperopt
dockerfile: cmd/suggestion/hyperopt/v1beta1/Dockerfile dockerfile: cmd/suggestion/hyperopt/v1beta1/Dockerfile
- component-name: suggestion-chocolate
dockerfile: cmd/suggestion/chocolate/v1beta1/Dockerfile
- component-name: suggestion-hyperband - component-name: suggestion-hyperband
dockerfile: cmd/suggestion/hyperband/v1beta1/Dockerfile dockerfile: cmd/suggestion/hyperband/v1beta1/Dockerfile
- component-name: suggestion-skopt - component-name: suggestion-skopt

View File

@ -1,24 +0,0 @@
name: Publish Katib Conformance Test Images
on:
- push
- pull_request
jobs:
core:
name: Publish Image
uses: ./.github/workflows/build-and-publish-images.yaml
with:
component-name: ${{ matrix.component-name }}
platforms: linux/amd64,linux/arm64
dockerfile: ${{ matrix.dockerfile }}
secrets:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
strategy:
fail-fast: false
matrix:
include:
- component-name: katib-conformance
dockerfile: Dockerfile.conformance

View File

@ -1,20 +1,29 @@
name: Publish Katib Core Images name: Publish Katib Core Images
on: on:
- push push:
- pull_request branches:
- master
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
jobs: jobs:
core: core:
name: Publish Image name: Publish Image
uses: ./.github/workflows/build-and-publish-images.yaml # Trigger workflow only for kubeflow/katib repository.
with: if: github.repository == 'kubeflow/katib'
component-name: ${{ matrix.component-name }} runs-on: ubuntu-latest
platforms: linux/amd64,linux/arm64 steps:
dockerfile: ${{ matrix.dockerfile }} - name: Checkout
secrets: uses: actions/checkout@v2
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }} - name: Publish Component ${{ matrix.component-name }}
uses: ./.github/workflows/template-publish-image
with:
image: docker.io/kubeflowkatib/${{ matrix.component-name }}
dockerfile: ${{ matrix.dockerfile }}
strategy: strategy:
fail-fast: false fail-fast: false
@ -25,7 +34,9 @@ jobs:
- component-name: katib-db-manager - component-name: katib-db-manager
dockerfile: cmd/db-manager/v1beta1/Dockerfile dockerfile: cmd/db-manager/v1beta1/Dockerfile
- component-name: katib-ui - component-name: katib-ui
dockerfile: cmd/ui/v1beta1/Dockerfile dockerfile: cmd/new-ui/v1beta1/Dockerfile
- component-name: cert-generator
dockerfile: cmd/cert-generator/v1beta1/Dockerfile
- component-name: file-metrics-collector - component-name: file-metrics-collector
dockerfile: cmd/metricscollector/v1beta1/file-metricscollector/Dockerfile dockerfile: cmd/metricscollector/v1beta1/file-metricscollector/Dockerfile
- component-name: tfevent-metrics-collector - component-name: tfevent-metrics-collector

View File

@ -2,47 +2,48 @@ name: Publish Trial Images
on: on:
push: push:
pull_request: branches:
paths-ignore: - master
- "pkg/ui/v1beta1/frontend/**"
env:
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
jobs: jobs:
trial: trial:
name: Publish Image name: Publish Image
uses: ./.github/workflows/build-and-publish-images.yaml # Trigger workflow only for kubeflow/katib repository.
with: if: github.repository == 'kubeflow/katib'
component-name: ${{ matrix.trial-name }} runs-on: ubuntu-latest
platforms: ${{ matrix.platforms }} steps:
dockerfile: ${{ matrix.dockerfile }} - name: Checkout
secrets: uses: actions/checkout@v2
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }} - name: Publish Trial ${{ matrix.trial-name }}
uses: ./.github/workflows/template-publish-image
with:
image: docker.io/kubeflowkatib/${{ matrix.trial-name }}
dockerfile: ${{ matrix.dockerfile }}
strategy: strategy:
fail-fast: false fail-fast: false
matrix: matrix:
include: include:
- trial-name: mxnet-mnist
dockerfile: examples/v1beta1/trial-images/mxnet-mnist/Dockerfile
- trial-name: pytorch-mnist-cpu - trial-name: pytorch-mnist-cpu
platforms: linux/amd64,linux/arm64
dockerfile: examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.cpu dockerfile: examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.cpu
- trial-name: pytorch-mnist-gpu - trial-name: pytorch-mnist-gpu
platforms: linux/amd64
dockerfile: examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.gpu dockerfile: examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.gpu
- trial-name: tf-mnist-with-summaries - trial-name: tf-mnist-with-summaries
platforms: linux/amd64,linux/arm64
dockerfile: examples/v1beta1/trial-images/tf-mnist-with-summaries/Dockerfile dockerfile: examples/v1beta1/trial-images/tf-mnist-with-summaries/Dockerfile
- trial-name: enas-cnn-cifar10-gpu - trial-name: enas-cnn-cifar10-gpu
platforms: linux/amd64
dockerfile: examples/v1beta1/trial-images/enas-cnn-cifar10/Dockerfile.gpu dockerfile: examples/v1beta1/trial-images/enas-cnn-cifar10/Dockerfile.gpu
- trial-name: enas-cnn-cifar10-cpu - trial-name: enas-cnn-cifar10-cpu
platforms: linux/amd64,linux/arm64
dockerfile: examples/v1beta1/trial-images/enas-cnn-cifar10/Dockerfile.cpu dockerfile: examples/v1beta1/trial-images/enas-cnn-cifar10/Dockerfile.cpu
- trial-name: darts-cnn-cifar10-cpu - trial-name: darts-cnn-cifar10-cpu
platforms: linux/amd64,linux/arm64
dockerfile: examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile.cpu dockerfile: examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile.cpu
- trial-name: darts-cnn-cifar10-gpu - trial-name: darts-cnn-cifar10-gpu
platforms: linux/amd64
dockerfile: examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile.gpu dockerfile: examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile.gpu
- trial-name: simple-pbt - trial-name: simple-pbt
platforms: linux/amd64,linux/arm64
dockerfile: examples/v1beta1/trial-images/simple-pbt/Dockerfile dockerfile: examples/v1beta1/trial-images/simple-pbt/Dockerfile

View File

@ -1,27 +1,22 @@
name: E2E Test with pytorch-mnist name: E2E Test with pytorch-mnist
on: on:
pull_request: - pull_request
paths-ignore:
- "pkg/ui/v1beta1/frontend/**"
concurrency: env:
group: ${{ github.workflow }}-${{ github.ref }} GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
cancel-in-progress: true
jobs: jobs:
e2e: e2e:
runs-on: ubuntu-22.04 runs-on: ubuntu-20.04
timeout-minutes: 120 timeout-minutes: 120
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v2
- name: Setup Test Env - name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test uses: ./.github/workflows/template-setup-e2e-test
with: with:
kubernetes-version: ${{ matrix.kubernetes-version }} kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: "3.10"
- name: Run e2e test with ${{ matrix.experiments }} experiments - name: Run e2e test with ${{ matrix.experiments }} experiments
uses: ./.github/workflows/template-e2e-test uses: ./.github/workflows/template-e2e-test
@ -34,13 +29,10 @@ jobs:
strategy: strategy:
fail-fast: false fail-fast: false
matrix: matrix:
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"] # TODO (tenzen-y): We need to consider running tests on more kubernetes versions.
# kubernetes-version: ["v1.20.15", "v1.21.13", "v1.22.10", "v1.23.7", "v1.24.1"]
kubernetes-version: ["v1.21.13", "v1.22.10", "v1.23.7"]
# Comma Delimited # Comma Delimited
experiments: experiments:
# suggestion-hyperopt
- "long-running-resume,from-volume-resume,median-stop"
# others
- "grid,bayesian-optimization,tpe,multivariate-tpe,cma-es,hyperband"
- "hyperopt-distribution,optuna-distribution"
- "file-metrics-collector,pytorchjob-mnist" - "file-metrics-collector,pytorchjob-mnist"
- "median-stop-with-json-format,file-metrics-collector-with-json-format" - "median-stop-with-json-format,file-metrics-collector-with-json-format"

View File

@ -1,21 +1,17 @@
name: E2E Test with simple-pbt name: E2E Test with simple-pbt
on: on:
pull_request: - pull_request
paths-ignore:
- "pkg/ui/v1beta1/frontend/**"
concurrency: env:
group: ${{ github.workflow }}-${{ github.ref }} GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
cancel-in-progress: true
jobs: jobs:
e2e: e2e:
runs-on: ubuntu-22.04 runs-on: ubuntu-20.04
timeout-minutes: 120 timeout-minutes: 120
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v2
- name: Setup Test Env - name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test uses: ./.github/workflows/template-setup-e2e-test
@ -33,6 +29,8 @@ jobs:
fail-fast: false fail-fast: false
matrix: matrix:
# Detail: https://hub.docker.com/r/kindest/node # Detail: https://hub.docker.com/r/kindest/node
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"] # TODO (tenzen-y): We need to consider running tests on more kubernetes versions.
# kubernetes-version: ["v1.20.15", "v1.21.12", "v1.22.9", "v1.23.6", "v1.24.1"]
kubernetes-version: ["v1.21.12", "v1.22.9", "v1.23.6"]
# Comma Delimited # Comma Delimited
experiments: ["simple-pbt"] experiments: ["simple-pbt"]

View File

@ -1,42 +0,0 @@
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
#
# You can adjust the behavior by modifying this file.
# For more information, see:
# https://github.com/actions/stale
name: Mark stale issues and pull requests
on:
schedule:
- cron: "0 */5 * * *"
jobs:
stale:
runs-on: ubuntu-22.04
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/stale@v5
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
days-before-stale: 90
days-before-close: 20
stale-issue-message: >
This issue has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
close-issue-message: >
This issue has been automatically closed because it has not had recent
activity. Please comment "/reopen" to reopen it.
stale-issue-label: lifecycle/stale
exempt-issue-labels: lifecycle/frozen
stale-pr-message: >
This pull request has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
close-pr-message: >
This pull request has been automatically closed because it has not had recent
activity. Please comment "/reopen" to reopen it.
stale-pr-label: lifecycle/stale
exempt-pr-labels: lifecycle/frozen

View File

@ -1,49 +1,31 @@
# Composite action for e2e tests. # Template for e2e tests.
name: Run E2E Test
description: Run e2e test using the minikube cluster
inputs: inputs:
experiments: experiments:
required: false required: true
description: comma delimited experiment name type: string
default: ""
training-operator: training-operator:
required: false required: false
description: whether to deploy training-operator or not type: boolean
default: false
trial-images: trial-images:
required: false required: true
description: comma delimited trial image name type: string
default: ""
katib-ui: katib-ui:
required: true required: true
description: whether to deploy katib-ui or not type: boolean
default: false
database-type:
required: false
description: mysql or postgres
default: mysql
tune-api:
required: true
description: whether to execute tune-api test or not
default: false default: false
runs: runs:
using: composite using: composite
steps: steps:
- name: Setup Minikube Cluster - name: Set Up Minikube Cluster
shell: bash shell: bash
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh ${{ inputs.katib-ui }} ${{ inputs.tune-api }} ${{ inputs.trial-images }} ${{ inputs.experiments }} run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh ${{ inputs.katib-ui }} ${{ inputs.trial-images }} ${{ inputs.experiments }}
- name: Setup Katib - name: Set Up Katib
shell: bash shell: bash
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-katib.sh ${{ inputs.katib-ui }} ${{ inputs.training-operator }} ${{ inputs.database-type }} run: ./test/e2e/v1beta1/scripts/gh-actions/setup-katib.sh ${{ inputs.katib-ui }} ${{ inputs.training-operator }}
- name: Run E2E Experiment - name: Run E2E Experiment
shell: bash shell: bash
run: | run: ./test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.sh ${{ inputs.experiments }}
if "${{ inputs.tune-api }}"; then
./test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.sh
else
./test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.sh ${{ inputs.experiments }}
fi

View File

@ -1,49 +1,28 @@
# Composite action for publishing Katib images. # Template run for publishing Katib images.
name: Build And Publish Container Images
description: Build MultiPlatform Supporting Container Images
inputs: inputs:
image: image:
required: true required: true
description: image tag type: string
dockerfile: dockerfile:
required: true required: true
description: path for dockerfile type: string
platforms:
required: true
description: linux/amd64 or linux/amd64,linux/arm64
push:
required: true
description: whether to push container images or not
runs: runs:
using: composite using: composite
steps: steps:
# This step is a Workaround to avoid the "No space left on device" error.
# ref: https://github.com/actions/runner-images/issues/2840
- name: Remove unnecessary files
shell: bash
run: |
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/local/lib/android
sudo rm -rf /usr/local/share/powershell
sudo rm -rf /usr/share/swift
echo "Disk usage after cleanup:"
df -h
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set Up Docker Buildx - name: Set Up Docker Buildx
uses: docker/setup-buildx-action@v3 uses: docker/setup-buildx-action@v1
- name: Docker Login
uses: docker/login-action@v1
with:
username: ${{ env.DOCKERHUB_USERNAME }}
password: ${{ env.DOCKERHUB_TOKEN }}
- name: Add Docker Tags - name: Add Docker Tags
id: meta id: meta
uses: docker/metadata-action@v5 uses: docker/metadata-action@v3
with: with:
images: ${{ inputs.image }} images: ${{ inputs.image }}
tags: | tags: |
@ -51,12 +30,11 @@ runs:
type=sha,prefix=v1beta1- type=sha,prefix=v1beta1-
- name: Build and Push - name: Build and Push
uses: docker/build-push-action@v5 uses: docker/build-push-action@v2
with: with:
context: . context: .
file: ${{ inputs.dockerfile }} file: ${{ inputs.dockerfile }}
push: ${{ inputs.push }} push: true
tags: ${{ steps.meta.outputs.tags }} tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha cache-from: type=gha
cache-to: type=gha,mode=max,ignore-error=true cache-to: type=gha,mode=max
platforms: ${{ inputs.platforms }}

View File

@ -1,48 +1,25 @@
# Composite action to setup e2e tests. # Template for e2e tests.
name: Setup E2E Test
description: setup env for e2e test using the minikube cluster
inputs: inputs:
kubernetes-version: kubernetes-version:
required: true required: true
description: kubernetes version type: string
python-version:
required: false
description: Python version
# Most latest supporting version
default: "3.10"
runs: runs:
using: composite using: composite
steps: steps:
# This step is a Workaround to avoid the "No space left on device" error. - name: Set Up Minikube Cluster
# ref: https://github.com/actions/runner-images/issues/2840 uses: manusa/actions-setup-minikube@v2.6.0
- name: Free-Up Disk Space
uses: ./.github/workflows/free-up-disk-space
- name: Setup kubectl
uses: azure/setup-kubectl@v4
with: with:
version: ${{ inputs.kubernetes-version }} minikube version: "v1.25.2"
kubernetes version: ${{ inputs.kubernetes-version }}
start args: --driver none --wait-timeout=60s
github token: ${{ env.GITHUB_TOKEN }}
- name: Setup Minikube Cluster - name: Set Up Docker Buildx
uses: medyagh/setup-minikube@v0.0.18 uses: docker/setup-buildx-action@v1
- name: Set Up Go env
uses: actions/setup-go@v2
with: with:
network-plugin: cni go-version: 1.17.10
cni: flannel
driver: none
kubernetes-version: ${{ inputs.kubernetes-version }}
minikube-version: 1.34.0
start-args: --wait-timeout=120s
- name: Setup Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: ${{ inputs.python-version }}
- name: Install Katib SDK
shell: bash
run: pip install --prefer-binary -e sdk/python/v1beta1

View File

@ -0,0 +1,118 @@
name: Charmed Katib
on:
- push
- pull_request
jobs:
lint:
name: Lint
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v2
- name: Install dependencies
run: |
set -eux
sudo apt update
sudo apt install python3-setuptools
sudo pip3 install black flake8
- name: Check black
run: black --check operators/*/src
- name: Check flake8
run: cd operators && flake8 ./katib*/src
build:
name: Test
runs-on: ubuntu-latest
steps:
- name: Check out repo
uses: actions/checkout@v2
- uses: balchua/microk8s-actions@v0.2.2
with:
channel: "1.21/stable"
addons: '["dns", "storage", "rbac"]'
- name: Install dependencies
run: |
set -eux
sudo apt update
sudo apt install -y python3-pip
sudo snap install juju --classic
sudo snap install juju-bundle --classic
sudo snap install juju-wait --classic
sudo pip3 install charmcraft==1.3.1
- name: Build Docker images
run: |
set -eux
images=("katib-controller" "katib-ui" "katib-db-manager")
folders=("katib-controller" "ui" "db-manager")
for idx in {0..2}; do
docker build . \
-t docker.io/kubeflowkatib/${images[$idx]}:latest \
-f cmd/${folders[$idx]}/v1beta1/Dockerfile
docker save docker.io/kubeflowkatib/${images[$idx]} > ${images[$idx]}.tar
microk8s ctr image import ${images[$idx]}.tar
done
- name: Deploy Katib
env:
CHARMCRAFT_DEVELOPER: "1"
run: |
set -eux
cd operators/
git clone git://git.launchpad.net/canonical-osm
cp -r canonical-osm/charms/interfaces/juju-relation-mysql mysql
sg microk8s -c 'juju bootstrap microk8s uk8s'
juju add-model kubeflow
juju bundle deploy --build --destructive-mode --serial
juju wait -wvt 600
- name: Test Katib
run: kubectl apply -f examples/v1beta1/hp-tuning/random.yaml
- name: Get pod statuses
run: kubectl get all -A
if: failure()
- name: Get juju status
run: juju status
if: failure()
- name: Get katib-controller workload logs
run: kubectl logs --tail 100 -nkubeflow -lapp.kubernetes.io/name=katib-controller
if: failure()
- name: Get katib-controller operator logs
run: kubectl logs --tail 100 -nkubeflow -loperator.juju.is/name=katib-controller
if: failure()
- name: Get katib-ui workload logs
run: kubectl logs --tail 100 -nkubeflow -lapp.kubernetes.io/name=katib-ui
if: failure()
- name: Get katib-ui operator logs
run: kubectl logs --tail 100 -nkubeflow -loperator.juju.is/name=katib-ui
if: failure()
- name: Get katib-db-manager workload logs
run: kubectl logs --tail 100 -nkubeflow -lapp.kubernetes.io/name=katib-db-manager
if: failure()
- name: Get katib-db-manager operator logs
run: kubectl logs --tail 100 -nkubeflow -loperator.juju.is/name=katib-db-manager
if: failure()
- name: Upload charmcraft logs
uses: actions/upload-artifact@v2
with:
name: charmcraft-logs
path: /tmp/charmcraft-log-*
if: failure()

View File

@ -1,18 +1,13 @@
name: Go Test name: Go Test
on: on:
pull_request: - push
paths-ignore: - pull_request
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs: jobs:
generatetests: generatetests:
name: Generate And Format Test name: Generate And Format Test
runs-on: ubuntu-22.04 runs-on: ubuntu-latest
env: env:
GOPATH: ${{ github.workspace }}/go GOPATH: ${{ github.workspace }}/go
defaults: defaults:
@ -20,22 +15,32 @@ jobs:
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
steps: steps:
- name: Check out code - name: Check out code
uses: actions/checkout@v4 uses: actions/checkout@v2
with: with:
path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
- name: Setup Go - name: Setup Go
uses: actions/setup-go@v5 uses: actions/setup-go@v2
with: with:
go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.mod go-version: 1.17.10
cache-dependency-path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.sum
- name: Check Go Modules, Generated Go/Python codes, and Format # Verify that go.mod and go.sum is synchronized
run: make check - name: Check Go modules
run: |
go mod tidy &&
git add go.* &&
git diff --cached --exit-code || (echo 'Please run "go mod tidy" to sync Go modules' && exit 1)
- name: Run Generate And Go Format Test
run: |
go mod download &&
make check &&
git add pkg/apis hack/gen-python-sdk &&
git diff --cached --exit-code || (echo 'Please run "make check" to generate codes and to format Go codes' && exit 1)
unittests: unittests:
name: Unit Test name: Unit Test
runs-on: ubuntu-22.04 runs-on: ubuntu-latest
env: env:
GOPATH: ${{ github.workspace }}/go GOPATH: ${{ github.workspace }}/go
defaults: defaults:
@ -43,15 +48,14 @@ jobs:
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
steps: steps:
- name: Check out code - name: Check out code
uses: actions/checkout@v4 uses: actions/checkout@v2
with: with:
path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
- name: Setup Go - name: Setup Go
uses: actions/setup-go@v5 uses: actions/setup-go@v2
with: with:
go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.mod go-version: 1.17.10
cache-dependency-path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.sum
- name: Run Go test - name: Run Go test
run: go mod download && make test ENVTEST_K8S_VERSION=${{ matrix.kubernetes-version }} run: go mod download && make test ENVTEST_K8S_VERSION=${{ matrix.kubernetes-version }}
@ -61,19 +65,9 @@ jobs:
with: with:
path-to-profile: coverage.out path-to-profile: coverage.out
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
parallel: true
strategy: strategy:
fail-fast: false fail-fast: false
matrix: matrix:
# Detail: `setup-envtest list` # Detail: `setup-envtest list --arch amd64`
kubernetes-version: ["1.29.3", "1.30.0", "1.31.0"] kubernetes-version: ["1.21.4", "1.22.1", "1.23.5"]
# notifies that all test jobs are finished.
finish:
needs: unittests
runs-on: ubuntu-22.04
steps:
- uses: shogo82148/actions-goveralls@v1
with:
parallel-finished: true

View File

@ -1,30 +0,0 @@
name: Lint Files
on:
pull_request:
paths-ignore:
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
lint:
name: Lint
runs-on: ubuntu-22.04
steps:
- name: Check out code
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: 3.9
- name: Check shell scripts
run: make shellcheck
- name: Run pre-commit
uses: pre-commit/action@v3.0.1

View File

@ -1,101 +1,24 @@
name: Frontend Test name: Frontend Test
on: on:
pull_request: - push
paths: - pull_request
- pkg/ui/v1beta1/frontend/**
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs: jobs:
test: test:
name: Code format and lint name: Test
runs-on: ubuntu-22.04 runs-on: ubuntu-latest
steps: steps:
- name: Check out code - name: Check out code
uses: actions/checkout@v4 uses: actions/checkout@v2
- name: Setup Node - name: Setup Node
uses: actions/setup-node@v4 uses: actions/setup-node@v2
with: with:
node-version: 16.20.2 node-version: 12.18.1
- name: Format katib code - name: Run Node test
run: | run: |
npm install prettier --prefix ./pkg/ui/v1beta1/frontend npm install prettier --prefix ./pkg/new-ui/v1beta1/frontend
make prettier-check make prettier-check
- name: Lint katib code
run: |
cd pkg/ui/v1beta1/frontend
npm run lint-check
frontend-unit-tests:
name: Frontend Unit Tests
runs-on: ubuntu-22.04
steps:
- name: Check out code
uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: 16.20.2
- name: Fetch Kubeflow and install common code dependencies
run: |
COMMIT=$(cat pkg/ui/v1beta1/frontend/COMMIT)
cd /tmp && git clone https://github.com/kubeflow/kubeflow.git
cd kubeflow
git checkout $COMMIT
cd components/crud-web-apps/common/frontend/kubeflow-common-lib
npm i
npm run build
npm link ./dist/kubeflow
- name: Install KWA dependencies
run: |
cd pkg/ui/v1beta1/frontend
npm i
npm link kubeflow
- name: Run unit tests
run: |
cd pkg/ui/v1beta1/frontend
npm run test:prod
frontend-ui-tests:
name: UI tests with Cypress
runs-on: ubuntu-22.04
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup node version to 16
uses: actions/setup-node@v4
with:
node-version: 16
- name: Fetch Kubeflow and install common code dependencies
run: |
COMMIT=$(cat pkg/ui/v1beta1/frontend/COMMIT)
cd /tmp && git clone https://github.com/kubeflow/kubeflow.git
cd kubeflow
git checkout $COMMIT
cd components/crud-web-apps/common/frontend/kubeflow-common-lib
npm i
npm run build
npm link ./dist/kubeflow
- name: Install KWA dependencies
run: |
cd pkg/ui/v1beta1/frontend
npm i
npm link kubeflow
- name: Serve UI & run Cypress tests in Chrome and Firefox
run: |
cd pkg/ui/v1beta1/frontend
npm run start & npx wait-on http://localhost:4200
npm run ui-test-ci-all

View File

@ -1,47 +1,22 @@
name: Python Test name: Python Test
on: on:
pull_request: - push
paths-ignore: - pull_request
- "pkg/ui/v1beta1/frontend/**"
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs: jobs:
test: test:
name: Test name: Test
runs-on: ubuntu-22.04 runs-on: ubuntu-latest
steps: steps:
- name: Check out code - name: Check out code
uses: actions/checkout@v4 uses: actions/checkout@v2
- name: Setup Python - name: Setup Python
uses: actions/setup-python@v5 uses: actions/setup-python@v2
with:
python-version: 3.11
- name: Run Python test
run: make pytest
# The skopt service doesn't work appropriately with Python 3.11.
# So, we need to run the test with Python 3.9.
# TODO (tenzen-y): Once we stop to support skopt, we can remove this test.
# REF: https://github.com/kubeflow/katib/issues/2280
test-skopt:
name: Test Skopt
runs-on: ubuntu-22.04
steps:
- name: Check out code
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with: with:
python-version: 3.9 python-version: 3.9
- name: Run Python test - name: Run Python test
run: make pytest-skopt run: make pytest

View File

@ -0,0 +1,17 @@
name: Shellcheck
on:
- push
- pull_request
jobs:
test:
name: Test
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v2
- name: Run shellcheck
run: make shellcheck

View File

@ -1,21 +1,17 @@
name: E2E Test with tf-mnist-with-summaries name: E2E Test with tf-mnist-with-summaries
on: on:
pull_request: - pull_request
paths-ignore:
- "pkg/ui/v1beta1/frontend/**"
concurrency: env:
group: ${{ github.workflow }}-${{ github.ref }} GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
cancel-in-progress: true
jobs: jobs:
e2e: e2e:
runs-on: ubuntu-22.04 runs-on: ubuntu-20.04
timeout-minutes: 120 timeout-minutes: 120
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@v2
- name: Setup Test Env - name: Setup Test Env
uses: ./.github/workflows/template-setup-e2e-test uses: ./.github/workflows/template-setup-e2e-test
@ -33,6 +29,8 @@ jobs:
strategy: strategy:
fail-fast: false fail-fast: false
matrix: matrix:
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"] # TODO (tenzen-y): We need to consider running tests on more kubernetes versions.
# kubernetes-version: ["v1.20.15", "v1.21.13", "v1.22.10", "v1.23.7", "v1.24.1"]
kubernetes-version: ["v1.21.13", "v1.22.10", "v1.23.7"]
# Comma Delimited # Comma Delimited
experiments: ["tfjob-mnist-with-summaries"] experiments: ["tfjob-mnist-with-summaries"]

3
.gitignore vendored
View File

@ -78,6 +78,3 @@ $RECYCLE.BIN/
## Vendor dir ## Vendor dir
vendor vendor
# Jupyter Notebooks.
**/.ipynb_checkpoints

View File

@ -1,38 +0,0 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
args: [--allow-multiple-documents]
- id: check-json
- repo: https://github.com/pycqa/isort
rev: 5.11.5
hooks:
- id: isort
name: isort
entry: isort --profile black
- repo: https://github.com/psf/black
rev: 24.2.0
hooks:
- id: black
files: (sdk|examples|pkg)/.*
- repo: https://github.com/pycqa/flake8
rev: 7.1.1
hooks:
- id: flake8
files: (sdk|examples|pkg)/.*
exclude: |
(?x)^(
.*zz_generated.deepcopy.*|
.*pb.go|
pkg/apis/manager/.*pb2(?:_grpc)?.py(?:i)?|
pkg/apis/v1beta1/openapi_generated.go|
pkg/mock/.*|
pkg/client/controller/.*|
sdk/python/v1beta1/kubeflow/katib/configuration.py|
sdk/python/v1beta1/kubeflow/katib/rest.py|
sdk/python/v1beta1/kubeflow/katib/__init__.py|
sdk/python/v1beta1/kubeflow/katib/exceptions.py|
sdk/python/v1beta1/kubeflow/katib/api_client.py|
sdk/python/v1beta1/kubeflow/katib/models/.*
)$

View File

@ -11,10 +11,8 @@ Please keep the list in alphabetical order.
| [babylon health](https://www.babylonhealth.com/) | [@jeremievallee](https://github.com/jeremievallee) | Hyperparameter tuning for AIR internal AI Platform | | [babylon health](https://www.babylonhealth.com/) | [@jeremievallee](https://github.com/jeremievallee) | Hyperparameter tuning for AIR internal AI Platform |
| [caicloud](https://caicloud.io/) | [@gaocegege](https://github.com/gaocegege) | Hyperparameter tuning in Caicloud Cloud-Native AI Platform | | [caicloud](https://caicloud.io/) | [@gaocegege](https://github.com/gaocegege) | Hyperparameter tuning in Caicloud Cloud-Native AI Platform |
| [canonical](https://ubuntu.com/) | [@RFMVasconcelos](https://github.com/rfmvasconcelos) | Hyperparameter tuning for customer projects in Defense and Fintech | | [canonical](https://ubuntu.com/) | [@RFMVasconcelos](https://github.com/rfmvasconcelos) | Hyperparameter tuning for customer projects in Defense and Fintech |
| [CERN](https://home.cern/) | [@d-gol](https://github.com/d-gol) | Hyperparameter tuning within the ML platform on private cloud |
| [cisco](https://cisco.com/) | [@ramdootp](https://github.com/ramdootp) | Hyperparameter tuning for conversational AI interface using Rasa | | [cisco](https://cisco.com/) | [@ramdootp](https://github.com/ramdootp) | Hyperparameter tuning for conversational AI interface using Rasa |
| [cubonacci](https://www.cubonacci.com) | [@janvdvegt](https://github.com/janvdvegt) | Hyperparameter tuning within the Cubonacci machine learning platform | | [cubonacci](https://www.cubonacci.com) | [@janvdvegt](https://github.com/janvdvegt) | Hyperparameter tuning within the Cubonacci machine learning platform |
| [CyberAgent](https://www.cyberagent.co.jp/en/) | [@tenzen-y](https://github.com/tenzen-y) | Experiment in CyberAgent internal ML Platform on Private Cloud | | [CyberAgent](https://www.cyberagent.co.jp/en/) | [@tenzen-y](https://github.com/tenzen-y) | Experiment in CyberAgent internal ML Platform on Private Cloud |
| [fuzhi](http://www.fuzhi.ai/) | [@planck0591](https://github.com/planck0591) | Experiment and Trial in autoML Platform | | [fuzhi](http://www.fuzhi.ai/) | [@planck0591](https://github.com/planck0591) | Experiment and Trial in autoML Platform |
| [karrot](https://uk.karrotmarket.com/) | [@muik](https://github.com/muik) | Hyperparameter tuning in Karrot ML Platform | | [karrot](https://uk.karrotmarket.com/) | [@muik](https://github.com/muik) | Hyperparameter tuning in Karrot ML Platform |
| [PITS Global Data Recovery Services](https://www.pitsdatarecovery.net/) | [@pheianox](https://github.com/pheianox) | CyberAgent and ML Platform |

View File

@ -1,821 +1,6 @@
# Changelog # Changelog
# [v0.18.0](https://github.com/kubeflow/katib/tree/v0.18.0) (2025-03-25) ## [v0.13.0](https://github.com/kubeflow/katib/tree/v0.13.0) (2022-03-04)
## Breaking Changes
- Move Katib manifest image references to ghcr ([#2535](https://github.com/kubeflow/katib/pull/2535) by [@saileshd1402](https://github.com/saileshd1402))
- Migrate docker images to ghcr ([#2531](https://github.com/kubeflow/katib/pull/2531) by [@mahdikhashan](https://github.com/mahdikhashan))
- Upgrade Kubernetes to v1.31.3 ([#2478](https://github.com/kubeflow/katib/pull/2478) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Upgrade Kubernetes to v1.30.7 ([#2463](https://github.com/kubeflow/katib/pull/2463) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Drop Python 3.7 and Support Python 3.11 in the SDK ([#2337](https://github.com/kubeflow/katib/pull/2337) by [@tenzen-y](https://github.com/tenzen-y))
## New Features
### Hyperparameter Optimization for LLMs
- [DOCS] move llm hyperparameter optimisation design image to the proposal directory and rename it ([#2472](https://github.com/kubeflow/katib/pull/2472) by [@mahdikhashan](https://github.com/mahdikhashan))
- [GSoC] Update `tune` API for LLM hyperparameters optimization ([#2393](https://github.com/kubeflow/katib/pull/2393) by [@helenxie-bit](https://github.com/helenxie-bit))
- [GSoC] Create LLM Hyperparameters Optimization API Proposal ([#2333](https://github.com/kubeflow/katib/pull/2333) by [@helenxie-bit](https://github.com/helenxie-bit))
### Support for Advanced Distributions for HPO
- [GSOC] `optuna` suggestion service logic update ([#2446](https://github.com/kubeflow/katib/pull/2446) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] `hyperopt` suggestion service logic update ([#2412](https://github.com/kubeflow/katib/pull/2412) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] Add validator for feasible space distribution ([#2404](https://github.com/kubeflow/katib/pull/2404) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] added Unknown distribution and convertDistribution in suggestion client ([#2403](https://github.com/kubeflow/katib/pull/2403) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] Support for various Parameter distributions in Katib ([#2334](https://github.com/kubeflow/katib/pull/2334) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSoC] Added `DistributionType` to Experiment API ([#2377](https://github.com/kubeflow/katib/pull/2377) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
### Push-based Metrics Collector
- [GSoC] Provide a PyTorch MNIST Example for Push-based Metrics Collection ([#2437](https://github.com/kubeflow/katib/pull/2437) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] Compatibility Changes in Trial Controller ([#2394](https://github.com/kubeflow/katib/pull/2394) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] New Interface `report_metrics` in Python SDK ([#2371](https://github.com/kubeflow/katib/pull/2371) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] KEP for Project 6: Push-based Metrics Collection for Katib ([#2328](https://github.com/kubeflow/katib/pull/2328) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] Add New Parameter in `tune` ([#2369](https://github.com/kubeflow/katib/pull/2369) by [@Electronic-Waste](https://github.com/Electronic-Waste))
### SDK Updates
- [SDK] Support PyTorchJob as a Trial Worker ([#2512](https://github.com/kubeflow/katib/pull/2512) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] test: Add e2e test for tune function. ([#2399](https://github.com/kubeflow/katib/pull/2399) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [SDK] improve PVC creation name error ([#2496](https://github.com/kubeflow/katib/pull/2496) by [@mahdikhashan](https://github.com/mahdikhashan))
- [SDK] Fix empty list for env variables and numpy version ([#2360](https://github.com/kubeflow/katib/pull/2360) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Explain Python version support cycle ([#2354](https://github.com/kubeflow/katib/pull/2354) by [@andreyvelich](https://github.com/andreyvelich))
## Bug Fixes
- fix(webhook): fix validation message in experiment webhook ([#2507](https://github.com/kubeflow/katib/pull/2507) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Install typing-extensions v4.10.0 to fix Python test error ([#2504](https://github.com/kubeflow/katib/pull/2504) by [@helenxie-bit](https://github.com/helenxie-bit))
- [SDK] Update `tune` API ([#2497](https://github.com/kubeflow/katib/pull/2497) by [@helenxie-bit](https://github.com/helenxie-bit))
- fix(api): resolve all api voilation exceptions in katib api ([#2482](https://github.com/kubeflow/katib/pull/2482) by [@truc0](https://github.com/truc0))
- fix(trial): use propagated gomega to improve debuggability. ([#2432](https://github.com/kubeflow/katib/pull/2432) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- fix(ui): update None Collector with Push Collector. ([#2418](https://github.com/kubeflow/katib/pull/2418) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- fix: Resolve errors in e2e tests for cypress in Katib UI ([#2384](https://github.com/kubeflow/katib/pull/2384) by [@tariq-hasan](https://github.com/tariq-hasan))
- doc(example): fix the broken link. ([#2433](https://github.com/kubeflow/katib/pull/2433) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- fix: remove remaining MXNet dependency. ([#2456](https://github.com/kubeflow/katib/pull/2456) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Remove Dropout layer from ENAS Trial container to fix E2E tests ([#2455](https://github.com/kubeflow/katib/pull/2455) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] fix grpc related bugs in Python SDK ([#2398](https://github.com/kubeflow/katib/pull/2398) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [SDK] Fix types error ([#2424](https://github.com/kubeflow/katib/pull/2424) by [@helenxie-bit](https://github.com/helenxie-bit))
- fix: remove the dependency of `protocmp` in `google.golang.org/protobuf/testing/protocmp`. ([#2391](https://github.com/kubeflow/katib/pull/2391) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Fix TestReconcileBatchJob ([#2350](https://github.com/kubeflow/katib/pull/2350) by [@forsaken628](https://github.com/forsaken628))
- Fix apple silicon rosetta error when building images from the source code ([#2342](https://github.com/kubeflow/katib/pull/2342) by [@helenxie-bit](https://github.com/helenxie-bit))
- fix katib use crds token pipeline trail template guide ([#2330](https://github.com/kubeflow/katib/pull/2330) by [@Jerry-yz](https://github.com/Jerry-yz))
- Fix Scikit-Learn Version for Skopt Tests ([#2336](https://github.com/kubeflow/katib/pull/2336) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Support old-style TensorFlow events (tensorboard) ([#2517](https://github.com/kubeflow/katib/pull/2517) by [@garymm](https://github.com/garymm))
- Set experiment names at a max of 40 characters. ([#2468](https://github.com/kubeflow/katib/pull/2468) by [@AydanPirani](https://github.com/AydanPirani))
- [CI] optimize katib ui dockerfile ([#2505](https://github.com/kubeflow/katib/pull/2505) by [@mahdikhashan](https://github.com/mahdikhashan))
- Sort experiments by descending creation date by default in katib-ui ([#2498](https://github.com/kubeflow/katib/pull/2498) by [@Doris-xm](https://github.com/Doris-xm))
- [GSoC] Add unit tests for `tune` API ([#2423](https://github.com/kubeflow/katib/pull/2423) by [@helenxie-bit](https://github.com/helenxie-bit))
- Update MutatingWebhookConfiguration: Switch from objectSelector to AdmissionWebhookMatchConditions ([#2241](https://github.com/kubeflow/katib/pull/2241) by [@lianghao208](https://github.com/lianghao208))
- chore: supporting the listen-address parameter on db-manager ([#2465](https://github.com/kubeflow/katib/pull/2465) by [@caiofralmeida](https://github.com/caiofralmeida))
- Upgrade klog to v2 ([#2470](https://github.com/kubeflow/katib/pull/2470) by [@Doris-xm](https://github.com/Doris-xm))
- Ignore cache exporting errors in the image building workflows ([#2487](https://github.com/kubeflow/katib/pull/2487) by [@Doris-xm](https://github.com/Doris-xm))
- Upgrade grpcio version to v1.64.1 ([#2483](https://github.com/kubeflow/katib/pull/2483) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- docs: remove katib workflow ([#2443](https://github.com/kubeflow/katib/pull/2443) by [@gonmmarques](https://github.com/gonmmarques))
- Migrate KatibCertGenerator to OPA CertController ([#2345](https://github.com/kubeflow/katib/pull/2345) by [@forsaken628](https://github.com/forsaken628))
- Promote @Electronic-Waste and @helenxie-bit as Katib reviewers ([#2439](https://github.com/kubeflow/katib/pull/2439) by [@andreyvelich](https://github.com/andreyvelich))
- Update README and out-of-date docs ([#2438](https://github.com/kubeflow/katib/pull/2438) by [@andreyvelich](https://github.com/andreyvelich))
- Changes isort profile to black, to be fully compatible and adds 'pkg' dir for black and flake8 ([#2413](https://github.com/kubeflow/katib/pull/2413) by [@Ygnas](https://github.com/Ygnas))
- Introduced error constants and replaced reflect with cmp ([#2289](https://github.com/kubeflow/katib/pull/2289) by [@tariq-hasan](https://github.com/tariq-hasan))
- [Test] Refactor `inject_webhook_test.go` according to the Developer Guide ([#2401](https://github.com/kubeflow/katib/pull/2401) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Enhance pre-commit hooks with flake8 and black ([#2407](https://github.com/kubeflow/katib/pull/2407) by [@Ygnas](https://github.com/Ygnas))
- added `Distribution` field to feasibleSpace in `api.proto` ([#2397](https://github.com/kubeflow/katib/pull/2397) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- Begin enabling pre-commit hooks ([#2242](https://github.com/kubeflow/katib/pull/2242) by [@droctothorpe](https://github.com/droctothorpe))
- Update Instructions for Argo Workflows ([#2382](https://github.com/kubeflow/katib/pull/2382) by [@jaffe-fly](https://github.com/jaffe-fly))
- docs: update suggestion.md ([#2387](https://github.com/kubeflow/katib/pull/2387) by [@eltociear](https://github.com/eltociear))
- Add command to re-run GitHub Actions tests ([#2385](https://github.com/kubeflow/katib/pull/2385) by [@andreyvelich](https://github.com/andreyvelich))
- Bump Katib Python SDK to 0.17.0 version ([#2379](https://github.com/kubeflow/katib/pull/2379) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.17.0 ([#2380](https://github.com/kubeflow/katib/pull/2380) by [@andreyvelich](https://github.com/andreyvelich))
- Replaced hpcloud with nxadm for tail package in Go ([#2375](https://github.com/kubeflow/katib/pull/2375) by [@tariq-hasan](https://github.com/tariq-hasan))
- Use ErrorList for experiment validator ([#2329](https://github.com/kubeflow/katib/pull/2329) by [@ckcd](https://github.com/ckcd))
- Add Changelog for Katib v0.17.0-rc.1 ([#2370](https://github.com/kubeflow/katib/pull/2370) by [@andreyvelich](https://github.com/andreyvelich))
- Remove default caBundle value ([#2368](https://github.com/kubeflow/katib/pull/2368) by [@vihangm](https://github.com/vihangm))
- Bump Katib Python SDK to 0.17.0rc1 version ([#2365](https://github.com/kubeflow/katib/pull/2365) by [@andreyvelich](https://github.com/andreyvelich))
- Add unit test for `create_experiment` in the `katib_client` module ([#2325](https://github.com/kubeflow/katib/pull/2325) by [@tariq-hasan](https://github.com/tariq-hasan))
- Remove code generation from release script ([#2363](https://github.com/kubeflow/katib/pull/2363) by [@andreyvelich](https://github.com/andreyvelich))
- Upgrade the protobuf version to >=4.21.12,<5 ([#2358](https://github.com/kubeflow/katib/pull/2358) by [@tenzen-y](https://github.com/tenzen-y))
- Replace gRPC code generation tool from Znly/protoc to Buf ([#2344](https://github.com/kubeflow/katib/pull/2344) by [@forsaken628](https://github.com/forsaken628))
- Replace already closed github.com/golang/mock with go.uber.org/mock ([#2357](https://github.com/kubeflow/katib/pull/2357) by [@forsaken628](https://github.com/forsaken628))
- Use cache-dependency-path in actions/setup-go for CI workflow ([#2355](https://github.com/kubeflow/katib/pull/2355) by [@forsaken628](https://github.com/forsaken628))
- Update Slack Invitation ([#2349](https://github.com/kubeflow/katib/pull/2349) by [@andreyvelich](https://github.com/andreyvelich))
- Update GitHub template to better triage Issues ([#2335](https://github.com/kubeflow/katib/pull/2335) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.17.0-rc.0 ([#2319](https://github.com/kubeflow/katib/pull/2319) by [@andreyvelich](https://github.com/andreyvelich))
- Update outdated actions ([#2324](https://github.com/kubeflow/katib/pull/2324) by [@Mersho](https://github.com/Mersho))
- Make test fields private in Go unit tests ([#2316](https://github.com/kubeflow/katib/pull/2316) by [@tariq-hasan](https://github.com/tariq-hasan))
- Bump Katib Python SDK to 0.17.0rc0 Version ([#2318](https://github.com/kubeflow/katib/pull/2318) by [@andreyvelich](https://github.com/andreyvelich))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.17.0...v0.18.0)
# [v0.18.0-rc.0](https://github.com/kubeflow/katib/tree/v0.18.0-rc.0) (2025-02-13)
## Breaking Changes
- Upgrade Kubernetes to v1.31.3 ([#2478](https://github.com/kubeflow/katib/pull/2478) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Upgrade Kubernetes to v1.30.7 ([#2463](https://github.com/kubeflow/katib/pull/2463) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Drop Python 3.7 and Support Python 3.11 in the SDK ([#2337](https://github.com/kubeflow/katib/pull/2337) by [@tenzen-y](https://github.com/tenzen-y))
## New Features
### Hyperparameter Optimization for LLMs
- [DOCS] move llm hyperparameter optimisation design image to the proposal directory and rename it ([#2472](https://github.com/kubeflow/katib/pull/2472) by [@mahdikhashan](https://github.com/mahdikhashan))
- [GSoC] Update `tune` API for LLM hyperparameters optimization ([#2393](https://github.com/kubeflow/katib/pull/2393) by [@helenxie-bit](https://github.com/helenxie-bit))
- [GSoC] Create LLM Hyperparameters Optimization API Proposal ([#2333](https://github.com/kubeflow/katib/pull/2333) by [@helenxie-bit](https://github.com/helenxie-bit))
### Support for Advanced Distributions for HPO
- [GSOC] `optuna` suggestion service logic update ([#2446](https://github.com/kubeflow/katib/pull/2446) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] `hyperopt` suggestion service logic update ([#2412](https://github.com/kubeflow/katib/pull/2412) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] Add validator for feasible space distribution ([#2404](https://github.com/kubeflow/katib/pull/2404) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] added Unknown distribution and convertDistribution in suggestion client ([#2403](https://github.com/kubeflow/katib/pull/2403) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSOC] Support for various Parameter distributions in Katib ([#2334](https://github.com/kubeflow/katib/pull/2334) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- [GSoC] Added `DistributionType` to Experiment API ([#2377](https://github.com/kubeflow/katib/pull/2377) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
### Push-based Metrics Collector
- [GSoC] Provide a PyTorch MNIST Example for Push-based Metrics Collection ([#2437](https://github.com/kubeflow/katib/pull/2437) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] Compatibility Changes in Trial Controller ([#2394](https://github.com/kubeflow/katib/pull/2394) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] New Interface `report_metrics` in Python SDK ([#2371](https://github.com/kubeflow/katib/pull/2371) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] KEP for Project 6: Push-based Metrics Collection for Katib ([#2328](https://github.com/kubeflow/katib/pull/2328) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [GSoC] Add New Parameter in `tune` ([#2369](https://github.com/kubeflow/katib/pull/2369) by [@Electronic-Waste](https://github.com/Electronic-Waste))
### SDK Updates
- [SDK] Support PyTorchJob as a Trial Worker ([#2512](https://github.com/kubeflow/katib/pull/2512) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] test: Add e2e test for tune function. ([#2399](https://github.com/kubeflow/katib/pull/2399) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [SDK] improve PVC creation name error ([#2496](https://github.com/kubeflow/katib/pull/2496) by [@mahdikhashan](https://github.com/mahdikhashan))
- [SDK] Fix empty list for env variables and numpy version ([#2360](https://github.com/kubeflow/katib/pull/2360) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Explain Python version support cycle ([#2354](https://github.com/kubeflow/katib/pull/2354) by [@andreyvelich](https://github.com/andreyvelich))
## Bug Fixes
- fix(webhook): fix validation message in experiment webhook ([#2507](https://github.com/kubeflow/katib/pull/2507) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Install typing-extensions v4.10.0 to fix Python test error ([#2504](https://github.com/kubeflow/katib/pull/2504) by [@helenxie-bit](https://github.com/helenxie-bit))
- [SDK] Update `tune` API ([#2497](https://github.com/kubeflow/katib/pull/2497) by [@helenxie-bit](https://github.com/helenxie-bit))
- fix(api): resolve all api voilation exceptions in katib api ([#2482](https://github.com/kubeflow/katib/pull/2482) by [@truc0](https://github.com/truc0))
- fix(trial): use propagated gomega to improve debuggability. ([#2432](https://github.com/kubeflow/katib/pull/2432) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- fix(ui): update None Collector with Push Collector. ([#2418](https://github.com/kubeflow/katib/pull/2418) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- fix: Resolve errors in e2e tests for cypress in Katib UI ([#2384](https://github.com/kubeflow/katib/pull/2384) by [@tariq-hasan](https://github.com/tariq-hasan))
- doc(example): fix the broken link. ([#2433](https://github.com/kubeflow/katib/pull/2433) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- fix: remove remaining MXNet dependency. ([#2456](https://github.com/kubeflow/katib/pull/2456) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Remove Dropout layer from ENAS Trial container to fix E2E tests ([#2455](https://github.com/kubeflow/katib/pull/2455) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] fix grpc related bugs in Python SDK ([#2398](https://github.com/kubeflow/katib/pull/2398) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- [SDK] Fix types error ([#2424](https://github.com/kubeflow/katib/pull/2424) by [@helenxie-bit](https://github.com/helenxie-bit))
- fix: remove the dependency of `protocmp` in `google.golang.org/protobuf/testing/protocmp`. ([#2391](https://github.com/kubeflow/katib/pull/2391) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Fix TestReconcileBatchJob ([#2350](https://github.com/kubeflow/katib/pull/2350) by [@forsaken628](https://github.com/forsaken628))
- Fix apple silicon rosetta error when building images from the source code ([#2342](https://github.com/kubeflow/katib/pull/2342) by [@helenxie-bit](https://github.com/helenxie-bit))
- fix katib use crds token pipeline trail template guide ([#2330](https://github.com/kubeflow/katib/pull/2330) by [@Jerry-yz](https://github.com/Jerry-yz))
- Fix Scikit-Learn Version for Skopt Tests ([#2336](https://github.com/kubeflow/katib/pull/2336) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Set experiment names at a max of 40 characters. ([#2468](https://github.com/kubeflow/katib/pull/2468) by [@AydanPirani](https://github.com/AydanPirani))
- [CI] optimize katib ui dockerfile ([#2505](https://github.com/kubeflow/katib/pull/2505) by [@mahdikhashan](https://github.com/mahdikhashan))
- Sort experiments by descending creation date by default in katib-ui ([#2498](https://github.com/kubeflow/katib/pull/2498) by [@Doris-xm](https://github.com/Doris-xm))
- [GSoC] Add unit tests for `tune` API ([#2423](https://github.com/kubeflow/katib/pull/2423) by [@helenxie-bit](https://github.com/helenxie-bit))
- Update MutatingWebhookConfiguration: Switch from objectSelector to AdmissionWebhookMatchConditions ([#2241](https://github.com/kubeflow/katib/pull/2241) by [@lianghao208](https://github.com/lianghao208))
- chore: supporting the listen-address parameter on db-manager ([#2465](https://github.com/kubeflow/katib/pull/2465) by [@caiofralmeida](https://github.com/caiofralmeida))
- Upgrade klog to v2 ([#2470](https://github.com/kubeflow/katib/pull/2470) by [@Doris-xm](https://github.com/Doris-xm))
- Ignore cache exporting errors in the image building workflows ([#2487](https://github.com/kubeflow/katib/pull/2487) by [@Doris-xm](https://github.com/Doris-xm))
- Upgrade grpcio version to v1.64.1 ([#2483](https://github.com/kubeflow/katib/pull/2483) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- docs: remove katib workflow ([#2443](https://github.com/kubeflow/katib/pull/2443) by [@gonmmarques](https://github.com/gonmmarques))
- Migrate KatibCertGenerator to OPA CertController ([#2345](https://github.com/kubeflow/katib/pull/2345) by [@forsaken628](https://github.com/forsaken628))
- Promote @Electronic-Waste and @helenxie-bit as Katib reviewers ([#2439](https://github.com/kubeflow/katib/pull/2439) by [@andreyvelich](https://github.com/andreyvelich))
- Update README and out-of-date docs ([#2438](https://github.com/kubeflow/katib/pull/2438) by [@andreyvelich](https://github.com/andreyvelich))
- Changes isort profile to black, to be fully compatible and adds 'pkg' dir for black and flake8 ([#2413](https://github.com/kubeflow/katib/pull/2413) by [@Ygnas](https://github.com/Ygnas))
- Introduced error constants and replaced reflect with cmp ([#2289](https://github.com/kubeflow/katib/pull/2289) by [@tariq-hasan](https://github.com/tariq-hasan))
- [Test] Refactor `inject_webhook_test.go` according to the Developer Guide ([#2401](https://github.com/kubeflow/katib/pull/2401) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Enhance pre-commit hooks with flake8 and black ([#2407](https://github.com/kubeflow/katib/pull/2407) by [@Ygnas](https://github.com/Ygnas))
- added `Distribution` field to feasibleSpace in `api.proto` ([#2397](https://github.com/kubeflow/katib/pull/2397) by [@shashank-iitbhu](https://github.com/shashank-iitbhu))
- Begin enabling pre-commit hooks ([#2242](https://github.com/kubeflow/katib/pull/2242) by [@droctothorpe](https://github.com/droctothorpe))
- Update Instructions for Argo Workflows ([#2382](https://github.com/kubeflow/katib/pull/2382) by [@jaffe-fly](https://github.com/jaffe-fly))
- docs: update suggestion.md ([#2387](https://github.com/kubeflow/katib/pull/2387) by [@eltociear](https://github.com/eltociear))
- Add command to re-run GitHub Actions tests ([#2385](https://github.com/kubeflow/katib/pull/2385) by [@andreyvelich](https://github.com/andreyvelich))
- Bump Katib Python SDK to 0.17.0 version ([#2379](https://github.com/kubeflow/katib/pull/2379) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.17.0 ([#2380](https://github.com/kubeflow/katib/pull/2380) by [@andreyvelich](https://github.com/andreyvelich))
- Replaced hpcloud with nxadm for tail package in Go ([#2375](https://github.com/kubeflow/katib/pull/2375) by [@tariq-hasan](https://github.com/tariq-hasan))
- Use ErrorList for experiment validator ([#2329](https://github.com/kubeflow/katib/pull/2329) by [@ckcd](https://github.com/ckcd))
- Add Changelog for Katib v0.17.0-rc.1 ([#2370](https://github.com/kubeflow/katib/pull/2370) by [@andreyvelich](https://github.com/andreyvelich))
- Remove default caBundle value ([#2368](https://github.com/kubeflow/katib/pull/2368) by [@vihangm](https://github.com/vihangm))
- Bump Katib Python SDK to 0.17.0rc1 version ([#2365](https://github.com/kubeflow/katib/pull/2365) by [@andreyvelich](https://github.com/andreyvelich))
- Add unit test for `create_experiment` in the `katib_client` module ([#2325](https://github.com/kubeflow/katib/pull/2325) by [@tariq-hasan](https://github.com/tariq-hasan))
- Remove code generation from release script ([#2363](https://github.com/kubeflow/katib/pull/2363) by [@andreyvelich](https://github.com/andreyvelich))
- Upgrade the protobuf version to >=4.21.12,<5 ([#2358](https://github.com/kubeflow/katib/pull/2358) by [@tenzen-y](https://github.com/tenzen-y))
- Replace gRPC code generation tool from Znly/protoc to Buf ([#2344](https://github.com/kubeflow/katib/pull/2344) by [@forsaken628](https://github.com/forsaken628))
- Replace already closed github.com/golang/mock with go.uber.org/mock ([#2357](https://github.com/kubeflow/katib/pull/2357) by [@forsaken628](https://github.com/forsaken628))
- Use cache-dependency-path in actions/setup-go for CI workflow ([#2355](https://github.com/kubeflow/katib/pull/2355) by [@forsaken628](https://github.com/forsaken628))
- Update Slack Invitation ([#2349](https://github.com/kubeflow/katib/pull/2349) by [@andreyvelich](https://github.com/andreyvelich))
- Update GitHub template to better triage Issues ([#2335](https://github.com/kubeflow/katib/pull/2335) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.17.0-rc.0 ([#2319](https://github.com/kubeflow/katib/pull/2319) by [@andreyvelich](https://github.com/andreyvelich))
- Update outdated actions ([#2324](https://github.com/kubeflow/katib/pull/2324) by [@Mersho](https://github.com/Mersho))
- Make test fields private in Go unit tests ([#2316](https://github.com/kubeflow/katib/pull/2316) by [@tariq-hasan](https://github.com/tariq-hasan))
- Bump Katib Python SDK to 0.17.0rc0 Version ([#2318](https://github.com/kubeflow/katib/pull/2318) by [@andreyvelich](https://github.com/andreyvelich))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.17.0...v0.18.0-rc.0)
# [v0.17.0](https://github.com/kubeflow/katib/tree/v0.17.0) (2024-07-12)
## Breaking Changes
- [SDK] Drop Python 3.7 and Support Python 3.11 ([#2337](https://github.com/kubeflow/katib/pull/2337) by [@tenzen-y](https://github.com/tenzen-y))
- [SDK] Upgrade the protobuf version to >=4.21.12,<5 ([#2358](https://github.com/kubeflow/katib/pull/2358) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.26, and support Kubernetes v1.29 ([#2308](https://github.com/kubeflow/katib/pull/2308) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.25, and Support Kubernetes v1.28 ([#2303](https://github.com/kubeflow/katib/pull/2303) by [@tenzen-y](https://github.com/tenzen-y))
- Remove MXNet examples ([#2267](https://github.com/kubeflow/katib/pull/2267) by [@tenzen-y](https://github.com/tenzen-y))
## New Features
### Core Features
- Replace gRPC code generation tool from Znly/protoc to Buf ([#2344](https://github.com/kubeflow/katib/pull/2344) by [@forsaken628](https://github.com/forsaken628))
- Support ARM64 arch for release images ([#2315](https://github.com/kubeflow/katib/pull/2315) by [@andreyvelich](https://github.com/andreyvelich))
- DB: Add environment variable option to skip DB table creationˆ ([#2245](https://github.com/kubeflow/katib/pull/2245) by [@lkaybob](https://github.com/lkaybob))
- Add environment variable option to set postgres ssl mode ([#2266](https://github.com/kubeflow/katib/pull/2266) by [@ckcd](https://github.com/ckcd))
- Upgrade TensorFlow version to v2.16.1 ([#2282](https://github.com/kubeflow/katib/pull/2282) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade PyTorch version to v2.2.1 ([#2279](https://github.com/kubeflow/katib/pull/2279) by [@tenzen-y](https://github.com/tenzen-y))
### SDK Features
- [SDK] Generate Name functionality for creating experiments. ([#2272](https://github.com/kubeflow/katib/pull/2272) by [@bharathk005](https://github.com/bharathk005))
- [SDK] Add `env` & `env_from` in client tune ([#2235](https://github.com/kubeflow/katib/pull/2235) by [@shipengcheng1230](https://github.com/shipengcheng1230))
- [SDK] Add 'algorithm_settings' in client tune ([#2227](https://github.com/kubeflow/katib/pull/2227) by [@shipengcheng1230](https://github.com/shipengcheng1230))
- [SDK] Raise more human-readable name conflict exception ([#2199](https://github.com/kubeflow/katib/pull/2199) by [@droctothorpe](https://github.com/droctothorpe))
## Bug Fixes
- Remove code generation from release script ([#2364](https://github.com/kubeflow/katib/pull/2364) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Fix empty list for env variables and numpy version ([#2360](https://github.com/kubeflow/katib/pull/2360) by [@andreyvelich](https://github.com/andreyvelich))
- Use cache-dependency-path in actions/setup-go for CI workflow ([#2355](https://github.com/kubeflow/katib/pull/2355) by [@forsaken628](https://github.com/forsaken628))
- Fix TestReconcileBatchJob ([#2350](https://github.com/kubeflow/katib/pull/2350) by [@forsaken628](https://github.com/forsaken628))
- Fix Scikit-Learn Version for Skopt Tests ([#2336](https://github.com/kubeflow/katib/pull/2336) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Fix env per Trial parameter in tune API ([#2304](https://github.com/kubeflow/katib/pull/2304) by [@andreyvelich](https://github.com/andreyvelich))
- Fix: clean up UTs for file metrics collector ([#2285](https://github.com/kubeflow/katib/pull/2285) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Fix tensor devices for DARTS Trial ([#2273](https://github.com/kubeflow/katib/pull/2273) by [@sifa1024](https://github.com/sifa1024))
- Typo fix stale.yaml ([#2257](https://github.com/kubeflow/katib/pull/2257) by [@tarilabs](https://github.com/tarilabs))
- Fix Optuna Validation for CMA-ES ([#2240](https://github.com/kubeflow/katib/pull/2240) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Replace already closed github.com/golang/mock with go.uber.org/mock ([#2357](https://github.com/kubeflow/katib/pull/2357) by [@forsaken628](https://github.com/forsaken628))
- Update outdated actions ([#2324](https://github.com/kubeflow/katib/pull/2324) by [@Mersho](https://github.com/Mersho))
- Upgrade Go version to v1.22 ([#2309](https://github.com/kubeflow/katib/pull/2309) by [@tenzen-y](https://github.com/tenzen-y))
- CI: Enable parallel mode for the coveralls ([#2297](https://github.com/kubeflow/katib/pull/2297) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Python version to 3.11 ([#2278](https://github.com/kubeflow/katib/pull/2278) by [@tenzen-y](https://github.com/tenzen-y))
- chore: add unit testcases for files in Text format. ([#2274](https://github.com/kubeflow/katib/pull/2274) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Upgrade google/go-containerregistry/pkg/authn/k8schain ([#2252](https://github.com/kubeflow/katib/pull/2252) by [@tenzen-y](https://github.com/tenzen-y))
- Add Technical and style guide to the contribution guide ([#2250](https://github.com/kubeflow/katib/pull/2250) by [@tenzen-y](https://github.com/tenzen-y))
- Install typing-extensions v4.6.3 for Optuna ([#2251](https://github.com/kubeflow/katib/pull/2251) by [@tenzen-y](https://github.com/tenzen-y))
- Remove legacy BO code ([#2246](https://github.com/kubeflow/katib/pull/2246) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.16.0 ([#2239](https://github.com/kubeflow/katib/pull/2239) by [@andreyvelich](https://github.com/andreyvelich))
- Add Katib ROADMAP 2022/2023 ([#2153](https://github.com/kubeflow/katib/pull/2153) by [@andreyvelich](https://github.com/andreyvelich))
- Update Ubuntu to 22.04 for E2E Tests ([#2222](https://github.com/kubeflow/katib/pull/2222) by [@andreyvelich](https://github.com/andreyvelich))
- Run Stale Action Every 5th Hour ([#2221](https://github.com/kubeflow/katib/pull/2221) by [@andreyvelich](https://github.com/andreyvelich))
- Add Stale GitHub Action ([#2220](https://github.com/kubeflow/katib/pull/2220) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.16.0-rc.1 ([#2218](https://github.com/kubeflow/katib/pull/2218) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.16.0-rc.0 ([#2204](https://github.com/kubeflow/katib/pull/2204) by [@andreyvelich](https://github.com/andreyvelich))
- Use the controller-runtime logger in the cert-generator ([#2219](https://github.com/kubeflow/katib/pull/2219) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.16.0...v0.17.0)
# [v0.17.0-rc.1](https://github.com/kubeflow/katib/tree/v0.17.0-rc.1) (2024-06-20)
## Breaking Changes
- [SDK] Drop Python 3.7 and Support Python 3.11 ([#2337](https://github.com/kubeflow/katib/pull/2337) by [@tenzen-y](https://github.com/tenzen-y))
- [SDK] Upgrade the protobuf version to >=4.21.12,<5 ([#2358](https://github.com/kubeflow/katib/pull/2358) by [@tenzen-y](https://github.com/tenzen-y))
## New Features
- Replace gRPC code generation tool from Znly/protoc to Buf ([#2344](https://github.com/kubeflow/katib/pull/2344) by [@forsaken628](https://github.com/forsaken628))
## Bug Fixes
- Remove code generation from release script ([#2364](https://github.com/kubeflow/katib/pull/2364) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Fix empty list for env variables and numpy version ([#2360](https://github.com/kubeflow/katib/pull/2360) by [@andreyvelich](https://github.com/andreyvelich))
- Use cache-dependency-path in actions/setup-go for CI workflow ([#2355](https://github.com/kubeflow/katib/pull/2355) by [@forsaken628](https://github.com/forsaken628))
- Fix TestReconcileBatchJob ([#2350](https://github.com/kubeflow/katib/pull/2350) by [@forsaken628](https://github.com/forsaken628))
- Fix Scikit-Learn Version for Skopt Tests ([#2336](https://github.com/kubeflow/katib/pull/2336) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Replace already closed github.com/golang/mock with go.uber.org/mock ([#2357](https://github.com/kubeflow/katib/pull/2357) by [@forsaken628](https://github.com/forsaken628))
- Update outdated actions ([#2324](https://github.com/kubeflow/katib/pull/2324) by [@Mersho](https://github.com/Mersho))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.17.0-rc.0...v0.17.0-rc.1)
# [v0.17.0-rc.0](https://github.com/kubeflow/katib/tree/v0.17.0-rc.0) (2024-04-29)
## Breaking Changes
- Drop Kubernetes v1.26, and support Kubernetes v1.29 ([#2308](https://github.com/kubeflow/katib/pull/2308) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.25, and Support Kubernetes v1.28 ([#2303](https://github.com/kubeflow/katib/pull/2303) by [@tenzen-y](https://github.com/tenzen-y))
## New Features
### Core Features
- Support ARM64 arch for release images ([#2315](https://github.com/kubeflow/katib/pull/2315) by [@andreyvelich](https://github.com/andreyvelich))
- DB: Add environment variable option to skip DB table creationˆ ([#2245](https://github.com/kubeflow/katib/pull/2245) by [@lkaybob](https://github.com/lkaybob))
- Add environment variable option to set postgres ssl mode ([#2266](https://github.com/kubeflow/katib/pull/2266) by [@ckcd](https://github.com/ckcd))
- Upgrade TensorFlow version to v2.16.1 ([#2282](https://github.com/kubeflow/katib/pull/2282) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade PyTorch version to v2.2.1 ([#2279](https://github.com/kubeflow/katib/pull/2279) by [@tenzen-y](https://github.com/tenzen-y))
### SDK Features
- [SDK] Generate Name functionality for creating experiments. ([#2272](https://github.com/kubeflow/katib/pull/2272) by [@bharathk005](https://github.com/bharathk005))
- [SDK] Add `env` & `env_from` in client tune ([#2235](https://github.com/kubeflow/katib/pull/2235) by [@shipengcheng1230](https://github.com/shipengcheng1230))
- [SDK] Add 'algorithm_settings' in client tune ([#2227](https://github.com/kubeflow/katib/pull/2227) by [@shipengcheng1230](https://github.com/shipengcheng1230))
- [SDK] Raise more human-readable name conflict exception ([#2199](https://github.com/kubeflow/katib/pull/2199) by [@droctothorpe](https://github.com/droctothorpe))
## Bug Fixes
- [SDK] Fix env per Trial parameter in tune API ([#2304](https://github.com/kubeflow/katib/pull/2304) by [@andreyvelich](https://github.com/andreyvelich))
- Fix: clean up UTs for file metrics collector ([#2285](https://github.com/kubeflow/katib/pull/2285) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Fix tensor devices for DARTS Trial ([#2273](https://github.com/kubeflow/katib/pull/2273) by [@sifa1024](https://github.com/sifa1024))
- Typo fix stale.yaml ([#2257](https://github.com/kubeflow/katib/pull/2257) by [@tarilabs](https://github.com/tarilabs))
- Fix Optuna Validation for CMA-ES ([#2240](https://github.com/kubeflow/katib/pull/2240) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Upgrade Go version to v1.22 ([#2309](https://github.com/kubeflow/katib/pull/2309) by [@tenzen-y](https://github.com/tenzen-y))
- CI: Enable parallel mode for the coveralls ([#2297](https://github.com/kubeflow/katib/pull/2297) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Python version to 3.11 ([#2278](https://github.com/kubeflow/katib/pull/2278) by [@tenzen-y](https://github.com/tenzen-y))
- chore: add unit testcases for files in Text format. ([#2274](https://github.com/kubeflow/katib/pull/2274) by [@Electronic-Waste](https://github.com/Electronic-Waste))
- Upgrade google/go-containerregistry/pkg/authn/k8schain ([#2252](https://github.com/kubeflow/katib/pull/2252) by [@tenzen-y](https://github.com/tenzen-y))
- Remove MXNet examples ([#2267](https://github.com/kubeflow/katib/pull/2267) by [@tenzen-y](https://github.com/tenzen-y))
- Add Technical and style guide to the contribution guide ([#2250](https://github.com/kubeflow/katib/pull/2250) by [@tenzen-y](https://github.com/tenzen-y))
- Install typing-extensions v4.6.3 for Optuna ([#2251](https://github.com/kubeflow/katib/pull/2251) by [@tenzen-y](https://github.com/tenzen-y))
- Remove legacy BO code ([#2246](https://github.com/kubeflow/katib/pull/2246) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.16.0 ([#2239](https://github.com/kubeflow/katib/pull/2239) by [@andreyvelich](https://github.com/andreyvelich))
- Add Katib ROADMAP 2022/2023 ([#2153](https://github.com/kubeflow/katib/pull/2153) by [@andreyvelich](https://github.com/andreyvelich))
- Update Ubuntu to 22.04 for E2E Tests ([#2222](https://github.com/kubeflow/katib/pull/2222) by [@andreyvelich](https://github.com/andreyvelich))
- Run Stale Action Every 5th Hour ([#2221](https://github.com/kubeflow/katib/pull/2221) by [@andreyvelich](https://github.com/andreyvelich))
- Add Stale GitHub Action ([#2220](https://github.com/kubeflow/katib/pull/2220) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.16.0-rc.1 ([#2218](https://github.com/kubeflow/katib/pull/2218) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.16.0-rc.0 ([#2204](https://github.com/kubeflow/katib/pull/2204) by [@andreyvelich](https://github.com/andreyvelich))
- Use the controller-runtime logger in the cert-generator ([#2219](https://github.com/kubeflow/katib/pull/2219) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.16.0...v0.17.0-rc.0)
# [v0.16.0](https://github.com/kubeflow/katib/tree/v0.16.0) (2023-10-31)
## Breaking Changes
- Implement KatibConfig API ([#2176](https://github.com/kubeflow/katib/pull/2176) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.24 and support Kubernetes v1.27 ([#2182](https://github.com/kubeflow/katib/pull/2182) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.23 and support Kubernetes v1.26 ([#2177](https://github.com/kubeflow/katib/pull/2177) by [@tenzen-y](https://github.com/tenzen-y))
- Change failurePolicy to Fail for Katib Webhooks ([#2018](https://github.com/kubeflow/katib/pull/2018) by [@andreyvelich](https://github.com/andreyvelich))
## New Features
### Core Features
- Consolidate the Katib Cert Generator to the Katib Controller ([#2185](https://github.com/kubeflow/katib/pull/2185) by [@tenzen-y](https://github.com/tenzen-y))
- Containerize tests for Katib Conformance ([#2146](https://github.com/kubeflow/katib/pull/2146) by [@nagar-ajay](https://github.com/nagar-ajay))
### UI Improvements
- [UI] Default Resume Policy to never from UI ([#2195](https://github.com/kubeflow/katib/pull/2195) by [@mChowdhury-91](https://github.com/mChowdhury-91))
- [UI] Remove Deprecated Katib UI ([#2179](https://github.com/kubeflow/katib/pull/2179) by [@andreyvelich](https://github.com/andreyvelich))
- [UI] Fix Trial Logs when Kubernetes Job Fails ([#2164](https://github.com/kubeflow/katib/pull/2164) by [@andreyvelich](https://github.com/andreyvelich))
- kwa(front): Support all namespaces ([#2119](https://github.com/kubeflow/katib/pull/2119) by [@elenzio9](https://github.com/elenzio9))
- kwa(front): Update the use of SnackBarService ([#2113](https://github.com/kubeflow/katib/pull/2113) by [@orfeas-k](https://github.com/orfeas-k))
- UI: Remove an unsed import, EventV1beta1Api ([#2116](https://github.com/kubeflow/katib/pull/2116) by [@tenzen-y](https://github.com/tenzen-y))
### SDK Improvements
- [SDK] Enable resource specification for trial containers ([#2192](https://github.com/kubeflow/katib/pull/2192) by [@droctothorpe](https://github.com/droctothorpe))
- [SDK] Add namespace parameter to KatibClient ([#2183](https://github.com/kubeflow/katib/pull/2183) by [@droctothorpe](https://github.com/droctothorpe))
- [SDK] Import all Kubernetes Models ([#2148](https://github.com/kubeflow/katib/pull/2148) by [@andreyvelich](https://github.com/andreyvelich))
## Bug fixes
- Bug: Wait for the certs to be mounted inside the container ([#2213](https://github.com/kubeflow/katib/pull/2213) by [@tenzen-y](https://github.com/tenzen-y))
- Start waiting for certs to be ready before sending data to the channel ([#2215](https://github.com/kubeflow/katib/pull/2215) by [@tenzen-y](https://github.com/tenzen-y))
- E2E: Add additional checks to verify if the components are ready ([#2212](https://github.com/kubeflow/katib/pull/2212) by [@tenzen-y](https://github.com/tenzen-y))
- Remove a katib-webhook-cert Secret from components ([#2214](https://github.com/kubeflow/katib/pull/2214) by [@tenzen-y](https://github.com/tenzen-y))
- Skip to inject the metrics-collector pods to the Katib controller ([#2211](https://github.com/kubeflow/katib/pull/2211) by [@tenzen-y](https://github.com/tenzen-y))
- Sending an empty data to the certsReady channel ([#2196](https://github.com/kubeflow/katib/pull/2196) by [@tenzen-y](https://github.com/tenzen-y))
- Fix conformance docker image ([#2147](https://github.com/kubeflow/katib/pull/2147) by [@nagar-ajay](https://github.com/nagar-ajay))
## Documentation
- Add PITS Global Data Recovery Services to the adopters list ([#2160](https://github.com/kubeflow/katib/pull/2160) by [@ghost](https://github.com/ghost))
- Add SDK Breaking Change to Changelog ([#2133](https://github.com/kubeflow/katib/pull/2133) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.15.0 ([#2129](https://github.com/kubeflow/katib/pull/2129) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.15.0-rc.1 ([#2123](https://github.com/kubeflow/katib/pull/2123) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.15.0-rc.0 ([#2106](https://github.com/kubeflow/katib/pull/2106) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Upgrade Tensorflow version to v2.13.0 ([#2216](https://github.com/kubeflow/katib/pull/2216) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Go version to v1.20 ([#2190](https://github.com/kubeflow/katib/pull/2190) by [@tenzen-y](https://github.com/tenzen-y))
- Replace grpc_health_probe with the built-in gRPC container probe feature ([#2189](https://github.com/kubeflow/katib/pull/2189) by [@tenzen-y](https://github.com/tenzen-y))
- Allow install binaries for the arm64 in the envtest ([#2188](https://github.com/kubeflow/katib/pull/2188) by [@tenzen-y](https://github.com/tenzen-y))
- Replace action to setup minikube with medyagh/setup-minikube ([#2178](https://github.com/kubeflow/katib/pull/2178) by [@tenzen-y](https://github.com/tenzen-y))
- Remove Charmed Operators for Katib ([#2161](https://github.com/kubeflow/katib/pull/2161) by [@ca-scribner](https://github.com/ca-scribner))
- Namespace and trial pod annotations as CLI argument ([#2138](https://github.com/kubeflow/katib/pull/2138) by [@nagar-ajay](https://github.com/nagar-ajay))
- Relax dependencies restriction for the gRPC libraries ([#2140](https://github.com/kubeflow/katib/pull/2140) by [@tenzen-y](https://github.com/tenzen-y))
- Add SDK Breaking Change to Changelog ([#2133](https://github.com/kubeflow/katib/pull/2133) by [@andreyvelich](https://github.com/andreyvelich))
- Increase the free spaces in CI ([#2131](https://github.com/kubeflow/katib/pull/2131) by [@tenzen-y](https://github.com/tenzen-y))
- Reformat katib-operators ([#2114](https://github.com/kubeflow/katib/pull/2114) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.15.0...v0.16.0)
# [v0.16.0-rc.1](https://github.com/kubeflow/katib/tree/v0.16.0-rc.1) (2023-08-16)
## New Features
- Upgrade Tensorflow version to v2.13.0 ([#2216](https://github.com/kubeflow/katib/pull/2216) by [@tenzen-y](https://github.com/tenzen-y))
## Bug Fixes
- Bug: Wait for the certs to be mounted inside the container ([#2213](https://github.com/kubeflow/katib/pull/2213) by [@tenzen-y](https://github.com/tenzen-y))
- Start waiting for certs to be ready before sending data to the channel ([#2215](https://github.com/kubeflow/katib/pull/2215) by [@tenzen-y](https://github.com/tenzen-y))
- E2E: Add additional checks to verify if the components are ready ([#2212](https://github.com/kubeflow/katib/pull/2212) by [@tenzen-y](https://github.com/tenzen-y))
- Remove a katib-webhook-cert Secret from components ([#2214](https://github.com/kubeflow/katib/pull/2214) by [@tenzen-y](https://github.com/tenzen-y))
- Skip to inject the metrics-collector pods to the Katib controller ([#2211](https://github.com/kubeflow/katib/pull/2211) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.16.0-rc.0...v0.16.0-rc.1)
# [v0.16.0-rc.0](https://github.com/kubeflow/katib/tree/v0.16.0-rc.0) (2023-08-05)
## Breaking Changes
- Implement KatibConfig API ([#2176](https://github.com/kubeflow/katib/pull/2176) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.24 and support Kubernetes v1.27 ([#2182](https://github.com/kubeflow/katib/pull/2182) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.23 and support Kubernetes v1.26 ([#2177](https://github.com/kubeflow/katib/pull/2177) by [@tenzen-y](https://github.com/tenzen-y))
- Change failurePolicy to Fail for Katib Webhooks ([#2018](https://github.com/kubeflow/katib/pull/2018) by [@andreyvelich](https://github.com/andreyvelich))
## New Features
### Core Features
- Consolidate the Katib Cert Generator to the Katib Controller ([#2185](https://github.com/kubeflow/katib/pull/2185) by [@tenzen-y](https://github.com/tenzen-y))
- Containerize tests for Katib Conformance ([#2146](https://github.com/kubeflow/katib/pull/2146) by [@nagar-ajay](https://github.com/nagar-ajay))
### UI Improvements
- [UI] Default Resume Policy to never from UI ([#2195](https://github.com/kubeflow/katib/pull/2195) by [@mChowdhury-91](https://github.com/mChowdhury-91))
- [UI] Remove Deprecated Katib UI ([#2179](https://github.com/kubeflow/katib/pull/2179) by [@andreyvelich](https://github.com/andreyvelich))
- [UI] Fix Trial Logs when Kubernetes Job Fails ([#2164](https://github.com/kubeflow/katib/pull/2164) by [@andreyvelich](https://github.com/andreyvelich))
- kwa(front): Support all namespaces ([#2119](https://github.com/kubeflow/katib/pull/2119) by [@elenzio9](https://github.com/elenzio9))
- kwa(front): Update the use of SnackBarService ([#2113](https://github.com/kubeflow/katib/pull/2113) by [@orfeas-k](https://github.com/orfeas-k))
- UI: Remove an unsed import, EventV1beta1Api ([#2116](https://github.com/kubeflow/katib/pull/2116) by [@tenzen-y](https://github.com/tenzen-y))
### SDK Improvements
- [SDK] Enable resource specification for trial containers ([#2192](https://github.com/kubeflow/katib/pull/2192) by [@droctothorpe](https://github.com/droctothorpe))
- [SDK] Add namespace parameter to KatibClient ([#2183](https://github.com/kubeflow/katib/pull/2183) by [@droctothorpe](https://github.com/droctothorpe))
- [SDK] Import all Kubernetes Models ([#2148](https://github.com/kubeflow/katib/pull/2148) by [@andreyvelich](https://github.com/andreyvelich))
## Bug fixes
- Sending an empty data to the certsReady channel ([#2196](https://github.com/kubeflow/katib/pull/2196) by [@tenzen-y](https://github.com/tenzen-y))
- Fix conformance docker image ([#2147](https://github.com/kubeflow/katib/pull/2147) by [@nagar-ajay](https://github.com/nagar-ajay))
## Documentation
- Add PITS Global Data Recovery Services to the adopters list ([#2160](https://github.com/kubeflow/katib/pull/2160) by [@ghost](https://github.com/ghost))
- Add SDK Breaking Change to Changelog ([#2133](https://github.com/kubeflow/katib/pull/2133) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.15.0 ([#2129](https://github.com/kubeflow/katib/pull/2129) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.15.0-rc.1 ([#2123](https://github.com/kubeflow/katib/pull/2123) by [@andreyvelich](https://github.com/andreyvelich))
- Add Changelog for Katib v0.15.0-rc.0 ([#2106](https://github.com/kubeflow/katib/pull/2106) by [@andreyvelich](https://github.com/andreyvelich))
## Misc
- Upgrade Go version to v1.20 ([#2190](https://github.com/kubeflow/katib/pull/2190) by [@tenzen-y](https://github.com/tenzen-y))
- Replace grpc_health_probe with the built-in gRPC container probe feature ([#2189](https://github.com/kubeflow/katib/pull/2189) by [@tenzen-y](https://github.com/tenzen-y))
- Allow install binaries for the arm64 in the envtest ([#2188](https://github.com/kubeflow/katib/pull/2188) by [@tenzen-y](https://github.com/tenzen-y))
- Replace action to setup minikube with medyagh/setup-minikube ([#2178](https://github.com/kubeflow/katib/pull/2178) by [@tenzen-y](https://github.com/tenzen-y))
- Remove Charmed Operators for Katib ([#2161](https://github.com/kubeflow/katib/pull/2161) by [@ca-scribner](https://github.com/ca-scribner))
- Namespace and trial pod annotations as CLI argument ([#2138](https://github.com/kubeflow/katib/pull/2138) by [@nagar-ajay](https://github.com/nagar-ajay))
- Relax dependencies restriction for the gRPC libraries ([#2140](https://github.com/kubeflow/katib/pull/2140) by [@tenzen-y](https://github.com/tenzen-y))
- Add SDK Breaking Change to Changelog ([#2133](https://github.com/kubeflow/katib/pull/2133) by [@andreyvelich](https://github.com/andreyvelich))
- Increase the free spaces in CI ([#2131](https://github.com/kubeflow/katib/pull/2131) by [@tenzen-y](https://github.com/tenzen-y))
- Reformat katib-operators ([#2114](https://github.com/kubeflow/katib/pull/2114) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.15.0...v0.16.0-rc.0)
# [v0.15.0](https://github.com/kubeflow/katib/tree/v0.15.0) (2023-03-22)
## Breaking Changes
- Use **Never** Resume Policy as Default ([#2102](https://github.com/kubeflow/katib/pull/2102) by [@andreyvelich](https://github.com/andreyvelich))
- Chocolate Suggestion Service is removed ([#2071](https://github.com/kubeflow/katib/pull/2071) by [@tenzen-y](https://github.com/tenzen-y))
- `request_number` is removed from the GRPC APIs ([#1994](https://github.com/kubeflow/katib/pull/1994) by [@johnugeorge](https://github.com/johnugeorge))
- Enabling Authorization in Katib UI ([#1983](https://github.com/kubeflow/katib/pull/1983) and [#2041](https://github.com/kubeflow/katib/pull/2041) by [@apo-ger](https://github.com/apo-ger))
- The new improved and refactored Katib SDK is not backward compatible ([#2075](https://github.com/kubeflow/katib/pull/2075) by [@andreyvelich](https://github.com/andreyvelich))
## New Features
### Major Features
- Narrow down Katib RBAC rules ([#2091](https://github.com/kubeflow/katib/pull/2091) by [@johnugeorge](https://github.com/johnugeorge))
- Support Postgres as a Katib DB ([#1921](https://github.com/kubeflow/katib/pull/1921) by [@anencore94](https://github.com/anencore94))
- More Suggestion container fields in Katib Config ([#2000](https://github.com/kubeflow/katib/pull/2000) by [@fischor](https://github.com/fischor))
- Katib UI: Create the LOGS tab of Trial's details page ([#2117](https://github.com/kubeflow/katib/pull/2117) by [@elenzio9](https://github.com/elenzio9))
- Katib UI: Enable pagination/sorting/filtering ([#2017](https://github.com/kubeflow/katib/pull/2017) and [#2040](https://github.com/kubeflow/katib/pull/2040) by [@elenzio9](https://github.com/elenzio9))
- [SDK] Create Tune API in the Katib SDK ([#1951](https://github.com/kubeflow/katib/pull/1951) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Get Trial Metrics from Katib DB ([#2050](https://github.com/kubeflow/katib/pull/2050) by [@andreyvelich](https://github.com/andreyvelich))
### Core Features
- Add Conformance Program Doc for AutoML and Training WG ([#2048](https://github.com/kubeflow/katib/pull/2048) by [@andreyvelich](https://github.com/andreyvelich))
- Support for grid search algorithm in Optuna Suggestion Service ([#2060](https://github.com/kubeflow/katib/pull/2060) by [@tenzen-y](https://github.com/tenzen-y))
- Add Trial Labels During Pod Mutation ([#2047](https://github.com/kubeflow/katib/pull/2047) by [@andreyvelich](https://github.com/andreyvelich))
- Support for k8s v1.25 in CI ([#1997](https://github.com/kubeflow/katib/pull/1997) by [@johnugeorge](https://github.com/johnugeorge))
- Add the CI to build multi-platform container images ([#1956](https://github.com/kubeflow/katib/pull/1956) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.21 and introduce Kubernetes v1.24 ([#1953](https://github.com/kubeflow/katib/pull/1953) by [@tenzen-y](https://github.com/tenzen-y))
- Add --connect-timeout flag to katib-db-manager ([#1937](https://github.com/kubeflow/katib/pull/1937) by [@tenzen-y](https://github.com/tenzen-y))
- Implement validations for DARTS suggestion service ([#1926](https://github.com/kubeflow/katib/pull/1926) by [@tenzen-y](https://github.com/tenzen-y))
- Implement validation for Optuna suggestion service ([#1924](https://github.com/kubeflow/katib/pull/1924) by [@tenzen-y](https://github.com/tenzen-y))
### UI Improvements
- Make links in KWA's tables actual links ([#2090](https://github.com/kubeflow/katib/pull/2090) by [@elenzio9](https://github.com/elenzio9))
- frontend: Rework the trial graph using ECharts in KWA ([#2089](https://github.com/kubeflow/katib/pull/2089) by [@elenzio9](https://github.com/elenzio9))
- kwa(front): Add UI tests with Cypress ([#2088](https://github.com/kubeflow/katib/pull/2088) by [@orfeas-k](https://github.com/orfeas-k))
- frontend: Enable actions in experiment graph ([#2065](https://github.com/kubeflow/katib/pull/2065) by [@elenzio9](https://github.com/elenzio9))
- frontend: Show message in case of uncompleted trial instead of the graph ([#2063](https://github.com/kubeflow/katib/pull/2063) by [@elenzio9](https://github.com/elenzio9))
- frontend: Add source maps in the browser ([#2043](https://github.com/kubeflow/katib/pull/2043) by [@elenzio9](https://github.com/elenzio9))
- Backend for getting logs of a trial ([#2039](https://github.com/kubeflow/katib/pull/2039) by [@d-gol](https://github.com/d-gol))
- frontend: Show the successful trials in the experiment graph (#2013) ([#2033](https://github.com/kubeflow/katib/pull/2033) by [@elenzio9](https://github.com/elenzio9))
- frontend: Migrate from tslint to eslint in KWA ([#2042](https://github.com/kubeflow/katib/pull/2042) by [@elenzio9](https://github.com/elenzio9))
- Dedicated yaml tab for Trials ([#2034](https://github.com/kubeflow/katib/pull/2034) by [@elenzio9](https://github.com/elenzio9))
- KWA: Use new Editor component (Monaco) ([#2023](https://github.com/kubeflow/katib/pull/2023) by [@orfeas-k](https://github.com/orfeas-k))
- kwa(build): Introduce COMMIT file for building KWA ([#2014](https://github.com/kubeflow/katib/pull/2014) by [@orfeas-k](https://github.com/orfeas-k))
- frontend: Fix 500 error after detail page refresh (#1967) ([#2001](https://github.com/kubeflow/katib/pull/2001) by [@elenzio9](https://github.com/elenzio9))
- Introduce KWA's frontend component for kfp links ([#1991](https://github.com/kubeflow/katib/pull/1991) by [@elenzio9](https://github.com/elenzio9))
- UI: Rename and right align the age column ([#1989](https://github.com/kubeflow/katib/pull/1989) by [@elenzio9](https://github.com/elenzio9))
- Show the trials table's status column first ([#1990](https://github.com/kubeflow/katib/pull/1990) by [@elenzio9](https://github.com/elenzio9))
- UI: Make KWA's main table responsive and add toolbar ([#1982](https://github.com/kubeflow/katib/pull/1982) by [@elenzio9](https://github.com/elenzio9))
- UI: Fix unit tests ([#1977](https://github.com/kubeflow/katib/pull/1977) by [@elenzio9](https://github.com/elenzio9))
- UI: Format code ([#1979](https://github.com/kubeflow/katib/pull/1979) by [@orfeas-k](https://github.com/orfeas-k))
- Recreate the Experiments Parallel Coordinates Graph ([#1974](https://github.com/kubeflow/katib/pull/1974) by [@elenzio9](https://github.com/elenzio9))
- Improve UI API/controller logging to ease troubleshooting ([#1966](https://github.com/kubeflow/katib/pull/1966) by [@lukeogg](https://github.com/lukeogg))
### SDK Improvements
- [SDK] Use Katib SDK for E2E Tests ([#2075](https://github.com/kubeflow/katib/pull/2075) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Use Katib Client without Kube Config ([#2098](https://github.com/kubeflow/katib/pull/2098) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Fix namespace parameter in tune API ([#1981](https://github.com/kubeflow/katib/pull/1981) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Remove Final Keyword from constants ([#1980](https://github.com/kubeflow/katib/pull/1980) by [@andreyvelich](https://github.com/andreyvelich))
## Bug fixes
- Fix Release Script for Updating SDK Version ([#2104](https://github.com/kubeflow/katib/pull/2104) by [@andreyvelich](https://github.com/andreyvelich))
- [Fix] add early stopped trials in converter ([#2004](https://github.com/kubeflow/katib/pull/2004) by [@shaowei-su](https://github.com/shaowei-su))
- [bugfix] Fix value passing bug in New Experiment form ([#2027](https://github.com/kubeflow/katib/pull/2027) by [@orfeas-k](https://github.com/orfeas-k))
- Fix main process retrieve logic for early stopping ([#1988](https://github.com/kubeflow/katib/pull/1988) by [@shaowei-su](https://github.com/shaowei-su))
- [hotfix]: filter by name of experiment ([#1920](https://github.com/kubeflow/katib/pull/1920) by [@anencore94](https://github.com/anencore94))
- Fix push script to include new images ([#1911](https://github.com/kubeflow/katib/pull/1911) by [@johnugeorge](https://github.com/johnugeorge))
- fix: only validate Kubernetes Job ([#2025](https://github.com/kubeflow/katib/pull/2025) by [@zhixian82](https://github.com/zhixian82))
- Upgrade grpc-health-probe version to fix some security issues ([#2093](https://github.com/kubeflow/katib/pull/2093) by [@tenzen-y](https://github.com/tenzen-y))
- Format Katib Charm Operator ([#2115](https://github.com/kubeflow/katib/pull/2115) by [@tenzen-y](https://github.com/tenzen-y))
## Documentation
- Add CERN to adopters ([#2010](https://github.com/kubeflow/katib/pull/2010) by [@d-gol](https://github.com/d-gol))
- Add More Katib Presentations 2022 ([#2009](https://github.com/kubeflow/katib/pull/2009) by [@andreyvelich](https://github.com/andreyvelich))
- Add the documentation for simple-pbt ([#1978](https://github.com/kubeflow/katib/pull/1978) by [@tenzen-y](https://github.com/tenzen-y))
- Add the license to pbt ([#1958](https://github.com/kubeflow/katib/pull/1958) by [@tenzen-y](https://github.com/tenzen-y))
- Update the Katib version in docs ([#1950](https://github.com/kubeflow/katib/pull/1950) by [@tenzen-y](https://github.com/tenzen-y))
- Update CHANGELOG for v0.14.0 release ([#1932](https://github.com/kubeflow/katib/pull/1932) by [@johnugeorge](https://github.com/johnugeorge))
## Misc
- Update Training operator Image in CI ([#2103](https://github.com/kubeflow/katib/pull/2103) by [@johnugeorge](https://github.com/johnugeorge))
- Upgrade Go libraries to resolve security issues ([#2094](https://github.com/kubeflow/katib/pull/2094) by [@tenzen-y](https://github.com/tenzen-y))
- Run e2e with various Python versions to verify Python SDK ([#2092](https://github.com/kubeflow/katib/pull/2092) by [@tenzen-y](https://github.com/tenzen-y))
- Add a --prefer-binary flag to 'pip install' command ([#2096](https://github.com/kubeflow/katib/pull/2096) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade PyTorch version to v1.13.0 ([#2082](https://github.com/kubeflow/katib/pull/2082) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Tensorflow version ([#2079](https://github.com/kubeflow/katib/pull/2079) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Python version to 3.10 ([#2057](https://github.com/kubeflow/katib/pull/2057) by [@tenzen-y](https://github.com/tenzen-y))
- Pin the NumPy version with v1.23.5 in some images ([#2070](https://github.com/kubeflow/katib/pull/2070) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade the actions-setup-minikube version to v2.7.2 ([#2064](https://github.com/kubeflow/katib/pull/2064) by [@tenzen-y](https://github.com/tenzen-y))
- Remove Certificate Chain from Cert Generator ([#2045](https://github.com/kubeflow/katib/pull/2045) by [@andreyvelich](https://github.com/andreyvelich))
- Add resources to earlystopping container ([#2038](https://github.com/kubeflow/katib/pull/2038) by [@zhixian82](https://github.com/zhixian82))
- Add scripts to verify generated codes and Go Modules ([#1999](https://github.com/kubeflow/katib/pull/1999) by [@tenzen-y](https://github.com/tenzen-y))
- [Test] Reduce Katib GitHub Action Runs ([#2036](https://github.com/kubeflow/katib/pull/2036) by [@andreyvelich](https://github.com/andreyvelich))
- gh-actions: Extend action to run Frontend Unit tests ([#1998](https://github.com/kubeflow/katib/pull/1998) by [@orfeas-k](https://github.com/orfeas-k))
- [chore] Upgrade docker/metadata-action, actions/checkout, and actions/setup-python version ([#1996](https://github.com/kubeflow/katib/pull/1996) by [@tenzen-y](https://github.com/tenzen-y))
- [chore] Upgrade Go version to v1.19 ([#1995](https://github.com/kubeflow/katib/pull/1995) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in simple-pbt image ([#1948](https://github.com/kubeflow/katib/pull/1948) by [@tenzen-y](https://github.com/tenzen-y))
- Support arm64 in darts-cnn-cifar10 image ([#1947](https://github.com/kubeflow/katib/pull/1947) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in enas-cnn-cifar10 image ([#1944](https://github.com/kubeflow/katib/pull/1944) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in pytorch-mnist image ([#1943](https://github.com/kubeflow/katib/pull/1943) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in mxnet-mnist image ([#1940](https://github.com/kubeflow/katib/pull/1940) by [@tenzen-y](https://github.com/tenzen-y))
- Use the katib-new-ui for Charmed gh-actions ([#1987](https://github.com/kubeflow/katib/pull/1987) by [@tenzen-y](https://github.com/tenzen-y))
- [feat] health check for katib-controller ([#1934](https://github.com/kubeflow/katib/pull/1934) by [@anencore94](https://github.com/anencore94))
- Upgrade Optuna from v2.x.x to v3.0.0 ([#1942](https://github.com/kubeflow/katib/pull/1942) by [@keisuke-umezawa](https://github.com/keisuke-umezawa))
- Add validation webhooks for maxFailedTrialCount and parallelTrialCount ([#1936](https://github.com/kubeflow/katib/pull/1936) by [@tenzen-y](https://github.com/tenzen-y))
- Introduce Automatic platform ARGs ([#1935](https://github.com/kubeflow/katib/pull/1935) by [@tenzen-y](https://github.com/tenzen-y))
- Update training operator image in CI ([#1933](https://github.com/kubeflow/katib/pull/1933) by [@johnugeorge](https://github.com/johnugeorge))
- Update Katib SDK version ([#1931](https://github.com/kubeflow/katib/pull/1931) by [@johnugeorge](https://github.com/johnugeorge))
- [chore] Upgrade Go version to v1.18 ([#1925](https://github.com/kubeflow/katib/pull/1925) by [@tenzen-y](https://github.com/tenzen-y))
- Add the pytorch-mnist with GPU support container image ([#1916](https://github.com/kubeflow/katib/pull/1916) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.14.0...v0.15.0)
# [v0.15.0-rc.1](https://github.com/kubeflow/katib/tree/v0.15.0-rc.1) (2023-02-15)
## New Features
- UI: Create the LOGS tab of Trial's details page ([#2117](https://github.com/kubeflow/katib/pull/2117) by [@elenzio9](https://github.com/elenzio9))
## Bug Fixes
- Format Katib Charm Operator ([#2115](https://github.com/kubeflow/katib/pull/2115) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.15.0-rc.0...v0.15.0-rc.1)
# [v0.15.0-rc.0](https://github.com/kubeflow/katib/tree/v0.15.0-rc.0) (2023-01-27)
## Breaking Changes
- Use **Never** Resume Policy as Default ([#2102](https://github.com/kubeflow/katib/pull/2102) by [@andreyvelich](https://github.com/andreyvelich))
- Chocolate Suggestion Service is removed ([#2071](https://github.com/kubeflow/katib/pull/2071) by [@tenzen-y](https://github.com/tenzen-y))
- `request_number` is removed from the GRPC APIs ([#1994](https://github.com/kubeflow/katib/pull/1994) by [@johnugeorge](https://github.com/johnugeorge))
- The new improved and refactored Katib SDK is not backward compatible ([#2075](https://github.com/kubeflow/katib/pull/2075) by [@andreyvelich](https://github.com/andreyvelich))
## New Features
### Major Features
- Narrow down Katib RBAC rules ([#2091](https://github.com/kubeflow/katib/pull/2091) by [@johnugeorge](https://github.com/johnugeorge))
- Support Postgres as a Katib DB ([#1921](https://github.com/kubeflow/katib/pull/1921) by [@anencore94](https://github.com/anencore94))
- More Suggestion container fields in Katib Config ([#2000](https://github.com/kubeflow/katib/pull/2000) by [@fischor](https://github.com/fischor))
- Katib UI: Enable pagination/sorting/filtering ([#2017](https://github.com/kubeflow/katib/pull/2017) and [#2040](https://github.com/kubeflow/katib/pull/2040) by [@elenzio9](https://github.com/elenzio9))
- Katib UI: Add authorization mechanisms ([#1983](https://github.com/kubeflow/katib/pull/1983) by [@apo-ger](https://github.com/apo-ger))
- [SDK] Create Tune API in the Katib SDK ([#1951](https://github.com/kubeflow/katib/pull/1951) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Get Trial Metrics from Katib DB ([#2050](https://github.com/kubeflow/katib/pull/2050) by [@andreyvelich](https://github.com/andreyvelich))
### Core Features
- Add Conformance Program Doc for AutoML and Training WG ([#2048](https://github.com/kubeflow/katib/pull/2048) by [@andreyvelich](https://github.com/andreyvelich))
- Support for grid search algorithm in Optuna Suggestion Service ([#2060](https://github.com/kubeflow/katib/pull/2060) by [@tenzen-y](https://github.com/tenzen-y))
- Add Trial Labels During Pod Mutation ([#2047](https://github.com/kubeflow/katib/pull/2047) by [@andreyvelich](https://github.com/andreyvelich))
- Support for k8s v1.25 in CI ([#1997](https://github.com/kubeflow/katib/pull/1997) by [@johnugeorge](https://github.com/johnugeorge))
- Add the CI to build multi-platform container images ([#1956](https://github.com/kubeflow/katib/pull/1956) by [@tenzen-y](https://github.com/tenzen-y))
- Drop Kubernetes v1.21 and introduce Kubernetes v1.24 ([#1953](https://github.com/kubeflow/katib/pull/1953) by [@tenzen-y](https://github.com/tenzen-y))
- Add --connect-timeout flag to katib-db-manager ([#1937](https://github.com/kubeflow/katib/pull/1937) by [@tenzen-y](https://github.com/tenzen-y))
- Implement validations for DARTS suggestion service ([#1926](https://github.com/kubeflow/katib/pull/1926) by [@tenzen-y](https://github.com/tenzen-y))
- Implement validation for Optuna suggestion service ([#1924](https://github.com/kubeflow/katib/pull/1924) by [@tenzen-y](https://github.com/tenzen-y))
### UI Improvements
- Make links in KWA's tables actual links ([#2090](https://github.com/kubeflow/katib/pull/2090) by [@elenzio9](https://github.com/elenzio9))
- frontend: Rework the trial graph using ECharts in KWA ([#2089](https://github.com/kubeflow/katib/pull/2089) by [@elenzio9](https://github.com/elenzio9))
- kwa(front): Add UI tests with Cypress ([#2088](https://github.com/kubeflow/katib/pull/2088) by [@orfeas-k](https://github.com/orfeas-k))
- Update manifests to enable authorization check mechanisms for Katib UI in Kubeflow mode ([#2041](https://github.com/kubeflow/katib/pull/2041) by [@apo-ger](https://github.com/apo-ger))
- frontend: Enable actions in experiment graph ([#2065](https://github.com/kubeflow/katib/pull/2065) by [@elenzio9](https://github.com/elenzio9))
- frontend: Show message in case of uncompleted trial instead of the graph ([#2063](https://github.com/kubeflow/katib/pull/2063) by [@elenzio9](https://github.com/elenzio9))
- frontend: Add source maps in the browser ([#2043](https://github.com/kubeflow/katib/pull/2043) by [@elenzio9](https://github.com/elenzio9))
- Backend for getting logs of a trial ([#2039](https://github.com/kubeflow/katib/pull/2039) by [@d-gol](https://github.com/d-gol))
- frontend: Show the successful trials in the experiment graph (#2013) ([#2033](https://github.com/kubeflow/katib/pull/2033) by [@elenzio9](https://github.com/elenzio9))
- frontend: Migrate from tslint to eslint in KWA ([#2042](https://github.com/kubeflow/katib/pull/2042) by [@elenzio9](https://github.com/elenzio9))
- Dedicated yaml tab for Trials ([#2034](https://github.com/kubeflow/katib/pull/2034) by [@elenzio9](https://github.com/elenzio9))
- KWA: Use new Editor component (Monaco) ([#2023](https://github.com/kubeflow/katib/pull/2023) by [@orfeas-k](https://github.com/orfeas-k))
- kwa(build): Introduce COMMIT file for building KWA ([#2014](https://github.com/kubeflow/katib/pull/2014) by [@orfeas-k](https://github.com/orfeas-k))
- frontend: Fix 500 error after detail page refresh (#1967) ([#2001](https://github.com/kubeflow/katib/pull/2001) by [@elenzio9](https://github.com/elenzio9))
- Introduce KWA's frontend component for kfp links ([#1991](https://github.com/kubeflow/katib/pull/1991) by [@elenzio9](https://github.com/elenzio9))
- UI: Rename and right align the age column ([#1989](https://github.com/kubeflow/katib/pull/1989) by [@elenzio9](https://github.com/elenzio9))
- Show the trials table's status column first ([#1990](https://github.com/kubeflow/katib/pull/1990) by [@elenzio9](https://github.com/elenzio9))
- UI: Make KWA's main table responsive and add toolbar ([#1982](https://github.com/kubeflow/katib/pull/1982) by [@elenzio9](https://github.com/elenzio9))
- UI: Fix unit tests ([#1977](https://github.com/kubeflow/katib/pull/1977) by [@elenzio9](https://github.com/elenzio9))
- UI: Format code ([#1979](https://github.com/kubeflow/katib/pull/1979) by [@orfeas-k](https://github.com/orfeas-k))
- Recreate the Experiments Parallel Coordinates Graph ([#1974](https://github.com/kubeflow/katib/pull/1974) by [@elenzio9](https://github.com/elenzio9))
- Improve UI API/controller logging to ease troubleshooting ([#1966](https://github.com/kubeflow/katib/pull/1966) by [@lukeogg](https://github.com/lukeogg))
### SDK Improvements
- [SDK] Use Katib SDK for E2E Tests ([#2075](https://github.com/kubeflow/katib/pull/2075) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Use Katib Client without Kube Config ([#2098](https://github.com/kubeflow/katib/pull/2098) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Fix namespace parameter in tune API ([#1981](https://github.com/kubeflow/katib/pull/1981) by [@andreyvelich](https://github.com/andreyvelich))
- [SDK] Remove Final Keyword from constants ([#1980](https://github.com/kubeflow/katib/pull/1980) by [@andreyvelich](https://github.com/andreyvelich))
## Bug fixes
- Fix Release Script for Updating SDK Version ([#2104](https://github.com/kubeflow/katib/pull/2104) by [@andreyvelich](https://github.com/andreyvelich))
- [Fix] add early stopped trials in converter ([#2004](https://github.com/kubeflow/katib/pull/2004) by [@shaowei-su](https://github.com/shaowei-su))
- [bugfix] Fix value passing bug in New Experiment form ([#2027](https://github.com/kubeflow/katib/pull/2027) by [@orfeas-k](https://github.com/orfeas-k))
- Fix main process retrieve logic for early stopping ([#1988](https://github.com/kubeflow/katib/pull/1988) by [@shaowei-su](https://github.com/shaowei-su))
- [hotfix]: filter by name of experiment ([#1920](https://github.com/kubeflow/katib/pull/1920) by [@anencore94](https://github.com/anencore94))
- Fix push script to include new images ([#1911](https://github.com/kubeflow/katib/pull/1911) by [@johnugeorge](https://github.com/johnugeorge))
- fix: only validate Kubernetes Job ([#2025](https://github.com/kubeflow/katib/pull/2025) by [@zhixian82](https://github.com/zhixian82))
- Upgrade grpc-health-probe version to fix some security issues ([#2093](https://github.com/kubeflow/katib/pull/2093) by [@tenzen-y](https://github.com/tenzen-y))
## Documentation
- Add CERN to adopters ([#2010](https://github.com/kubeflow/katib/pull/2010) by [@d-gol](https://github.com/d-gol))
- Add More Katib Presentations 2022 ([#2009](https://github.com/kubeflow/katib/pull/2009) by [@andreyvelich](https://github.com/andreyvelich))
- Add the documentation for simple-pbt ([#1978](https://github.com/kubeflow/katib/pull/1978) by [@tenzen-y](https://github.com/tenzen-y))
- Add the license to pbt ([#1958](https://github.com/kubeflow/katib/pull/1958) by [@tenzen-y](https://github.com/tenzen-y))
- Update the Katib version in docs ([#1950](https://github.com/kubeflow/katib/pull/1950) by [@tenzen-y](https://github.com/tenzen-y))
- Update CHANGELOG for v0.14.0 release ([#1932](https://github.com/kubeflow/katib/pull/1932) by [@johnugeorge](https://github.com/johnugeorge))
## Misc
- Update Training operator Image in CI ([#2103](https://github.com/kubeflow/katib/pull/2103) by [@johnugeorge](https://github.com/johnugeorge))
- Upgrade Go libraries to resolve security issues ([#2094](https://github.com/kubeflow/katib/pull/2094) by [@tenzen-y](https://github.com/tenzen-y))
- Run e2e with various Python versions to verify Python SDK ([#2092](https://github.com/kubeflow/katib/pull/2092) by [@tenzen-y](https://github.com/tenzen-y))
- Add a --prefer-binary flag to 'pip install' command ([#2096](https://github.com/kubeflow/katib/pull/2096) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade PyTorch version to v1.13.0 ([#2082](https://github.com/kubeflow/katib/pull/2082) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Tensorflow version ([#2079](https://github.com/kubeflow/katib/pull/2079) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade Python version to 3.10 ([#2057](https://github.com/kubeflow/katib/pull/2057) by [@tenzen-y](https://github.com/tenzen-y))
- Pin the NumPy version with v1.23.5 in some images ([#2070](https://github.com/kubeflow/katib/pull/2070) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade the actions-setup-minikube version to v2.7.2 ([#2064](https://github.com/kubeflow/katib/pull/2064) by [@tenzen-y](https://github.com/tenzen-y))
- Remove Certificate Chain from Cert Generator ([#2045](https://github.com/kubeflow/katib/pull/2045) by [@andreyvelich](https://github.com/andreyvelich))
- Add resources to earlystopping container ([#2038](https://github.com/kubeflow/katib/pull/2038) by [@zhixian82](https://github.com/zhixian82))
- Add scripts to verify generated codes and Go Modules ([#1999](https://github.com/kubeflow/katib/pull/1999) by [@tenzen-y](https://github.com/tenzen-y))
- [Test] Reduce Katib GitHub Action Runs ([#2036](https://github.com/kubeflow/katib/pull/2036) by [@andreyvelich](https://github.com/andreyvelich))
- gh-actions: Extend action to run Frontend Unit tests ([#1998](https://github.com/kubeflow/katib/pull/1998) by [@orfeas-k](https://github.com/orfeas-k))
- [chore] Upgrade docker/metadata-action, actions/checkout, and actions/setup-python version ([#1996](https://github.com/kubeflow/katib/pull/1996) by [@tenzen-y](https://github.com/tenzen-y))
- [chore] Upgrade Go version to v1.19 ([#1995](https://github.com/kubeflow/katib/pull/1995) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in simple-pbt image ([#1948](https://github.com/kubeflow/katib/pull/1948) by [@tenzen-y](https://github.com/tenzen-y))
- Support arm64 in darts-cnn-cifar10 image ([#1947](https://github.com/kubeflow/katib/pull/1947) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in enas-cnn-cifar10 image ([#1944](https://github.com/kubeflow/katib/pull/1944) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in pytorch-mnist image ([#1943](https://github.com/kubeflow/katib/pull/1943) by [@tenzen-y](https://github.com/tenzen-y))
- Support for arm64 in mxnet-mnist image ([#1940](https://github.com/kubeflow/katib/pull/1940) by [@tenzen-y](https://github.com/tenzen-y))
- Use the katib-new-ui for Charmed gh-actions ([#1987](https://github.com/kubeflow/katib/pull/1987) by [@tenzen-y](https://github.com/tenzen-y))
- [feat] health check for katib-controller ([#1934](https://github.com/kubeflow/katib/pull/1934) by [@anencore94](https://github.com/anencore94))
- Upgrade Optuna from v2.x.x to v3.0.0 ([#1942](https://github.com/kubeflow/katib/pull/1942) by [@keisuke-umezawa](https://github.com/keisuke-umezawa))
- Add validation webhooks for maxFailedTrialCount and parallelTrialCount ([#1936](https://github.com/kubeflow/katib/pull/1936) by [@tenzen-y](https://github.com/tenzen-y))
- Introduce Automatic platform ARGs ([#1935](https://github.com/kubeflow/katib/pull/1935) by [@tenzen-y](https://github.com/tenzen-y))
- Update training operator image in CI ([#1933](https://github.com/kubeflow/katib/pull/1933) by [@johnugeorge](https://github.com/johnugeorge))
- Update Katib SDK version ([#1931](https://github.com/kubeflow/katib/pull/1931) by [@johnugeorge](https://github.com/johnugeorge))
- [chore] Upgrade Go version to v1.18 ([#1925](https://github.com/kubeflow/katib/pull/1925) by [@tenzen-y](https://github.com/tenzen-y))
- Add the pytorch-mnist with GPU support container image ([#1916](https://github.com/kubeflow/katib/pull/1916) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.14.0...v0.15.0-rc.0)
# [v0.14.0](https://github.com/kubeflow/katib/tree/v0.14.0) (2022-08-18)
## New Features
### Core Features
- Population based training ([#1833](https://github.com/kubeflow/katib/pull/1833) by [@a9p](https://github.com/a9p))
- Support JSON format logs in `file-metrics-collector` ([#1765](https://github.com/kubeflow/katib/pull/1765) by [@tenzen-y](https://github.com/tenzen-y))
- Include MetricsUnavailable condition to Complete in Trial ([#1877](https://github.com/kubeflow/katib/pull/1877) by [@tenzen-y](https://github.com/tenzen-y))
- Allow running examples on Apple Silicon M1 and fix image build errors for arm64 ([#1898](https://github.com/kubeflow/katib/pull/1898) by [@tenzen-y](https://github.com/tenzen-y))
- Configurable job name and service name for cert generator ([#1889](https://github.com/kubeflow/katib/pull/1889) by [@shaowei-su](https://github.com/shaowei-su))
### UI Features and Enhancements
- Add PBT to experiment creation form ([#1909](https://github.com/kubeflow/katib/pull/1909) by [@a9p](https://github.com/a9p))
- Distinct page for each Trial in the UI ([#1783](https://github.com/kubeflow/katib/pull/1783) by [@d-gol](https://github.com/d-gol))
## Bug fixes
- Add the pytorch-mnist with GPU support container image ([#1917](https://github.com/kubeflow/katib/pull/1917) by [@tenzen-y](https://github.com/tenzen-y))
- Fix push script to include new images ([#1912](https://github.com/kubeflow/katib/pull/1912) by [@johnugeorge](https://github.com/johnugeorge))
- Fixes lint warnings in YAML files ([#1902](https://github.com/kubeflow/katib/pull/1902) by [@Rishit-dagli](https://github.com/Rishit-dagli))
- Fix errors when running the test on Apple Silicon M1 ([#1886](https://github.com/kubeflow/katib/pull/1886) by [@tenzen-y](https://github.com/tenzen-y))
- Reconcile trial assignments by comparing suggestion and trials being executed ([#1831](https://github.com/kubeflow/katib/pull/1831) by [@henrysecond1](https://github.com/henrysecond1))
- Increate the probes seconds in manifests ([#1845](https://github.com/kubeflow/katib/pull/1845) by [@haoxins](https://github.com/haoxins))
- Set upper constraint for Optuna ([#1852](https://github.com/kubeflow/katib/pull/1852) by [@himkt](https://github.com/himkt))
- Don't check if trial's metadata is in spec.parameters ([#1848](https://github.com/kubeflow/katib/pull/1848) by [@alexeygorobets](https://github.com/alexeygorobets))
## Documentation
- Fix the FPGA examples documentation ([#1841](https://github.com/kubeflow/katib/pull/1841) by [@eliaskoromilas](https://github.com/eliaskoromilas))
- Add CyberAgent to adopters ([#1894](https://github.com/kubeflow/katib/pull/1894) by [@tenzen-y](https://github.com/tenzen-y))
## Misc
- Updating the training operator image in CI ([#1910](https://github.com/kubeflow/katib/pull/1910) by [@johnugeorge](https://github.com/johnugeorge))
- Upgrade Python and Pytorch versions for some examples ([#1906](https://github.com/kubeflow/katib/pull/1906) by [@tenzen-y](https://github.com/tenzen-y))
- Linting for K8s YAML files ([#1901](https://github.com/kubeflow/katib/pull/1901) by [@Rishit-dagli](https://github.com/Rishit-dagli))
- Change integration test sysytem from KinD Cluster to Minikube Cluster ([#1899](https://github.com/kubeflow/katib/pull/1899) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade mysql version to v8.0.29 ([#1897](https://github.com/kubeflow/katib/pull/1897) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade tensorflow-aarch64 version to v2.9.1 ([#1891](https://github.com/kubeflow/katib/pull/1891) by [@tenzen-y](https://github.com/tenzen-y))
- chore: Upgrade Go libraries to resolve some security issues in the katib-controller ([#1888](https://github.com/kubeflow/katib/pull/1888) by [@tenzen-y](https://github.com/tenzen-y))
- Migrate kubeflow-katib-presubmit to GitHub Actions ([#1882](https://github.com/kubeflow/katib/pull/1882) by [@tenzen-y](https://github.com/tenzen-y))
- Add semicolon when using `command` command in Makefile ([#1885](https://github.com/kubeflow/katib/pull/1885) by [@tenzen-y](https://github.com/tenzen-y))
- Fix `HAS_SHELLCHECK` and `HAS_SETUP_ENVTEST` in Makefile ([#1884](https://github.com/kubeflow/katib/pull/1884) by [@tenzen-y](https://github.com/tenzen-y))
- Remove presubmit tests depending on optional-test-infra ([#1871](https://github.com/kubeflow/katib/pull/1871) by [@aws-kf-ci-bot](https://github.com/aws-kf-ci-bot))
- Upgrade the Tensorflow version to address some security issues ([#1870](https://github.com/kubeflow/katib/pull/1870) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade the grpc_health_probe version to v0.4.11 to resolve security vulnerability CVE-2022-27191 ([#1875](https://github.com/kubeflow/katib/pull/1875) by [@tenzen-y](https://github.com/tenzen-y))
- additional metric names should not include objective metric name ([#1874](https://github.com/kubeflow/katib/pull/1874) by [@henrysecond1](https://github.com/henrysecond1))
- Upgrade the Kubernetes Python client to 22.6.0 ([#1869](https://github.com/kubeflow/katib/pull/1869) by [@tenzen-y](https://github.com/tenzen-y))
- Upgrade the kubebuilder to v3.2.0 and Kubernetes Go libraries to v1.22.2 ([#1861](https://github.com/kubeflow/katib/pull/1861) by [@tenzen-y](https://github.com/tenzen-y))
- Update FPGA XGBoost example ([#1865](https://github.com/kubeflow/katib/pull/1865) by [@eliaskoromilas](https://github.com/eliaskoromilas))
- Fix kubeflowkatib/mxnet-mnist image ([#1866](https://github.com/kubeflow/katib/pull/1866) by [@tenzen-y](https://github.com/tenzen-y))
- pins pip and setuptools versions operators to avoid installation issues ([#1867](https://github.com/kubeflow/katib/pull/1867) by [@DnPlas](https://github.com/DnPlas))
- Add shellcheck ([#1857](https://github.com/kubeflow/katib/pull/1857) by [@tenzen-y](https://github.com/tenzen-y))
- Bump kubeflow-katib and kfp version in notebook examples ([#1849](https://github.com/kubeflow/katib/pull/1849) by [@tenzen-y](https://github.com/tenzen-y))
- Add prometheus scraping and grafana support to charmed katib-controller operator ([#1839](https://github.com/kubeflow/katib/pull/1839) by [@jardon](https://github.com/jardon))
- Upgrade Black to fix linting ([#1842](https://github.com/kubeflow/katib/pull/1842) by [@jardon](https://github.com/jardon))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.13.0...v0.14.0).
# [v0.13.0](https://github.com/kubeflow/katib/tree/v0.13.0) (2022-03-04)
## New Features ## New Features
@ -874,6 +59,7 @@
- Fix default label for Training Operators ([#1813](https://github.com/kubeflow/katib/pull/1813) by [@andreyvelich](https://github.com/andreyvelich)) - Fix default label for Training Operators ([#1813](https://github.com/kubeflow/katib/pull/1813) by [@andreyvelich](https://github.com/andreyvelich))
- Update supported Python version for Katib SDK ([#1798](https://github.com/kubeflow/katib/pull/1798) by [@tenzen-y](https://github.com/tenzen-y)) - Update supported Python version for Katib SDK ([#1798](https://github.com/kubeflow/katib/pull/1798) by [@tenzen-y](https://github.com/tenzen-y))
## Misc ## Misc
- Use release tags for Trial images ([#1757](https://github.com/kubeflow/katib/pull/1757) by [@andreyvelich](https://github.com/andreyvelich)) - Use release tags for Trial images ([#1757](https://github.com/kubeflow/katib/pull/1757) by [@andreyvelich](https://github.com/andreyvelich))
@ -888,9 +74,10 @@
- Add envtest to check `reconcileRBAC` ([#1678](https://github.com/kubeflow/katib/pull/1678) by [@tenzen-y](https://github.com/tenzen-y)) - Add envtest to check `reconcileRBAC` ([#1678](https://github.com/kubeflow/katib/pull/1678) by [@tenzen-y](https://github.com/tenzen-y))
- Use golangci-lint as linter for Go ([#1671](https://github.com/kubeflow/katib/pull/1671) by [@tenzen-y](https://github.com/tenzen-y)) - Use golangci-lint as linter for Go ([#1671](https://github.com/kubeflow/katib/pull/1671) by [@tenzen-y](https://github.com/tenzen-y))
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.12.0...v0.13.0) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.12.0...v0.13.0)
# [v0.13.0-rc.1](https://github.com/kubeflow/katib/tree/v0.13.0-rc.1) (2022-02-15) ## [v0.13.0-rc.1](https://github.com/kubeflow/katib/tree/v0.13.0-rc.1) (2022-02-15)
## Bug fixes ## Bug fixes
@ -899,7 +86,7 @@
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.13.0-rc.0...v0.13.0-rc.1) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.13.0-rc.0...v0.13.0-rc.1)
# [v0.13.0-rc.0](https://github.com/kubeflow/katib/tree/v0.13.0-rc.0) (2022-01-25) ## [v0.13.0-rc.0](https://github.com/kubeflow/katib/tree/v0.13.0-rc.0) (2022-01-25)
## New Features ## New Features
@ -972,7 +159,7 @@
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.12.0...v0.13.0-rc.0) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.12.0...v0.13.0-rc.0)
# [v0.12.0](https://github.com/kubeflow/katib/tree/v0.12.0) (2021-10-05) ## [v0.12.0](https://github.com/kubeflow/katib/tree/v0.12.0) (2021-10-05)
## New Features ## New Features
@ -1028,7 +215,7 @@
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.11.1...v0.12.0) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.11.1...v0.12.0)
# [v0.12.0-rc.1](https://github.com/kubeflow/katib/tree/v0.12.0-rc.1) (2021-09-07) ## [v0.12.0-rc.1](https://github.com/kubeflow/katib/tree/v0.12.0-rc.1) (2021-09-07)
## Bug Fixes ## Bug Fixes
@ -1037,7 +224,7 @@
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.12.0-rc.0...v0.12.0-rc.1) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.12.0-rc.0...v0.12.0-rc.1)
# [v0.12.0-rc.0](https://github.com/kubeflow/katib/tree/v0.12.0-rc.0) (2021-08-19) ## [v0.12.0-rc.0](https://github.com/kubeflow/katib/tree/v0.12.0-rc.0) (2021-08-19)
## New Features ## New Features
@ -1091,7 +278,7 @@
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.11.1...v0.12.0-rc.0) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.11.1...v0.12.0-rc.0)
# [v0.11.1](https://github.com/kubeflow/katib/tree/v0.11.1) (2021-06-09) ## [v0.11.1](https://github.com/kubeflow/katib/tree/v0.11.1) (2021-06-09)
## Bug fixes ## Bug fixes
@ -1105,7 +292,7 @@
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.11.0...v0.11.1) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.11.0...v0.11.1)
# [v0.11.0](https://github.com/kubeflow/katib/tree/v0.11.0) (2021-03-22) ## [v0.11.0](https://github.com/kubeflow/katib/tree/v0.11.0) (2021-03-22)
## New Features ## New Features
@ -1162,7 +349,7 @@
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.10.1...v0.11.0) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.10.1...v0.11.0)
# [v0.10.1](https://github.com/kubeflow/katib/tree/v0.10.1) (2021-03-02) ## [v0.10.1](https://github.com/kubeflow/katib/tree/v0.10.1) (2021-03-02)
## Features and Bug Fixes ## Features and Bug Fixes
@ -1196,7 +383,7 @@
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.10.0...v0.10.1) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.10.0...v0.10.1)
# [v0.10.0](https://github.com/kubeflow/katib/tree/v0.10.0) (2020-11-07) ## [v0.10.0](https://github.com/kubeflow/katib/tree/v0.10.0) (2020-11-07)
## New Features ## New Features
@ -1240,7 +427,7 @@
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.9.0...v0.10.0) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.9.0...v0.10.0)
# [v0.9.0](https://github.com/kubeflow/katib/tree/v0.9.0) (2020-06-10) ## [v0.9.0](https://github.com/kubeflow/katib/tree/v0.9.0) (2020-06-10)
## Features and Bug Fixes ## Features and Bug Fixes
@ -1497,7 +684,7 @@
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.6.0-rc.0...v0.9.0) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.6.0-rc.0...v0.9.0)
# [v0.6.0-rc.0](https://github.com/kubeflow/katib/tree/v0.6.0-rc.0) (2019-06-28) ## [v0.6.0-rc.0](https://github.com/kubeflow/katib/tree/v0.6.0-rc.0) (2019-06-28)
## Features and Bug Fixes ## Features and Bug Fixes
@ -1752,7 +939,7 @@
[Full Changelog](https://github.com/kubeflow/katib/compare/826657c14602a3f36263f3d6769451af0a75d18a...v0.6.0-rc.0) [Full Changelog](https://github.com/kubeflow/katib/compare/826657c14602a3f36263f3d6769451af0a75d18a...v0.6.0-rc.0)
# [0.2](https://github.com/kubeflow/katib/tree/0.2) (2018-08-20) ## [0.2](https://github.com/kubeflow/katib/tree/0.2) (2018-08-20)
## Features ## Features
@ -1779,7 +966,7 @@
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.1.2-alpha...826657c14602a3f36263f3d6769451af0a75d18a) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.1.2-alpha...826657c14602a3f36263f3d6769451af0a75d18a)
# [v0.1.2-alpha](https://github.com/kubeflow/katib/tree/v0.1.2-alpha) (2018-06-05) ## [v0.1.2-alpha](https://github.com/kubeflow/katib/tree/v0.1.2-alpha) (2018-06-05)
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.1.1-alpha...v0.1.2-alpha) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.1.1-alpha...v0.1.2-alpha)
@ -1810,7 +997,7 @@
- Refine API [\#74](https://github.com/kubeflow/katib/pull/74) ([YujiOshima](https://github.com/YujiOshima)) - Refine API [\#74](https://github.com/kubeflow/katib/pull/74) ([YujiOshima](https://github.com/YujiOshima))
- worker: Rename worker_interface to worker [\#70](https://github.com/kubeflow/katib/pull/70) ([gaocegege](https://github.com/gaocegege)) - worker: Rename worker_interface to worker [\#70](https://github.com/kubeflow/katib/pull/70) ([gaocegege](https://github.com/gaocegege))
# [v0.1.1-alpha](https://github.com/kubeflow/katib/tree/v0.1.1-alpha) (2018-04-24) ## [v0.1.1-alpha](https://github.com/kubeflow/katib/tree/v0.1.1-alpha) (2018-04-24)
[Full Changelog](https://github.com/kubeflow/katib/compare/v0.1.0-alpha...v0.1.1-alpha) [Full Changelog](https://github.com/kubeflow/katib/compare/v0.1.0-alpha...v0.1.1-alpha)
@ -1848,7 +1035,7 @@
- New db log schema [\#35](https://github.com/kubeflow/katib/pull/35) ([YujiOshima](https://github.com/YujiOshima)) - New db log schema [\#35](https://github.com/kubeflow/katib/pull/35) ([YujiOshima](https://github.com/YujiOshima))
- Fix CI failures [\#27](https://github.com/kubeflow/katib/pull/27) ([gaocegege](https://github.com/gaocegege)) - Fix CI failures [\#27](https://github.com/kubeflow/katib/pull/27) ([gaocegege](https://github.com/gaocegege))
# [v0.1.0-alpha](https://github.com/kubeflow/katib/tree/v0.1.0-alpha) (2018-04-10) ## [v0.1.0-alpha](https://github.com/kubeflow/katib/tree/v0.1.0-alpha) (2018-04-10)
**Closed issues:** **Closed issues:**

View File

@ -1,43 +0,0 @@
cff-version: 1.2.0
message: "If you use Katib in your scientific publication, please cite it as below."
authors:
- family-names: "George"
given-names: "Johnu"
- family-names: "Gao"
given-names: "Ce"
- family-names: "Liu"
given-names: "Richard"
- family-names: "Liu"
given-names: "Hou Gang"
- family-names: "Tang"
given-names: "Yuan"
- family-names: "Pydipaty"
given-names: "Ramdoot"
- family-names: "Saha"
given-names: "Amit Kumar"
title: "Katib"
type: software
repository-code: "https://github.com/kubeflow/katib"
preferred-citation:
type: misc
title: "A Scalable and Cloud-Native Hyperparameter Tuning System"
authors:
- family-names: "George"
given-names: "Johnu"
- family-names: "Gao"
given-names: "Ce"
- family-names: "Liu"
given-names: "Richard"
- family-names: "Liu"
given-names: "Hou Gang"
- family-names: "Tang"
given-names: "Yuan"
- family-names: "Pydipaty"
given-names: "Ramdoot"
- family-names: "Saha"
given-names: "Amit Kumar"
year: 2020
url: "https://arxiv.org/abs/2006.02085"
identifiers:
- type: "other"
value: "arXiv:2006.02085"

View File

@ -1,167 +0,0 @@
# Developer Guide
This developer guide is for people who want to contribute to the Katib project.
If you're interesting in using Katib in your machine learning project,
see the following guides:
- [Getting started with Katib](https://kubeflow.org/docs/components/katib/hyperparameter/).
- [How to configure Katib Experiment](https://kubeflow.org/docs/components/katib/experiment/).
- [Katib architecture and concepts](https://www.kubeflow.org/docs/components/katib/reference/architecture/)
for hyperparameter tuning and neural architecture search.
## Requirements
- [Go](https://golang.org/) (1.22 or later)
- [Docker](https://docs.docker.com/) (24.0 or later)
- [Docker Buildx](https://docs.docker.com/build/buildx/) (0.8.0 or later)
- [Java](https://docs.oracle.com/javase/8/docs/technotes/guides/install/install_overview.html) (8 or later)
- [Python](https://www.python.org/) (3.11 or later)
- [kustomize](https://kustomize.io/) (4.0.5 or later)
- [pre-commit](https://pre-commit.com/)
## Build from source code
**Note** that your Docker Desktop should
[enable containerd image store](https://docs.docker.com/desktop/containerd/#enable-the-containerd-image-store)
to build multi-arch images. Check source code as follows:
```bash
make build REGISTRY=<image-registry> TAG=<image-tag>
```
If you are using an Apple Silicon machine and encounter the "rosetta error: bss_size overflow," go to Docker Desktop -> General and uncheck "Use Rosetta for x86_64/amd64 emulation on Apple Silicon."
To use your custom images for the Katib components, modify
[Kustomization file](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml)
and [Katib Config](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/katib-config.yaml)
You can deploy Katib v1beta1 manifests into a Kubernetes cluster as follows:
```bash
make deploy
```
You can undeploy Katib v1beta1 manifests from a Kubernetes cluster as follows:
```bash
make undeploy
```
## Technical and style guide
The following guidelines apply primarily to Katib,
but other projects like [Training Operator](https://github.com/kubeflow/training-operator) might also adhere to them.
## Go Development
When coding:
- Follow [effective go](https://go.dev/doc/effective_go) guidelines.
- Run locally [`make check`](https://github.com/kubeflow/katib/blob/46173463027e4fd2e604e25d7075b2b31a702049/Makefile#L31)
to verify if changes follow best practices before submitting PRs.
Testing:
- Use [`cmp.Diff`](https://pkg.go.dev/github.com/google/go-cmp/cmp#Diff) instead of `reflect.Equal`, to provide useful comparisons.
- Define test cases as maps instead of slices to avoid dependencies on the running order.
Map key should be equal to the test case name.
## Modify controller APIs
If you want to modify Katib controller APIs, you have to
generate deepcopy, clientset, listers, informers, open-api and Python SDK with the changed APIs.
You can update the necessary files as follows:
```bash
make generate
```
## Controller Flags
Below is a list of command-line flags accepted by Katib controller:
| Name | Type | Default | Description |
| ------------ | ------ | ------- | -------------------------------------------------------------------------------------------------------------------------------- |
| katib-config | string | "" | The katib-controller will load its initial configuration from this file. Omit this flag to use the default configuration values. |
## DB Manager Flags
Below is a list of command-line flags accepted by Katib DB Manager:
| Name | Type | Default | Description |
| --------------- | ------------- | -------------| ------------------------------------------------------------------- |
| connect-timeout | time.Duration | 60s | Timeout before calling error during database connection |
| listen-address | string | 0.0.0.0:6789 | The network interface or IP address to receive incoming connections |
## Katib admission webhooks
Katib uses three [Kubernetes admission webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/).
1. `validator.experiment.katib.kubeflow.org` -
[Validating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook)
to validate the Katib Experiment before the creation.
1. `defaulter.experiment.katib.kubeflow.org` -
[Mutating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook)
to set the [default values](../pkg/apis/controller/experiments/v1beta1/experiment_defaults.go)
in the Katib Experiment before the creation.
1. `mutator.pod.katib.kubeflow.org` - Mutating admission webhook to inject the metrics
collector sidecar container to the training pod. Learn more about the Katib's
metrics collector in the
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/).
You can find the YAMLs for the Katib webhooks
[here](../manifests/v1beta1/components/webhook/webhooks.yaml).
**Note:** If you are using a private Kubernetes cluster, you have to allow traffic
via `TCP:8443` by specifying the firewall rule and you have to update the master
plane CIDR source range to use the Katib webhooks
### Katib cert generator
Katib Controller has the internal `cert-generator` to generate certificates for the webhooks.
Once Katib is deployed in the Kubernetes cluster, the `cert-generator` follows these steps:
- Generate the self-signed certificate and private key.
- Update a Kubernetes Secret with the self-signed TLS certificate and private key.
- Patch the webhooks with the `CABundle`.
Once the `cert-generator` finished, the Katib controller starts to register controllers such as `experiment-controller` to the manager.
You can find the `cert-generator` source code [here](../pkg/certgenerator/v1beta1).
NOTE: the Katib also supports the [cert-manager](https://cert-manager.io/) to generate certs for the admission webhooks instead of using cert-generator.
You can find the installation with the cert-manager [here](../manifests/v1beta1/installs/katib-cert-manager).
## Implement a new algorithm and use it in Katib
Please see [new-algorithm-service.md](./new-algorithm-service.md).
## Katib UI documentation
Please see [Katib UI README](../pkg/ui/v1beta1).
## Design proposals
Please see [proposals](./proposals).
## Code Style
### pre-commit
Make sure to install [pre-commit](https://pre-commit.com/) (`pip install
pre-commit`) and run `pre-commit install` from the root of the repository at
least once before creating git commits.
The pre-commit [hooks](../.pre-commit-config.yaml) ensure code quality and
consistency. They are executed in CI. PRs that fail to comply with the hooks
will not be able to pass the corresponding CI gate. The hooks are only executed
against staged files unless you run `pre-commit run --all`, in which case,
they'll be executed against every file in the repository.
Specific programmatically generated files listed in the `exclude` field in
[.pre-commit-config.yaml](../.pre-commit-config.yaml) are deliberately excluded
from the hooks.

View File

@ -1,32 +0,0 @@
# Copyright 2023 The Kubeflow Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Dockerfile for building the source code of conformance tests
FROM python:3.10-slim
WORKDIR /kubeflow/katib
COPY sdk/ /kubeflow/katib/sdk/
COPY examples/ /kubeflow/katib/examples/
COPY test/ /kubeflow/katib/test/
COPY pkg/ /kubeflow/katib/pkg/
COPY conformance/run.sh .
# Add test script.
RUN chmod +x run.sh
RUN pip install --prefer-binary -e sdk/python/v1beta1
ENTRYPOINT [ "./run.sh" ]

124
Makefile
View File

@ -2,46 +2,45 @@ HAS_LINT := $(shell command -v golangci-lint;)
HAS_YAMLLINT := $(shell command -v yamllint;) HAS_YAMLLINT := $(shell command -v yamllint;)
HAS_SHELLCHECK := $(shell command -v shellcheck;) HAS_SHELLCHECK := $(shell command -v shellcheck;)
HAS_SETUP_ENVTEST := $(shell command -v setup-envtest;) HAS_SETUP_ENVTEST := $(shell command -v setup-envtest;)
HAS_MOCKGEN := $(shell command -v mockgen;)
COMMIT := v1beta1-$(shell git rev-parse --short=7 HEAD) COMMIT := v1beta1-$(shell git rev-parse --short=7 HEAD)
KATIB_REGISTRY := ghcr.io/kubeflow/katib KATIB_REGISTRY := docker.io/kubeflowkatib
CPU_ARCH ?= linux/amd64,linux/arm64 CPU_ARCH ?= amd64
ENVTEST_K8S_VERSION ?= 1.31 ENVTEST_K8S_VERSION ?= 1.23
MOCKGEN_VERSION ?= $(shell grep 'go.uber.org/mock' go.mod | cut -d ' ' -f 2)
GO_VERSION=$(shell grep '^go' go.mod | cut -d ' ' -f 2)
GOPATH ?= $(shell go env GOPATH)
# for pytest
PYTHONPATH := $(PYTHONPATH):$(CURDIR)/pkg/apis/manager/v1beta1/python:$(CURDIR)/pkg/apis/manager/health/python
PYTHONPATH := $(PYTHONPATH):$(CURDIR)/pkg/metricscollector/v1beta1/common:$(CURDIR)/pkg/metricscollector/v1beta1/tfevent-metricscollector
TEST_TENSORFLOW_EVENT_FILE_PATH ?= $(CURDIR)/test/unit/v1beta1/metricscollector/testdata/tfevent-metricscollector/logs TEST_TENSORFLOW_EVENT_FILE_PATH ?= $(CURDIR)/test/unit/v1beta1/metricscollector/testdata/tfevent-metricscollector/logs
# Run tests # Run tests
.PHONY: test .PHONY: test
test: envtest test: envtest
KUBEBUILDER_ASSETS="$(shell setup-envtest use $(ENVTEST_K8S_VERSION) -p path)" go test ./pkg/... ./cmd/... -coverprofile coverage.out KUBEBUILDER_ASSETS="$(shell setup-envtest --arch=amd64 use $(ENVTEST_K8S_VERSION) -p path)" go test ./pkg/... ./cmd/... -coverprofile coverage.out
envtest: envtest:
ifndef HAS_SETUP_ENVTEST ifndef HAS_SETUP_ENVTEST
go install sigs.k8s.io/controller-runtime/tools/setup-envtest@release-0.19 go install sigs.k8s.io/controller-runtime/tools/setup-envtest@bf71fc56485f6bf03e95ef6b0233ff36c695d4c9 # v0.11.2
$(info "setup-envtest has been installed") @echo "setup-envtest has been installed"
endif endif
$(info "setup-envtest has already installed") @echo "setup-envtest has already installed"
check: generated-codes go-mod fmt vet lint check: generate fmt vet lint
fmt: fmt:
hack/verify-gofmt.sh hack/verify-gofmt.sh
lint: lint:
ifndef HAS_LINT ifndef HAS_LINT
go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.64.7 go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.42.1
$(info "golangci-lint has been installed") @echo "golangci-lint has been installed"
endif endif
hack/verify-golangci-lint.sh hack/verify-golangci-lint.sh
yamllint: yamllint:
ifndef HAS_YAMLLINT ifndef HAS_YAMLLINT
pip install --prefer-binary yamllint pip install yamllint
$(info "yamllint has been installed") @echo "yamllint has been installed"
endif endif
hack/verify-yamllint.sh hack/verify-yamllint.sh
@ -51,7 +50,7 @@ vet:
shellcheck: shellcheck:
ifndef HAS_SHELLCHECK ifndef HAS_SHELLCHECK
bash hack/install-shellcheck.sh bash hack/install-shellcheck.sh
$(info "shellcheck has been installed") @echo "shellcheck has been installed"
endif endif
hack/verify-shellcheck.sh hack/verify-shellcheck.sh
@ -60,49 +59,25 @@ update:
# Deploy Katib v1beta1 manifests using Kustomize into a k8s cluster. # Deploy Katib v1beta1 manifests using Kustomize into a k8s cluster.
deploy: deploy:
bash scripts/v1beta1/deploy.sh $(WITH_DATABASE_TYPE) bash scripts/v1beta1/deploy.sh
# Undeploy Katib v1beta1 manifests using Kustomize from a k8s cluster # Undeploy Katib v1beta1 manifests using Kustomize from a k8s cluster
undeploy: undeploy:
bash scripts/v1beta1/undeploy.sh bash scripts/v1beta1/undeploy.sh
generated-codes: generate
ifneq ($(shell bash hack/verify-generated-codes.sh '.'; echo $$?),0)
$(error 'Please run "make generate" to generate codes')
endif
go-mod: sync-go-mod
ifneq ($(shell bash hack/verify-generated-codes.sh 'go.*'; echo $$?),0)
$(error 'Please run "go mod tidy -go $(GO_VERSION)" to sync Go modules')
endif
sync-go-mod:
go mod tidy -go $(GO_VERSION)
.PHONY: go-mod-download
go-mod-download:
go mod download
CONTROLLER_GEN = $(shell pwd)/bin/controller-gen
.PHONY: controller-gen
controller-gen:
@GOBIN=$(shell pwd)/bin GO111MODULE=on go install sigs.k8s.io/controller-tools/cmd/controller-gen@v0.16.5
# Run this if you update any existing controller APIs. # Run this if you update any existing controller APIs.
# 1. Generate deepcopy, clientset, listers, informers for the APIs (hack/update-codegen.sh) # 1. Genereate deepcopy, clientset, listers, informers for the APIs (hack/update-codegen.sh)
# 2. Generate open-api for the APIs (hack/update-openapigen) # 2. Generate open-api for the APIs (hack/update-openapigen)
# 3. Generate Python SDK for Katib (hack/gen-python-sdk/gen-sdk.sh) # 3. Generate Python SDK for Katib (hack/gen-python-sdk/gen-sdk.sh)
# 4. Generate gRPC manager APIs (pkg/apis/manager/v1beta1/build.sh and pkg/apis/manager/health/build.sh) # 4. Generate gRPC manager APIs (pkg/apis/manager/v1beta1/build.sh and pkg/apis/manager/health/build.sh)
# 5. Generate Go mock codes generate:
generate: go-mod-download controller-gen ifndef GOPATH
ifndef HAS_MOCKGEN $(error GOPATH not defined, please define GOPATH. Run "go help gopath" to learn more about GOPATH)
go install go.uber.org/mock/mockgen@$(MOCKGEN_VERSION)
$(info "mockgen has been installed")
endif endif
go generate ./pkg/... ./cmd/... go generate ./pkg/... ./cmd/...
hack/gen-python-sdk/gen-sdk.sh hack/gen-python-sdk/gen-sdk.sh
hack/update-proto.sh pkg/apis/manager/v1beta1/build.sh
hack/update-mockgen.sh pkg/apis/manager/health/build.sh
# Build images for the Katib v1beta1 components. # Build images for the Katib v1beta1 components.
build: generate build: generate
@ -119,12 +94,14 @@ push-latest: generate
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(COMMIT) bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(COMMIT)
# Build and push Katib images for the given tag. # Build and push Katib images for the given tag.
push-tag: push-tag: generate
ifeq ($(TAG),) ifeq ($(TAG),)
$(error TAG must be set. Usage: make push-tag TAG=<release-tag>) $(error TAG must be set. Usage: make push-tag TAG=<release-tag>)
endif endif
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) $(TAG) $(CPU_ARCH) bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) $(TAG) $(CPU_ARCH)
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) $(COMMIT) $(CPU_ARCH)
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(TAG) bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(TAG)
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(COMMIT)
# Release a new version of Katib. # Release a new version of Katib.
release: release:
@ -144,50 +121,31 @@ endif
# Prettier UI format check for Katib v1beta1. # Prettier UI format check for Katib v1beta1.
prettier-check: prettier-check:
npm run format:check --prefix pkg/ui/v1beta1/frontend npm run format:check --prefix pkg/new-ui/v1beta1/frontend
# Update boilerplate for the source code. # Update boilerplate for the source code.
update-boilerplate: update-boilerplate:
./hack/boilerplate/update-boilerplate.sh ./hack/boilerplate/update-boilerplate.sh
prepare-pytest: prepare-pytest:
pip install --prefer-binary -r test/unit/v1beta1/requirements.txt pip install -r test/unit/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/hyperopt/v1beta1/requirements.txt pip install -r cmd/suggestion/chocolate/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/optuna/v1beta1/requirements.txt pip install -r cmd/suggestion/hyperopt/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/hyperband/v1beta1/requirements.txt pip install -r cmd/suggestion/skopt/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/nas/enas/v1beta1/requirements.txt pip install -r cmd/suggestion/optuna/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/nas/darts/v1beta1/requirements.txt pip install -r cmd/suggestion/hyperband/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/pbt/v1beta1/requirements.txt pip install -r cmd/suggestion/nas/enas/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/earlystopping/medianstop/v1beta1/requirements.txt pip install -r cmd/suggestion/nas/darts/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/metricscollector/v1beta1/tfevent-metricscollector/requirements.txt pip install -r cmd/suggestion/pbt/v1beta1/requirements.txt
# `TypeIs` was introduced in typing-extensions 4.10.0, and torch 2.6.0 requires typing-extensions>=4.10.0. pip install -r cmd/earlystopping/medianstop/v1beta1/requirements.txt
# REF: https://github.com/kubeflow/katib/pull/2504 pip install -r cmd/metricscollector/v1beta1/tfevent-metricscollector/requirements.txt
# TODO (tenzen-y): Once we upgrade libraries depended on typing-extensions==4.5.0, we can remove this line.
pip install typing-extensions==4.10.0
prepare-pytest-testdata: prepare-pytest-testdata:
ifeq ("$(wildcard $(TEST_TENSORFLOW_EVENT_FILE_PATH))", "") ifeq ("$(wildcard $(TEST_TENSORFLOW_EVENT_FILE_PATH))", "")
python examples/v1beta1/trial-images/tf-mnist-with-summaries/mnist.py --epochs 5 --batch-size 200 --log-path $(TEST_TENSORFLOW_EVENT_FILE_PATH) python examples/v1beta1/trial-images/tf-mnist-with-summaries/mnist.py --epochs 5 --batch-size 200 --log-path $(TEST_TENSORFLOW_EVENT_FILE_PATH)
endif endif
# TODO(Electronic-Waste): Remove the import rewrite when protobuf supports `python_package` option.
# REF: https://github.com/protocolbuffers/protobuf/issues/7061
pytest: prepare-pytest prepare-pytest-testdata pytest: prepare-pytest prepare-pytest-testdata
pytest ./test/unit/v1beta1/suggestion --ignore=./test/unit/v1beta1/suggestion/test_skopt_service.py PYTHONPATH=$(PYTHONPATH) pytest ./test/unit/v1beta1/suggestion
pytest ./test/unit/v1beta1/earlystopping PYTHONPATH=$(PYTHONPATH) pytest ./test/unit/v1beta1/earlystopping
pytest ./test/unit/v1beta1/metricscollector PYTHONPATH=$(PYTHONPATH) pytest ./test/unit/v1beta1/metricscollector
cp ./pkg/apis/manager/v1beta1/python/api_pb2.py ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2.py
cp ./pkg/apis/manager/v1beta1/python/api_pb2_grpc.py ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2_grpc.py
sed -i "s/api_pb2/kubeflow\.katib\.katib_api_pb2/g" ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2_grpc.py
pytest ./sdk/python/v1beta1/kubeflow/katib
rm ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2.py ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2_grpc.py
# The skopt service doesn't work appropriately with Python 3.11.
# So, we need to run the test with Python 3.9.
# TODO (tenzen-y): Once we stop to support skopt, we can remove this test.
# REF: https://github.com/kubeflow/katib/issues/2280
pytest-skopt:
pip install six
pip install --prefer-binary -r test/unit/v1beta1/requirements.txt
pip install --prefer-binary -r cmd/suggestion/skopt/v1beta1/requirements.txt
pytest ./test/unit/v1beta1/suggestion/test_skopt_service.py

4
OWNERS
View File

@ -1,10 +1,10 @@
approvers: approvers:
- andreyvelich - andreyvelich
- gaocegege - gaocegege
- hougangliu
- johnugeorge - johnugeorge
reviewers: reviewers:
- anencore94 - anencore94
- c-bata - c-bata
- Electronic-Waste - sperlingxx
emeritus_approvers:
- tenzen-y - tenzen-y

132
README.md
View File

@ -1,18 +1,15 @@
# Kubeflow Katib
[![Build Status](https://github.com/kubeflow/katib/actions/workflows/test-go.yaml/badge.svg?branch=master)](https://github.com/kubeflow/katib/actions/workflows/test-go.yaml?branch=master)
[![Coverage Status](https://coveralls.io/repos/github/kubeflow/katib/badge.svg?branch=master)](https://coveralls.io/github/kubeflow/katib?branch=master)
[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/katib)](https://goreportcard.com/report/github.com/kubeflow/katib)
[![Releases](https://img.shields.io/github/release-pre/kubeflow/katib.svg?sort=semver)](https://github.com/kubeflow/katib/releases)
[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels)
[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/9941/badge)](https://www.bestpractices.dev/projects/9941)
<h1 align="center"> <h1 align="center">
<img src="./docs/images/logo-title.png" alt="logo" width="200"> <img src="./docs/images/logo-title.png" alt="logo" width="200">
<br> <br>
</h1> </h1>
Kubeflow Katib is a Kubernetes-native project for automated machine learning (AutoML). [![Build Status](https://github.com/kubeflow/katib/actions/workflows/test-go.yaml/badge.svg?branch=master)](https://github.com/kubeflow/katib/actions/workflows/test-go.yaml?branch=master)
[![Coverage Status](https://coveralls.io/repos/github/kubeflow/katib/badge.svg?branch=master)](https://coveralls.io/github/kubeflow/katib?branch=master)
[![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/katib)](https://goreportcard.com/report/github.com/kubeflow/katib)
[![Releases](https://img.shields.io/github/release-pre/kubeflow/katib.svg?sort=semver)](https://github.com/kubeflow/katib/releases)
[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://kubeflow.slack.com/archives/C018PMV53NW)
Katib is a Kubernetes-native project for automated machine learning (AutoML).
Katib supports Katib supports
[Hyperparameter Tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization), [Hyperparameter Tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization),
[Early Stopping](https://en.wikipedia.org/wiki/Early_stopping) and [Early Stopping](https://en.wikipedia.org/wiki/Early_stopping) and
@ -21,7 +18,8 @@ Katib supports
Katib is the project which is agnostic to machine learning (ML) frameworks. Katib is the project which is agnostic to machine learning (ML) frameworks.
It can tune hyperparameters of applications written in any language of the It can tune hyperparameters of applications written in any language of the
users choice and natively supports many ML frameworks, such as users choice and natively supports many ML frameworks, such as
[TensorFlow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), [XGBoost](https://xgboost.readthedocs.io/en/latest/), and others. [TensorFlow](https://www.tensorflow.org/), [Apache MXNet](https://mxnet.apache.org/),
[PyTorch](https://pytorch.org/), [XGBoost](https://xgboost.readthedocs.io/en/latest/), and others.
Katib can perform training jobs using any Kubernetes Katib can perform training jobs using any Kubernetes
[Custom Resources](https://www.kubeflow.org/docs/components/katib/trial-template/) [Custom Resources](https://www.kubeflow.org/docs/components/katib/trial-template/)
@ -31,13 +29,13 @@ and many more.
Katib stands for `secretary` in Arabic. Katib stands for `secretary` in Arabic.
## Search Algorithms # Search Algorithms
Katib supports several search algorithms. Follow the Katib supports several search algorithms. Follow the
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-algorithm/#hp-tuning-algorithms) [Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/experiment/#search-algorithms-in-detail)
to know more about each algorithm and check the to know more about each algorithm and check the
[this guide](https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-algorithm/#use-custom-algorithm-in-katib) [Suggestion service guide](/docs/new-algorithm-service.md) to implement your
to implement your custom algorithm. custom algorithm.
<table> <table>
<tbody> <tbody>
@ -139,68 +137,102 @@ to implement your custom algorithm.
</tbody> </tbody>
</table> </table>
To perform the above algorithms Katib supports the following frameworks: To perform above algorithms Katib supports the following frameworks:
- [Chocolate](https://github.com/AIworx-Labs/chocolate)
- [Goptuna](https://github.com/c-bata/goptuna) - [Goptuna](https://github.com/c-bata/goptuna)
- [Hyperopt](https://github.com/hyperopt/hyperopt) - [Hyperopt](https://github.com/hyperopt/hyperopt)
- [Optuna](https://github.com/optuna/optuna) - [Optuna](https://github.com/optuna/optuna)
- [Scikit Optimize](https://github.com/scikit-optimize/scikit-optimize) - [Scikit Optimize](https://github.com/scikit-optimize/scikit-optimize)
# Installation
For the various Katib installs check the
[Kubeflow guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-setup).
Follow the next steps to install Katib standalone.
## Prerequisites ## Prerequisites
Please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/installation/#prerequisites) This is the minimal requirements to install Katib:
for prerequisites to install Katib.
## Installation - Kubernetes >= 1.21
- `kubectl` >= 1.21
Please follow [the Kubeflow Katib guide](https://www.kubeflow.org/docs/components/katib/installation/#installing-katib) ## Latest Version
for the detailed instructions on how to install Katib.
### Installing the Control Plane For the latest Katib version run this command:
Run the following command to install the latest stable release of Katib control plane:
```
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.17.0"
```
Run the following command to install the latest changes of Katib control plane:
``` ```
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master" kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master"
``` ```
For the Katib Experiments check the [complete examples list](./examples/v1beta1). ## Release Version
### Installing the Python SDK For the specific Katib release (for example `v0.13.0`) run this command:
Katib implements [a Python SDK](https://pypi.org/project/kubeflow-katib/) to simplify creation of ```
hyperparameter tuning jobs for Data Scientists. kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.13.0"
Run the following command to install the latest stable release of Katib SDK:
```sh
pip install -U kubeflow-katib
``` ```
## Getting Started Make sure that all Katib components are running:
Please refer to [the getting started guide](https://www.kubeflow.org/docs/components/katib/getting-started/#getting-started-with-katib-python-sdk) ```
to quickly create your first hyperparameter tuning Experiment using the Python SDK. $ kubectl get pods -n kubeflow
## Community NAME READY STATUS RESTARTS AGE
katib-cert-generator-rw95w 0/1 Completed 0 35s
katib-controller-566595bdd8-hbxgf 1/1 Running 0 36s
katib-db-manager-57cd769cdb-4g99m 1/1 Running 0 36s
katib-mysql-7894994f88-5d4s5 1/1 Running 0 36s
katib-ui-5767cfccdc-pwg2x 1/1 Running 0 36s
```
The following links provide information on how to get involved in the community: For the Katib Experiments check the [complete examples list](./examples/v1beta1).
- Attend [the bi-weekly AutoML and Training Working Group](https://bit.ly/2PWVCkV) # Documentation
community meeting.
- Join our [`#kubeflow-katib`](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) - Run your first Katib Experiment in the
Slack channel. [getting started guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-algorithm).
- Check out [who is using Katib](ADOPTERS.md) and [presentations about Katib project](docs/presentations.md).
- Learn about Katib **Concepts** in this
[guide](https://www.kubeflow.org/docs/components/katib/overview/#katib-concepts).
- Learn about Katib **Interfaces** in this
[guide](https://www.kubeflow.org/docs/components/katib/overview/#katib-interfaces).
- Learn about Katib **Components** in this
[guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-components).
- Know more about Katib in the [presentations and demos list](./docs/presentations.md).
# Community
We are always growing our community and invite new users and AutoML enthusiasts
to contribute to the Katib project. The following links provide information
about getting involved in the community:
- Subscribe to the
[AutoML calendar](https://calendar.google.com/calendar/u/0/r?cid=ZDQ5bnNpZWZzbmZna2Y5MW8wdThoMmpoazRAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ)
to attend Working Group bi-weekly community meetings.
- Check the
[AutoML and Training Working Group meeting notes](https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit).
- If you use Katib, please update [the adopters list](ADOPTERS.md).
## Contributing ## Contributing
Please refer to the [CONTRIBUTING guide](CONTRIBUTING.md). Please feel free to test the system! [Developer guide](./docs/developer-guide.md)
is a good starting point for our developers.
## Blog posts
- [Kubeflow Katib: Scalable, Portable and Cloud Native System for AutoML](https://blog.kubeflow.org/katib/)
(by Andrey Velichkevich)
## Events
- [AutoML and Training WG Summit. 16th of July 2021](https://docs.google.com/document/d/1vGluSPHmAqEr8k9Dmm82RcQ-MVnqbYYSfnjMGB-aPuo/edit?usp=sharing)
## Citation ## Citation

View File

@ -1,45 +1,3 @@
# Katib 2022/2023 Roadmap
## AutoML Features
- Support advance HyperParameter tuning algorithms:
- Population Based Training (PBT) - [#1382](https://github.com/kubeflow/katib/issues/1382)
- Tree of Parzen Estimators (TPE)
- Multivariate TPE
- Sobols Quasirandom Sequence
- Asynchronous Successive Halving - [ASHA](https://arxiv.org/pdf/1810.05934.pdf)
- Support multi-objective optimization - [#1549](https://github.com/kubeflow/katib/issues/1549)
- Support various HP distributions (log-uniform, uniform, normal) - [#1207](https://github.com/kubeflow/katib/issues/1207)
- Support Auto Model Compression - [#460](https://github.com/kubeflow/katib/issues/460)
- Support Auto Feature Engineering - [#475](https://github.com/kubeflow/katib/issues/475)
- Improve Neural Architecture Search design
## Backend and API Enhancements
- Conformance tests for Katib - [#2044](https://github.com/kubeflow/katib/issues/2044)
- Support push-based metrics collection in Katib - [#577](https://github.com/kubeflow/katib/issues/577)
- Support PostgreSQL as a Katib DB - [#915](https://github.com/kubeflow/katib/issues/915)
- Improve Katib scalability - [#1847](https://github.com/kubeflow/katib/issues/1847)
- Promote Katib APIs to the `v1` version
- Support multiple CRD versions (`v1beta1`, `v1`) with conversion webhook
## Improve Katib User Experience
- Simplify Katib Experiment creation with Katib SDK - [#1951](https://github.com/kubeflow/katib/pull/1951)
- Fully migrate to a new Katib UI - [Project 1](https://github.com/kubeflow/katib/projects/1)
- Expose Trial logs in Katib UI - [#971](https://github.com/kubeflow/katib/issues/971)
- Enhance Katib UI visualization metrics for AutoML Experiments
- Improve Katib Config UX - [#2150](https://github.com/kubeflow/katib/issues/2150)
## Integration with Kubeflow Components
- Kubeflow Pipeline as a Katib Trial target - [#1914](https://github.com/kubeflow/katib/issues/1914)
- Improve data passing when Katib Experiment is part of Kubeflow Pipeline - [#1846](https://github.com/kubeflow/katib/issues/1846)
# History
# Katib 2021 Roadmap # Katib 2021 Roadmap
## New Features ## New Features
@ -66,6 +24,8 @@
- Support multiple CRD version with conversion webhook - Support multiple CRD version with conversion webhook
- MLMD integration with Katib Experiments - MLMD integration with Katib Experiments
# History
# Katib 2020 Roadmap # Katib 2020 Roadmap
## New Features ## New Features

View File

@ -1,64 +0,0 @@
# Security Policy
## Supported Versions
Kubeflow Katib versions are expressed as `vX.Y.Z`, where X is the major version,
Y is the minor version, and Z is the patch version, following the
[Semantic Versioning](https://semver.org/) terminology.
The Kubeflow Katib project maintains release branches for the most recent two minor releases.
Applicable fixes, including security fixes, may be backported to those two release branches,
depending on severity and feasibility.
Users are encouraged to stay updated with the latest releases to benefit from security patches and
improvements.
## Reporting a Vulnerability
We're extremely grateful for security researchers and users that report vulnerabilities to the
Kubeflow Open Source Community. All reports are thoroughly investigated by Kubeflow projects owners.
You can use the following ways to report security vulnerabilities privately:
- Using the Kubeflow Katib repository [GitHub Security Advisory](https://github.com/kubeflow/katib/security/advisories/new).
- Using our private Kubeflow Steering Committee mailing list: ksc@kubeflow.org.
Please provide detailed information to help us understand and address the issue promptly.
## Disclosure Process
**Acknowledgment**: We will acknowledge receipt of your report within 10 business days.
**Assessment**: The Kubeflow projects owners will investigate the reported issue to determine its
validity and severity.
**Resolution**: If the issue is confirmed, we will work on a fix and prepare a release.
**Notification**: Once a fix is available, we will notify the reporter and coordinate a public
disclosure.
**Public Disclosure**: Details of the vulnerability and the fix will be published in the project's
release notes and communicated through appropriate channels.
## Prevention Mechanisms
Kubeflow Katib employs several measures to prevent security issues:
**Code Reviews**: All code changes are reviewed by maintainers to ensure code quality and security.
**Dependency Management**: Regular updates and monitoring of dependencies (e.g. Dependabot) to
address known vulnerabilities.
**Continuous Integration**: Automated testing and security checks are integrated into the CI/CD pipeline.
**Image Scanning**: Container images are scanned for vulnerabilities.
## Communication Channels
For the general questions please join the following resources:
- Kubeflow [Slack channels](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels).
- Kubeflow discuss [mailing list](https://www.kubeflow.org/docs/about/community/#kubeflow-mailing-list).
Please **do not report** security vulnerabilities through public channels.

View File

@ -0,0 +1,29 @@
# Build the Katib Cert Generatoe.
FROM golang:alpine AS build-env
WORKDIR /go/src/github.com/kubeflow/katib
# Download packages.
COPY go.mod .
COPY go.sum .
RUN go mod download -x
# Copy sources.
COPY cmd/ cmd/
COPY pkg/ pkg/
# Build the binary.
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-cert-generator ./cmd/cert-generator/v1beta1; \
elif [ "$(uname -m)" = "aarch64" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-cert-generator ./cmd/cert-generator/v1beta1; \
else \
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-cert-generator ./cmd/cert-generator/v1beta1; \
fi
# Copy the cert-generator into a thin image.
FROM gcr.io/distroless/static:nonroot
WORKDIR /app
COPY --from=build-env /go/src/github.com/kubeflow/katib/katib-cert-generator /app/
USER 65532:65532
ENTRYPOINT ["./katib-cert-generator"]

View File

@ -0,0 +1,42 @@
/*
Copyright 2022 The Kubeflow Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package main
import (
"github.com/kubeflow/katib/pkg/cert-generator/v1beta1"
"k8s.io/client-go/kubernetes/scheme"
"k8s.io/klog"
"os"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/client/config"
)
func main() {
kubeClient, err := client.New(config.GetConfigOrDie(), client.Options{Scheme: scheme.Scheme})
if err != nil {
klog.Fatalf("Failed to create kube client.")
}
cmd, err := v1beta1.NewKatibCertGeneratorCmd(kubeClient)
if err != nil {
klog.Fatalf("Failed to generate cert: %v", err)
}
if err = cmd.Execute(); err != nil {
os.Exit(1)
}
}

View File

@ -1,7 +1,7 @@
# Build the Katib DB manager. # Build the Katib DB manager.
FROM golang:alpine AS build-env FROM golang:alpine AS build-env
ARG TARGETARCH ENV GRPC_HEALTH_PROBE_VERSION v0.4.11
WORKDIR /go/src/github.com/kubeflow/katib WORKDIR /go/src/github.com/kubeflow/katib
@ -15,10 +15,28 @@ COPY cmd/ cmd/
COPY pkg/ pkg/ COPY pkg/ pkg/
# Build the binary. # Build the binary.
RUN CGO_ENABLED=0 GOOS=linux GOARCH="${TARGETARCH}" go build -a -o katib-db-manager ./cmd/db-manager/v1beta1 RUN if [ "$(uname -m)" = "ppc64le" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-db-manager ./cmd/db-manager/v1beta1; \
elif [ "$(uname -m)" = "aarch64" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-db-manager ./cmd/db-manager/v1beta1; \
else \
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-db-manager ./cmd/db-manager/v1beta1; \
fi
# Add GRPC health probe.
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
elif [ "$(uname -m)" = "aarch64" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
else \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
fi && \
chmod +x /bin/grpc_health_probe
# Copy the db-manager into a thin image. # Copy the db-manager into a thin image.
FROM alpine:3.15 FROM alpine:3.15
WORKDIR /app WORKDIR /app
COPY --from=build-env /bin/grpc_health_probe /bin/
COPY --from=build-env /go/src/github.com/kubeflow/katib/katib-db-manager /app/ COPY --from=build-env /go/src/github.com/kubeflow/katib/katib-db-manager /app/
ENTRYPOINT ["./katib-db-manager"] ENTRYPOINT ["./katib-db-manager"]
CMD ["-w", "kubernetes"]

View File

@ -22,21 +22,19 @@ import (
"fmt" "fmt"
"net" "net"
"os" "os"
"time"
health_pb "github.com/kubeflow/katib/pkg/apis/manager/health" health_pb "github.com/kubeflow/katib/pkg/apis/manager/health"
api_pb "github.com/kubeflow/katib/pkg/apis/manager/v1beta1" api_pb "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
db "github.com/kubeflow/katib/pkg/db/v1beta1" db "github.com/kubeflow/katib/pkg/db/v1beta1"
"github.com/kubeflow/katib/pkg/db/v1beta1/common" "github.com/kubeflow/katib/pkg/db/v1beta1/common"
"k8s.io/klog/v2" "k8s.io/klog"
"google.golang.org/grpc" "google.golang.org/grpc"
"google.golang.org/grpc/reflection" "google.golang.org/grpc/reflection"
) )
const ( const (
defaultListenAddress = "0.0.0.0:6789" port = "0.0.0.0:6789"
defaultConnectTimeout = time.Second * 60
) )
var dbIf common.KatibDBInterface var dbIf common.KatibDBInterface
@ -89,30 +87,25 @@ func (s *server) Check(ctx context.Context, in *health_pb.HealthCheckRequest) (*
} }
func main() { func main() {
var connectTimeout time.Duration
var listenAddress string
flag.DurationVar(&connectTimeout, "connect-timeout", defaultConnectTimeout, "Timeout before calling error during database connection. (e.g. 120s)")
flag.StringVar(&listenAddress, "listen-address", defaultListenAddress, "The network interface or IP address to receive incoming connections. (e.g. 0.0.0.0:6789)")
flag.Parse() flag.Parse()
var err error var err error
dbNameEnvName := common.DBNameEnvName dbNameEnvName := common.DBNameEnvName
dbName := os.Getenv(dbNameEnvName) dbName := os.Getenv(dbNameEnvName)
if dbName == "" { if dbName == "" {
klog.Fatal("DB_NAME env is not set. Exiting") klog.Fatal("DB_NAME env is not set. Exiting")
} }
dbIf, err = db.NewKatibDBInterface(dbName, connectTimeout) dbIf, err = db.NewKatibDBInterface(dbName)
if err != nil { if err != nil {
klog.Fatalf("Failed to open db connection: %v", err) klog.Fatalf("Failed to open db connection: %v", err)
} }
dbIf.DBInit() dbIf.DBInit()
listener, err := net.Listen("tcp", listenAddress) listener, err := net.Listen("tcp", port)
if err != nil { if err != nil {
klog.Fatalf("Failed to listen: %v", err) klog.Fatalf("Failed to listen: %v", err)
} }
size := 1<<31 - 1 size := 1<<31 - 1
klog.Infof("Start Katib manager: %s", listenAddress) klog.Infof("Start Katib manager: %s", port)
s := grpc.NewServer(grpc.MaxRecvMsgSize(size), grpc.MaxSendMsgSize(size)) s := grpc.NewServer(grpc.MaxRecvMsgSize(size), grpc.MaxSendMsgSize(size))
api_pb.RegisterDBManagerServer(s, &server{}) api_pb.RegisterDBManagerServer(s, &server{})
health_pb.RegisterHealthServer(s, &server{}) health_pb.RegisterHealthServer(s, &server{})

View File

@ -20,7 +20,7 @@ import (
"context" "context"
"testing" "testing"
"go.uber.org/mock/gomock" "github.com/golang/mock/gomock"
health_pb "github.com/kubeflow/katib/pkg/apis/manager/health" health_pb "github.com/kubeflow/katib/pkg/apis/manager/health"
api_pb "github.com/kubeflow/katib/pkg/apis/manager/v1beta1" api_pb "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"

View File

@ -1,11 +1,9 @@
FROM python:3.11-slim FROM python:3.9-slim
ARG TARGETARCH
ENV TARGET_DIR /opt/katib ENV TARGET_DIR /opt/katib
ENV EARLY_STOPPING_DIR cmd/earlystopping/medianstop/v1beta1 ENV EARLY_STOPPING_DIR cmd/earlystopping/medianstop/v1beta1
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python
RUN if [ "${TARGETARCH}" = "ppc64le" ] || [ "${TARGETARCH}" = "arm64" ]; then \ RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
apt-get -y update && \ apt-get -y update && \
apt-get -y install gfortran libopenblas-dev liblapack-dev && \ apt-get -y install gfortran libopenblas-dev liblapack-dev && \
apt-get clean && \ apt-get clean && \
@ -14,11 +12,12 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ] || [ "${TARGETARCH}" = "arm64" ]; then \
ADD ./pkg/ ${TARGET_DIR}/pkg/ ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${EARLY_STOPPING_DIR}/ ${TARGET_DIR}/${EARLY_STOPPING_DIR}/ ADD ./${EARLY_STOPPING_DIR}/ ${TARGET_DIR}/${EARLY_STOPPING_DIR}/
WORKDIR ${TARGET_DIR}/${EARLY_STOPPING_DIR} WORKDIR ${TARGET_DIR}/${EARLY_STOPPING_DIR}
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
RUN chgrp -R 0 ${TARGET_DIR} \ RUN chgrp -R 0 ${TARGET_DIR} \
&& chmod -R g+rwX ${TARGET_DIR} && chmod -R g+rwX ${TARGET_DIR}
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python
ENTRYPOINT ["python", "main.py"] ENTRYPOINT ["python", "main.py"]

View File

@ -12,14 +12,12 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import logging
import time
from concurrent import futures
import grpc import grpc
import time
import logging
from pkg.apis.manager.v1beta1.python import api_pb2_grpc from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.earlystopping.v1beta1.medianstop.service import MedianStopService from pkg.earlystopping.v1beta1.medianstop.service import MedianStopService
from concurrent import futures
_ONE_DAY_IN_SECONDS = 60 * 60 * 24 _ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6788" DEFAULT_PORT = "0.0.0.0:6788"

View File

@ -1,5 +1,5 @@
grpcio>=1.64.1 grpcio==1.41.1
protobuf>=4.21.12,<5 protobuf==3.19.1
googleapis-common-protos==1.6.0 googleapis-common-protos==1.6.0
kubernetes==22.6.0 kubernetes==22.6.0
cython>=0.29.24 cython>=0.29.24

View File

@ -1,8 +1,6 @@
# Build the Katib controller. # Build the Katib controller.
FROM golang:alpine AS build-env FROM golang:alpine AS build-env
ARG TARGETARCH
WORKDIR /go/src/github.com/kubeflow/katib WORKDIR /go/src/github.com/kubeflow/katib
# Download packages. # Download packages.
@ -15,7 +13,13 @@ COPY cmd/ cmd/
COPY pkg/ pkg/ COPY pkg/ pkg/
# Build the binary. # Build the binary.
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o katib-controller ./cmd/katib-controller/v1beta1 RUN if [ "$(uname -m)" = "ppc64le" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-controller ./cmd/katib-controller/v1beta1; \
elif [ "$(uname -m)" = "aarch64" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-controller ./cmd/katib-controller/v1beta1; \
else \
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-controller ./cmd/katib-controller/v1beta1; \
fi
# Copy the controller-manager into a thin image. # Copy the controller-manager into a thin image.
FROM alpine:3.15 FROM alpine:3.15

View File

@ -15,7 +15,7 @@ limitations under the License.
*/ */
/* /*
Katib-controller is a controller (operator) for Experiments and Trials Katib-controller is a controller (operator) for Experiments and Trials
*/ */
package main package main
@ -24,75 +24,64 @@ import (
"os" "os"
"github.com/spf13/viper" "github.com/spf13/viper"
"k8s.io/apimachinery/pkg/runtime"
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp" _ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
"sigs.k8s.io/controller-runtime/pkg/client/config" "sigs.k8s.io/controller-runtime/pkg/client/config"
"sigs.k8s.io/controller-runtime/pkg/healthz"
logf "sigs.k8s.io/controller-runtime/pkg/log" logf "sigs.k8s.io/controller-runtime/pkg/log"
"sigs.k8s.io/controller-runtime/pkg/log/zap" "sigs.k8s.io/controller-runtime/pkg/log/zap"
"sigs.k8s.io/controller-runtime/pkg/manager" "sigs.k8s.io/controller-runtime/pkg/manager"
"sigs.k8s.io/controller-runtime/pkg/manager/signals" "sigs.k8s.io/controller-runtime/pkg/manager/signals"
metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
"sigs.k8s.io/controller-runtime/pkg/webhook"
configv1beta1 "github.com/kubeflow/katib/pkg/apis/config/v1beta1"
apis "github.com/kubeflow/katib/pkg/apis/controller" apis "github.com/kubeflow/katib/pkg/apis/controller"
cert "github.com/kubeflow/katib/pkg/certgenerator/v1beta1" controller "github.com/kubeflow/katib/pkg/controller.v1beta1"
"github.com/kubeflow/katib/pkg/controller.v1beta1"
"github.com/kubeflow/katib/pkg/controller.v1beta1/consts" "github.com/kubeflow/katib/pkg/controller.v1beta1/consts"
"github.com/kubeflow/katib/pkg/util/v1beta1/katibconfig" trialutil "github.com/kubeflow/katib/pkg/controller.v1beta1/trial/util"
webhookv1beta1 "github.com/kubeflow/katib/pkg/webhook/v1beta1" webhook "github.com/kubeflow/katib/pkg/webhook/v1beta1"
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
) )
var (
scheme = runtime.NewScheme()
log = logf.Log.WithName("entrypoint")
)
func init() {
utilruntime.Must(apis.AddToScheme(scheme))
utilruntime.Must(configv1beta1.AddToScheme(scheme))
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
}
func main() { func main() {
logf.SetLogger(zap.New()) logf.SetLogger(zap.New())
log := logf.Log.WithName("entrypoint")
var experimentSuggestionName string
var metricsAddr string
var webhookPort int
var injectSecurityContext bool
var enableGRPCProbeInSuggestion bool
var trialResources trialutil.GvkListFlag
var enableLeaderElection bool
var leaderElectionID string
flag.StringVar(&experimentSuggestionName, "experiment-suggestion-name",
"default", "The implementation of suggestion interface in experiment controller (default)")
flag.StringVar(&metricsAddr, "metrics-addr", ":8080", "The address the metric endpoint binds to.")
flag.BoolVar(&injectSecurityContext, "webhook-inject-securitycontext", false, "Inject the securityContext of container[0] in the sidecar")
flag.BoolVar(&enableGRPCProbeInSuggestion, "enable-grpc-probe-in-suggestion", true, "enable grpc probe in suggestions")
flag.Var(&trialResources, "trial-resources", "The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org)")
flag.IntVar(&webhookPort, "webhook-port", 8443, "The port number to be used for admission webhook server.")
// For leader election
flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, "Enable leader election for katib-controller. Enabling this will ensure there is only one active katib-controller.")
flag.StringVar(&leaderElectionID, "leader-election-id", "3fbc96e9.katib.kubeflow.org", "The ID for leader election.")
// TODO (andreyvelich): Currently it is not possible to set different webhook service name.
// flag.StringVar(&serviceName, "webhook-service-name", "katib-controller", "The service name which will be used in webhook")
// TODO (andreyvelich): Currently is is not possible to store webhook cert in the local file system.
// flag.BoolVar(&certLocalFS, "cert-localfs", false, "Store the webhook cert in local file system")
var katibConfigFile string
flag.StringVar(&katibConfigFile, "katib-config", "",
"The katib-controller will load its initial configuration from this file. "+
"Omit this flag to use the default configuration values. ")
flag.Parse() flag.Parse()
initConfig, err := katibconfig.GetInitConfigData(scheme, katibConfigFile)
if err != nil {
log.Error(err, "Failed to get KatibConfig")
os.Exit(1)
}
// Set the config in viper. // Set the config in viper.
viper.Set(consts.ConfigExperimentSuggestionName, initConfig.ControllerConfig.ExperimentSuggestionName) viper.Set(consts.ConfigExperimentSuggestionName, experimentSuggestionName)
viper.Set(consts.ConfigInjectSecurityContext, initConfig.ControllerConfig.InjectSecurityContext) viper.Set(consts.ConfigInjectSecurityContext, injectSecurityContext)
viper.Set(consts.ConfigEnableGRPCProbeInSuggestion, initConfig.ControllerConfig.EnableGRPCProbeInSuggestion) viper.Set(consts.ConfigEnableGRPCProbeInSuggestion, enableGRPCProbeInSuggestion)
viper.Set(consts.ConfigTrialResources, trialResources)
trialGVKs, err := katibconfig.TrialResourcesToGVKs(initConfig.ControllerConfig.TrialResources)
if err != nil {
log.Error(err, "Failed to parse trialResources")
os.Exit(1)
}
viper.Set(consts.ConfigTrialResources, trialGVKs)
log.Info("Config:", log.Info("Config:",
consts.ConfigExperimentSuggestionName, consts.ConfigExperimentSuggestionName,
viper.GetString(consts.ConfigExperimentSuggestionName), viper.GetString(consts.ConfigExperimentSuggestionName),
"webhook-port", "webhook-port",
initConfig.ControllerConfig.WebhookPort, webhookPort,
"metrics-addr", "metrics-addr",
initConfig.ControllerConfig.MetricsAddr, metricsAddr,
"healthz-addr",
initConfig.ControllerConfig.HealthzAddr,
consts.ConfigInjectSecurityContext, consts.ConfigInjectSecurityContext,
viper.GetBool(consts.ConfigInjectSecurityContext), viper.GetBool(consts.ConfigInjectSecurityContext),
consts.ConfigEnableGRPCProbeInSuggestion, consts.ConfigEnableGRPCProbeInSuggestion,
@ -110,13 +99,9 @@ func main() {
// Create a new katib controller to provide shared dependencies and start components // Create a new katib controller to provide shared dependencies and start components
mgr, err := manager.New(cfg, manager.Options{ mgr, err := manager.New(cfg, manager.Options{
Metrics: metricsserver.Options{ MetricsBindAddress: metricsAddr,
BindAddress: initConfig.ControllerConfig.MetricsAddr, LeaderElection: enableLeaderElection,
}, LeaderElectionID: leaderElectionID,
HealthProbeBindAddress: initConfig.ControllerConfig.HealthzAddr,
LeaderElection: initConfig.ControllerConfig.EnableLeaderElection,
LeaderElectionID: initConfig.ControllerConfig.LeaderElectionID,
Scheme: scheme,
}) })
if err != nil { if err != nil {
log.Error(err, "Failed to create the manager") log.Error(err, "Failed to create the manager")
@ -125,50 +110,11 @@ func main() {
log.Info("Registering Components.") log.Info("Registering Components.")
// Create a webhook server. // Setup Scheme for all resources
hookServer := webhook.NewServer(webhook.Options{ if err := apis.AddToScheme(mgr.GetScheme()); err != nil {
Port: *initConfig.ControllerConfig.WebhookPort, log.Error(err, "Unable to add APIs to scheme")
CertDir: consts.CertDir,
})
ctx := signals.SetupSignalHandler()
certsReady := make(chan struct{})
defer close(certsReady)
// The setupControllers will register controllers to the manager
// after generated certs for the admission webhooks.
go setupControllers(mgr, certsReady, hookServer)
if initConfig.CertGeneratorConfig.Enable {
if err = cert.AddToManager(mgr, initConfig.CertGeneratorConfig, certsReady); err != nil {
log.Error(err, "Failed to set up cert-generator")
}
} else {
certsReady <- struct{}{}
}
log.Info("Setting up health checker.")
if err := mgr.AddReadyzCheck("readyz", hookServer.StartedChecker()); err != nil {
log.Error(err, "Unable to add readyz endpoint to the manager")
os.Exit(1) os.Exit(1)
} }
if err = mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
log.Error(err, "Add webhook server health checker to the manager failed")
os.Exit(1)
}
// Start the Cmd
log.Info("Starting the manager.")
if err = mgr.Start(ctx); err != nil {
log.Error(err, "Unable to run the manager")
os.Exit(1)
}
}
func setupControllers(mgr manager.Manager, certsReady chan struct{}, hookServer webhook.Server) {
// The certsReady blocks to register controllers until generated certs.
<-certsReady
log.Info("Certs ready")
// Setup all Controllers // Setup all Controllers
log.Info("Setting up controller.") log.Info("Setting up controller.")
@ -178,8 +124,15 @@ func setupControllers(mgr manager.Manager, certsReady chan struct{}, hookServer
} }
log.Info("Setting up webhooks.") log.Info("Setting up webhooks.")
if err := webhookv1beta1.AddToManager(mgr, hookServer); err != nil { if err := webhook.AddToManager(mgr, webhookPort); err != nil {
log.Error(err, "Unable to register webhooks to the manager") log.Error(err, "Unable to register webhooks to the manager")
os.Exit(1) os.Exit(1)
} }
// Start the Cmd
log.Info("Starting the Cmd.")
if err := mgr.Start(signals.SetupSignalHandler()); err != nil {
log.Error(err, "Unable to run the manager")
os.Exit(1)
}
} }

View File

@ -1,8 +1,6 @@
# Build the Katib file metrics collector. # Build the Katib file metrics collector.
FROM golang:alpine AS build-env FROM golang:alpine AS build-env
ARG TARGETARCH
WORKDIR /go/src/github.com/kubeflow/katib WORKDIR /go/src/github.com/kubeflow/katib
# Download packages. # Download packages.
@ -15,7 +13,13 @@ COPY cmd/ cmd/
COPY pkg/ pkg/ COPY pkg/ pkg/
# Build the binary. # Build the binary.
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o file-metricscollector ./cmd/metricscollector/v1beta1/file-metricscollector RUN if [ "$(uname -m)" = "ppc64le" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o file-metricscollector ./cmd/metricscollector/v1beta1/file-metricscollector; \
elif [ "$(uname -m)" = "aarch64" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o file-metricscollector ./cmd/metricscollector/v1beta1/file-metricscollector; \
else \
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o file-metricscollector ./cmd/metricscollector/v1beta1/file-metricscollector; \
fi
# Copy the file metrics collector into a thin image. # Copy the file metrics collector into a thin image.
FROM alpine:3.15 FROM alpine:3.15

View File

@ -42,6 +42,7 @@ import (
"encoding/json" "encoding/json"
"flag" "flag"
"fmt" "fmt"
"io/ioutil"
"os" "os"
"path/filepath" "path/filepath"
"regexp" "regexp"
@ -49,11 +50,11 @@ import (
"strings" "strings"
"time" "time"
"github.com/nxadm/tail" "github.com/hpcloud/tail"
psutil "github.com/shirou/gopsutil/v3/process" psutil "github.com/shirou/gopsutil/v3/process"
"google.golang.org/grpc" "google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure" "google.golang.org/grpc/credentials/insecure"
"k8s.io/klog/v2" "k8s.io/klog"
commonv1beta1 "github.com/kubeflow/katib/pkg/apis/controller/common/v1beta1" commonv1beta1 "github.com/kubeflow/katib/pkg/apis/controller/common/v1beta1"
api "github.com/kubeflow/katib/pkg/apis/manager/v1beta1" api "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
@ -134,11 +135,7 @@ func printMetricsFile(mFile string) {
checkMetricFile(mFile) checkMetricFile(mFile)
// Print lines from metrics file. // Print lines from metrics file.
t, err := tail.TailFile(mFile, tail.Config{Follow: true, ReOpen: true}) t, _ := tail.TailFile(mFile, tail.Config{Follow: true})
if err != nil {
klog.Errorf("Failed to open metrics file: %v", err)
}
for line := range t.Lines { for line := range t.Lines {
klog.Info(line.Text) klog.Info(line.Text)
} }
@ -164,9 +161,7 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
checkMetricFile(mFile) checkMetricFile(mFile)
// Get Main process. // Get Main process.
// Extract the metric file dir path based on the file name. _, mainProcPid, err := common.GetMainProcesses(mFile)
mDirPath, _ := filepath.Split(mFile)
_, mainProcPid, err := common.GetMainProcesses(mDirPath)
if err != nil { if err != nil {
klog.Fatalf("GetMainProcesses failed: %v", err) klog.Fatalf("GetMainProcesses failed: %v", err)
} }
@ -273,7 +268,7 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
klog.Fatalf("Create mark file %v error: %v", markFile, err) klog.Fatalf("Create mark file %v error: %v", markFile, err)
} }
err = os.WriteFile(markFile, []byte(common.TrainingEarlyStopped), 0) err = ioutil.WriteFile(markFile, []byte(common.TrainingEarlyStopped), 0)
if err != nil { if err != nil {
klog.Fatalf("Write to file %v error: %v", markFile, err) klog.Fatalf("Write to file %v error: %v", markFile, err)
} }
@ -311,7 +306,7 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
} }
// Create connection and client for Early Stopping service. // Create connection and client for Early Stopping service.
conn, err := grpc.NewClient(*earlyStopServiceAddr, grpc.WithTransportCredentials(insecure.NewCredentials())) conn, err := grpc.Dial(*earlyStopServiceAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))
if err != nil { if err != nil {
klog.Fatalf("Could not connect to Early Stopping service, error: %v", err) klog.Fatalf("Could not connect to Early Stopping service, error: %v", err)
} }
@ -433,7 +428,7 @@ func main() {
func reportMetrics(filters []string, fileFormat commonv1beta1.FileFormat) { func reportMetrics(filters []string, fileFormat commonv1beta1.FileFormat) {
conn, err := grpc.NewClient(*dbManagerServiceAddr, grpc.WithTransportCredentials(insecure.NewCredentials())) conn, err := grpc.Dial(*dbManagerServiceAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))
if err != nil { if err != nil {
klog.Fatalf("Could not connect to DB manager service, error: %v", err) klog.Fatalf("Could not connect to DB manager service, error: %v", err)
} }

View File

@ -1,24 +1,24 @@
FROM python:3.11-slim FROM python:3.9-slim
ARG TARGETARCH
ENV TARGET_DIR /opt/katib ENV TARGET_DIR /opt/katib
ENV METRICS_COLLECTOR_DIR cmd/metricscollector/v1beta1/tfevent-metricscollector ENV METRICS_COLLECTOR_DIR cmd/metricscollector/v1beta1/tfevent-metricscollector
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/metricscollector/v1beta1/tfevent-metricscollector/::${TARGET_DIR}/pkg/metricscollector/v1beta1/common/
ADD ./pkg/ ${TARGET_DIR}/pkg/ ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${METRICS_COLLECTOR_DIR}/ ${TARGET_DIR}/${METRICS_COLLECTOR_DIR}/ ADD ./${METRICS_COLLECTOR_DIR}/ ${TARGET_DIR}/${METRICS_COLLECTOR_DIR}/
WORKDIR ${TARGET_DIR}/${METRICS_COLLECTOR_DIR} WORKDIR ${TARGET_DIR}/${METRICS_COLLECTOR_DIR}
RUN if [ "${TARGETARCH}" = "arm64" ]; then \ RUN if [ "$(uname -m)" = "aarch64" ]; then \
apt-get -y update && \ apt-get -y update && \
apt-get -y install gfortran libpcre3 libpcre3-dev && \ apt-get -y install gfortran libpcre3 libpcre3-dev && \
apt-get clean && \ apt-get clean && \
rm -rf /var/lib/apt/lists/*; \ rm -rf /var/lib/apt/lists/*; \
fi fi
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt RUN pip install --no-cache-dir -r requirements.txt
RUN chgrp -R 0 ${TARGET_DIR} \ RUN chgrp -R 0 ${TARGET_DIR} \
&& chmod -R g+rwX ${TARGET_DIR} && chmod -R g+rwX ${TARGET_DIR}
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/metricscollector/v1beta1/tfevent-metricscollector/::${TARGET_DIR}/pkg/metricscollector/v1beta1/common/
ENTRYPOINT ["python", "main.py"] ENTRYPOINT ["python", "main.py"]

View File

@ -1,6 +1,6 @@
FROM ibmcom/tensorflow-ppc64le:2.2.0-py3 FROM ibmcom/tensorflow-ppc64le:2.2.0-py3
ADD . /usr/src/app/github.com/kubeflow/katib ADD . /usr/src/app/github.com/kubeflow/katib
WORKDIR /usr/src/app/github.com/kubeflow/katib/cmd/metricscollector/v1beta1/tfevent-metricscollector/ WORKDIR /usr/src/app/github.com/kubeflow/katib/cmd/metricscollector/v1beta1/tfevent-metricscollector/
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt RUN pip install --no-cache-dir -r requirements.txt
ENV PYTHONPATH /usr/src/app/github.com/kubeflow/katib:/usr/src/app/github.com/kubeflow/katib/pkg/apis/manager/v1beta1/python:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/tfevent-metricscollector/:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/common/ ENV PYTHONPATH /usr/src/app/github.com/kubeflow/katib:/usr/src/app/github.com/kubeflow/katib/pkg/apis/manager/v1beta1/python:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/tfevent-metricscollector/:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/common/
ENTRYPOINT ["python", "main.py"] ENTRYPOINT ["python", "main.py"]

View File

@ -12,15 +12,13 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import argparse
from logging import INFO, StreamHandler, getLogger
import api_pb2
import api_pb2_grpc
import const
import grpc import grpc
import argparse
import api_pb2
from pns import WaitMainProcesses from pns import WaitMainProcesses
import const
from tfevent_loader import MetricsCollector from tfevent_loader import MetricsCollector
from logging import getLogger, StreamHandler, INFO
timeout_in_seconds = 60 timeout_in_seconds = 60
@ -57,28 +55,25 @@ if __name__ == '__main__':
wait_all_processes = opt.wait_all_processes.lower() == "true" wait_all_processes = opt.wait_all_processes.lower() == "true"
db_manager_server = opt.db_manager_server_addr.split(':') db_manager_server = opt.db_manager_server_addr.split(':')
if len(db_manager_server) != 2: if len(db_manager_server) != 2:
raise Exception( raise Exception("Invalid Katib DB manager service address: %s" %
f"Invalid Katib DB manager service address: {opt.db_manager_server_addr}" opt.db_manager_server_addr)
)
WaitMainProcesses( WaitMainProcesses(
pool_interval=opt.poll_interval, pool_interval=opt.poll_interval,
timout=opt.timeout, timout=opt.timeout,
wait_all=wait_all_processes, wait_all=wait_all_processes,
completed_marked_dir=opt.metrics_file_dir, completed_marked_dir=opt.metrics_file_dir)
)
mc = MetricsCollector(opt.metric_names.split(";")) mc = MetricsCollector(opt.metric_names.split(';'))
observation_log = mc.parse_file(opt.metrics_file_dir) observation_log = mc.parse_file(opt.metrics_file_dir)
with grpc.insecure_channel(opt.db_manager_server_addr) as channel: channel = grpc.beta.implementations.insecure_channel(
stub = api_pb2_grpc.DBManagerStub(channel) db_manager_server[0], int(db_manager_server[1]))
logger.info(
f"In {opt.trial_name} {str(len(observation_log.metric_logs))} metrics will be reported." with api_pb2.beta_create_DBManager_stub(channel) as client:
) logger.info("In " + opt.trial_name + " " +
stub.ReportObservationLog( str(len(observation_log.metric_logs)) + " metrics will be reported.")
api_pb2.ReportObservationLogRequest( client.ReportObservationLog(api_pb2.ReportObservationLogRequest(
trial_name=opt.trial_name, observation_log=observation_log trial_name=opt.trial_name,
), observation_log=observation_log
timeout=timeout_in_seconds, ), timeout=timeout_in_seconds)
)

View File

@ -1,6 +1,6 @@
psutil==5.9.4 psutil==5.8.0
rfc3339>=6.2 rfc3339>=6.2
grpcio>=1.64.1 grpcio==1.41.1
googleapis-common-protos==1.6.0 googleapis-common-protos==1.6.0
tensorflow==2.16.1 tensorflow==2.9.1; platform_machine=="x86_64"
protobuf>=4.21.12,<5 tensorflow-aarch64==2.9.1; platform_machine=="aarch64"

View File

@ -0,0 +1,63 @@
# --- Clone the kubeflow/kubeflow code ---
FROM ubuntu AS fetch-kubeflow-kubeflow
RUN apt-get update && apt-get install git -y
WORKDIR /kf
RUN git clone https://github.com/kubeflow/kubeflow.git && \
cd kubeflow && \
git checkout ecb72c2
# --- Build the frontend kubeflow library ---
FROM node:12 AS frontend-kubeflow-lib
WORKDIR /src
ARG LIB=/kf/kubeflow/components/crud-web-apps/common/frontend/kubeflow-common-lib
COPY --from=fetch-kubeflow-kubeflow $LIB/package*.json ./
RUN npm ci
COPY --from=fetch-kubeflow-kubeflow $LIB/ ./
RUN npm run build
# --- Build the frontend ---
FROM node:12 AS frontend
WORKDIR /src
COPY ./pkg/new-ui/v1beta1/frontend/package*.json ./
RUN npm ci
COPY ./pkg/new-ui/v1beta1/frontend/ .
COPY --from=frontend-kubeflow-lib /src/dist/kubeflow/ ./node_modules/kubeflow/
RUN npm run build:prod
# --- Build the backend ---
FROM golang:alpine AS go-build
WORKDIR /go/src/github.com/kubeflow/katib
# Download packages.
COPY go.mod .
COPY go.sum .
RUN go mod download -x
# Copy sources.
COPY cmd/ cmd/
COPY pkg/ pkg/
# Build the binary.
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-ui ./cmd/new-ui/v1beta1; \
elif [ "$(uname -m)" = "aarch64" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-ui ./cmd/new-ui/v1beta1; \
else \
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-ui ./cmd/new-ui/v1beta1; \
fi
# --- Compose the web app ---
FROM alpine:3.15
WORKDIR /app
COPY --from=go-build /go/src/github.com/kubeflow/katib/katib-ui /app/
COPY --from=frontend /src/dist/static /app/build/static/
ENTRYPOINT ["./katib-ui"]

View File

@ -0,0 +1,75 @@
/*
Copyright 2022 The Kubeflow Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package main
import (
"flag"
"fmt"
"log"
"net/http"
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
common_v1beta1 "github.com/kubeflow/katib/pkg/common/v1beta1"
ui "github.com/kubeflow/katib/pkg/new-ui/v1beta1"
)
var (
port, host, buildDir, dbManagerAddr *string
)
func init() {
port = flag.String("port", "8080", "The port to listen to for incoming HTTP connections")
host = flag.String("host", "0.0.0.0", "The host to listen to for incoming HTTP connections")
buildDir = flag.String("build-dir", "/app/build", "The dir of frontend")
dbManagerAddr = flag.String("db-manager-address", common_v1beta1.GetDBManagerAddr(), "The address of Katib DB manager")
}
func main() {
flag.Parse()
kuh := ui.NewKatibUIHandler(*dbManagerAddr)
log.Printf("Serving the frontend dir %s", *buildDir)
frontend := http.FileServer(http.Dir(*buildDir))
http.HandleFunc("/katib/", kuh.ServeIndex(*buildDir))
http.Handle("/katib/static/", http.StripPrefix("/katib/", frontend))
http.HandleFunc("/katib/fetch_experiments/", kuh.FetchAllExperiments)
http.HandleFunc("/katib/create_experiment/", kuh.CreateExperiment)
http.HandleFunc("/katib/delete_experiment/", kuh.DeleteExperiment)
http.HandleFunc("/katib/fetch_experiment/", kuh.FetchExperiment)
http.HandleFunc("/katib/fetch_trial/", kuh.FetchTrial)
http.HandleFunc("/katib/fetch_suggestion/", kuh.FetchSuggestion)
http.HandleFunc("/katib/fetch_hp_job_info/", kuh.FetchHPJobInfo)
http.HandleFunc("/katib/fetch_hp_job_trial_info/", kuh.FetchHPJobTrialInfo)
http.HandleFunc("/katib/fetch_nas_job_info/", kuh.FetchNASJobInfo)
http.HandleFunc("/katib/fetch_trial_templates/", kuh.FetchTrialTemplates)
http.HandleFunc("/katib/add_template/", kuh.AddTemplate)
http.HandleFunc("/katib/edit_template/", kuh.EditTemplate)
http.HandleFunc("/katib/delete_template/", kuh.DeleteTemplate)
http.HandleFunc("/katib/fetch_namespaces", kuh.FetchNamespaces)
log.Printf("Serving at %s:%s", *host, *port)
if err := http.ListenAndServe(fmt.Sprintf("%s:%s", *host, *port), nil); err != nil {
panic(err)
}
}

View File

@ -0,0 +1,36 @@
FROM alpine:3.15 AS downloader
ENV GRPC_HEALTH_PROBE_VERSION v0.4.11
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
elif [ "$(uname -m)" = "aarch64" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
else \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
fi && \
chmod +x /bin/grpc_health_probe
FROM python:3.9-slim
ENV TARGET_DIR /opt/katib
ENV SUGGESTION_DIR cmd/suggestion/chocolate/v1beta1
RUN apt-get -y update && \
apt-get -y install git && \
if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
apt-get -y install gfortran libopenblas-dev liblapack-dev g++; \
fi && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
RUN pip install --no-cache-dir -r requirements.txt
RUN chgrp -R 0 ${TARGET_DIR} \
&& chmod -R g+rwX ${TARGET_DIR}
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
ENTRYPOINT ["python", "main.py"]

View File

@ -0,0 +1,42 @@
# Copyright 2022 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import grpc
import time
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.chocolate.service import ChocolateService
from concurrent import futures
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789"
def serve():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
service = ChocolateService()
api_pb2_grpc.add_SuggestionServicer_to_server(service, server)
health_pb2_grpc.add_HealthServicer_to_server(service, server)
server.add_insecure_port(DEFAULT_PORT)
print("Listening...")
server.start()
try:
while True:
time.sleep(_ONE_DAY_IN_SECONDS)
except KeyboardInterrupt:
server.stop(0)
if __name__ == "__main__":
serve()

View File

@ -0,0 +1,13 @@
grpcio==1.41.1
cloudpickle==0.5.6
numpy>=1.20.0
scikit-learn>=0.24.0
scipy>=1.5.4
forestci==0.3
protobuf==3.19.1
googleapis-common-protos==1.6.0
SQLAlchemy==1.4.26
git+https://github.com/AIworx-Labs/chocolate@master
ghalton>=0.6.2; platform_machine=="x86_64"
git+https://github.com/fmder/ghalton@master; platform_machine=="aarch64"
cython>=0.29.24

View File

@ -1,7 +1,7 @@
# Build the Goptuna Suggestion. # Build the Goptuna Suggestion.
FROM golang:alpine AS build-env FROM golang:alpine AS build-env
ARG TARGETARCH ENV GRPC_HEALTH_PROBE_VERSION v0.4.11
WORKDIR /go/src/github.com/kubeflow/katib WORKDIR /go/src/github.com/kubeflow/katib
@ -15,7 +15,23 @@ COPY cmd/ cmd/
COPY pkg/ pkg/ COPY pkg/ pkg/
# Build the binary. # Build the binary.
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1 RUN if [ "$(uname -m)" = "ppc64le" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1; \
elif [ "$(uname -m)" = "aarch64" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1; \
else \
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1; \
fi
# Add GRPC health probe.
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
elif [ "$(uname -m)" = "aarch64" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
else \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
fi && \
chmod +x /bin/grpc_health_probe
# Copy the Goptuna suggestion into a thin image. # Copy the Goptuna suggestion into a thin image.
FROM alpine:3.15 FROM alpine:3.15
@ -23,7 +39,7 @@ FROM alpine:3.15
ENV TARGET_DIR /opt/katib ENV TARGET_DIR /opt/katib
WORKDIR ${TARGET_DIR} WORKDIR ${TARGET_DIR}
COPY --from=build-env /bin/grpc_health_probe /bin/
COPY --from=build-env /go/src/github.com/kubeflow/katib/goptuna-suggestion ${TARGET_DIR}/ COPY --from=build-env /go/src/github.com/kubeflow/katib/goptuna-suggestion ${TARGET_DIR}/
RUN chgrp -R 0 ${TARGET_DIR} \ RUN chgrp -R 0 ${TARGET_DIR} \

View File

@ -24,7 +24,7 @@ import (
api_v1_beta1 "github.com/kubeflow/katib/pkg/apis/manager/v1beta1" api_v1_beta1 "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
suggestion "github.com/kubeflow/katib/pkg/suggestion/v1beta1/goptuna" suggestion "github.com/kubeflow/katib/pkg/suggestion/v1beta1/goptuna"
"google.golang.org/grpc" "google.golang.org/grpc"
"k8s.io/klog/v2" "k8s.io/klog"
) )
const ( const (

View File

@ -1,11 +1,19 @@
FROM python:3.11-slim FROM alpine:3.15 AS downloader
ENV GRPC_HEALTH_PROBE_VERSION v0.4.11
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
elif [ "$(uname -m)" = "aarch64" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
else \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
fi && \
chmod +x /bin/grpc_health_probe
ARG TARGETARCH FROM python:3.9-slim
ENV TARGET_DIR /opt/katib ENV TARGET_DIR /opt/katib
ENV SUGGESTION_DIR cmd/suggestion/hyperband/v1beta1 ENV SUGGESTION_DIR cmd/suggestion/hyperband/v1beta1
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
RUN if [ "${TARGETARCH}" = "ppc64le" ] || [ "${TARGETARCH}" = "arm64" ]; then \ RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
apt-get -y update && \ apt-get -y update && \
apt-get -y install gfortran libopenblas-dev liblapack-dev && \ apt-get -y install gfortran libopenblas-dev liblapack-dev && \
apt-get clean && \ apt-get clean && \
@ -14,11 +22,14 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ] || [ "${TARGETARCH}" = "arm64" ]; then \
ADD ./pkg/ ${TARGET_DIR}/pkg/ ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/ ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR} WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
RUN chgrp -R 0 ${TARGET_DIR} \ RUN chgrp -R 0 ${TARGET_DIR} \
&& chmod -R g+rwX ${TARGET_DIR} && chmod -R g+rwX ${TARGET_DIR}
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
ENTRYPOINT ["python", "main.py"] ENTRYPOINT ["python", "main.py"]

View File

@ -12,14 +12,12 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import time
from concurrent import futures
import grpc import grpc
import time
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.hyperband.service import HyperbandService from pkg.suggestion.v1beta1.hyperband.service import HyperbandService
from concurrent import futures
_ONE_DAY_IN_SECONDS = 60 * 60 * 24 _ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789" DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,9 +1,9 @@
grpcio>=1.64.1 grpcio==1.41.1
cloudpickle==0.5.6 cloudpickle==0.5.6
numpy>=1.25.2 numpy>=1.20.0
scikit-learn>=0.24.0 scikit-learn>=0.24.0
scipy>=1.5.4 scipy>=1.5.4
forestci==0.3 forestci==0.3
protobuf>=4.21.12,<5 protobuf==3.19.1
googleapis-common-protos==1.6.0 googleapis-common-protos==1.6.0
cython>=0.29.24 cython>=0.29.24

View File

@ -1,11 +1,19 @@
FROM python:3.11-slim FROM alpine:3.15 AS downloader
ENV GRPC_HEALTH_PROBE_VERSION v0.4.11
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
elif [ "$(uname -m)" = "aarch64" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
else \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
fi && \
chmod +x /bin/grpc_health_probe
ARG TARGETARCH FROM python:3.9-slim
ENV TARGET_DIR /opt/katib ENV TARGET_DIR /opt/katib
ENV SUGGESTION_DIR cmd/suggestion/hyperopt/v1beta1 ENV SUGGESTION_DIR cmd/suggestion/hyperopt/v1beta1
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \ RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
apt-get -y update && \ apt-get -y update && \
apt-get -y install gfortran libopenblas-dev liblapack-dev && \ apt-get -y install gfortran libopenblas-dev liblapack-dev && \
apt-get clean && \ apt-get clean && \
@ -14,11 +22,14 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
ADD ./pkg/ ${TARGET_DIR}/pkg/ ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/ ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR} WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
RUN chgrp -R 0 ${TARGET_DIR} \ RUN chgrp -R 0 ${TARGET_DIR} \
&& chmod -R g+rwX ${TARGET_DIR} && chmod -R g+rwX ${TARGET_DIR}
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
ENTRYPOINT ["python", "main.py"] ENTRYPOINT ["python", "main.py"]

View File

@ -12,14 +12,12 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import time
from concurrent import futures
import grpc import grpc
import time
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.hyperopt.service import HyperoptService from pkg.suggestion.v1beta1.hyperopt.service import HyperoptService
from concurrent import futures
_ONE_DAY_IN_SECONDS = 60 * 60 * 24 _ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789" DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,10 +1,10 @@
grpcio>=1.64.1 grpcio==1.41.1
cloudpickle==0.5.6 cloudpickle==0.5.6
numpy>=1.25.2 numpy>=1.20.0
scikit-learn>=0.24.0 scikit-learn>=0.24.0
scipy>=1.5.4 scipy>=1.5.4
forestci==0.3 forestci==0.3
protobuf>=4.21.12,<5 protobuf==3.19.1
googleapis-common-protos==1.6.0 googleapis-common-protos==1.6.0
hyperopt==0.2.5 hyperopt==0.2.5
cython>=0.29.24 cython>=0.29.24

View File

@ -1,11 +1,19 @@
FROM python:3.11-slim FROM alpine:3.15 as downloader
ENV GRPC_HEALTH_PROBE_VERSION v0.4.11
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
elif [ "$(uname -m)" = "aarch64" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
else \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
fi && \
chmod +x /bin/grpc_health_probe
ARG TARGETARCH FROM python:3.9-slim
ENV TARGET_DIR /opt/katib ENV TARGET_DIR /opt/katib
ENV SUGGESTION_DIR cmd/suggestion/nas/darts/v1beta1 ENV SUGGESTION_DIR cmd/suggestion/nas/darts/v1beta1
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \ RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
apt-get -y update && \ apt-get -y update && \
apt-get -y install gfortran libopenblas-dev liblapack-dev && \ apt-get -y install gfortran libopenblas-dev liblapack-dev && \
apt-get clean && \ apt-get clean && \
@ -14,11 +22,14 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
ADD ./pkg/ ${TARGET_DIR}/pkg/ ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/ ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR} WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
RUN chgrp -R 0 ${TARGET_DIR} \ RUN chgrp -R 0 ${TARGET_DIR} \
&& chmod -R g+rwX ${TARGET_DIR} && chmod -R g+rwX ${TARGET_DIR}
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
ENTRYPOINT ["python", "main.py"] ENTRYPOINT ["python", "main.py"]

View File

@ -12,15 +12,14 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import time
from concurrent import futures
import grpc import grpc
from concurrent import futures
from pkg.apis.manager.health.python import health_pb2_grpc import time
from pkg.apis.manager.v1beta1.python import api_pb2_grpc from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.nas.darts.service import DartsService from pkg.suggestion.v1beta1.nas.darts.service import DartsService
_ONE_DAY_IN_SECONDS = 60 * 60 * 24 _ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789" DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,4 +1,4 @@
grpcio>=1.64.1 grpcio==1.41.1
protobuf>=4.21.12,<5 protobuf==3.19.1
googleapis-common-protos==1.6.0 googleapis-common-protos==1.6.0
cython>=0.29.24 cython>=0.29.24

View File

@ -1,11 +1,20 @@
FROM python:3.11-slim FROM alpine:3.15 AS downloader
ENV GRPC_HEALTH_PROBE_VERSION v0.4.11
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
elif [ "$(uname -m)" = "aarch64" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
else \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
fi && \
chmod +x /bin/grpc_health_probe
ARG TARGETARCH FROM python:3.9-slim
ENV TARGET_DIR /opt/katib ENV TARGET_DIR /opt/katib
ENV SUGGESTION_DIR cmd/suggestion/nas/enas/v1beta1 ENV SUGGESTION_DIR cmd/suggestion/nas/enas/v1beta1
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python ENV GRPC_HEALTH_PROBE_VERSION v0.4.11
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \ RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
apt-get -y update && \ apt-get -y update && \
apt-get -y install gfortran libopenblas-dev liblapack-dev && \ apt-get -y install gfortran libopenblas-dev liblapack-dev && \
apt-get clean && \ apt-get clean && \
@ -14,11 +23,14 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
ADD ./pkg/ ${TARGET_DIR}/pkg/ ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/ ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR} WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
RUN chgrp -R 0 ${TARGET_DIR} \ RUN chgrp -R 0 ${TARGET_DIR} \
&& chmod -R g+rwX ${TARGET_DIR} && chmod -R g+rwX ${TARGET_DIR}
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
ENTRYPOINT ["python", "main.py"] ENTRYPOINT ["python", "main.py"]

View File

@ -12,15 +12,15 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import time
from concurrent import futures
import grpc import grpc
from concurrent import futures
import time
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.nas.enas.service import EnasService from pkg.suggestion.v1beta1.nas.enas.service import EnasService
_ONE_DAY_IN_SECONDS = 60 * 60 * 24 _ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789" DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,5 +1,5 @@
grpcio>=1.64.1 grpcio==1.41.1
googleapis-common-protos==1.6.0 googleapis-common-protos==1.6.0
cython>=0.29.24 cython>=0.29.24
tensorflow==2.16.1 tensorflow==2.9.1; platform_machine=="x86_64"
protobuf>=4.21.12,<5 tensorflow-aarch64==2.9.1; platform_machine=="aarch64"

View File

@ -1,24 +1,34 @@
FROM python:3.11-slim FROM alpine:3.15 AS downloader
ENV GRPC_HEALTH_PROBE_VERSION v0.4.11
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
elif [ "$(uname -m)" = "aarch64" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
else \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
fi && \
chmod +x /bin/grpc_health_probe
ARG TARGETARCH FROM python:3.9-slim
ENV TARGET_DIR /opt/katib ENV TARGET_DIR /opt/katib
ENV SUGGESTION_DIR cmd/suggestion/optuna/v1beta1 ENV SUGGESTION_DIR cmd/suggestion/optuna/v1beta1
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \ RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
apt-get -y update && \ apt-get -y update && \
apt-get -y install gfortran libopenblas-dev liblapack-dev && \ apt-get -y install gfortran libopenblas-dev liblapack-dev && \
apt-get clean && \ apt-get clean && \
rm -rf /var/lib/apt/lists/*; \ rm -rf /var/lib/apt/lists/*; \
fi fi
ADD ./pkg/ ${TARGET_DIR}/pkg/ ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/ ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR} WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
RUN chgrp -R 0 ${TARGET_DIR} \ RUN chgrp -R 0 ${TARGET_DIR} \
&& chmod -R g+rwX ${TARGET_DIR} && chmod -R g+rwX ${TARGET_DIR}
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
ENTRYPOINT ["python", "main.py"] ENTRYPOINT ["python", "main.py"]

View File

@ -12,14 +12,12 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import time
from concurrent import futures
import grpc import grpc
import time
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.optuna.service import OptunaService from pkg.suggestion.v1beta1.optuna.service import OptunaService
from concurrent import futures
_ONE_DAY_IN_SECONDS = 60 * 60 * 24 _ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789" DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,4 +1,4 @@
grpcio>=1.64.1 grpcio==1.41.1
protobuf>=4.21.12,<5 protobuf==3.19.1
googleapis-common-protos==1.53.0 googleapis-common-protos==1.53.0
optuna==3.3.0 optuna<3.0.0

View File

@ -1,24 +1,37 @@
FROM python:3.11-slim FROM python:3.9-slim
ARG TARGETARCH
ENV TARGET_DIR /opt/katib ENV TARGET_DIR /opt/katib
ENV SUGGESTION_DIR cmd/suggestion/pbt/v1beta1 ENV SUGGESTION_DIR cmd/suggestion/pbt/v1beta1
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python ENV GRPC_HEALTH_PROBE_VERSION v0.4.6
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \ RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
apt-get -y update && \ apt-get -y update && \
apt-get -y install gfortran libopenblas-dev liblapack-dev && \ apt-get -y install gfortran libopenblas-dev liblapack-dev wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*; \
else \
apt-get -y update && \
apt-get -y install wget && \
apt-get clean && \ apt-get clean && \
rm -rf /var/lib/apt/lists/*; \ rm -rf /var/lib/apt/lists/*; \
fi fi
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
elif [ "$(uname -m)" = "aarch64" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
else \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
fi && \
chmod +x /bin/grpc_health_probe
ADD ./pkg/ ${TARGET_DIR}/pkg/ ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/ ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR} WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
RUN chgrp -R 0 ${TARGET_DIR} \ RUN chgrp -R 0 ${TARGET_DIR} \
&& chmod -R g+rwX ${TARGET_DIR} && chmod -R g+rwX ${TARGET_DIR}
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
ENTRYPOINT ["python", "main.py"] ENTRYPOINT ["python", "main.py"]

View File

@ -12,14 +12,12 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import time
from concurrent import futures
import grpc import grpc
import time
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.pbt.service import PbtService from pkg.suggestion.v1beta1.pbt.service import PbtService
from concurrent import futures
_ONE_DAY_IN_SECONDS = 60 * 60 * 24 _ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789" DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,4 +1,4 @@
grpcio>=1.64.1 grpcio==1.41.1
protobuf>=4.21.12,<5 protobuf==3.19.1
googleapis-common-protos==1.53.0 googleapis-common-protos==1.53.0
numpy==1.25.2 numpy==1.22.2

View File

@ -1,24 +1,34 @@
FROM python:3.10-slim FROM alpine:3.15 AS downloader
ENV GRPC_HEALTH_PROBE_VERSION v0.4.11
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
elif [ "$(uname -m)" = "aarch64" ]; then \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
else \
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
fi && \
chmod +x /bin/grpc_health_probe
ARG TARGETARCH FROM python:3.9-slim
ENV TARGET_DIR /opt/katib ENV TARGET_DIR /opt/katib
ENV SUGGESTION_DIR cmd/suggestion/skopt/v1beta1 ENV SUGGESTION_DIR cmd/suggestion/skopt/v1beta1
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \ RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
apt-get -y update && \ apt-get -y update && \
apt-get -y install gfortran libopenblas-dev liblapack-dev && \ apt-get -y install gfortran libopenblas-dev liblapack-dev && \
apt-get clean && \ apt-get clean && \
rm -rf /var/lib/apt/lists/*; \ rm -rf /var/lib/apt/lists/*; \
fi fi
ADD ./pkg/ ${TARGET_DIR}/pkg/ ADD ./pkg/ ${TARGET_DIR}/pkg/
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/ ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
COPY --from=downloader /bin/grpc_health_probe /bin/grpc_health_probe
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR} WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
RUN chgrp -R 0 ${TARGET_DIR} \ RUN chgrp -R 0 ${TARGET_DIR} \
&& chmod -R g+rwX ${TARGET_DIR} && chmod -R g+rwX ${TARGET_DIR}
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
ENTRYPOINT ["python", "main.py"] ENTRYPOINT ["python", "main.py"]

View File

@ -12,14 +12,12 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import time
from concurrent import futures
import grpc import grpc
import time
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.apis.manager.v1beta1.python import api_pb2_grpc from pkg.apis.manager.v1beta1.python import api_pb2_grpc
from pkg.apis.manager.health.python import health_pb2_grpc
from pkg.suggestion.v1beta1.skopt.service import SkoptService from pkg.suggestion.v1beta1.skopt.service import SkoptService
from concurrent import futures
_ONE_DAY_IN_SECONDS = 60 * 60 * 24 _ONE_DAY_IN_SECONDS = 60 * 60 * 24
DEFAULT_PORT = "0.0.0.0:6789" DEFAULT_PORT = "0.0.0.0:6789"

View File

@ -1,13 +1,10 @@
grpcio>=1.64.1 grpcio==1.41.1
cloudpickle==0.5.6 cloudpickle==0.5.6
# This is a workaround to avoid the following error. numpy>=1.20.0
# AttributeError: module 'numpy' has no attribute 'int' scikit-learn>=0.24.0
# See more: https://github.com/numpy/numpy/pull/22607
numpy==1.23.5
scikit-learn>=0.24.0, <=1.3.0
scipy>=1.5.4 scipy>=1.5.4
forestci==0.3 forestci==0.3
protobuf>=4.21.12,<5 protobuf==3.19.1
googleapis-common-protos==1.6.0 googleapis-common-protos==1.6.0
scikit-optimize>=0.9.0 scikit-optimize>=0.9.0
cython>=0.29.24 cython>=0.29.24

View File

@ -1,56 +1,15 @@
# --- Clone the kubeflow/kubeflow code --- # Build the Katib UI.
FROM alpine/git AS fetch-kubeflow-kubeflow FROM node:12.18.1 AS npm-build
WORKDIR /kf # Build frontend.
COPY ./pkg/ui/v1beta1/frontend/COMMIT ./ ADD /pkg/ui/v1beta1/frontend /frontend
RUN git clone https://github.com/kubeflow/kubeflow.git && \ RUN cd /frontend && npm ci
COMMIT=$(cat ./COMMIT) && \ RUN cd /frontend && npm run build
cd kubeflow && \ RUN rm -rf /frontend/node_modules
git checkout $COMMIT
# --- Build the frontend kubeflow library --- # Build backend.
FROM node:16-alpine AS frontend-kubeflow-lib
WORKDIR /src
ARG LIB=/kf/kubeflow/components/crud-web-apps/common/frontend/kubeflow-common-lib
COPY --from=fetch-kubeflow-kubeflow $LIB/package*.json ./
RUN npm config set fetch-retry-mintimeout 200000 && \
npm config set fetch-retry-maxtimeout 1200000 && \
npm config get registry && \
npm config set registry https://registry.npmjs.org/ && \
npm config delete https-proxy && \
npm config set loglevel verbose && \
npm cache clean --force && \
npm ci --force --prefer-offline --no-audit
COPY --from=fetch-kubeflow-kubeflow $LIB/ ./
RUN npm run build
# --- Build the frontend ---
FROM node:16-alpine AS frontend
WORKDIR /src
COPY ./pkg/ui/v1beta1/frontend/package*.json ./
RUN npm config set fetch-retry-mintimeout 200000 && \
npm config set fetch-retry-maxtimeout 1200000 && \
npm config get registry && \
npm config set registry https://registry.npmjs.org/ && \
npm config delete https-proxy && \
npm config set loglevel verbose && \
npm cache clean --force && \
npm ci --force --prefer-offline --no-audit
COPY ./pkg/ui/v1beta1/frontend/ .
COPY --from=frontend-kubeflow-lib /src/dist/kubeflow/ ./node_modules/kubeflow/
RUN npm run build:prod
# --- Build the backend ---
FROM golang:alpine AS go-build FROM golang:alpine AS go-build
ARG TARGETARCH
WORKDIR /go/src/github.com/kubeflow/katib WORKDIR /go/src/github.com/kubeflow/katib
# Download packages. # Download packages.
@ -63,11 +22,17 @@ COPY cmd/ cmd/
COPY pkg/ pkg/ COPY pkg/ pkg/
# Build the binary. # Build the binary.
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o katib-ui ./cmd/ui/v1beta1 RUN if [ "$(uname -m)" = "ppc64le" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-ui ./cmd/ui/v1beta1; \
elif [ "$(uname -m)" = "aarch64" ]; then \
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-ui ./cmd/ui/v1beta1; \
else \
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-ui ./cmd/ui/v1beta1; \
fi
# --- Compose the web app --- # Copy the backend and frontend into a thin image.
FROM alpine:3.15 FROM alpine:3.15
WORKDIR /app WORKDIR /app
COPY --from=go-build /go/src/github.com/kubeflow/katib/katib-ui /app/ COPY --from=go-build /go/src/github.com/kubeflow/katib/katib-ui /app/
COPY --from=frontend /src/dist/static /app/build/static/ COPY --from=npm-build /frontend/build /app/build
ENTRYPOINT ["./katib-ui"] ENTRYPOINT ["./katib-ui"]

View File

@ -33,7 +33,7 @@ var (
) )
func init() { func init() {
port = flag.String("port", "8080", "The port to listen to for incoming HTTP connections") port = flag.String("port", "80", "The port to listen to for incoming HTTP connections")
host = flag.String("host", "0.0.0.0", "The host to listen to for incoming HTTP connections") host = flag.String("host", "0.0.0.0", "The host to listen to for incoming HTTP connections")
buildDir = flag.String("build-dir", "/app/build", "The dir of frontend") buildDir = flag.String("build-dir", "/app/build", "The dir of frontend")
dbManagerAddr = flag.String("db-manager-address", common_v1beta1.GetDBManagerAddr(), "The address of Katib DB manager") dbManagerAddr = flag.String("db-manager-address", common_v1beta1.GetDBManagerAddr(), "The address of Katib DB manager")
@ -45,17 +45,17 @@ func main() {
log.Printf("Serving the frontend dir %s", *buildDir) log.Printf("Serving the frontend dir %s", *buildDir)
frontend := http.FileServer(http.Dir(*buildDir)) frontend := http.FileServer(http.Dir(*buildDir))
http.HandleFunc("/katib/", kuh.ServeIndex(*buildDir)) http.Handle("/katib/", http.StripPrefix("/katib/", frontend))
http.Handle("/katib/static/", http.StripPrefix("/katib/", frontend))
http.HandleFunc("/katib/fetch_experiments/", kuh.FetchExperiments) http.HandleFunc("/katib/fetch_experiments/", kuh.FetchAllExperiments)
http.HandleFunc("/katib/create_experiment/", kuh.CreateExperiment) http.HandleFunc("/katib/submit_yaml/", kuh.SubmitYamlJob)
http.HandleFunc("/katib/submit_hp_job/", kuh.SubmitParamsJob)
http.HandleFunc("/katib/submit_nas_job/", kuh.SubmitParamsJob)
http.HandleFunc("/katib/delete_experiment/", kuh.DeleteExperiment) http.HandleFunc("/katib/delete_experiment/", kuh.DeleteExperiment)
http.HandleFunc("/katib/fetch_experiment/", kuh.FetchExperiment) http.HandleFunc("/katib/fetch_experiment/", kuh.FetchExperiment)
http.HandleFunc("/katib/fetch_trial/", kuh.FetchTrial)
http.HandleFunc("/katib/fetch_suggestion/", kuh.FetchSuggestion) http.HandleFunc("/katib/fetch_suggestion/", kuh.FetchSuggestion)
http.HandleFunc("/katib/fetch_hp_job_info/", kuh.FetchHPJobInfo) http.HandleFunc("/katib/fetch_hp_job_info/", kuh.FetchHPJobInfo)
@ -67,7 +67,6 @@ func main() {
http.HandleFunc("/katib/edit_template/", kuh.EditTemplate) http.HandleFunc("/katib/edit_template/", kuh.EditTemplate)
http.HandleFunc("/katib/delete_template/", kuh.DeleteTemplate) http.HandleFunc("/katib/delete_template/", kuh.DeleteTemplate)
http.HandleFunc("/katib/fetch_namespaces", kuh.FetchNamespaces) http.HandleFunc("/katib/fetch_namespaces", kuh.FetchNamespaces)
http.HandleFunc("/katib/fetch_trial_logs/", kuh.FetchTrialLogs)
log.Printf("Serving at %s:%s", *host, *port) log.Printf("Serving at %s:%s", *host, *port)
if err := http.ListenAndServe(fmt.Sprintf("%s:%s", *host, *port), nil); err != nil { if err := http.ListenAndServe(fmt.Sprintf("%s:%s", *host, *port), nil); err != nil {

View File

@ -1,13 +0,0 @@
#!/bin/sh
# Run conformance test and generate test report.
python test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py --experiment-path examples/v1beta1/hp-tuning/random.yaml --namespace kf-conformance \
--trial-pod-labels '{"sidecar.istio.io/inject": "false"}' | tee /tmp/katib-conformance.log
# Create the done file.
touch /tmp/katib-conformance.done
echo "Done..."
# Keep the container running so the test logs can be downloaded.
while true; do sleep 10000; done

View File

@ -1,5 +0,0 @@
# Katib Documentation
Welcome to Kubeflow Katib!
The Katib documentation is available on [kubeflow.org](https://www.kubeflow.org/docs/components/katib/).

131
docs/developer-guide.md Normal file
View File

@ -0,0 +1,131 @@
# Developer Guide
This developer guide is for people who want to contribute to the Katib project.
If you're interesting in using Katib in your machine learning project,
see the following user guides:
- [Concepts](https://www.kubeflow.org/docs/components/katib/overview/)
in Katib, hyperparameter tuning, and neural architecture search.
- [Getting started with Katib](https://kubeflow.org/docs/components/katib/hyperparameter/).
- Detailed guide to [configuring and running a Katib
experiment](https://kubeflow.org/docs/components/katib/experiment/).
## Requirements
- [Go](https://golang.org/) (1.17 or later)
- [Docker](https://docs.docker.com/) (20.10 or later)
- [Java](https://docs.oracle.com/javase/8/docs/technotes/guides/install/install_overview.html) (8 or later)
- [Python](https://www.python.org/) (3.9 or later)
- [kustomize](https://kustomize.io/) (4.0.5 or later)
## Build from source code
Check source code as follows:
```bash
make build REGISTRY=<image-registry> TAG=<image-tag>
```
To use your custom images for the Katib components, modify
[Kustomization file](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml)
and [Katib Config](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/controller/katib-config.yaml)
You can deploy Katib v1beta1 manifests into a Kubernetes cluster as follows:
```bash
make deploy
```
You can undeploy Katib v1beta1 manifests from a Kubernetes cluster as follows:
```bash
make undeploy
```
## Modify controller APIs
If you want to modify Katib controller APIs, you have to
generate deepcopy, clientset, listers, informers, open-api and Python SDK with the changed APIs.
You can update the necessary files as follows:
```bash
make generate
```
## Controller Flags
Below is a list of command-line flags accepted by Katib controller:
| Name | Type | Default | Description |
| ------------------------------- | ------------------------- | ----------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| enable-grpc-probe-in-suggestion | bool | true | Enable grpc probe in suggestions |
| experiment-suggestion-name | string | "default" | The implementation of suggestion interface in experiment controller |
| metrics-addr | string | ":8080" | The address the metric endpoint binds to |
| trial-resources | []schema.GroupVersionKind | null | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
| webhook-inject-securitycontext | bool | false | Inject the securityContext of container[0] in the sidecar |
| webhook-port | int | 8443 | The port number to be used for admission webhook server |
| enable-leader-election | bool | false | Enable leader election for katib-controller. Enabling this will ensure there is only one active katib-controller. |
| leader-election-id | string | "3fbc96e9.katib.kubeflow.org" | The ID for leader election. |
## Workflow design
Please see [workflow-design.md](./workflow-design.md).
## Katib admission webhooks
Katib uses three [Kubernetes admission webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/).
1. `validator.experiment.katib.kubeflow.org` -
[Validating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook)
to validate the Katib Experiment before the creation.
1. `defaulter.experiment.katib.kubeflow.org` -
[Mutating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook)
to set the [default values](../pkg/apis/controller/experiments/v1beta1/experiment_defaults.go)
in the Katib Experiment before the creation.
1. `mutator.pod.katib.kubeflow.org` - Mutating admission webhook to inject the metrics
collector sidecar container to the training pod. Learn more about the Katib's
metrics collector in the
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector).
You can find the YAMLs for the Katib webhooks
[here](../manifests/v1beta1/components/webhook/webhooks.yaml).
**Note:** If you are using a private Kubernetes cluster, you have to allow traffic
via `TCP:8443` by specifying the firewall rule and you have to update the master
plane CIDR source range to use the Katib webhooks
### Katib cert generator
Katib uses the custom `cert-generator` [Kubernetes Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/)
to generate certificates for the webhooks.
Once Katib is deployed in the Kubernetes cluster, the `cert-generator` Job follows these steps:
- Generate the self-signed CA certificate and private key.
- Generate public certificate and private key signed with the key generated in the previous step.
- Create a Kubernetes Secret with the signed certificate. Secret has
the `katib-webhook-cert` name and `cert-generator` Job's `ownerReference` to
clean-up resources once Katib is uninstalled.
Once Secret is created, the Katib controller Deployment spawns the Pod,
since the controller has the `katib-webhook-cert` Secret volume.
- Patch the webhooks with the `CABundle`.
You can find the `cert-generator` source code [here](../cmd/cert-generator/v1beta1).
## Implement a new algorithm and use it in Katib
Please see [new-algorithm-service.md](./new-algorithm-service.md).
## Katib UI documentation
Please see [Katib UI README](https://github.com/kubeflow/katib/tree/master/pkg/ui/v1beta1).
## Design proposals
Please see [proposals](./proposals).

View File

@ -5,7 +5,7 @@ Here you can find the location for images that are used in Katib.
## Katib Components Images ## Katib Components Images
The following table shows images for the The following table shows images for the
[Katib components](https://www.kubeflow.org/docs/components/katib/reference/architecture/#katib-control-plane-components). [Katib components](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-components).
<table> <table>
<tbody> <tbody>
@ -22,7 +22,7 @@ The following table shows images for the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/katib-controller</code> <code>docker.io/kubeflowkatib/katib-controller</code>
</td> </td>
<td> <td>
Katib Controller Katib Controller
@ -33,7 +33,7 @@ The following table shows images for the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/katib-ui</code> <code>docker.io/kubeflowkatib/katib-ui</code>
</td> </td>
<td> <td>
Katib User Interface Katib User Interface
@ -44,7 +44,7 @@ The following table shows images for the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/katib-db-manager</code> <code>docker.io/kubeflowkatib/katib-db-manager</code>
</td> </td>
<td> <td>
Katib DB Manager Katib DB Manager
@ -64,13 +64,24 @@ The following table shows images for the
<a href="https://github.com/docker-library/mysql/blob/c506174eab8ae160f56483e8d72410f8f1e1470f/8.0/Dockerfile.debian">Dockerfile</a> <a href="https://github.com/docker-library/mysql/blob/c506174eab8ae160f56483e8d72410f8f1e1470f/8.0/Dockerfile.debian">Dockerfile</a>
</td> </td>
</tr> </tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/cert-generator</code>
</td>
<td>
Katib Cert Generator
</td>
<td>
<a href="https://github.com/kubeflow/katib/blob/master/cmd/cert-generator/v1beta1/Dockerfile">Dockerfile</a>
</td>
</tr>
</tbody> </tbody>
</table> </table>
## Katib Metrics Collectors Images ## Katib Metrics Collectors Images
The following table shows images for the The following table shows images for the
[Katib Metrics Collectors](https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/). [Katib Metrics Collectors](https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector).
<table> <table>
<tbody> <tbody>
@ -87,7 +98,7 @@ The following table shows images for the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/file-metrics-collector</code> <code>docker.io/kubeflowkatib/file-metrics-collector</code>
</td> </td>
<td> <td>
File Metrics Collector File Metrics Collector
@ -98,7 +109,7 @@ The following table shows images for the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/tfevent-metrics-collector</code> <code>docker.io/kubeflowkatib/tfevent-metrics-collector</code>
</td> </td>
<td> <td>
Tensorflow Event Metrics Collector Tensorflow Event Metrics Collector
@ -113,8 +124,8 @@ The following table shows images for the
## Katib Suggestions and Early Stopping Images ## Katib Suggestions and Early Stopping Images
The following table shows images for the The following table shows images for the
[Katib Suggestion services](https://www.kubeflow.org/docs/components/katib/reference/architecture/#suggestion) [Katib Suggestions](https://www.kubeflow.org/docs/components/katib/experiment/#search-algorithms-in-detail)
and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/components/katib/user-guides/early-stopping/#early-stopping-algorithms). and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/components/katib/early-stopping/).
<table> <table>
<tbody> <tbody>
@ -131,7 +142,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/suggestion-hyperopt</code> <code>docker.io/kubeflowkatib/suggestion-hyperopt</code>
</td> </td>
<td> <td>
<a href="https://github.com/hyperopt/hyperopt">Hyperopt</a> Suggestion <a href="https://github.com/hyperopt/hyperopt">Hyperopt</a> Suggestion
@ -142,7 +153,18 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/suggestion-skopt</code> <code>docker.io/kubeflowkatib/suggestion-chocolate</code>
</td>
<td>
<a href="https://github.com/AIworx-Labs/chocolate">Chocolate</a> Suggestion
</td>
<td>
<a href="https://github.com/kubeflow/katib/blob/master/cmd/suggestion/chocolate/v1beta1/Dockerfile">Dockerfile</a>
</td>
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/suggestion-skopt</code>
</td> </td>
<td> <td>
<a href="https://github.com/scikit-optimize/scikit-optimize">Skopt</a> Suggestion <a href="https://github.com/scikit-optimize/scikit-optimize">Skopt</a> Suggestion
@ -153,7 +175,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/suggestion-optuna</code> <code>docker.io/kubeflowkatib/suggestion-optuna</code>
</td> </td>
<td> <td>
<a href="https://github.com/optuna/optuna">Optuna</a> Suggestion <a href="https://github.com/optuna/optuna">Optuna</a> Suggestion
@ -164,7 +186,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/suggestion-goptuna</code> <code>docker.io/kubeflowkatib/suggestion-goptuna</code>
</td> </td>
<td> <td>
<a href="https://github.com/c-bata/goptuna">Goptuna</a> Suggestion <a href="https://github.com/c-bata/goptuna">Goptuna</a> Suggestion
@ -175,7 +197,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/suggestion-hyperband</code> <code>docker.io/kubeflowkatib/suggestion-hyperband</code>
</td> </td>
<td> <td>
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#hyperband">Hyperband</a> Suggestion <a href="https://www.kubeflow.org/docs/components/katib/experiment/#hyperband">Hyperband</a> Suggestion
@ -186,7 +208,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/suggestion-enas</code> <code>docker.io/kubeflowkatib/suggestion-enas</code>
</td> </td>
<td> <td>
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#enas">ENAS</a> Suggestion <a href="https://www.kubeflow.org/docs/components/katib/experiment/#enas">ENAS</a> Suggestion
@ -197,7 +219,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/suggestion-darts</code> <code>docker.io/kubeflowkatib/suggestion-darts</code>
</td> </td>
<td> <td>
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#differentiable-architecture-search-darts">DARTS</a> Suggestion <a href="https://www.kubeflow.org/docs/components/katib/experiment/#differentiable-architecture-search-darts">DARTS</a> Suggestion
@ -208,7 +230,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/earlystopping-medianstop</code> <code>docker.io/kubeflowkatib/earlystopping-medianstop</code>
</td> </td>
<td> <td>
<a href="https://www.kubeflow.org/docs/components/katib/early-stopping/#median-stopping-rule">Median Stopping Rule</a> <a href="https://www.kubeflow.org/docs/components/katib/early-stopping/#median-stopping-rule">Median Stopping Rule</a>
@ -223,7 +245,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
## Training Containers Images ## Training Containers Images
The following table shows images for training containers which are used in the The following table shows images for training containers which are used in the
[Katib Trials](https://www.kubeflow.org/docs/components/katib/reference/architecture/#trial). [Katib Trials](https://www.kubeflow.org/docs/components/katib/experiment/#packaging-your-training-code-in-a-container-image).
<table> <table>
<tbody> <tbody>
@ -240,7 +262,18 @@ The following table shows images for training containers which are used in the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/pytorch-mnist-cpu</code> <code>docker.io/kubeflowkatib/mxnet-mnist</code>
</td>
<td>
MXNet MNIST example with collecting metrics time
</td>
<td>
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/mxnet-mnist/Dockerfile">Dockerfile</a>
</td>
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/pytorch-mnist-cpu</code>
</td> </td>
<td> <td>
PyTorch MNIST example with printing metrics to the file or StdOut with CPU support PyTorch MNIST example with printing metrics to the file or StdOut with CPU support
@ -251,7 +284,7 @@ The following table shows images for training containers which are used in the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/pytorch-mnist-gpu</code> <code>docker.io/kubeflowkatib/pytorch-mnist-gpu</code>
</td> </td>
<td> <td>
PyTorch MNIST example with printing metrics to the file or StdOut with GPU support PyTorch MNIST example with printing metrics to the file or StdOut with GPU support
@ -262,7 +295,7 @@ The following table shows images for training containers which are used in the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/tf-mnist-with-summaries</code> <code>docker.io/kubeflowkatib/tf-mnist-with-summaries</code>
</td> </td>
<td> <td>
Tensorflow MNIST example with saving metrics in the summaries Tensorflow MNIST example with saving metrics in the summaries
@ -273,7 +306,18 @@ The following table shows images for training containers which are used in the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/xgboost-lightgbm</code> <code>docker.io/bytepsimage/mxnet</code>
</td>
<td>
Distributed BytePS example for MXJob
</td>
<td>
<a href="https://github.com/bytedance/byteps/blob/v0.2.5/docker/Dockerfile">Dockerfile</a>
</td>
</tr>
<tr align="center">
<td>
<code>docker.io/kubeflowkatib/xgboost-lightgbm</code>
</td> </td>
<td> <td>
Distributed LightGBM example for XGBoostJob Distributed LightGBM example for XGBoostJob
@ -306,7 +350,7 @@ The following table shows images for training containers which are used in the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/enas-cnn-cifar10-gpu</code> <code>docker.io/kubeflowkatib/enas-cnn-cifar10-gpu</code>
</td> </td>
<td> <td>
Keras CIFAR-10 CNN example for ENAS with GPU support Keras CIFAR-10 CNN example for ENAS with GPU support
@ -317,7 +361,7 @@ The following table shows images for training containers which are used in the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/enas-cnn-cifar10-cpu</code> <code>docker.io/kubeflowkatib/enas-cnn-cifar10-cpu</code>
</td> </td>
<td> <td>
Keras CIFAR-10 CNN example for ENAS with CPU support Keras CIFAR-10 CNN example for ENAS with CPU support
@ -328,7 +372,7 @@ The following table shows images for training containers which are used in the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/darts-cnn-cifar10-gpu</code> <code>docker.io/kubeflowkatib/darts-cnn-cifar10-gpu</code>
</td> </td>
<td> <td>
PyTorch CIFAR-10 CNN example for DARTS with GPU support PyTorch CIFAR-10 CNN example for DARTS with GPU support
@ -339,7 +383,7 @@ The following table shows images for training containers which are used in the
</tr> </tr>
<tr align="center"> <tr align="center">
<td> <td>
<code>ghcr.io/kubeflow/katib/darts-cnn-cifar10-cpu</code> <code>docker.io/kubeflowkatib/darts-cnn-cifar10-cpu</code>
</td> </td>
<td> <td>
PyTorch CIFAR-10 CNN example for DARTS with CPU support PyTorch CIFAR-10 CNN example for DARTS with CPU support

BIN
docs/images/SystemFlow.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 102 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 192 KiB

View File

Before

Width:  |  Height:  |  Size: 166 KiB

After

Width:  |  Height:  |  Size: 166 KiB

Some files were not shown because too many files have changed in this diff Show More