mirror of https://github.com/kubeflow/katib.git
Compare commits
4 Commits
master
...
v0.13.0-rc
Author | SHA1 | Date |
---|---|---|
|
1f9dff0307 | |
|
c00cf67074 | |
|
4458e7bdcd | |
|
6329f48685 |
|
@ -4,3 +4,5 @@ docs
|
|||
manifests
|
||||
pkg/ui/*/frontend/node_modules
|
||||
pkg/ui/*/frontend/build
|
||||
pkg/new-ui/*/frontend/node_modules
|
||||
pkg/new-ui/*/frontend/build
|
||||
|
|
4
.flake8
4
.flake8
|
@ -1,4 +0,0 @@
|
|||
[flake8]
|
||||
max-line-length = 100
|
||||
# E203 is ignored to avoid conflicts with Black's formatting, as it's not PEP 8 compliant
|
||||
extend-ignore = W503, E203
|
|
@ -0,0 +1,26 @@
|
|||
---
|
||||
name: Bug report
|
||||
about: Tell us about a problem you are experiencing
|
||||
---
|
||||
|
||||
/kind bug
|
||||
|
||||
**What steps did you take and what happened:**
|
||||
[A clear and concise description of what the bug is.]
|
||||
|
||||
**What did you expect to happen:**
|
||||
|
||||
**Anything else you would like to add:**
|
||||
[Miscellaneous information that will assist in solving the issue.]
|
||||
|
||||
**Environment:**
|
||||
|
||||
- Katib version (check the Katib controller image version):
|
||||
- Kubernetes version: (`kubectl version`):
|
||||
- OS (`uname -a`):
|
||||
|
||||
---
|
||||
|
||||
<!-- Don't delete this message to encourage users to support your issue! -->
|
||||
|
||||
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍
|
|
@ -1,50 +0,0 @@
|
|||
name: Bug Report
|
||||
description: Tell us about a problem you are experiencing with Katib
|
||||
labels: ["kind/bug", "lifecycle/needs-triage"]
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thanks for taking the time to fill out this Katib bug report!
|
||||
- type: textarea
|
||||
id: problem
|
||||
attributes:
|
||||
label: What happened?
|
||||
description: |
|
||||
Please provide as much info as possible. Not doing so may result in your bug not being
|
||||
addressed in a timely manner.
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: expected
|
||||
attributes:
|
||||
label: What did you expect to happen?
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: environment
|
||||
attributes:
|
||||
label: Environment
|
||||
value: |
|
||||
Kubernetes version:
|
||||
```bash
|
||||
$ kubectl version
|
||||
|
||||
```
|
||||
Katib controller version:
|
||||
```bash
|
||||
$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
|
||||
|
||||
```
|
||||
Katib Python SDK version:
|
||||
```bash
|
||||
$ pip show kubeflow-katib
|
||||
|
||||
```
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
id: votes
|
||||
attributes:
|
||||
label: Impacted by this bug?
|
||||
value: Give it a 👍 We prioritize the issues with most 👍
|
|
@ -1,12 +1,9 @@
|
|||
blank_issues_enabled: true
|
||||
blank_issues_enabled: false
|
||||
|
||||
contact_links:
|
||||
- name: Katib Documentation
|
||||
url: https://www.kubeflow.org/docs/components/katib/
|
||||
about: Much help can be found in the docs
|
||||
- name: Kubeflow Katib Slack Channel
|
||||
url: https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels
|
||||
about: Ask the Katib community on CNCF Slack
|
||||
- name: Kubeflow Katib Community Meeting
|
||||
url: https://bit.ly/2PWVCkV
|
||||
about: Join the Kubeflow AutoML working group meeting
|
||||
- name: AutoML Slack Channel
|
||||
url: https://kubeflow.slack.com/archives/C018PMV53NW
|
||||
about: Ask the Katib community on Slack
|
||||
|
|
|
@ -0,0 +1,18 @@
|
|||
---
|
||||
name: Feature enhancement request
|
||||
about: Suggest an idea for this project
|
||||
---
|
||||
|
||||
/kind feature
|
||||
|
||||
**Describe the solution you'd like**
|
||||
[A clear and concise description of what you want to happen.]
|
||||
|
||||
**Anything else you would like to add:**
|
||||
[Miscellaneous information that will assist in solving the issue.]
|
||||
|
||||
---
|
||||
|
||||
<!-- Don't delete this message to encourage users to support your issue! -->
|
||||
|
||||
Love this feature? Give it a 👍 We prioritize the features with the most 👍
|
|
@ -1,28 +0,0 @@
|
|||
name: Feature Request
|
||||
description: Suggest an idea for Katib
|
||||
labels: ["kind/feature", "lifecycle/needs-triage"]
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thanks for taking the time to fill out this Katib feature request!
|
||||
- type: textarea
|
||||
id: feature
|
||||
attributes:
|
||||
label: What you would like to be added?
|
||||
description: |
|
||||
A clear and concise description of what you want to add to Katib.
|
||||
Please consider to write Katib enhancement proposal if it is a large feature request.
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: rationale
|
||||
attributes:
|
||||
label: Why is this needed?
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
id: votes
|
||||
attributes:
|
||||
label: Love this feature?
|
||||
value: Give it a 👍 We prioritize the features with most 👍
|
|
@ -1,6 +1,6 @@
|
|||
<!-- Thanks for sending a pull request! Here are some tips for you:
|
||||
1. If this is your first time, check our contributor guidelines https://www.kubeflow.org/docs/about/contributing
|
||||
2. To know more about Katib components, check developer guide https://github.com/kubeflow/katib/blob/master/CONTRIBUTING.md
|
||||
2. To know more about Katib components, check developer guide https://github.com/kubeflow/katib/blob/master/docs/developer-guide.md
|
||||
3. If you want *faster* PR reviews, check how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
|
||||
-->
|
||||
|
||||
|
|
|
@ -0,0 +1,20 @@
|
|||
# Configuration for stale probot https://probot.github.io/apps/stale/
|
||||
|
||||
# Number of days of inactivity before an issue becomes stale
|
||||
daysUntilStale: 90
|
||||
# Number of days of inactivity before a stale issue is closed
|
||||
daysUntilClose: 20
|
||||
# Issues with these labels will never be considered stale
|
||||
exemptLabels:
|
||||
- lifecycle/frozen
|
||||
# Label to use when marking an issue as stale
|
||||
staleLabel: lifecycle/stale
|
||||
# Comment to post when marking an issue as stale. Set to `false` to disable
|
||||
markComment: >
|
||||
This issue has been automatically marked as stale because it has not had
|
||||
recent activity. It will be closed if no further activity occurs. Thank you
|
||||
for your contributions.
|
||||
# Comment to post when closing a stale issue. Set to `false` to disable
|
||||
closeComment: >
|
||||
This issue has been automatically closed because it has not had recent
|
||||
activity. Please comment "/reopen" to reopen it.
|
|
@ -1,81 +0,0 @@
|
|||
# Reusable workflows for publishing Katib images.
|
||||
name: Build and Publish Images
|
||||
|
||||
on:
|
||||
workflow_call:
|
||||
inputs:
|
||||
component-name:
|
||||
required: true
|
||||
type: string
|
||||
platforms:
|
||||
required: true
|
||||
type: string
|
||||
dockerfile:
|
||||
required: true
|
||||
type: string
|
||||
secrets:
|
||||
DOCKERHUB_USERNAME:
|
||||
required: false
|
||||
DOCKERHUB_TOKEN:
|
||||
required: false
|
||||
|
||||
jobs:
|
||||
build-and-publish:
|
||||
name: Build and Publish Images
|
||||
runs-on: ubuntu-22.04
|
||||
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Set Publish Condition
|
||||
id: publish-condition
|
||||
shell: bash
|
||||
run: |
|
||||
if [[ "${{ github.repository }}" == 'kubeflow/katib' && \
|
||||
( "${{ github.ref }}" == 'refs/heads/master' || \
|
||||
"${{ github.ref }}" =~ ^refs/heads/release- || \
|
||||
"${{ github.ref }}" =~ ^refs/tags/v ) ]]; then
|
||||
echo "should_publish=true" >> $GITHUB_OUTPUT
|
||||
else
|
||||
echo "should_publish=false" >> $GITHUB_OUTPUT
|
||||
fi
|
||||
|
||||
- name: GHCR Login
|
||||
if: steps.publish-condition.outputs.should_publish == 'true'
|
||||
uses: docker/login-action@v3
|
||||
with:
|
||||
registry: ghcr.io
|
||||
username: ${{ github.actor }}
|
||||
password: ${{ secrets.GITHUB_TOKEN }}
|
||||
|
||||
- name: DockerHub Login
|
||||
if: steps.publish-condition.outputs.should_publish == 'true'
|
||||
uses: docker/login-action@v3
|
||||
with:
|
||||
registry: docker.io
|
||||
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
|
||||
- name: Publish Component ${{ inputs.component-name }}
|
||||
if: steps.publish-condition.outputs.should_publish == 'true'
|
||||
id: publish
|
||||
uses: ./.github/workflows/template-publish-image
|
||||
with:
|
||||
image: |
|
||||
ghcr.io/kubeflow/katib/${{ inputs.component-name }}
|
||||
docker.io/kubeflowkatib/${{ inputs.component-name }}
|
||||
dockerfile: ${{ inputs.dockerfile }}
|
||||
platforms: ${{ inputs.platforms }}
|
||||
push: true
|
||||
|
||||
- name: Test Build For Component ${{ inputs.component-name }}
|
||||
if: steps.publish.outcome == 'skipped'
|
||||
uses: ./.github/workflows/template-publish-image
|
||||
with:
|
||||
image: |
|
||||
ghcr.io/kubeflow/katib/${{ inputs.component-name }}
|
||||
docker.io/kubeflowkatib/${{ inputs.component-name }}
|
||||
dockerfile: ${{ inputs.dockerfile }}
|
||||
platforms: ${{ inputs.platforms }}
|
||||
push: false
|
|
@ -1,38 +0,0 @@
|
|||
name: E2E Test with darts-cnn-cifar10
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths-ignore:
|
||||
- "pkg/ui/v1beta1/frontend/**"
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
e2e:
|
||||
runs-on: ubuntu-22.04
|
||||
timeout-minutes: 120
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Test Env
|
||||
uses: ./.github/workflows/template-setup-e2e-test
|
||||
with:
|
||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
||||
python-version: "3.11"
|
||||
|
||||
- name: Run e2e test with ${{ matrix.experiments }} experiments
|
||||
uses: ./.github/workflows/template-e2e-test
|
||||
with:
|
||||
experiments: ${{ matrix.experiments }}
|
||||
# Comma Delimited
|
||||
trial-images: darts-cnn-cifar10-cpu
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
||||
# Comma Delimited
|
||||
experiments: ["darts-cpu"]
|
|
@ -1,38 +0,0 @@
|
|||
name: E2E Test with enas-cnn-cifar10
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths-ignore:
|
||||
- "pkg/ui/v1beta1/frontend/**"
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
e2e:
|
||||
runs-on: ubuntu-22.04
|
||||
timeout-minutes: 120
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Test Env
|
||||
uses: ./.github/workflows/template-setup-e2e-test
|
||||
with:
|
||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
||||
python-version: "3.8"
|
||||
|
||||
- name: Run e2e test with ${{ matrix.experiments }} experiments
|
||||
uses: ./.github/workflows/template-e2e-test
|
||||
with:
|
||||
experiments: ${{ matrix.experiments }}
|
||||
# Comma Delimited
|
||||
trial-images: enas-cnn-cifar10-cpu
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
||||
# Comma Delimited
|
||||
experiments: ["enas-cpu"]
|
|
@ -1,46 +0,0 @@
|
|||
name: E2E Test with pytorch-mnist
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths-ignore:
|
||||
- "pkg/ui/v1beta1/frontend/**"
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
e2e:
|
||||
runs-on: ubuntu-22.04
|
||||
timeout-minutes: 120
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Test Env
|
||||
uses: ./.github/workflows/template-setup-e2e-test
|
||||
with:
|
||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
||||
python-version: "3.10"
|
||||
|
||||
- name: Run e2e test with ${{ matrix.experiments }} experiments
|
||||
uses: ./.github/workflows/template-e2e-test
|
||||
with:
|
||||
experiments: ${{ matrix.experiments }}
|
||||
training-operator: true
|
||||
# Comma Delimited
|
||||
trial-images: pytorch-mnist-cpu
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
||||
# Comma Delimited
|
||||
experiments:
|
||||
# suggestion-hyperopt
|
||||
- "long-running-resume,from-volume-resume,median-stop"
|
||||
# others
|
||||
- "grid,bayesian-optimization,tpe,multivariate-tpe,cma-es,hyperband"
|
||||
- "hyperopt-distribution,optuna-distribution"
|
||||
- "file-metrics-collector,pytorchjob-mnist"
|
||||
- "median-stop-with-json-format,file-metrics-collector-with-json-format"
|
|
@ -1,38 +0,0 @@
|
|||
name: E2E Test with simple-pbt
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths-ignore:
|
||||
- "pkg/ui/v1beta1/frontend/**"
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
e2e:
|
||||
runs-on: ubuntu-22.04
|
||||
timeout-minutes: 120
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Test Env
|
||||
uses: ./.github/workflows/template-setup-e2e-test
|
||||
with:
|
||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
||||
|
||||
- name: Run e2e test with ${{ matrix.experiments }} experiments
|
||||
uses: ./.github/workflows/template-e2e-test
|
||||
with:
|
||||
experiments: ${{ matrix.experiments }}
|
||||
# Comma Delimited
|
||||
trial-images: simple-pbt
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
# Detail: https://hub.docker.com/r/kindest/node
|
||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
||||
# Comma Delimited
|
||||
experiments: ["simple-pbt"]
|
|
@ -1,38 +0,0 @@
|
|||
name: E2E Test with tf-mnist-with-summaries
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths-ignore:
|
||||
- "pkg/ui/v1beta1/frontend/**"
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
e2e:
|
||||
runs-on: ubuntu-22.04
|
||||
timeout-minutes: 120
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Test Env
|
||||
uses: ./.github/workflows/template-setup-e2e-test
|
||||
with:
|
||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
||||
|
||||
- name: Run e2e test with ${{ matrix.experiments }} experiments
|
||||
uses: ./.github/workflows/template-e2e-test
|
||||
with:
|
||||
experiments: ${{ matrix.experiments }}
|
||||
training-operator: true
|
||||
# Comma Delimited
|
||||
trial-images: tf-mnist-with-summaries
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
||||
# Comma Delimited
|
||||
experiments: ["tfjob-mnist-with-summaries"]
|
|
@ -1,40 +0,0 @@
|
|||
name: E2E Test with tune API
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths-ignore:
|
||||
- "pkg/ui/v1beta1/frontend/**"
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
e2e:
|
||||
runs-on: ubuntu-22.04
|
||||
timeout-minutes: 120
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Test Env
|
||||
uses: ./.github/workflows/template-setup-e2e-test
|
||||
with:
|
||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
||||
|
||||
- name: Install Katib SDK with extra requires
|
||||
shell: bash
|
||||
run: |
|
||||
pip install --prefer-binary -e 'sdk/python/v1beta1[huggingface]'
|
||||
|
||||
- name: Run e2e test with tune API
|
||||
uses: ./.github/workflows/template-e2e-test
|
||||
with:
|
||||
tune-api: true
|
||||
training-operator: true
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
# Detail: https://hub.docker.com/r/kindest/node
|
||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
|
@ -1,35 +0,0 @@
|
|||
name: E2E Test with Katib UI, random search, and postgres
|
||||
|
||||
on:
|
||||
- pull_request
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
e2e:
|
||||
runs-on: ubuntu-22.04
|
||||
timeout-minutes: 120
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Test Env
|
||||
uses: ./.github/workflows/template-setup-e2e-test
|
||||
with:
|
||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
||||
|
||||
- name: Run e2e test with ${{ matrix.experiments }} experiments
|
||||
uses: ./.github/workflows/template-e2e-test
|
||||
with:
|
||||
experiments: random
|
||||
# Comma Delimited
|
||||
trial-images: pytorch-mnist-cpu
|
||||
katib-ui: true
|
||||
database-type: postgres
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
|
@ -1,49 +0,0 @@
|
|||
name: Free-Up Disk Space
|
||||
description: Remove Non-Essential Tools And Move Docker Data Directory to /mnt/docker
|
||||
|
||||
runs:
|
||||
using: composite
|
||||
steps:
|
||||
# This step is a Workaround to avoid the "No space left on device" error.
|
||||
# ref: https://github.com/actions/runner-images/issues/2840
|
||||
- name: Remove unnecessary files
|
||||
shell: bash
|
||||
run: |
|
||||
echo "Disk usage before cleanup:"
|
||||
df -hT
|
||||
|
||||
sudo rm -rf /usr/share/dotnet
|
||||
sudo rm -rf /opt/ghc
|
||||
sudo rm -rf /usr/local/share/boost
|
||||
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
|
||||
sudo rm -rf /usr/local/lib/android
|
||||
sudo rm -rf /usr/local/share/powershell
|
||||
sudo rm -rf /usr/share/swift
|
||||
|
||||
echo "Disk usage after cleanup:"
|
||||
df -hT
|
||||
|
||||
- name: Prune docker images
|
||||
shell: bash
|
||||
run: |
|
||||
docker image prune -a -f
|
||||
docker system df
|
||||
df -hT
|
||||
|
||||
- name: Move docker data directory
|
||||
shell: bash
|
||||
run: |
|
||||
echo "Stopping docker service ..."
|
||||
sudo systemctl stop docker
|
||||
DOCKER_DEFAULT_ROOT_DIR=/var/lib/docker
|
||||
DOCKER_ROOT_DIR=/mnt/docker
|
||||
echo "Moving ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
|
||||
sudo mv ${DOCKER_DEFAULT_ROOT_DIR} ${DOCKER_ROOT_DIR}
|
||||
echo "Creating symlink ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
|
||||
sudo ln -s ${DOCKER_ROOT_DIR} ${DOCKER_DEFAULT_ROOT_DIR}
|
||||
echo "$(sudo ls -l ${DOCKER_DEFAULT_ROOT_DIR})"
|
||||
echo "Starting docker service ..."
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl start docker
|
||||
echo "Docker service status:"
|
||||
sudo systemctl --no-pager -l -o short status docker
|
|
@ -2,28 +2,36 @@ name: Publish AutoML Algorithm Images
|
|||
|
||||
on:
|
||||
push:
|
||||
pull_request:
|
||||
paths-ignore:
|
||||
- "pkg/ui/v1beta1/frontend/**"
|
||||
branches:
|
||||
- master
|
||||
|
||||
env:
|
||||
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
|
||||
jobs:
|
||||
algorithm:
|
||||
name: Publish Image
|
||||
uses: ./.github/workflows/build-and-publish-images.yaml
|
||||
with:
|
||||
component-name: ${{ matrix.component-name }}
|
||||
platforms: linux/amd64,linux/arm64
|
||||
dockerfile: ${{ matrix.dockerfile }}
|
||||
secrets:
|
||||
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
# Trigger workflow only for kubeflow/katib repository.
|
||||
if: github.repository == 'kubeflow/katib'
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v2
|
||||
|
||||
- name: Publish Component ${{ matrix.component-name }}
|
||||
uses: ./.github/workflows/template-publish-image
|
||||
with:
|
||||
image: docker.io/kubeflowkatib/${{ matrix.component-name }}
|
||||
dockerfile: ${{ matrix.dockerfile }}
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
include:
|
||||
- component-name: suggestion-hyperopt
|
||||
dockerfile: cmd/suggestion/hyperopt/v1beta1/Dockerfile
|
||||
- component-name: suggestion-chocolate
|
||||
dockerfile: cmd/suggestion/chocolate/v1beta1/Dockerfile
|
||||
- component-name: suggestion-hyperband
|
||||
dockerfile: cmd/suggestion/hyperband/v1beta1/Dockerfile
|
||||
- component-name: suggestion-skopt
|
||||
|
@ -32,8 +40,6 @@ jobs:
|
|||
dockerfile: cmd/suggestion/goptuna/v1beta1/Dockerfile
|
||||
- component-name: suggestion-optuna
|
||||
dockerfile: cmd/suggestion/optuna/v1beta1/Dockerfile
|
||||
- component-name: suggestion-pbt
|
||||
dockerfile: cmd/suggestion/pbt/v1beta1/Dockerfile
|
||||
- component-name: suggestion-enas
|
||||
dockerfile: cmd/suggestion/nas/enas/v1beta1/Dockerfile
|
||||
- component-name: suggestion-darts
|
||||
|
|
|
@ -1,24 +0,0 @@
|
|||
name: Publish Katib Conformance Test Images
|
||||
|
||||
on:
|
||||
- push
|
||||
- pull_request
|
||||
|
||||
jobs:
|
||||
core:
|
||||
name: Publish Image
|
||||
uses: ./.github/workflows/build-and-publish-images.yaml
|
||||
with:
|
||||
component-name: ${{ matrix.component-name }}
|
||||
platforms: linux/amd64,linux/arm64
|
||||
dockerfile: ${{ matrix.dockerfile }}
|
||||
secrets:
|
||||
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
include:
|
||||
- component-name: katib-conformance
|
||||
dockerfile: Dockerfile.conformance
|
|
@ -1,23 +1,31 @@
|
|||
name: Publish Katib Core Images
|
||||
|
||||
on:
|
||||
- push
|
||||
- pull_request
|
||||
push:
|
||||
branches:
|
||||
- master
|
||||
|
||||
env:
|
||||
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
|
||||
jobs:
|
||||
core:
|
||||
name: Publish Image
|
||||
uses: ./.github/workflows/build-and-publish-images.yaml
|
||||
with:
|
||||
component-name: ${{ matrix.component-name }}
|
||||
platforms: linux/amd64,linux/arm64
|
||||
dockerfile: ${{ matrix.dockerfile }}
|
||||
secrets:
|
||||
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
# Trigger workflow only for kubeflow/katib repository.
|
||||
if: github.repository == 'kubeflow/katib'
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v2
|
||||
|
||||
- name: Publish Component ${{ matrix.component-name }}
|
||||
uses: ./.github/workflows/template-publish-image
|
||||
with:
|
||||
image: docker.io/kubeflowkatib/${{ matrix.component-name }}
|
||||
dockerfile: ${{ matrix.dockerfile }}
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
include:
|
||||
- component-name: katib-controller
|
||||
|
@ -25,7 +33,9 @@ jobs:
|
|||
- component-name: katib-db-manager
|
||||
dockerfile: cmd/db-manager/v1beta1/Dockerfile
|
||||
- component-name: katib-ui
|
||||
dockerfile: cmd/ui/v1beta1/Dockerfile
|
||||
dockerfile: cmd/new-ui/v1beta1/Dockerfile
|
||||
- component-name: cert-generator
|
||||
dockerfile: cmd/cert-generator/v1beta1/Dockerfile
|
||||
- component-name: file-metrics-collector
|
||||
dockerfile: cmd/metricscollector/v1beta1/file-metricscollector/Dockerfile
|
||||
- component-name: tfevent-metrics-collector
|
||||
|
|
|
@ -2,47 +2,41 @@ name: Publish Trial Images
|
|||
|
||||
on:
|
||||
push:
|
||||
pull_request:
|
||||
paths-ignore:
|
||||
- "pkg/ui/v1beta1/frontend/**"
|
||||
branches:
|
||||
- master
|
||||
|
||||
env:
|
||||
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
|
||||
jobs:
|
||||
trial:
|
||||
name: Publish Image
|
||||
uses: ./.github/workflows/build-and-publish-images.yaml
|
||||
with:
|
||||
component-name: ${{ matrix.trial-name }}
|
||||
platforms: ${{ matrix.platforms }}
|
||||
dockerfile: ${{ matrix.dockerfile }}
|
||||
secrets:
|
||||
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
|
||||
# Trigger workflow only for kubeflow/katib repository.
|
||||
if: github.repository == 'kubeflow/katib'
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v2
|
||||
|
||||
- name: Publish Trial ${{ matrix.trial-name }}
|
||||
uses: ./.github/workflows/template-publish-image
|
||||
with:
|
||||
image: docker.io/kubeflowkatib/${{ matrix.trial-name }}
|
||||
dockerfile: ${{ matrix.dockerfile }}
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
include:
|
||||
- trial-name: pytorch-mnist-cpu
|
||||
platforms: linux/amd64,linux/arm64
|
||||
dockerfile: examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.cpu
|
||||
- trial-name: pytorch-mnist-gpu
|
||||
platforms: linux/amd64
|
||||
dockerfile: examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.gpu
|
||||
- trial-name: mxnet-mnist
|
||||
dockerfile: examples/v1beta1/trial-images/mxnet-mnist/Dockerfile
|
||||
- trial-name: pytorch-mnist
|
||||
dockerfile: examples/v1beta1/trial-images/pytorch-mnist/Dockerfile
|
||||
- trial-name: tf-mnist-with-summaries
|
||||
platforms: linux/amd64,linux/arm64
|
||||
dockerfile: examples/v1beta1/trial-images/tf-mnist-with-summaries/Dockerfile
|
||||
- trial-name: enas-cnn-cifar10-gpu
|
||||
platforms: linux/amd64
|
||||
dockerfile: examples/v1beta1/trial-images/enas-cnn-cifar10/Dockerfile.gpu
|
||||
- trial-name: enas-cnn-cifar10-cpu
|
||||
platforms: linux/amd64,linux/arm64
|
||||
dockerfile: examples/v1beta1/trial-images/enas-cnn-cifar10/Dockerfile.cpu
|
||||
- trial-name: darts-cnn-cifar10-cpu
|
||||
platforms: linux/amd64,linux/arm64
|
||||
dockerfile: examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile.cpu
|
||||
- trial-name: darts-cnn-cifar10-gpu
|
||||
platforms: linux/amd64
|
||||
dockerfile: examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile.gpu
|
||||
- trial-name: simple-pbt
|
||||
platforms: linux/amd64,linux/arm64
|
||||
dockerfile: examples/v1beta1/trial-images/simple-pbt/Dockerfile
|
||||
- trial-name: darts-cnn-cifar10
|
||||
dockerfile: examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile
|
||||
|
|
|
@ -1,42 +0,0 @@
|
|||
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
|
||||
#
|
||||
# You can adjust the behavior by modifying this file.
|
||||
# For more information, see:
|
||||
# https://github.com/actions/stale
|
||||
name: Mark stale issues and pull requests
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: "0 */5 * * *"
|
||||
|
||||
jobs:
|
||||
stale:
|
||||
runs-on: ubuntu-22.04
|
||||
permissions:
|
||||
issues: write
|
||||
pull-requests: write
|
||||
|
||||
steps:
|
||||
- uses: actions/stale@v5
|
||||
with:
|
||||
repo-token: ${{ secrets.GITHUB_TOKEN }}
|
||||
days-before-stale: 90
|
||||
days-before-close: 20
|
||||
stale-issue-message: >
|
||||
This issue has been automatically marked as stale because it has not had
|
||||
recent activity. It will be closed if no further activity occurs. Thank you
|
||||
for your contributions.
|
||||
close-issue-message: >
|
||||
This issue has been automatically closed because it has not had recent
|
||||
activity. Please comment "/reopen" to reopen it.
|
||||
stale-issue-label: lifecycle/stale
|
||||
exempt-issue-labels: lifecycle/frozen
|
||||
stale-pr-message: >
|
||||
This pull request has been automatically marked as stale because it has not had
|
||||
recent activity. It will be closed if no further activity occurs. Thank you
|
||||
for your contributions.
|
||||
close-pr-message: >
|
||||
This pull request has been automatically closed because it has not had recent
|
||||
activity. Please comment "/reopen" to reopen it.
|
||||
stale-pr-label: lifecycle/stale
|
||||
exempt-pr-labels: lifecycle/frozen
|
|
@ -1,49 +0,0 @@
|
|||
# Composite action for e2e tests.
|
||||
name: Run E2E Test
|
||||
description: Run e2e test using the minikube cluster
|
||||
|
||||
inputs:
|
||||
experiments:
|
||||
required: false
|
||||
description: comma delimited experiment name
|
||||
default: ""
|
||||
training-operator:
|
||||
required: false
|
||||
description: whether to deploy training-operator or not
|
||||
default: false
|
||||
trial-images:
|
||||
required: false
|
||||
description: comma delimited trial image name
|
||||
default: ""
|
||||
katib-ui:
|
||||
required: true
|
||||
description: whether to deploy katib-ui or not
|
||||
default: false
|
||||
database-type:
|
||||
required: false
|
||||
description: mysql or postgres
|
||||
default: mysql
|
||||
tune-api:
|
||||
required: true
|
||||
description: whether to execute tune-api test or not
|
||||
default: false
|
||||
|
||||
runs:
|
||||
using: composite
|
||||
steps:
|
||||
- name: Setup Minikube Cluster
|
||||
shell: bash
|
||||
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh ${{ inputs.katib-ui }} ${{ inputs.tune-api }} ${{ inputs.trial-images }} ${{ inputs.experiments }}
|
||||
|
||||
- name: Setup Katib
|
||||
shell: bash
|
||||
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-katib.sh ${{ inputs.katib-ui }} ${{ inputs.training-operator }} ${{ inputs.database-type }}
|
||||
|
||||
- name: Run E2E Experiment
|
||||
shell: bash
|
||||
run: |
|
||||
if "${{ inputs.tune-api }}"; then
|
||||
./test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.sh
|
||||
else
|
||||
./test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.sh ${{ inputs.experiments }}
|
||||
fi
|
|
@ -1,49 +1,28 @@
|
|||
# Composite action for publishing Katib images.
|
||||
name: Build And Publish Container Images
|
||||
description: Build MultiPlatform Supporting Container Images
|
||||
# Template run for publishing Katib images.
|
||||
|
||||
inputs:
|
||||
image:
|
||||
required: true
|
||||
description: image tag
|
||||
type: string
|
||||
dockerfile:
|
||||
required: true
|
||||
description: path for dockerfile
|
||||
platforms:
|
||||
required: true
|
||||
description: linux/amd64 or linux/amd64,linux/arm64
|
||||
push:
|
||||
required: true
|
||||
description: whether to push container images or not
|
||||
type: string
|
||||
|
||||
runs:
|
||||
using: composite
|
||||
steps:
|
||||
# This step is a Workaround to avoid the "No space left on device" error.
|
||||
# ref: https://github.com/actions/runner-images/issues/2840
|
||||
- name: Remove unnecessary files
|
||||
shell: bash
|
||||
run: |
|
||||
sudo rm -rf /usr/share/dotnet
|
||||
sudo rm -rf /opt/ghc
|
||||
sudo rm -rf "/usr/local/share/boost"
|
||||
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
|
||||
sudo rm -rf /usr/local/lib/android
|
||||
sudo rm -rf /usr/local/share/powershell
|
||||
sudo rm -rf /usr/share/swift
|
||||
|
||||
echo "Disk usage after cleanup:"
|
||||
df -h
|
||||
|
||||
- name: Set up QEMU
|
||||
uses: docker/setup-qemu-action@v3
|
||||
|
||||
- name: Set Up Docker Buildx
|
||||
uses: docker/setup-buildx-action@v3
|
||||
uses: docker/setup-buildx-action@v1
|
||||
|
||||
- name: Docker Login
|
||||
uses: docker/login-action@v1
|
||||
with:
|
||||
username: ${{ env.DOCKERHUB_USERNAME }}
|
||||
password: ${{ env.DOCKERHUB_TOKEN }}
|
||||
|
||||
- name: Add Docker Tags
|
||||
id: meta
|
||||
uses: docker/metadata-action@v5
|
||||
uses: docker/metadata-action@v3
|
||||
with:
|
||||
images: ${{ inputs.image }}
|
||||
tags: |
|
||||
|
@ -51,12 +30,11 @@ runs:
|
|||
type=sha,prefix=v1beta1-
|
||||
|
||||
- name: Build and Push
|
||||
uses: docker/build-push-action@v5
|
||||
uses: docker/build-push-action@v2
|
||||
with:
|
||||
context: .
|
||||
file: ${{ inputs.dockerfile }}
|
||||
push: ${{ inputs.push }}
|
||||
push: true
|
||||
tags: ${{ steps.meta.outputs.tags }}
|
||||
cache-from: type=gha
|
||||
cache-to: type=gha,mode=max,ignore-error=true
|
||||
platforms: ${{ inputs.platforms }}
|
||||
cache-to: type=gha,mode=max
|
||||
|
|
|
@ -1,48 +0,0 @@
|
|||
# Composite action to setup e2e tests.
|
||||
name: Setup E2E Test
|
||||
description: setup env for e2e test using the minikube cluster
|
||||
|
||||
inputs:
|
||||
kubernetes-version:
|
||||
required: true
|
||||
description: kubernetes version
|
||||
python-version:
|
||||
required: false
|
||||
description: Python version
|
||||
# Most latest supporting version
|
||||
default: "3.10"
|
||||
|
||||
runs:
|
||||
using: composite
|
||||
steps:
|
||||
# This step is a Workaround to avoid the "No space left on device" error.
|
||||
# ref: https://github.com/actions/runner-images/issues/2840
|
||||
- name: Free-Up Disk Space
|
||||
uses: ./.github/workflows/free-up-disk-space
|
||||
|
||||
- name: Setup kubectl
|
||||
uses: azure/setup-kubectl@v4
|
||||
with:
|
||||
version: ${{ inputs.kubernetes-version }}
|
||||
|
||||
- name: Setup Minikube Cluster
|
||||
uses: medyagh/setup-minikube@v0.0.18
|
||||
with:
|
||||
network-plugin: cni
|
||||
cni: flannel
|
||||
driver: none
|
||||
kubernetes-version: ${{ inputs.kubernetes-version }}
|
||||
minikube-version: 1.34.0
|
||||
start-args: --wait-timeout=120s
|
||||
|
||||
- name: Setup Docker Buildx
|
||||
uses: docker/setup-buildx-action@v3
|
||||
|
||||
- name: Setup Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: ${{ inputs.python-version }}
|
||||
|
||||
- name: Install Katib SDK
|
||||
shell: bash
|
||||
run: pip install --prefer-binary -e sdk/python/v1beta1
|
|
@ -0,0 +1,118 @@
|
|||
name: Charmed Katib
|
||||
|
||||
on:
|
||||
- push
|
||||
- pull_request
|
||||
|
||||
jobs:
|
||||
lint:
|
||||
name: Lint
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
steps:
|
||||
- name: Check out code
|
||||
uses: actions/checkout@v2
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
set -eux
|
||||
sudo apt update
|
||||
sudo apt install python3-setuptools
|
||||
sudo pip3 install black==20.8b1 flake8
|
||||
|
||||
- name: Check black
|
||||
run: black --check operators
|
||||
|
||||
- name: Check flake8
|
||||
run: cd operators && flake8
|
||||
|
||||
build:
|
||||
name: Test
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
steps:
|
||||
- name: Check out repo
|
||||
uses: actions/checkout@v2
|
||||
|
||||
- uses: balchua/microk8s-actions@v0.2.2
|
||||
with:
|
||||
channel: "1.21/stable"
|
||||
addons: '["dns", "storage", "rbac"]'
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
set -eux
|
||||
sudo apt update
|
||||
sudo apt install -y python3-pip
|
||||
sudo snap install juju --classic
|
||||
sudo snap install juju-bundle --classic
|
||||
sudo snap install juju-wait --classic
|
||||
sudo pip3 install charmcraft==1.3.1
|
||||
|
||||
- name: Build Docker images
|
||||
run: |
|
||||
set -eux
|
||||
images=("katib-controller" "katib-ui" "katib-db-manager")
|
||||
folders=("katib-controller" "ui" "db-manager")
|
||||
for idx in {0..2}; do
|
||||
docker build . \
|
||||
-t docker.io/kubeflowkatib/${images[$idx]}:latest \
|
||||
-f cmd/${folders[$idx]}/v1beta1/Dockerfile
|
||||
docker save docker.io/kubeflowkatib/${images[$idx]} > ${images[$idx]}.tar
|
||||
microk8s ctr image import ${images[$idx]}.tar
|
||||
done
|
||||
|
||||
- name: Deploy Katib
|
||||
env:
|
||||
CHARMCRAFT_DEVELOPER: "1"
|
||||
run: |
|
||||
set -eux
|
||||
cd operators/
|
||||
git clone git://git.launchpad.net/canonical-osm
|
||||
cp -r canonical-osm/charms/interfaces/juju-relation-mysql mysql
|
||||
sg microk8s -c 'juju bootstrap microk8s uk8s'
|
||||
juju add-model kubeflow
|
||||
juju bundle deploy --build --destructive-mode --serial
|
||||
juju wait -wvt 300
|
||||
|
||||
- name: Test Katib
|
||||
run: kubectl apply -f examples/v1beta1/hp-tuning/random.yaml
|
||||
|
||||
- name: Get pod statuses
|
||||
run: kubectl get all -A
|
||||
if: failure()
|
||||
|
||||
- name: Get juju status
|
||||
run: juju status
|
||||
if: failure()
|
||||
|
||||
- name: Get katib-controller workload logs
|
||||
run: kubectl logs --tail 100 -nkubeflow -lapp.kubernetes.io/name=katib-controller
|
||||
if: failure()
|
||||
|
||||
- name: Get katib-controller operator logs
|
||||
run: kubectl logs --tail 100 -nkubeflow -loperator.juju.is/name=katib-controller
|
||||
if: failure()
|
||||
|
||||
- name: Get katib-ui workload logs
|
||||
run: kubectl logs --tail 100 -nkubeflow -lapp.kubernetes.io/name=katib-ui
|
||||
if: failure()
|
||||
|
||||
- name: Get katib-ui operator logs
|
||||
run: kubectl logs --tail 100 -nkubeflow -loperator.juju.is/name=katib-ui
|
||||
if: failure()
|
||||
|
||||
- name: Get katib-db-manager workload logs
|
||||
run: kubectl logs --tail 100 -nkubeflow -lapp.kubernetes.io/name=katib-db-manager
|
||||
if: failure()
|
||||
|
||||
- name: Get katib-db-manager operator logs
|
||||
run: kubectl logs --tail 100 -nkubeflow -loperator.juju.is/name=katib-db-manager
|
||||
if: failure()
|
||||
|
||||
- name: Upload charmcraft logs
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: charmcraft-logs
|
||||
path: /tmp/charmcraft-log-*
|
||||
if: failure()
|
|
@ -1,79 +1,52 @@
|
|||
name: Go Test
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths-ignore:
|
||||
- "pkg/ui/v1beta1/frontend/**"
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
- push
|
||||
- pull_request
|
||||
|
||||
jobs:
|
||||
generatetests:
|
||||
name: Generate And Format Test
|
||||
runs-on: ubuntu-22.04
|
||||
test:
|
||||
name: Test
|
||||
runs-on: ubuntu-latest
|
||||
env:
|
||||
GOPATH: ${{ github.workspace }}/go
|
||||
defaults:
|
||||
run:
|
||||
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
|
||||
|
||||
steps:
|
||||
- name: Check out code
|
||||
uses: actions/checkout@v4
|
||||
uses: actions/checkout@v2
|
||||
with:
|
||||
path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
|
||||
|
||||
- name: Setup Go
|
||||
uses: actions/setup-go@v5
|
||||
uses: actions/setup-go@v2
|
||||
with:
|
||||
go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.mod
|
||||
cache-dependency-path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.sum
|
||||
go-version: 1.17.1
|
||||
|
||||
- name: Check Go Modules, Generated Go/Python codes, and Format
|
||||
run: make check
|
||||
|
||||
unittests:
|
||||
name: Unit Test
|
||||
runs-on: ubuntu-22.04
|
||||
env:
|
||||
GOPATH: ${{ github.workspace }}/go
|
||||
defaults:
|
||||
run:
|
||||
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
|
||||
steps:
|
||||
- name: Check out code
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
|
||||
|
||||
- name: Setup Go
|
||||
uses: actions/setup-go@v5
|
||||
with:
|
||||
go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.mod
|
||||
cache-dependency-path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.sum
|
||||
# Verify that go.mod and go.sum is synchronized
|
||||
- name: Check Go modules
|
||||
run: |
|
||||
if [[ ! -z $(go mod tidy && git diff --exit-code) ]]; then
|
||||
echo "Please run "go mod tidy" to sync Go modules"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
- name: Run Go test
|
||||
run: go mod download && make test ENVTEST_K8S_VERSION=${{ matrix.kubernetes-version }}
|
||||
run: |
|
||||
go mod download
|
||||
|
||||
curl -L -O "https://github.com/kubernetes-sigs/kubebuilder/releases/download/v2.3.0/kubebuilder_2.3.0_linux_amd64.tar.gz"
|
||||
tar -zxvf kubebuilder_2.3.0_linux_amd64.tar.gz
|
||||
sudo mv kubebuilder_2.3.0_linux_amd64 /usr/local/kubebuilder
|
||||
export PATH=$PATH:/usr/local/kubebuilder/bin
|
||||
|
||||
make check
|
||||
make test
|
||||
|
||||
- name: Coveralls report
|
||||
uses: shogo82148/actions-goveralls@v1
|
||||
with:
|
||||
path-to-profile: coverage.out
|
||||
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
|
||||
parallel: true
|
||||
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
# Detail: `setup-envtest list`
|
||||
kubernetes-version: ["1.29.3", "1.30.0", "1.31.0"]
|
||||
|
||||
# notifies that all test jobs are finished.
|
||||
finish:
|
||||
needs: unittests
|
||||
runs-on: ubuntu-22.04
|
||||
steps:
|
||||
- uses: shogo82148/actions-goveralls@v1
|
||||
with:
|
||||
parallel-finished: true
|
||||
|
|
|
@ -1,30 +0,0 @@
|
|||
name: Lint Files
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths-ignore:
|
||||
- "pkg/ui/v1beta1/frontend/**"
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
lint:
|
||||
name: Lint
|
||||
runs-on: ubuntu-22.04
|
||||
|
||||
steps:
|
||||
- name: Check out code
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: 3.9
|
||||
|
||||
- name: Check shell scripts
|
||||
run: make shellcheck
|
||||
|
||||
- name: Run pre-commit
|
||||
uses: pre-commit/action@v3.0.1
|
|
@ -1,101 +1,24 @@
|
|||
name: Frontend Test
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths:
|
||||
- pkg/ui/v1beta1/frontend/**
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
- push
|
||||
- pull_request
|
||||
|
||||
jobs:
|
||||
test:
|
||||
name: Code format and lint
|
||||
runs-on: ubuntu-22.04
|
||||
name: Test
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
steps:
|
||||
- name: Check out code
|
||||
uses: actions/checkout@v4
|
||||
uses: actions/checkout@v2
|
||||
|
||||
- name: Setup Node
|
||||
uses: actions/setup-node@v4
|
||||
uses: actions/setup-node@v2
|
||||
with:
|
||||
node-version: 16.20.2
|
||||
node-version: 12.18.1
|
||||
|
||||
- name: Format katib code
|
||||
- name: Run Node test
|
||||
run: |
|
||||
npm install prettier --prefix ./pkg/ui/v1beta1/frontend
|
||||
npm install prettier --prefix ./pkg/new-ui/v1beta1/frontend
|
||||
make prettier-check
|
||||
|
||||
- name: Lint katib code
|
||||
run: |
|
||||
cd pkg/ui/v1beta1/frontend
|
||||
npm run lint-check
|
||||
|
||||
frontend-unit-tests:
|
||||
name: Frontend Unit Tests
|
||||
runs-on: ubuntu-22.04
|
||||
|
||||
steps:
|
||||
- name: Check out code
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Node
|
||||
uses: actions/setup-node@v4
|
||||
with:
|
||||
node-version: 16.20.2
|
||||
|
||||
- name: Fetch Kubeflow and install common code dependencies
|
||||
run: |
|
||||
COMMIT=$(cat pkg/ui/v1beta1/frontend/COMMIT)
|
||||
cd /tmp && git clone https://github.com/kubeflow/kubeflow.git
|
||||
cd kubeflow
|
||||
git checkout $COMMIT
|
||||
cd components/crud-web-apps/common/frontend/kubeflow-common-lib
|
||||
npm i
|
||||
npm run build
|
||||
npm link ./dist/kubeflow
|
||||
|
||||
- name: Install KWA dependencies
|
||||
run: |
|
||||
cd pkg/ui/v1beta1/frontend
|
||||
npm i
|
||||
npm link kubeflow
|
||||
|
||||
- name: Run unit tests
|
||||
run: |
|
||||
cd pkg/ui/v1beta1/frontend
|
||||
npm run test:prod
|
||||
|
||||
frontend-ui-tests:
|
||||
name: UI tests with Cypress
|
||||
runs-on: ubuntu-22.04
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
- name: Setup node version to 16
|
||||
uses: actions/setup-node@v4
|
||||
with:
|
||||
node-version: 16
|
||||
|
||||
- name: Fetch Kubeflow and install common code dependencies
|
||||
run: |
|
||||
COMMIT=$(cat pkg/ui/v1beta1/frontend/COMMIT)
|
||||
cd /tmp && git clone https://github.com/kubeflow/kubeflow.git
|
||||
cd kubeflow
|
||||
git checkout $COMMIT
|
||||
cd components/crud-web-apps/common/frontend/kubeflow-common-lib
|
||||
npm i
|
||||
npm run build
|
||||
npm link ./dist/kubeflow
|
||||
- name: Install KWA dependencies
|
||||
run: |
|
||||
cd pkg/ui/v1beta1/frontend
|
||||
npm i
|
||||
npm link kubeflow
|
||||
- name: Serve UI & run Cypress tests in Chrome and Firefox
|
||||
run: |
|
||||
cd pkg/ui/v1beta1/frontend
|
||||
npm run start & npx wait-on http://localhost:4200
|
||||
npm run ui-test-ci-all
|
||||
|
|
|
@ -1,47 +1,22 @@
|
|||
name: Python Test
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths-ignore:
|
||||
- "pkg/ui/v1beta1/frontend/**"
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
- push
|
||||
- pull_request
|
||||
|
||||
jobs:
|
||||
test:
|
||||
name: Test
|
||||
runs-on: ubuntu-22.04
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
steps:
|
||||
- name: Check out code
|
||||
uses: actions/checkout@v4
|
||||
uses: actions/checkout@v2
|
||||
|
||||
- name: Setup Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: 3.11
|
||||
|
||||
- name: Run Python test
|
||||
run: make pytest
|
||||
|
||||
# The skopt service doesn't work appropriately with Python 3.11.
|
||||
# So, we need to run the test with Python 3.9.
|
||||
# TODO (tenzen-y): Once we stop to support skopt, we can remove this test.
|
||||
# REF: https://github.com/kubeflow/katib/issues/2280
|
||||
test-skopt:
|
||||
name: Test Skopt
|
||||
runs-on: ubuntu-22.04
|
||||
|
||||
steps:
|
||||
- name: Check out code
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Python
|
||||
uses: actions/setup-python@v5
|
||||
uses: actions/setup-python@v2
|
||||
with:
|
||||
python-version: 3.9
|
||||
|
||||
- name: Run Python test
|
||||
run: make pytest-skopt
|
||||
run: make pytest
|
||||
|
|
|
@ -22,7 +22,6 @@ bin
|
|||
*.dll
|
||||
*.so
|
||||
*.dylib
|
||||
pkg/metricscollector/v1beta1/file-metricscollector/testdata
|
||||
|
||||
## Test binary, build with `go test -c`
|
||||
*.test
|
||||
|
@ -78,6 +77,3 @@ $RECYCLE.BIN/
|
|||
|
||||
## Vendor dir
|
||||
vendor
|
||||
|
||||
# Jupyter Notebooks.
|
||||
**/.ipynb_checkpoints
|
||||
|
|
|
@ -1,38 +0,0 @@
|
|||
repos:
|
||||
- repo: https://github.com/pre-commit/pre-commit-hooks
|
||||
rev: v2.3.0
|
||||
hooks:
|
||||
- id: check-yaml
|
||||
args: [--allow-multiple-documents]
|
||||
- id: check-json
|
||||
- repo: https://github.com/pycqa/isort
|
||||
rev: 5.11.5
|
||||
hooks:
|
||||
- id: isort
|
||||
name: isort
|
||||
entry: isort --profile black
|
||||
- repo: https://github.com/psf/black
|
||||
rev: 24.2.0
|
||||
hooks:
|
||||
- id: black
|
||||
files: (sdk|examples|pkg)/.*
|
||||
- repo: https://github.com/pycqa/flake8
|
||||
rev: 7.1.1
|
||||
hooks:
|
||||
- id: flake8
|
||||
files: (sdk|examples|pkg)/.*
|
||||
exclude: |
|
||||
(?x)^(
|
||||
.*zz_generated.deepcopy.*|
|
||||
.*pb.go|
|
||||
pkg/apis/manager/.*pb2(?:_grpc)?.py(?:i)?|
|
||||
pkg/apis/v1beta1/openapi_generated.go|
|
||||
pkg/mock/.*|
|
||||
pkg/client/controller/.*|
|
||||
sdk/python/v1beta1/kubeflow/katib/configuration.py|
|
||||
sdk/python/v1beta1/kubeflow/katib/rest.py|
|
||||
sdk/python/v1beta1/kubeflow/katib/__init__.py|
|
||||
sdk/python/v1beta1/kubeflow/katib/exceptions.py|
|
||||
sdk/python/v1beta1/kubeflow/katib/api_client.py|
|
||||
sdk/python/v1beta1/kubeflow/katib/models/.*
|
||||
)$
|
25
ADOPTERS.md
25
ADOPTERS.md
|
@ -4,17 +4,14 @@ Below are the adopters of project Katib. If you are using Katib
|
|||
please add yourself into the following list by a pull request.
|
||||
Please keep the list in alphabetical order.
|
||||
|
||||
| Organization | Contact | Description of Use |
|
||||
|--------------------------------------------------|------------------------------------------------------|----------------------------------------------------------------------|
|
||||
| [Akuity](https://akuity.io/) | [@terrytangyuan](https://github.com/terrytangyuan) | |
|
||||
| [Ant Group](https://www.antgroup.com/) | [@ohmystack](https://github.com/ohmystack) | Automatic training in Ant Group internal AI Platform |
|
||||
| [babylon health](https://www.babylonhealth.com/) | [@jeremievallee](https://github.com/jeremievallee) | Hyperparameter tuning for AIR internal AI Platform |
|
||||
| [caicloud](https://caicloud.io/) | [@gaocegege](https://github.com/gaocegege) | Hyperparameter tuning in Caicloud Cloud-Native AI Platform |
|
||||
| [canonical](https://ubuntu.com/) | [@RFMVasconcelos](https://github.com/rfmvasconcelos) | Hyperparameter tuning for customer projects in Defense and Fintech |
|
||||
| [CERN](https://home.cern/) | [@d-gol](https://github.com/d-gol) | Hyperparameter tuning within the ML platform on private cloud |
|
||||
| [cisco](https://cisco.com/) | [@ramdootp](https://github.com/ramdootp) | Hyperparameter tuning for conversational AI interface using Rasa |
|
||||
| [cubonacci](https://www.cubonacci.com) | [@janvdvegt](https://github.com/janvdvegt) | Hyperparameter tuning within the Cubonacci machine learning platform |
|
||||
| [CyberAgent](https://www.cyberagent.co.jp/en/) | [@tenzen-y](https://github.com/tenzen-y) | Experiment in CyberAgent internal ML Platform on Private Cloud |
|
||||
| [fuzhi](http://www.fuzhi.ai/) | [@planck0591](https://github.com/planck0591) | Experiment and Trial in autoML Platform |
|
||||
| [karrot](https://uk.karrotmarket.com/) | [@muik](https://github.com/muik) | Hyperparameter tuning in Karrot ML Platform |
|
||||
| [PITS Global Data Recovery Services](https://www.pitsdatarecovery.net/) | [@pheianox](https://github.com/pheianox) | CyberAgent and ML Platform |
|
||||
| Organization | Contact | Description of Use |
|
||||
| ------------ | ------- | ------------------ |
|
||||
| [Akuity](https://akuity.io/) | [@terrytangyuan](https://github.com/terrytangyuan) | |
|
||||
| [Ant Group](https://www.antgroup.com/) |[@ohmystack](https://github.com/ohmystack) | Automatic training in Ant Group internal AI Platform |
|
||||
| [babylon health](https://www.babylonhealth.com/) |[@jeremievallee](https://github.com/jeremievallee) | Hyperparameter tuning for AIR internal AI Platform |
|
||||
| [caicloud](https://caicloud.io/) |[@gaocegege](https://github.com/gaocegege) | Hyperparameter tuning in Caicloud Cloud-Native AI Platform |
|
||||
| [canonical](https://ubuntu.com/) |[@RFMVasconcelos](https://github.com/rfmvasconcelos) | Hyperparameter tuning for customer projects in Defense and Fintech |
|
||||
| [cisco](https://cisco.com/) |[@ramdootp](https://github.com/ramdootp) | Hyperparameter tuning for conversational AI interface using Rasa |
|
||||
| [cubonacci](https://www.cubonacci.com) |[@janvdvegt](https://github.com/janvdvegt) | Hyperparameter tuning within the Cubonacci machine learning platform |
|
||||
| [fuzhi](http://www.fuzhi.ai/) | [@planck0591](https://github.com/planck0591) | Experiment and Trial in autoML Platform |
|
||||
| [karrot](https://uk.karrotmarket.com/) |[@muik](https://github.com/muik) | Hyperparameter tuning in Karrot ML Platform |
|
||||
|
|
998
CHANGELOG.md
998
CHANGELOG.md
File diff suppressed because it is too large
Load Diff
43
CITATION.cff
43
CITATION.cff
|
@ -1,43 +0,0 @@
|
|||
cff-version: 1.2.0
|
||||
message: "If you use Katib in your scientific publication, please cite it as below."
|
||||
authors:
|
||||
- family-names: "George"
|
||||
given-names: "Johnu"
|
||||
- family-names: "Gao"
|
||||
given-names: "Ce"
|
||||
- family-names: "Liu"
|
||||
given-names: "Richard"
|
||||
- family-names: "Liu"
|
||||
given-names: "Hou Gang"
|
||||
- family-names: "Tang"
|
||||
given-names: "Yuan"
|
||||
- family-names: "Pydipaty"
|
||||
given-names: "Ramdoot"
|
||||
- family-names: "Saha"
|
||||
given-names: "Amit Kumar"
|
||||
title: "Katib"
|
||||
type: software
|
||||
repository-code: "https://github.com/kubeflow/katib"
|
||||
preferred-citation:
|
||||
type: misc
|
||||
title: "A Scalable and Cloud-Native Hyperparameter Tuning System"
|
||||
authors:
|
||||
- family-names: "George"
|
||||
given-names: "Johnu"
|
||||
- family-names: "Gao"
|
||||
given-names: "Ce"
|
||||
- family-names: "Liu"
|
||||
given-names: "Richard"
|
||||
- family-names: "Liu"
|
||||
given-names: "Hou Gang"
|
||||
- family-names: "Tang"
|
||||
given-names: "Yuan"
|
||||
- family-names: "Pydipaty"
|
||||
given-names: "Ramdoot"
|
||||
- family-names: "Saha"
|
||||
given-names: "Amit Kumar"
|
||||
year: 2020
|
||||
url: "https://arxiv.org/abs/2006.02085"
|
||||
identifiers:
|
||||
- type: "other"
|
||||
value: "arXiv:2006.02085"
|
167
CONTRIBUTING.md
167
CONTRIBUTING.md
|
@ -1,167 +0,0 @@
|
|||
# Developer Guide
|
||||
|
||||
This developer guide is for people who want to contribute to the Katib project.
|
||||
If you're interesting in using Katib in your machine learning project,
|
||||
see the following guides:
|
||||
|
||||
- [Getting started with Katib](https://kubeflow.org/docs/components/katib/hyperparameter/).
|
||||
- [How to configure Katib Experiment](https://kubeflow.org/docs/components/katib/experiment/).
|
||||
- [Katib architecture and concepts](https://www.kubeflow.org/docs/components/katib/reference/architecture/)
|
||||
for hyperparameter tuning and neural architecture search.
|
||||
|
||||
## Requirements
|
||||
|
||||
- [Go](https://golang.org/) (1.22 or later)
|
||||
- [Docker](https://docs.docker.com/) (24.0 or later)
|
||||
- [Docker Buildx](https://docs.docker.com/build/buildx/) (0.8.0 or later)
|
||||
- [Java](https://docs.oracle.com/javase/8/docs/technotes/guides/install/install_overview.html) (8 or later)
|
||||
- [Python](https://www.python.org/) (3.11 or later)
|
||||
- [kustomize](https://kustomize.io/) (4.0.5 or later)
|
||||
- [pre-commit](https://pre-commit.com/)
|
||||
|
||||
## Build from source code
|
||||
|
||||
**Note** that your Docker Desktop should
|
||||
[enable containerd image store](https://docs.docker.com/desktop/containerd/#enable-the-containerd-image-store)
|
||||
to build multi-arch images. Check source code as follows:
|
||||
|
||||
```bash
|
||||
make build REGISTRY=<image-registry> TAG=<image-tag>
|
||||
```
|
||||
|
||||
If you are using an Apple Silicon machine and encounter the "rosetta error: bss_size overflow," go to Docker Desktop -> General and uncheck "Use Rosetta for x86_64/amd64 emulation on Apple Silicon."
|
||||
|
||||
To use your custom images for the Katib components, modify
|
||||
[Kustomization file](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml)
|
||||
and [Katib Config](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/katib-config.yaml)
|
||||
|
||||
You can deploy Katib v1beta1 manifests into a Kubernetes cluster as follows:
|
||||
|
||||
```bash
|
||||
make deploy
|
||||
```
|
||||
|
||||
You can undeploy Katib v1beta1 manifests from a Kubernetes cluster as follows:
|
||||
|
||||
```bash
|
||||
make undeploy
|
||||
```
|
||||
|
||||
## Technical and style guide
|
||||
|
||||
The following guidelines apply primarily to Katib,
|
||||
but other projects like [Training Operator](https://github.com/kubeflow/training-operator) might also adhere to them.
|
||||
|
||||
## Go Development
|
||||
|
||||
When coding:
|
||||
|
||||
- Follow [effective go](https://go.dev/doc/effective_go) guidelines.
|
||||
- Run locally [`make check`](https://github.com/kubeflow/katib/blob/46173463027e4fd2e604e25d7075b2b31a702049/Makefile#L31)
|
||||
to verify if changes follow best practices before submitting PRs.
|
||||
|
||||
Testing:
|
||||
|
||||
- Use [`cmp.Diff`](https://pkg.go.dev/github.com/google/go-cmp/cmp#Diff) instead of `reflect.Equal`, to provide useful comparisons.
|
||||
- Define test cases as maps instead of slices to avoid dependencies on the running order.
|
||||
Map key should be equal to the test case name.
|
||||
|
||||
## Modify controller APIs
|
||||
|
||||
If you want to modify Katib controller APIs, you have to
|
||||
generate deepcopy, clientset, listers, informers, open-api and Python SDK with the changed APIs.
|
||||
You can update the necessary files as follows:
|
||||
|
||||
```bash
|
||||
make generate
|
||||
```
|
||||
|
||||
## Controller Flags
|
||||
|
||||
Below is a list of command-line flags accepted by Katib controller:
|
||||
|
||||
| Name | Type | Default | Description |
|
||||
| ------------ | ------ | ------- | -------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| katib-config | string | "" | The katib-controller will load its initial configuration from this file. Omit this flag to use the default configuration values. |
|
||||
|
||||
## DB Manager Flags
|
||||
|
||||
Below is a list of command-line flags accepted by Katib DB Manager:
|
||||
|
||||
| Name | Type | Default | Description |
|
||||
| --------------- | ------------- | -------------| ------------------------------------------------------------------- |
|
||||
| connect-timeout | time.Duration | 60s | Timeout before calling error during database connection |
|
||||
| listen-address | string | 0.0.0.0:6789 | The network interface or IP address to receive incoming connections |
|
||||
|
||||
## Katib admission webhooks
|
||||
|
||||
Katib uses three [Kubernetes admission webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/).
|
||||
|
||||
1. `validator.experiment.katib.kubeflow.org` -
|
||||
[Validating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook)
|
||||
to validate the Katib Experiment before the creation.
|
||||
|
||||
1. `defaulter.experiment.katib.kubeflow.org` -
|
||||
[Mutating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook)
|
||||
to set the [default values](../pkg/apis/controller/experiments/v1beta1/experiment_defaults.go)
|
||||
in the Katib Experiment before the creation.
|
||||
|
||||
1. `mutator.pod.katib.kubeflow.org` - Mutating admission webhook to inject the metrics
|
||||
collector sidecar container to the training pod. Learn more about the Katib's
|
||||
metrics collector in the
|
||||
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/).
|
||||
|
||||
You can find the YAMLs for the Katib webhooks
|
||||
[here](../manifests/v1beta1/components/webhook/webhooks.yaml).
|
||||
|
||||
**Note:** If you are using a private Kubernetes cluster, you have to allow traffic
|
||||
via `TCP:8443` by specifying the firewall rule and you have to update the master
|
||||
plane CIDR source range to use the Katib webhooks
|
||||
|
||||
### Katib cert generator
|
||||
|
||||
Katib Controller has the internal `cert-generator` to generate certificates for the webhooks.
|
||||
|
||||
Once Katib is deployed in the Kubernetes cluster, the `cert-generator` follows these steps:
|
||||
|
||||
- Generate the self-signed certificate and private key.
|
||||
|
||||
- Update a Kubernetes Secret with the self-signed TLS certificate and private key.
|
||||
- Patch the webhooks with the `CABundle`.
|
||||
|
||||
Once the `cert-generator` finished, the Katib controller starts to register controllers such as `experiment-controller` to the manager.
|
||||
|
||||
You can find the `cert-generator` source code [here](../pkg/certgenerator/v1beta1).
|
||||
|
||||
NOTE: the Katib also supports the [cert-manager](https://cert-manager.io/) to generate certs for the admission webhooks instead of using cert-generator.
|
||||
You can find the installation with the cert-manager [here](../manifests/v1beta1/installs/katib-cert-manager).
|
||||
|
||||
## Implement a new algorithm and use it in Katib
|
||||
|
||||
Please see [new-algorithm-service.md](./new-algorithm-service.md).
|
||||
|
||||
## Katib UI documentation
|
||||
|
||||
Please see [Katib UI README](../pkg/ui/v1beta1).
|
||||
|
||||
## Design proposals
|
||||
|
||||
Please see [proposals](./proposals).
|
||||
|
||||
## Code Style
|
||||
|
||||
### pre-commit
|
||||
|
||||
Make sure to install [pre-commit](https://pre-commit.com/) (`pip install
|
||||
pre-commit`) and run `pre-commit install` from the root of the repository at
|
||||
least once before creating git commits.
|
||||
|
||||
The pre-commit [hooks](../.pre-commit-config.yaml) ensure code quality and
|
||||
consistency. They are executed in CI. PRs that fail to comply with the hooks
|
||||
will not be able to pass the corresponding CI gate. The hooks are only executed
|
||||
against staged files unless you run `pre-commit run --all`, in which case,
|
||||
they'll be executed against every file in the repository.
|
||||
|
||||
Specific programmatically generated files listed in the `exclude` field in
|
||||
[.pre-commit-config.yaml](../.pre-commit-config.yaml) are deliberately excluded
|
||||
from the hooks.
|
|
@ -1,32 +0,0 @@
|
|||
# Copyright 2023 The Kubeflow Authors
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# Dockerfile for building the source code of conformance tests
|
||||
FROM python:3.10-slim
|
||||
|
||||
WORKDIR /kubeflow/katib
|
||||
|
||||
COPY sdk/ /kubeflow/katib/sdk/
|
||||
COPY examples/ /kubeflow/katib/examples/
|
||||
COPY test/ /kubeflow/katib/test/
|
||||
COPY pkg/ /kubeflow/katib/pkg/
|
||||
|
||||
COPY conformance/run.sh .
|
||||
|
||||
# Add test script.
|
||||
RUN chmod +x run.sh
|
||||
|
||||
RUN pip install --prefer-binary -e sdk/python/v1beta1
|
||||
|
||||
ENTRYPOINT [ "./run.sh" ]
|
|
@ -1,108 +1,57 @@
|
|||
HAS_LINT := $(shell command -v golangci-lint;)
|
||||
HAS_YAMLLINT := $(shell command -v yamllint;)
|
||||
HAS_SHELLCHECK := $(shell command -v shellcheck;)
|
||||
HAS_SETUP_ENVTEST := $(shell command -v setup-envtest;)
|
||||
HAS_MOCKGEN := $(shell command -v mockgen;)
|
||||
|
||||
COMMIT := v1beta1-$(shell git rev-parse --short=7 HEAD)
|
||||
KATIB_REGISTRY := ghcr.io/kubeflow/katib
|
||||
CPU_ARCH ?= linux/amd64,linux/arm64
|
||||
ENVTEST_K8S_VERSION ?= 1.31
|
||||
MOCKGEN_VERSION ?= $(shell grep 'go.uber.org/mock' go.mod | cut -d ' ' -f 2)
|
||||
GO_VERSION=$(shell grep '^go' go.mod | cut -d ' ' -f 2)
|
||||
GOPATH ?= $(shell go env GOPATH)
|
||||
KATIB_REGISTRY := docker.io/kubeflowkatib
|
||||
CPU_ARCH ?= amd64
|
||||
|
||||
# for pytest
|
||||
PYTHONPATH := $(PYTHONPATH):$(CURDIR)/pkg/apis/manager/v1beta1/python:$(CURDIR)/pkg/apis/manager/health/python
|
||||
PYTHONPATH := $(PYTHONPATH):$(CURDIR)/pkg/metricscollector/v1beta1/common:$(CURDIR)/pkg/metricscollector/v1beta1/tfevent-metricscollector
|
||||
TEST_TENSORFLOW_EVENT_FILE_PATH ?= $(CURDIR)/test/unit/v1beta1/metricscollector/testdata/tfevent-metricscollector/logs
|
||||
|
||||
# Run tests
|
||||
.PHONY: test
|
||||
test: envtest
|
||||
KUBEBUILDER_ASSETS="$(shell setup-envtest use $(ENVTEST_K8S_VERSION) -p path)" go test ./pkg/... ./cmd/... -coverprofile coverage.out
|
||||
test:
|
||||
go test ./pkg/... ./cmd/... -coverprofile coverage.out
|
||||
|
||||
envtest:
|
||||
ifndef HAS_SETUP_ENVTEST
|
||||
go install sigs.k8s.io/controller-runtime/tools/setup-envtest@release-0.19
|
||||
$(info "setup-envtest has been installed")
|
||||
endif
|
||||
$(info "setup-envtest has already installed")
|
||||
|
||||
check: generated-codes go-mod fmt vet lint
|
||||
check: generate fmt vet lint
|
||||
|
||||
fmt:
|
||||
hack/verify-gofmt.sh
|
||||
|
||||
lint:
|
||||
ifndef HAS_LINT
|
||||
go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.64.7
|
||||
$(info "golangci-lint has been installed")
|
||||
go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.42.1
|
||||
echo "golangci-lint has been installed"
|
||||
endif
|
||||
hack/verify-golangci-lint.sh
|
||||
|
||||
yamllint:
|
||||
ifndef HAS_YAMLLINT
|
||||
pip install --prefer-binary yamllint
|
||||
$(info "yamllint has been installed")
|
||||
endif
|
||||
hack/verify-yamllint.sh
|
||||
|
||||
vet:
|
||||
go vet ./pkg/... ./cmd/...
|
||||
|
||||
shellcheck:
|
||||
ifndef HAS_SHELLCHECK
|
||||
bash hack/install-shellcheck.sh
|
||||
$(info "shellcheck has been installed")
|
||||
endif
|
||||
hack/verify-shellcheck.sh
|
||||
|
||||
update:
|
||||
hack/update-gofmt.sh
|
||||
|
||||
# Deploy Katib v1beta1 manifests using Kustomize into a k8s cluster.
|
||||
deploy:
|
||||
bash scripts/v1beta1/deploy.sh $(WITH_DATABASE_TYPE)
|
||||
bash scripts/v1beta1/deploy.sh
|
||||
|
||||
# Undeploy Katib v1beta1 manifests using Kustomize from a k8s cluster
|
||||
undeploy:
|
||||
bash scripts/v1beta1/undeploy.sh
|
||||
|
||||
generated-codes: generate
|
||||
ifneq ($(shell bash hack/verify-generated-codes.sh '.'; echo $$?),0)
|
||||
$(error 'Please run "make generate" to generate codes')
|
||||
endif
|
||||
|
||||
go-mod: sync-go-mod
|
||||
ifneq ($(shell bash hack/verify-generated-codes.sh 'go.*'; echo $$?),0)
|
||||
$(error 'Please run "go mod tidy -go $(GO_VERSION)" to sync Go modules')
|
||||
endif
|
||||
|
||||
sync-go-mod:
|
||||
go mod tidy -go $(GO_VERSION)
|
||||
|
||||
.PHONY: go-mod-download
|
||||
go-mod-download:
|
||||
go mod download
|
||||
|
||||
CONTROLLER_GEN = $(shell pwd)/bin/controller-gen
|
||||
.PHONY: controller-gen
|
||||
controller-gen:
|
||||
@GOBIN=$(shell pwd)/bin GO111MODULE=on go install sigs.k8s.io/controller-tools/cmd/controller-gen@v0.16.5
|
||||
|
||||
# Run this if you update any existing controller APIs.
|
||||
# 1. Generate deepcopy, clientset, listers, informers for the APIs (hack/update-codegen.sh)
|
||||
# 1. Genereate deepcopy, clientset, listers, informers for the APIs (hack/update-codegen.sh)
|
||||
# 2. Generate open-api for the APIs (hack/update-openapigen)
|
||||
# 3. Generate Python SDK for Katib (hack/gen-python-sdk/gen-sdk.sh)
|
||||
# 4. Generate gRPC manager APIs (pkg/apis/manager/v1beta1/build.sh and pkg/apis/manager/health/build.sh)
|
||||
# 5. Generate Go mock codes
|
||||
generate: go-mod-download controller-gen
|
||||
ifndef HAS_MOCKGEN
|
||||
go install go.uber.org/mock/mockgen@$(MOCKGEN_VERSION)
|
||||
$(info "mockgen has been installed")
|
||||
generate:
|
||||
ifndef GOPATH
|
||||
$(error GOPATH not defined, please define GOPATH. Run "go help gopath" to learn more about GOPATH)
|
||||
endif
|
||||
go generate ./pkg/... ./cmd/...
|
||||
hack/gen-python-sdk/gen-sdk.sh
|
||||
hack/update-proto.sh
|
||||
hack/update-mockgen.sh
|
||||
cd ./pkg/apis/manager/v1beta1 && ./build.sh
|
||||
cd ./pkg/apis/manager/health && ./build.sh
|
||||
|
||||
# Build images for the Katib v1beta1 components.
|
||||
build: generate
|
||||
|
@ -119,12 +68,14 @@ push-latest: generate
|
|||
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(COMMIT)
|
||||
|
||||
# Build and push Katib images for the given tag.
|
||||
push-tag:
|
||||
push-tag: generate
|
||||
ifeq ($(TAG),)
|
||||
$(error TAG must be set. Usage: make push-tag TAG=<release-tag>)
|
||||
endif
|
||||
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) $(TAG) $(CPU_ARCH)
|
||||
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) $(COMMIT) $(CPU_ARCH)
|
||||
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(TAG)
|
||||
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(COMMIT)
|
||||
|
||||
# Release a new version of Katib.
|
||||
release:
|
||||
|
@ -144,50 +95,30 @@ endif
|
|||
|
||||
# Prettier UI format check for Katib v1beta1.
|
||||
prettier-check:
|
||||
npm run format:check --prefix pkg/ui/v1beta1/frontend
|
||||
npm run format:check --prefix pkg/new-ui/v1beta1/frontend
|
||||
|
||||
# Update boilerplate for the source code.
|
||||
update-boilerplate:
|
||||
./hack/boilerplate/update-boilerplate.sh
|
||||
|
||||
prepare-pytest:
|
||||
pip install --prefer-binary -r test/unit/v1beta1/requirements.txt
|
||||
pip install --prefer-binary -r cmd/suggestion/hyperopt/v1beta1/requirements.txt
|
||||
pip install --prefer-binary -r cmd/suggestion/optuna/v1beta1/requirements.txt
|
||||
pip install --prefer-binary -r cmd/suggestion/hyperband/v1beta1/requirements.txt
|
||||
pip install --prefer-binary -r cmd/suggestion/nas/enas/v1beta1/requirements.txt
|
||||
pip install --prefer-binary -r cmd/suggestion/nas/darts/v1beta1/requirements.txt
|
||||
pip install --prefer-binary -r cmd/suggestion/pbt/v1beta1/requirements.txt
|
||||
pip install --prefer-binary -r cmd/earlystopping/medianstop/v1beta1/requirements.txt
|
||||
pip install --prefer-binary -r cmd/metricscollector/v1beta1/tfevent-metricscollector/requirements.txt
|
||||
# `TypeIs` was introduced in typing-extensions 4.10.0, and torch 2.6.0 requires typing-extensions>=4.10.0.
|
||||
# REF: https://github.com/kubeflow/katib/pull/2504
|
||||
# TODO (tenzen-y): Once we upgrade libraries depended on typing-extensions==4.5.0, we can remove this line.
|
||||
pip install typing-extensions==4.10.0
|
||||
pip install -r test/unit/v1beta1/requirements.txt
|
||||
pip install -r cmd/suggestion/chocolate/v1beta1/requirements.txt
|
||||
pip install -r cmd/suggestion/hyperopt/v1beta1/requirements.txt
|
||||
pip install -r cmd/suggestion/skopt/v1beta1/requirements.txt
|
||||
pip install -r cmd/suggestion/optuna/v1beta1/requirements.txt
|
||||
pip install -r cmd/suggestion/hyperband/v1beta1/requirements.txt
|
||||
pip install -r cmd/suggestion/nas/enas/v1beta1/requirements.txt
|
||||
pip install -r cmd/suggestion/nas/darts/v1beta1/requirements.txt
|
||||
pip install -r cmd/earlystopping/medianstop/v1beta1/requirements.txt
|
||||
pip install -r cmd/metricscollector/v1beta1/tfevent-metricscollector/requirements.txt
|
||||
|
||||
prepare-pytest-testdata:
|
||||
ifeq ("$(wildcard $(TEST_TENSORFLOW_EVENT_FILE_PATH))", "")
|
||||
python examples/v1beta1/trial-images/tf-mnist-with-summaries/mnist.py --epochs 5 --batch-size 200 --log-path $(TEST_TENSORFLOW_EVENT_FILE_PATH)
|
||||
endif
|
||||
|
||||
# TODO(Electronic-Waste): Remove the import rewrite when protobuf supports `python_package` option.
|
||||
# REF: https://github.com/protocolbuffers/protobuf/issues/7061
|
||||
pytest: prepare-pytest prepare-pytest-testdata
|
||||
pytest ./test/unit/v1beta1/suggestion --ignore=./test/unit/v1beta1/suggestion/test_skopt_service.py
|
||||
pytest ./test/unit/v1beta1/earlystopping
|
||||
pytest ./test/unit/v1beta1/metricscollector
|
||||
cp ./pkg/apis/manager/v1beta1/python/api_pb2.py ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2.py
|
||||
cp ./pkg/apis/manager/v1beta1/python/api_pb2_grpc.py ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2_grpc.py
|
||||
sed -i "s/api_pb2/kubeflow\.katib\.katib_api_pb2/g" ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2_grpc.py
|
||||
pytest ./sdk/python/v1beta1/kubeflow/katib
|
||||
rm ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2.py ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2_grpc.py
|
||||
|
||||
# The skopt service doesn't work appropriately with Python 3.11.
|
||||
# So, we need to run the test with Python 3.9.
|
||||
# TODO (tenzen-y): Once we stop to support skopt, we can remove this test.
|
||||
# REF: https://github.com/kubeflow/katib/issues/2280
|
||||
pytest-skopt:
|
||||
pip install six
|
||||
pip install --prefer-binary -r test/unit/v1beta1/requirements.txt
|
||||
pip install --prefer-binary -r cmd/suggestion/skopt/v1beta1/requirements.txt
|
||||
pytest ./test/unit/v1beta1/suggestion/test_skopt_service.py
|
||||
PYTHONPATH=$(PYTHONPATH) pytest ./test/unit/v1beta1/suggestion
|
||||
PYTHONPATH=$(PYTHONPATH) pytest ./test/unit/v1beta1/earlystopping
|
||||
PYTHONPATH=$(PYTHONPATH) pytest ./test/unit/v1beta1/metricscollector
|
||||
|
|
6
OWNERS
6
OWNERS
|
@ -1,10 +1,8 @@
|
|||
approvers:
|
||||
- andreyvelich
|
||||
- gaocegege
|
||||
- hougangliu
|
||||
- johnugeorge
|
||||
reviewers:
|
||||
- anencore94
|
||||
- c-bata
|
||||
- Electronic-Waste
|
||||
emeritus_approvers:
|
||||
- tenzen-y
|
||||
- sperlingxx
|
||||
|
|
2
PROJECT
2
PROJECT
|
@ -1,3 +1,3 @@
|
|||
version: "3"
|
||||
version: "1"
|
||||
domain: kubeflow.org
|
||||
repo: github.com/kubeflow/katib
|
||||
|
|
141
README.md
141
README.md
|
@ -1,18 +1,15 @@
|
|||
# Kubeflow Katib
|
||||
|
||||
[](https://github.com/kubeflow/katib/actions/workflows/test-go.yaml?branch=master)
|
||||
[](https://coveralls.io/github/kubeflow/katib?branch=master)
|
||||
[](https://goreportcard.com/report/github.com/kubeflow/katib)
|
||||
[](https://github.com/kubeflow/katib/releases)
|
||||
[](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels)
|
||||
[](https://www.bestpractices.dev/projects/9941)
|
||||
|
||||
<h1 align="center">
|
||||
<img src="./docs/images/logo-title.png" alt="logo" width="200">
|
||||
<br>
|
||||
</h1>
|
||||
|
||||
Kubeflow Katib is a Kubernetes-native project for automated machine learning (AutoML).
|
||||
[](https://github.com/kubeflow/katib/actions/workflows/test-go.yaml?branch=master)
|
||||
[](https://coveralls.io/github/kubeflow/katib?branch=master)
|
||||
[](https://goreportcard.com/report/github.com/kubeflow/katib)
|
||||
[](https://github.com/kubeflow/katib/releases)
|
||||
[](https://kubeflow.slack.com/archives/C018PMV53NW)
|
||||
|
||||
Katib is a Kubernetes-native project for automated machine learning (AutoML).
|
||||
Katib supports
|
||||
[Hyperparameter Tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization),
|
||||
[Early Stopping](https://en.wikipedia.org/wiki/Early_stopping) and
|
||||
|
@ -21,7 +18,8 @@ Katib supports
|
|||
Katib is the project which is agnostic to machine learning (ML) frameworks.
|
||||
It can tune hyperparameters of applications written in any language of the
|
||||
users’ choice and natively supports many ML frameworks, such as
|
||||
[TensorFlow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), [XGBoost](https://xgboost.readthedocs.io/en/latest/), and others.
|
||||
[TensorFlow](https://www.tensorflow.org/), [Apache MXNet](https://mxnet.apache.org/),
|
||||
[PyTorch](https://pytorch.org/), [XGBoost](https://xgboost.readthedocs.io/en/latest/), and others.
|
||||
|
||||
Katib can perform training jobs using any Kubernetes
|
||||
[Custom Resources](https://www.kubeflow.org/docs/components/katib/trial-template/)
|
||||
|
@ -31,13 +29,13 @@ and many more.
|
|||
|
||||
Katib stands for `secretary` in Arabic.
|
||||
|
||||
## Search Algorithms
|
||||
# Search Algorithms
|
||||
|
||||
Katib supports several search algorithms. Follow the
|
||||
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-algorithm/#hp-tuning-algorithms)
|
||||
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/experiment/#search-algorithms-in-detail)
|
||||
to know more about each algorithm and check the
|
||||
[this guide](https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-algorithm/#use-custom-algorithm-in-katib)
|
||||
to implement your custom algorithm.
|
||||
[Suggestion service guide](/docs/new-algorithm-service.md) to implement your
|
||||
custom algorithm.
|
||||
|
||||
<table>
|
||||
<tbody>
|
||||
|
@ -127,80 +125,105 @@ to implement your custom algorithm.
|
|||
<td>
|
||||
</td>
|
||||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#pbt">Population Based Training</a>
|
||||
</td>
|
||||
<td>
|
||||
</td>
|
||||
<td>
|
||||
</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
To perform the above algorithms Katib supports the following frameworks:
|
||||
To perform above algorithms Katib supports the following frameworks:
|
||||
|
||||
- [Chocolate](https://github.com/AIworx-Labs/chocolate)
|
||||
- [Goptuna](https://github.com/c-bata/goptuna)
|
||||
- [Hyperopt](https://github.com/hyperopt/hyperopt)
|
||||
- [Optuna](https://github.com/optuna/optuna)
|
||||
- [Scikit Optimize](https://github.com/scikit-optimize/scikit-optimize)
|
||||
|
||||
# Installation
|
||||
|
||||
For the various Katib installs check the
|
||||
[Kubeflow guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-setup).
|
||||
Follow the next steps to install Katib standalone.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/installation/#prerequisites)
|
||||
for prerequisites to install Katib.
|
||||
This is the minimal requirements to install Katib:
|
||||
|
||||
## Installation
|
||||
- Kubernetes >= 1.17
|
||||
- `kubectl` >= 1.21
|
||||
|
||||
Please follow [the Kubeflow Katib guide](https://www.kubeflow.org/docs/components/katib/installation/#installing-katib)
|
||||
for the detailed instructions on how to install Katib.
|
||||
## Latest Version
|
||||
|
||||
### Installing the Control Plane
|
||||
|
||||
Run the following command to install the latest stable release of Katib control plane:
|
||||
|
||||
```
|
||||
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.17.0"
|
||||
```
|
||||
|
||||
Run the following command to install the latest changes of Katib control plane:
|
||||
For the latest Katib version run this command:
|
||||
|
||||
```
|
||||
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master"
|
||||
```
|
||||
|
||||
For the Katib Experiments check the [complete examples list](./examples/v1beta1).
|
||||
## Release Version
|
||||
|
||||
### Installing the Python SDK
|
||||
For the specific Katib release (for example `v0.12.0`) run this command:
|
||||
|
||||
Katib implements [a Python SDK](https://pypi.org/project/kubeflow-katib/) to simplify creation of
|
||||
hyperparameter tuning jobs for Data Scientists.
|
||||
|
||||
Run the following command to install the latest stable release of Katib SDK:
|
||||
|
||||
```sh
|
||||
pip install -U kubeflow-katib
|
||||
```
|
||||
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.12.0"
|
||||
```
|
||||
|
||||
## Getting Started
|
||||
Make sure that all Katib components are running:
|
||||
|
||||
Please refer to [the getting started guide](https://www.kubeflow.org/docs/components/katib/getting-started/#getting-started-with-katib-python-sdk)
|
||||
to quickly create your first hyperparameter tuning Experiment using the Python SDK.
|
||||
```
|
||||
$ kubectl get pods -n kubeflow
|
||||
|
||||
## Community
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
katib-cert-generator-rw95w 0/1 Completed 0 35s
|
||||
katib-controller-566595bdd8-hbxgf 1/1 Running 0 36s
|
||||
katib-db-manager-57cd769cdb-4g99m 1/1 Running 0 36s
|
||||
katib-mysql-7894994f88-5d4s5 1/1 Running 0 36s
|
||||
katib-ui-5767cfccdc-pwg2x 1/1 Running 0 36s
|
||||
```
|
||||
|
||||
The following links provide information on how to get involved in the community:
|
||||
For the Katib Experiments check the [complete examples list](./examples/v1beta1).
|
||||
|
||||
- Attend [the bi-weekly AutoML and Training Working Group](https://bit.ly/2PWVCkV)
|
||||
community meeting.
|
||||
- Join our [`#kubeflow-katib`](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels)
|
||||
Slack channel.
|
||||
- Check out [who is using Katib](ADOPTERS.md) and [presentations about Katib project](docs/presentations.md).
|
||||
# Documentation
|
||||
|
||||
- Run your first Katib Experiment in the
|
||||
[getting started guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-algorithm).
|
||||
|
||||
- Learn about Katib **Concepts** in this
|
||||
[guide](https://www.kubeflow.org/docs/components/katib/overview/#katib-concepts).
|
||||
|
||||
- Learn about Katib **Interfaces** in this
|
||||
[guide](https://www.kubeflow.org/docs/components/katib/overview/#katib-interfaces).
|
||||
|
||||
- Learn about Katib **Components** in this
|
||||
[guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-components).
|
||||
|
||||
- Know more about Katib in the [presentations and demos list](./docs/presentations.md).
|
||||
|
||||
# Community
|
||||
|
||||
We are always growing our community and invite new users and AutoML enthusiasts
|
||||
to contribute to the Katib project. The following links provide information
|
||||
about getting involved in the community:
|
||||
|
||||
- Subscribe to the
|
||||
[AutoML calendar](https://calendar.google.com/calendar/u/0/r?cid=ZDQ5bnNpZWZzbmZna2Y5MW8wdThoMmpoazRAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ)
|
||||
to attend Working Group bi-weekly community meetings.
|
||||
|
||||
- Check the
|
||||
[AutoML and Training Working Group meeting notes](https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit).
|
||||
|
||||
- If you use Katib, please update [the adopters list](ADOPTERS.md).
|
||||
|
||||
## Contributing
|
||||
|
||||
Please refer to the [CONTRIBUTING guide](CONTRIBUTING.md).
|
||||
Please feel free to test the system! [Developer guide](./docs/developer-guide.md)
|
||||
is a good starting point for our developers.
|
||||
|
||||
## Blog posts
|
||||
|
||||
- [Kubeflow Katib: Scalable, Portable and Cloud Native System for AutoML](https://blog.kubeflow.org/katib/)
|
||||
(by Andrey Velichkevich)
|
||||
|
||||
## Events
|
||||
|
||||
- [AutoML and Training WG Summit. 16th of July 2021](https://docs.google.com/document/d/1vGluSPHmAqEr8k9Dmm82RcQ-MVnqbYYSfnjMGB-aPuo/edit?usp=sharing)
|
||||
|
||||
## Citation
|
||||
|
||||
|
|
44
ROADMAP.md
44
ROADMAP.md
|
@ -1,45 +1,3 @@
|
|||
# Katib 2022/2023 Roadmap
|
||||
|
||||
## AutoML Features
|
||||
|
||||
- Support advance HyperParameter tuning algorithms:
|
||||
|
||||
- Population Based Training (PBT) - [#1382](https://github.com/kubeflow/katib/issues/1382)
|
||||
- Tree of Parzen Estimators (TPE)
|
||||
- Multivariate TPE
|
||||
- Sobol’s Quasirandom Sequence
|
||||
- Asynchronous Successive Halving - [ASHA](https://arxiv.org/pdf/1810.05934.pdf)
|
||||
|
||||
- Support multi-objective optimization - [#1549](https://github.com/kubeflow/katib/issues/1549)
|
||||
- Support various HP distributions (log-uniform, uniform, normal) - [#1207](https://github.com/kubeflow/katib/issues/1207)
|
||||
- Support Auto Model Compression - [#460](https://github.com/kubeflow/katib/issues/460)
|
||||
- Support Auto Feature Engineering - [#475](https://github.com/kubeflow/katib/issues/475)
|
||||
- Improve Neural Architecture Search design
|
||||
|
||||
## Backend and API Enhancements
|
||||
|
||||
- Conformance tests for Katib - [#2044](https://github.com/kubeflow/katib/issues/2044)
|
||||
- Support push-based metrics collection in Katib - [#577](https://github.com/kubeflow/katib/issues/577)
|
||||
- Support PostgreSQL as a Katib DB - [#915](https://github.com/kubeflow/katib/issues/915)
|
||||
- Improve Katib scalability - [#1847](https://github.com/kubeflow/katib/issues/1847)
|
||||
- Promote Katib APIs to the `v1` version
|
||||
- Support multiple CRD versions (`v1beta1`, `v1`) with conversion webhook
|
||||
|
||||
## Improve Katib User Experience
|
||||
|
||||
- Simplify Katib Experiment creation with Katib SDK - [#1951](https://github.com/kubeflow/katib/pull/1951)
|
||||
- Fully migrate to a new Katib UI - [Project 1](https://github.com/kubeflow/katib/projects/1)
|
||||
- Expose Trial logs in Katib UI - [#971](https://github.com/kubeflow/katib/issues/971)
|
||||
- Enhance Katib UI visualization metrics for AutoML Experiments
|
||||
- Improve Katib Config UX - [#2150](https://github.com/kubeflow/katib/issues/2150)
|
||||
|
||||
## Integration with Kubeflow Components
|
||||
|
||||
- Kubeflow Pipeline as a Katib Trial target - [#1914](https://github.com/kubeflow/katib/issues/1914)
|
||||
- Improve data passing when Katib Experiment is part of Kubeflow Pipeline - [#1846](https://github.com/kubeflow/katib/issues/1846)
|
||||
|
||||
# History
|
||||
|
||||
# Katib 2021 Roadmap
|
||||
|
||||
## New Features
|
||||
|
@ -66,6 +24,8 @@
|
|||
- Support multiple CRD version with conversion webhook
|
||||
- MLMD integration with Katib Experiments
|
||||
|
||||
# History
|
||||
|
||||
# Katib 2020 Roadmap
|
||||
|
||||
## New Features
|
||||
|
|
64
SECURITY.md
64
SECURITY.md
|
@ -1,64 +0,0 @@
|
|||
# Security Policy
|
||||
|
||||
## Supported Versions
|
||||
|
||||
Kubeflow Katib versions are expressed as `vX.Y.Z`, where X is the major version,
|
||||
Y is the minor version, and Z is the patch version, following the
|
||||
[Semantic Versioning](https://semver.org/) terminology.
|
||||
|
||||
The Kubeflow Katib project maintains release branches for the most recent two minor releases.
|
||||
Applicable fixes, including security fixes, may be backported to those two release branches,
|
||||
depending on severity and feasibility.
|
||||
|
||||
Users are encouraged to stay updated with the latest releases to benefit from security patches and
|
||||
improvements.
|
||||
|
||||
## Reporting a Vulnerability
|
||||
|
||||
We're extremely grateful for security researchers and users that report vulnerabilities to the
|
||||
Kubeflow Open Source Community. All reports are thoroughly investigated by Kubeflow projects owners.
|
||||
|
||||
You can use the following ways to report security vulnerabilities privately:
|
||||
|
||||
- Using the Kubeflow Katib repository [GitHub Security Advisory](https://github.com/kubeflow/katib/security/advisories/new).
|
||||
- Using our private Kubeflow Steering Committee mailing list: ksc@kubeflow.org.
|
||||
|
||||
Please provide detailed information to help us understand and address the issue promptly.
|
||||
|
||||
## Disclosure Process
|
||||
|
||||
**Acknowledgment**: We will acknowledge receipt of your report within 10 business days.
|
||||
|
||||
**Assessment**: The Kubeflow projects owners will investigate the reported issue to determine its
|
||||
validity and severity.
|
||||
|
||||
**Resolution**: If the issue is confirmed, we will work on a fix and prepare a release.
|
||||
|
||||
**Notification**: Once a fix is available, we will notify the reporter and coordinate a public
|
||||
disclosure.
|
||||
|
||||
**Public Disclosure**: Details of the vulnerability and the fix will be published in the project's
|
||||
release notes and communicated through appropriate channels.
|
||||
|
||||
## Prevention Mechanisms
|
||||
|
||||
Kubeflow Katib employs several measures to prevent security issues:
|
||||
|
||||
**Code Reviews**: All code changes are reviewed by maintainers to ensure code quality and security.
|
||||
|
||||
**Dependency Management**: Regular updates and monitoring of dependencies (e.g. Dependabot) to
|
||||
address known vulnerabilities.
|
||||
|
||||
**Continuous Integration**: Automated testing and security checks are integrated into the CI/CD pipeline.
|
||||
|
||||
**Image Scanning**: Container images are scanned for vulnerabilities.
|
||||
|
||||
## Communication Channels
|
||||
|
||||
For the general questions please join the following resources:
|
||||
|
||||
- Kubeflow [Slack channels](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels).
|
||||
|
||||
- Kubeflow discuss [mailing list](https://www.kubeflow.org/docs/about/community/#kubeflow-mailing-list).
|
||||
|
||||
Please **do not report** security vulnerabilities through public channels.
|
|
@ -0,0 +1,29 @@
|
|||
# Build the Katib Cert Generatoe.
|
||||
FROM golang:alpine AS build-env
|
||||
|
||||
WORKDIR /go/src/github.com/kubeflow/katib
|
||||
|
||||
# Download packages.
|
||||
COPY go.mod .
|
||||
COPY go.sum .
|
||||
RUN go mod download -x
|
||||
|
||||
# Copy sources.
|
||||
COPY cmd/ cmd/
|
||||
COPY pkg/ pkg/
|
||||
|
||||
# Build the binary.
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-cert-generator ./cmd/cert-generator/v1beta1; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-cert-generator ./cmd/cert-generator/v1beta1; \
|
||||
else \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-cert-generator ./cmd/cert-generator/v1beta1; \
|
||||
fi
|
||||
|
||||
# Copy the cert-generator into a thin image.
|
||||
FROM gcr.io/distroless/static:nonroot
|
||||
WORKDIR /app
|
||||
COPY --from=build-env /go/src/github.com/kubeflow/katib/katib-cert-generator /app/
|
||||
USER 65532:65532
|
||||
ENTRYPOINT ["./katib-cert-generator"]
|
|
@ -0,0 +1,42 @@
|
|||
/*
|
||||
Copyright 2021 The Kubeflow Authors.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
*/
|
||||
|
||||
package main
|
||||
|
||||
import (
|
||||
"github.com/kubeflow/katib/pkg/cert-generator/v1beta1"
|
||||
"k8s.io/client-go/kubernetes/scheme"
|
||||
"k8s.io/klog"
|
||||
"os"
|
||||
"sigs.k8s.io/controller-runtime/pkg/client"
|
||||
"sigs.k8s.io/controller-runtime/pkg/client/config"
|
||||
)
|
||||
|
||||
func main() {
|
||||
kubeClient, err := client.New(config.GetConfigOrDie(), client.Options{Scheme: scheme.Scheme})
|
||||
if err != nil {
|
||||
klog.Fatalf("Failed to create kube client.")
|
||||
}
|
||||
|
||||
cmd, err := v1beta1.NewKatibCertGeneratorCmd(kubeClient)
|
||||
if err != nil {
|
||||
klog.Fatalf("Failed to generate cert: %v", err)
|
||||
}
|
||||
|
||||
if err = cmd.Execute(); err != nil {
|
||||
os.Exit(1)
|
||||
}
|
||||
}
|
|
@ -1,7 +1,7 @@
|
|||
# Build the Katib DB manager.
|
||||
FROM golang:alpine AS build-env
|
||||
|
||||
ARG TARGETARCH
|
||||
ENV GRPC_HEALTH_PROBE_VERSION v0.4.6
|
||||
|
||||
WORKDIR /go/src/github.com/kubeflow/katib
|
||||
|
||||
|
@ -15,10 +15,28 @@ COPY cmd/ cmd/
|
|||
COPY pkg/ pkg/
|
||||
|
||||
# Build the binary.
|
||||
RUN CGO_ENABLED=0 GOOS=linux GOARCH="${TARGETARCH}" go build -a -o katib-db-manager ./cmd/db-manager/v1beta1
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-db-manager ./cmd/db-manager/v1beta1; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-db-manager ./cmd/db-manager/v1beta1; \
|
||||
else \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-db-manager ./cmd/db-manager/v1beta1; \
|
||||
fi
|
||||
|
||||
# Add GRPC health probe.
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||
else \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||
fi && \
|
||||
chmod +x /bin/grpc_health_probe
|
||||
|
||||
# Copy the db-manager into a thin image.
|
||||
FROM alpine:3.15
|
||||
WORKDIR /app
|
||||
COPY --from=build-env /bin/grpc_health_probe /bin/
|
||||
COPY --from=build-env /go/src/github.com/kubeflow/katib/katib-db-manager /app/
|
||||
ENTRYPOINT ["./katib-db-manager"]
|
||||
CMD ["-w", "kubernetes"]
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
/*
|
||||
Copyright 2022 The Kubeflow Authors.
|
||||
Copyright 2021 The Kubeflow Authors.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
|
@ -22,21 +22,19 @@ import (
|
|||
"fmt"
|
||||
"net"
|
||||
"os"
|
||||
"time"
|
||||
|
||||
health_pb "github.com/kubeflow/katib/pkg/apis/manager/health"
|
||||
api_pb "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
|
||||
db "github.com/kubeflow/katib/pkg/db/v1beta1"
|
||||
"github.com/kubeflow/katib/pkg/db/v1beta1/common"
|
||||
"k8s.io/klog/v2"
|
||||
"k8s.io/klog"
|
||||
|
||||
"google.golang.org/grpc"
|
||||
"google.golang.org/grpc/reflection"
|
||||
)
|
||||
|
||||
const (
|
||||
defaultListenAddress = "0.0.0.0:6789"
|
||||
defaultConnectTimeout = time.Second * 60
|
||||
port = "0.0.0.0:6789"
|
||||
)
|
||||
|
||||
var dbIf common.KatibDBInterface
|
||||
|
@ -89,30 +87,25 @@ func (s *server) Check(ctx context.Context, in *health_pb.HealthCheckRequest) (*
|
|||
}
|
||||
|
||||
func main() {
|
||||
var connectTimeout time.Duration
|
||||
var listenAddress string
|
||||
flag.DurationVar(&connectTimeout, "connect-timeout", defaultConnectTimeout, "Timeout before calling error during database connection. (e.g. 120s)")
|
||||
flag.StringVar(&listenAddress, "listen-address", defaultListenAddress, "The network interface or IP address to receive incoming connections. (e.g. 0.0.0.0:6789)")
|
||||
flag.Parse()
|
||||
|
||||
var err error
|
||||
dbNameEnvName := common.DBNameEnvName
|
||||
dbName := os.Getenv(dbNameEnvName)
|
||||
if dbName == "" {
|
||||
klog.Fatal("DB_NAME env is not set. Exiting")
|
||||
}
|
||||
dbIf, err = db.NewKatibDBInterface(dbName, connectTimeout)
|
||||
dbIf, err = db.NewKatibDBInterface(dbName)
|
||||
if err != nil {
|
||||
klog.Fatalf("Failed to open db connection: %v", err)
|
||||
}
|
||||
dbIf.DBInit()
|
||||
listener, err := net.Listen("tcp", listenAddress)
|
||||
listener, err := net.Listen("tcp", port)
|
||||
if err != nil {
|
||||
klog.Fatalf("Failed to listen: %v", err)
|
||||
}
|
||||
|
||||
size := 1<<31 - 1
|
||||
klog.Infof("Start Katib manager: %s", listenAddress)
|
||||
klog.Infof("Start Katib manager: %s", port)
|
||||
s := grpc.NewServer(grpc.MaxRecvMsgSize(size), grpc.MaxSendMsgSize(size))
|
||||
api_pb.RegisterDBManagerServer(s, &server{})
|
||||
health_pb.RegisterHealthServer(s, &server{})
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
/*
|
||||
Copyright 2022 The Kubeflow Authors.
|
||||
Copyright 2021 The Kubeflow Authors.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
|
@ -20,7 +20,7 @@ import (
|
|||
"context"
|
||||
"testing"
|
||||
|
||||
"go.uber.org/mock/gomock"
|
||||
"github.com/golang/mock/gomock"
|
||||
|
||||
health_pb "github.com/kubeflow/katib/pkg/apis/manager/health"
|
||||
api_pb "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
|
||||
|
|
|
@ -1,11 +1,9 @@
|
|||
FROM python:3.11-slim
|
||||
FROM python:3.9
|
||||
|
||||
ARG TARGETARCH
|
||||
ENV TARGET_DIR /opt/katib
|
||||
ENV EARLY_STOPPING_DIR cmd/earlystopping/medianstop/v1beta1
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python
|
||||
|
||||
RUN if [ "${TARGETARCH}" = "ppc64le" ] || [ "${TARGETARCH}" = "arm64" ]; then \
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||
apt-get -y update && \
|
||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||
apt-get clean && \
|
||||
|
@ -14,11 +12,12 @@ RUN if [ "${TARGETARCH}" = "ppc64le" ] || [ "${TARGETARCH}" = "arm64" ]; then \
|
|||
|
||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||
ADD ./${EARLY_STOPPING_DIR}/ ${TARGET_DIR}/${EARLY_STOPPING_DIR}/
|
||||
|
||||
WORKDIR ${TARGET_DIR}/${EARLY_STOPPING_DIR}
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||
&& chmod -R g+rwX ${TARGET_DIR}
|
||||
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python
|
||||
|
||||
ENTRYPOINT ["python", "main.py"]
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
# Copyright 2022 The Kubeflow Authors.
|
||||
# Copyright 2021 The Kubeflow Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
|
@ -12,14 +12,12 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import logging
|
||||
import time
|
||||
from concurrent import futures
|
||||
|
||||
import grpc
|
||||
|
||||
import time
|
||||
import logging
|
||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||
from pkg.earlystopping.v1beta1.medianstop.service import MedianStopService
|
||||
from concurrent import futures
|
||||
|
||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||
DEFAULT_PORT = "0.0.0.0:6788"
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
grpcio>=1.64.1
|
||||
protobuf>=4.21.12,<5
|
||||
grpcio==1.41.1
|
||||
protobuf==3.19.1
|
||||
googleapis-common-protos==1.6.0
|
||||
kubernetes==22.6.0
|
||||
kubernetes==11.0.0
|
||||
cython>=0.29.24
|
||||
|
|
|
@ -1,8 +1,6 @@
|
|||
# Build the Katib controller.
|
||||
FROM golang:alpine AS build-env
|
||||
|
||||
ARG TARGETARCH
|
||||
|
||||
WORKDIR /go/src/github.com/kubeflow/katib
|
||||
|
||||
# Download packages.
|
||||
|
@ -15,7 +13,13 @@ COPY cmd/ cmd/
|
|||
COPY pkg/ pkg/
|
||||
|
||||
# Build the binary.
|
||||
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o katib-controller ./cmd/katib-controller/v1beta1
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-controller ./cmd/katib-controller/v1beta1; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-controller ./cmd/katib-controller/v1beta1; \
|
||||
else \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-controller ./cmd/katib-controller/v1beta1; \
|
||||
fi
|
||||
|
||||
# Copy the controller-manager into a thin image.
|
||||
FROM alpine:3.15
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
/*
|
||||
Copyright 2022 The Kubeflow Authors.
|
||||
Copyright 2021 The Kubeflow Authors.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
|
@ -15,7 +15,7 @@ limitations under the License.
|
|||
*/
|
||||
|
||||
/*
|
||||
Katib-controller is a controller (operator) for Experiments and Trials
|
||||
Katib-controller is a controller (operator) for Experiments and Trials
|
||||
*/
|
||||
package main
|
||||
|
||||
|
@ -24,75 +24,64 @@ import (
|
|||
"os"
|
||||
|
||||
"github.com/spf13/viper"
|
||||
"k8s.io/apimachinery/pkg/runtime"
|
||||
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
|
||||
"sigs.k8s.io/controller-runtime/pkg/client/config"
|
||||
"sigs.k8s.io/controller-runtime/pkg/healthz"
|
||||
logf "sigs.k8s.io/controller-runtime/pkg/log"
|
||||
"sigs.k8s.io/controller-runtime/pkg/log/zap"
|
||||
"sigs.k8s.io/controller-runtime/pkg/manager"
|
||||
"sigs.k8s.io/controller-runtime/pkg/manager/signals"
|
||||
metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
|
||||
"sigs.k8s.io/controller-runtime/pkg/webhook"
|
||||
|
||||
configv1beta1 "github.com/kubeflow/katib/pkg/apis/config/v1beta1"
|
||||
apis "github.com/kubeflow/katib/pkg/apis/controller"
|
||||
cert "github.com/kubeflow/katib/pkg/certgenerator/v1beta1"
|
||||
"github.com/kubeflow/katib/pkg/controller.v1beta1"
|
||||
controller "github.com/kubeflow/katib/pkg/controller.v1beta1"
|
||||
"github.com/kubeflow/katib/pkg/controller.v1beta1/consts"
|
||||
"github.com/kubeflow/katib/pkg/util/v1beta1/katibconfig"
|
||||
webhookv1beta1 "github.com/kubeflow/katib/pkg/webhook/v1beta1"
|
||||
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
|
||||
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
|
||||
trialutil "github.com/kubeflow/katib/pkg/controller.v1beta1/trial/util"
|
||||
webhook "github.com/kubeflow/katib/pkg/webhook/v1beta1"
|
||||
)
|
||||
|
||||
var (
|
||||
scheme = runtime.NewScheme()
|
||||
log = logf.Log.WithName("entrypoint")
|
||||
)
|
||||
|
||||
func init() {
|
||||
utilruntime.Must(apis.AddToScheme(scheme))
|
||||
utilruntime.Must(configv1beta1.AddToScheme(scheme))
|
||||
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
|
||||
}
|
||||
|
||||
func main() {
|
||||
logf.SetLogger(zap.New())
|
||||
log := logf.Log.WithName("entrypoint")
|
||||
|
||||
var experimentSuggestionName string
|
||||
var metricsAddr string
|
||||
var webhookPort int
|
||||
var injectSecurityContext bool
|
||||
var enableGRPCProbeInSuggestion bool
|
||||
var trialResources trialutil.GvkListFlag
|
||||
var enableLeaderElection bool
|
||||
var leaderElectionID string
|
||||
|
||||
flag.StringVar(&experimentSuggestionName, "experiment-suggestion-name",
|
||||
"default", "The implementation of suggestion interface in experiment controller (default)")
|
||||
flag.StringVar(&metricsAddr, "metrics-addr", ":8080", "The address the metric endpoint binds to.")
|
||||
flag.BoolVar(&injectSecurityContext, "webhook-inject-securitycontext", false, "Inject the securityContext of container[0] in the sidecar")
|
||||
flag.BoolVar(&enableGRPCProbeInSuggestion, "enable-grpc-probe-in-suggestion", true, "enable grpc probe in suggestions")
|
||||
flag.Var(&trialResources, "trial-resources", "The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org)")
|
||||
flag.IntVar(&webhookPort, "webhook-port", 8443, "The port number to be used for admission webhook server.")
|
||||
// For leader election
|
||||
flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, "Enable leader election for katib-controller. Enabling this will ensure there is only one active katib-controller.")
|
||||
flag.StringVar(&leaderElectionID, "leader-election-id", "3fbc96e9.katib.kubeflow.org", "The ID for leader election.")
|
||||
|
||||
// TODO (andreyvelich): Currently it is not possible to set different webhook service name.
|
||||
// flag.StringVar(&serviceName, "webhook-service-name", "katib-controller", "The service name which will be used in webhook")
|
||||
// TODO (andreyvelich): Currently is is not possible to store webhook cert in the local file system.
|
||||
// flag.BoolVar(&certLocalFS, "cert-localfs", false, "Store the webhook cert in local file system")
|
||||
|
||||
var katibConfigFile string
|
||||
flag.StringVar(&katibConfigFile, "katib-config", "",
|
||||
"The katib-controller will load its initial configuration from this file. "+
|
||||
"Omit this flag to use the default configuration values. ")
|
||||
flag.Parse()
|
||||
|
||||
initConfig, err := katibconfig.GetInitConfigData(scheme, katibConfigFile)
|
||||
if err != nil {
|
||||
log.Error(err, "Failed to get KatibConfig")
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
// Set the config in viper.
|
||||
viper.Set(consts.ConfigExperimentSuggestionName, initConfig.ControllerConfig.ExperimentSuggestionName)
|
||||
viper.Set(consts.ConfigInjectSecurityContext, initConfig.ControllerConfig.InjectSecurityContext)
|
||||
viper.Set(consts.ConfigEnableGRPCProbeInSuggestion, initConfig.ControllerConfig.EnableGRPCProbeInSuggestion)
|
||||
|
||||
trialGVKs, err := katibconfig.TrialResourcesToGVKs(initConfig.ControllerConfig.TrialResources)
|
||||
if err != nil {
|
||||
log.Error(err, "Failed to parse trialResources")
|
||||
os.Exit(1)
|
||||
}
|
||||
viper.Set(consts.ConfigTrialResources, trialGVKs)
|
||||
viper.Set(consts.ConfigExperimentSuggestionName, experimentSuggestionName)
|
||||
viper.Set(consts.ConfigInjectSecurityContext, injectSecurityContext)
|
||||
viper.Set(consts.ConfigEnableGRPCProbeInSuggestion, enableGRPCProbeInSuggestion)
|
||||
viper.Set(consts.ConfigTrialResources, trialResources)
|
||||
|
||||
log.Info("Config:",
|
||||
consts.ConfigExperimentSuggestionName,
|
||||
viper.GetString(consts.ConfigExperimentSuggestionName),
|
||||
"webhook-port",
|
||||
initConfig.ControllerConfig.WebhookPort,
|
||||
webhookPort,
|
||||
"metrics-addr",
|
||||
initConfig.ControllerConfig.MetricsAddr,
|
||||
"healthz-addr",
|
||||
initConfig.ControllerConfig.HealthzAddr,
|
||||
metricsAddr,
|
||||
consts.ConfigInjectSecurityContext,
|
||||
viper.GetBool(consts.ConfigInjectSecurityContext),
|
||||
consts.ConfigEnableGRPCProbeInSuggestion,
|
||||
|
@ -110,13 +99,9 @@ func main() {
|
|||
|
||||
// Create a new katib controller to provide shared dependencies and start components
|
||||
mgr, err := manager.New(cfg, manager.Options{
|
||||
Metrics: metricsserver.Options{
|
||||
BindAddress: initConfig.ControllerConfig.MetricsAddr,
|
||||
},
|
||||
HealthProbeBindAddress: initConfig.ControllerConfig.HealthzAddr,
|
||||
LeaderElection: initConfig.ControllerConfig.EnableLeaderElection,
|
||||
LeaderElectionID: initConfig.ControllerConfig.LeaderElectionID,
|
||||
Scheme: scheme,
|
||||
MetricsBindAddress: metricsAddr,
|
||||
LeaderElection: enableLeaderElection,
|
||||
LeaderElectionID: leaderElectionID,
|
||||
})
|
||||
if err != nil {
|
||||
log.Error(err, "Failed to create the manager")
|
||||
|
@ -125,50 +110,11 @@ func main() {
|
|||
|
||||
log.Info("Registering Components.")
|
||||
|
||||
// Create a webhook server.
|
||||
hookServer := webhook.NewServer(webhook.Options{
|
||||
Port: *initConfig.ControllerConfig.WebhookPort,
|
||||
CertDir: consts.CertDir,
|
||||
})
|
||||
|
||||
ctx := signals.SetupSignalHandler()
|
||||
certsReady := make(chan struct{})
|
||||
defer close(certsReady)
|
||||
|
||||
// The setupControllers will register controllers to the manager
|
||||
// after generated certs for the admission webhooks.
|
||||
go setupControllers(mgr, certsReady, hookServer)
|
||||
|
||||
if initConfig.CertGeneratorConfig.Enable {
|
||||
if err = cert.AddToManager(mgr, initConfig.CertGeneratorConfig, certsReady); err != nil {
|
||||
log.Error(err, "Failed to set up cert-generator")
|
||||
}
|
||||
} else {
|
||||
certsReady <- struct{}{}
|
||||
}
|
||||
|
||||
log.Info("Setting up health checker.")
|
||||
if err := mgr.AddReadyzCheck("readyz", hookServer.StartedChecker()); err != nil {
|
||||
log.Error(err, "Unable to add readyz endpoint to the manager")
|
||||
// Setup Scheme for all resources
|
||||
if err := apis.AddToScheme(mgr.GetScheme()); err != nil {
|
||||
log.Error(err, "Unable to add APIs to scheme")
|
||||
os.Exit(1)
|
||||
}
|
||||
if err = mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
|
||||
log.Error(err, "Add webhook server health checker to the manager failed")
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
// Start the Cmd
|
||||
log.Info("Starting the manager.")
|
||||
if err = mgr.Start(ctx); err != nil {
|
||||
log.Error(err, "Unable to run the manager")
|
||||
os.Exit(1)
|
||||
}
|
||||
}
|
||||
|
||||
func setupControllers(mgr manager.Manager, certsReady chan struct{}, hookServer webhook.Server) {
|
||||
// The certsReady blocks to register controllers until generated certs.
|
||||
<-certsReady
|
||||
log.Info("Certs ready")
|
||||
|
||||
// Setup all Controllers
|
||||
log.Info("Setting up controller.")
|
||||
|
@ -178,8 +124,15 @@ func setupControllers(mgr manager.Manager, certsReady chan struct{}, hookServer
|
|||
}
|
||||
|
||||
log.Info("Setting up webhooks.")
|
||||
if err := webhookv1beta1.AddToManager(mgr, hookServer); err != nil {
|
||||
if err := webhook.AddToManager(mgr, webhookPort); err != nil {
|
||||
log.Error(err, "Unable to register webhooks to the manager")
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
// Start the Cmd
|
||||
log.Info("Starting the Cmd.")
|
||||
if err := mgr.Start(signals.SetupSignalHandler()); err != nil {
|
||||
log.Error(err, "Unable to run the manager")
|
||||
os.Exit(1)
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,8 +1,6 @@
|
|||
# Build the Katib file metrics collector.
|
||||
FROM golang:alpine AS build-env
|
||||
|
||||
ARG TARGETARCH
|
||||
|
||||
WORKDIR /go/src/github.com/kubeflow/katib
|
||||
|
||||
# Download packages.
|
||||
|
@ -15,7 +13,13 @@ COPY cmd/ cmd/
|
|||
COPY pkg/ pkg/
|
||||
|
||||
# Build the binary.
|
||||
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o file-metricscollector ./cmd/metricscollector/v1beta1/file-metricscollector
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o file-metricscollector ./cmd/metricscollector/v1beta1/file-metricscollector; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o file-metricscollector ./cmd/metricscollector/v1beta1/file-metricscollector; \
|
||||
else \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o file-metricscollector ./cmd/metricscollector/v1beta1/file-metricscollector; \
|
||||
fi
|
||||
|
||||
# Copy the file metrics collector into a thin image.
|
||||
FROM alpine:3.15
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
/*
|
||||
Copyright 2022 The Kubeflow Authors.
|
||||
Copyright 2021 The Kubeflow Authors.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
|
@ -39,21 +39,19 @@ package main
|
|||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"flag"
|
||||
"fmt"
|
||||
"io/ioutil"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"regexp"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/nxadm/tail"
|
||||
psutil "github.com/shirou/gopsutil/v3/process"
|
||||
"github.com/hpcloud/tail"
|
||||
psutil "github.com/shirou/gopsutil/process"
|
||||
"google.golang.org/grpc"
|
||||
"google.golang.org/grpc/credentials/insecure"
|
||||
"k8s.io/klog/v2"
|
||||
"k8s.io/klog"
|
||||
|
||||
commonv1beta1 "github.com/kubeflow/katib/pkg/apis/controller/common/v1beta1"
|
||||
api "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
|
||||
|
@ -104,7 +102,6 @@ var (
|
|||
earlyStopServiceAddr = flag.String("s-earlystop", "", "Katib Early Stopping service endpoint")
|
||||
trialName = flag.String("t", "", "Trial Name")
|
||||
metricsFilePath = flag.String("path", "", "Metrics File Path")
|
||||
metricsFileFormat = flag.String("format", "", "Metrics File Format")
|
||||
metricNames = flag.String("m", "", "Metric names")
|
||||
objectiveType = flag.String("o-type", "", "Objective type")
|
||||
metricFilters = flag.String("f", "", "Metric filters")
|
||||
|
@ -134,17 +131,13 @@ func printMetricsFile(mFile string) {
|
|||
checkMetricFile(mFile)
|
||||
|
||||
// Print lines from metrics file.
|
||||
t, err := tail.TailFile(mFile, tail.Config{Follow: true, ReOpen: true})
|
||||
if err != nil {
|
||||
klog.Errorf("Failed to open metrics file: %v", err)
|
||||
}
|
||||
|
||||
t, _ := tail.TailFile(mFile, tail.Config{Follow: true})
|
||||
for line := range t.Lines {
|
||||
klog.Info(line.Text)
|
||||
}
|
||||
}
|
||||
|
||||
func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, fileFormat commonv1beta1.FileFormat) {
|
||||
func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string) {
|
||||
|
||||
// metricStartStep is the dict where key = metric name, value = start step.
|
||||
// We should apply early stopping rule only if metric is reported at least "start_step" times.
|
||||
|
@ -155,6 +148,9 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
|||
}
|
||||
}
|
||||
|
||||
// First metric is objective in metricNames array.
|
||||
objMetric := strings.Split(*metricNames, ";")[0]
|
||||
objType := commonv1beta1.ObjectiveType(*objectiveType)
|
||||
// For objective metric we calculate best optimal value from the recorded metrics.
|
||||
// This is workaround for Median Stop algorithm.
|
||||
// TODO (andreyvelich): Think about it, maybe define latest, max or min strategy type in stop-rule as well ?
|
||||
|
@ -164,9 +160,7 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
|||
checkMetricFile(mFile)
|
||||
|
||||
// Get Main process.
|
||||
// Extract the metric file dir path based on the file name.
|
||||
mDirPath, _ := filepath.Split(mFile)
|
||||
_, mainProcPid, err := common.GetMainProcesses(mDirPath)
|
||||
_, mainProcPid, err := common.GetMainProcesses(mFile)
|
||||
if err != nil {
|
||||
klog.Fatalf("GetMainProcesses failed: %v", err)
|
||||
}
|
||||
|
@ -175,6 +169,9 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
|||
klog.Fatalf("Failed to create new Process from pid %v, error: %v", mainProcPid, err)
|
||||
}
|
||||
|
||||
// Get list of regural expressions from filters.
|
||||
metricRegList := filemc.GetFilterRegexpList(filters)
|
||||
|
||||
// Start watch log lines.
|
||||
t, _ := tail.TailFile(mFile, tail.Config{Follow: true})
|
||||
for line := range t.Lines {
|
||||
|
@ -182,82 +179,78 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
|||
// Print log line
|
||||
klog.Info(logText)
|
||||
|
||||
switch fileFormat {
|
||||
case commonv1beta1.TextFormat:
|
||||
// Get list of regural expressions from filters.
|
||||
var metricRegList []*regexp.Regexp
|
||||
metricRegList = filemc.GetFilterRegexpList(filters)
|
||||
// Check if log line contains metric from stop rules.
|
||||
isRuleLine := false
|
||||
for _, rule := range stopRules {
|
||||
if strings.Contains(logText, rule.Name) {
|
||||
isRuleLine = true
|
||||
break
|
||||
}
|
||||
}
|
||||
// If log line doesn't contain appropriate metric, continue track file.
|
||||
if !isRuleLine {
|
||||
continue
|
||||
}
|
||||
|
||||
// Check if log line contains metric from stop rules.
|
||||
isRuleLine := false
|
||||
for _, rule := range stopRules {
|
||||
if strings.Contains(logText, rule.Name) {
|
||||
isRuleLine = true
|
||||
break
|
||||
}
|
||||
}
|
||||
// If log line doesn't contain appropriate metric, continue track file.
|
||||
if !isRuleLine {
|
||||
continue
|
||||
}
|
||||
|
||||
// If log line contains appropriate metric, find all submatches from metric filters.
|
||||
for _, metricReg := range metricRegList {
|
||||
matchStrings := metricReg.FindAllStringSubmatch(logText, -1)
|
||||
for _, subMatchList := range matchStrings {
|
||||
if len(subMatchList) < 3 {
|
||||
continue
|
||||
}
|
||||
// Submatch must have metric name and float value
|
||||
metricName := strings.TrimSpace(subMatchList[1])
|
||||
metricValue, err := strconv.ParseFloat(strings.TrimSpace(subMatchList[2]), 64)
|
||||
if err != nil {
|
||||
klog.Fatalf("Unable to parse value %v to float for metric %v", metricValue, metricName)
|
||||
}
|
||||
|
||||
// stopRules contains array of EarlyStoppingRules that has not been reached yet.
|
||||
// After rule is reached we delete appropriate element from the array.
|
||||
for idx, rule := range stopRules {
|
||||
if metricName != rule.Name {
|
||||
continue
|
||||
}
|
||||
stopRules, optimalObjValue = updateStopRules(stopRules, optimalObjValue, metricValue, metricStartStep, rule, idx)
|
||||
}
|
||||
}
|
||||
}
|
||||
case commonv1beta1.JsonFormat:
|
||||
var logJsonObj map[string]interface{}
|
||||
if err = json.Unmarshal([]byte(logText), &logJsonObj); err != nil {
|
||||
klog.Fatalf("Failed to unmarshal logs in %v format, log: %s, error: %v", commonv1beta1.JsonFormat, logText, err)
|
||||
}
|
||||
// Check if log line contains metric from stop rules.
|
||||
isRuleLine := false
|
||||
for _, rule := range stopRules {
|
||||
if _, exist := logJsonObj[rule.Name]; exist {
|
||||
isRuleLine = true
|
||||
break
|
||||
}
|
||||
}
|
||||
// If log line doesn't contain appropriate metric, continue track file.
|
||||
if !isRuleLine {
|
||||
continue
|
||||
}
|
||||
|
||||
// stopRules contains array of EarlyStoppingRules that has not been reached yet.
|
||||
// After rule is reached we delete appropriate element from the array.
|
||||
for idx, rule := range stopRules {
|
||||
value, exist := logJsonObj[rule.Name].(string)
|
||||
if !exist {
|
||||
// If log line contains appropriate metric, find all submatches from metric filters.
|
||||
for _, metricReg := range metricRegList {
|
||||
matchStrings := metricReg.FindAllStringSubmatch(logText, -1)
|
||||
for _, subMatchList := range matchStrings {
|
||||
if len(subMatchList) < 3 {
|
||||
continue
|
||||
}
|
||||
metricValue, err := strconv.ParseFloat(strings.TrimSpace(value), 64)
|
||||
// Submatch must have metric name and float value
|
||||
metricName := strings.TrimSpace(subMatchList[1])
|
||||
metricValue, err := strconv.ParseFloat(strings.TrimSpace(subMatchList[2]), 64)
|
||||
if err != nil {
|
||||
klog.Fatalf("Unable to parse value %v to float for metric %v", metricValue, rule.Name)
|
||||
klog.Fatalf("Unable to parse value %v to float for metric %v", metricValue, metricName)
|
||||
}
|
||||
|
||||
// stopRules contains array of EarlyStoppingRules that has not been reached yet.
|
||||
// After rule is reached we delete appropriate element from the array.
|
||||
for idx, rule := range stopRules {
|
||||
if metricName != rule.Name {
|
||||
continue
|
||||
}
|
||||
|
||||
// Calculate optimalObjValue.
|
||||
if metricName == objMetric {
|
||||
if optimalObjValue == nil {
|
||||
optimalObjValue = &metricValue
|
||||
} else if objType == commonv1beta1.ObjectiveTypeMaximize && metricValue > *optimalObjValue {
|
||||
optimalObjValue = &metricValue
|
||||
} else if objType == commonv1beta1.ObjectiveTypeMinimize && metricValue < *optimalObjValue {
|
||||
optimalObjValue = &metricValue
|
||||
}
|
||||
// Assign best optimal value to metric value.
|
||||
metricValue = *optimalObjValue
|
||||
}
|
||||
|
||||
// Reduce steps if appropriate metric is reported.
|
||||
// Once rest steps are empty we apply early stopping rule.
|
||||
if _, ok := metricStartStep[metricName]; ok {
|
||||
metricStartStep[metricName]--
|
||||
if metricStartStep[metricName] != 0 {
|
||||
continue
|
||||
}
|
||||
}
|
||||
|
||||
ruleValue, err := strconv.ParseFloat(rule.Value, 64)
|
||||
if err != nil {
|
||||
klog.Fatalf("Unable to parse value %v to float for rule metric %v", rule.Value, rule.Name)
|
||||
}
|
||||
|
||||
// Metric value can be equal, less or greater than stop rule.
|
||||
// Deleting suitable stop rule from the array.
|
||||
if rule.Comparison == commonv1beta1.ComparisonTypeEqual && metricValue == ruleValue {
|
||||
stopRules = deleteStopRule(stopRules, idx)
|
||||
} else if rule.Comparison == commonv1beta1.ComparisonTypeLess && metricValue < ruleValue {
|
||||
stopRules = deleteStopRule(stopRules, idx)
|
||||
} else if rule.Comparison == commonv1beta1.ComparisonTypeGreater && metricValue > ruleValue {
|
||||
stopRules = deleteStopRule(stopRules, idx)
|
||||
}
|
||||
}
|
||||
stopRules, optimalObjValue = updateStopRules(stopRules, optimalObjValue, metricValue, metricStartStep, rule, idx)
|
||||
}
|
||||
default:
|
||||
klog.Fatalf("Format must be set to %v or %v", commonv1beta1.TextFormat, commonv1beta1.JsonFormat)
|
||||
}
|
||||
|
||||
// If stopRules array is empty, Trial is early stopped.
|
||||
|
@ -273,7 +266,7 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
|||
klog.Fatalf("Create mark file %v error: %v", markFile, err)
|
||||
}
|
||||
|
||||
err = os.WriteFile(markFile, []byte(common.TrainingEarlyStopped), 0)
|
||||
err = ioutil.WriteFile(markFile, []byte(common.TrainingEarlyStopped), 0)
|
||||
if err != nil {
|
||||
klog.Fatalf("Write to file %v error: %v", markFile, err)
|
||||
}
|
||||
|
@ -296,7 +289,7 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
|||
}
|
||||
|
||||
// Report metrics to DB.
|
||||
reportMetrics(filters, fileFormat)
|
||||
reportMetrics(filters)
|
||||
|
||||
// Wait until main process is completed.
|
||||
timeout := 60 * time.Second
|
||||
|
@ -311,7 +304,7 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
|||
}
|
||||
|
||||
// Create connection and client for Early Stopping service.
|
||||
conn, err := grpc.NewClient(*earlyStopServiceAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))
|
||||
conn, err := grpc.Dial(*earlyStopServiceAddr, grpc.WithInsecure())
|
||||
if err != nil {
|
||||
klog.Fatalf("Could not connect to Early Stopping service, error: %v", err)
|
||||
}
|
||||
|
@ -333,58 +326,6 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
|||
}
|
||||
}
|
||||
|
||||
func updateStopRules(
|
||||
stopRules []commonv1beta1.EarlyStoppingRule,
|
||||
optimalObjValue *float64,
|
||||
metricValue float64,
|
||||
metricStartStep map[string]int,
|
||||
rule commonv1beta1.EarlyStoppingRule,
|
||||
ruleIdx int,
|
||||
) ([]commonv1beta1.EarlyStoppingRule, *float64) {
|
||||
|
||||
// First metric is objective in metricNames array.
|
||||
objMetric := strings.Split(*metricNames, ";")[0]
|
||||
objType := commonv1beta1.ObjectiveType(*objectiveType)
|
||||
|
||||
// Calculate optimalObjValue.
|
||||
if rule.Name == objMetric {
|
||||
if optimalObjValue == nil {
|
||||
optimalObjValue = &metricValue
|
||||
} else if objType == commonv1beta1.ObjectiveTypeMaximize && metricValue > *optimalObjValue {
|
||||
optimalObjValue = &metricValue
|
||||
} else if objType == commonv1beta1.ObjectiveTypeMinimize && metricValue < *optimalObjValue {
|
||||
optimalObjValue = &metricValue
|
||||
}
|
||||
// Assign best optimal value to metric value.
|
||||
metricValue = *optimalObjValue
|
||||
}
|
||||
|
||||
// Reduce steps if appropriate metric is reported.
|
||||
// Once rest steps are empty we apply early stopping rule.
|
||||
if _, ok := metricStartStep[rule.Name]; ok {
|
||||
metricStartStep[rule.Name]--
|
||||
if metricStartStep[rule.Name] != 0 {
|
||||
return stopRules, optimalObjValue
|
||||
}
|
||||
}
|
||||
|
||||
ruleValue, err := strconv.ParseFloat(rule.Value, 64)
|
||||
if err != nil {
|
||||
klog.Fatalf("Unable to parse value %v to float for rule metric %v", rule.Value, rule.Name)
|
||||
}
|
||||
|
||||
// Metric value can be equal, less or greater than stop rule.
|
||||
// Deleting suitable stop rule from the array.
|
||||
if rule.Comparison == commonv1beta1.ComparisonTypeEqual && metricValue == ruleValue {
|
||||
return deleteStopRule(stopRules, ruleIdx), optimalObjValue
|
||||
} else if rule.Comparison == commonv1beta1.ComparisonTypeLess && metricValue < ruleValue {
|
||||
return deleteStopRule(stopRules, ruleIdx), optimalObjValue
|
||||
} else if rule.Comparison == commonv1beta1.ComparisonTypeGreater && metricValue > ruleValue {
|
||||
return deleteStopRule(stopRules, ruleIdx), optimalObjValue
|
||||
}
|
||||
return stopRules, optimalObjValue
|
||||
}
|
||||
|
||||
func deleteStopRule(stopRules []commonv1beta1.EarlyStoppingRule, idx int) []commonv1beta1.EarlyStoppingRule {
|
||||
if idx >= len(stopRules) {
|
||||
klog.Fatalf("Index %v out of range stopRules: %v", idx, stopRules)
|
||||
|
@ -404,11 +345,9 @@ func main() {
|
|||
filters = strings.Split(*metricFilters, ";")
|
||||
}
|
||||
|
||||
fileFormat := commonv1beta1.FileFormat(*metricsFileFormat)
|
||||
|
||||
// If stop rule is set we need to parse metrics during run.
|
||||
if len(stopRules) != 0 {
|
||||
go watchMetricsFile(*metricsFilePath, stopRules, filters, fileFormat)
|
||||
go watchMetricsFile(*metricsFilePath, stopRules, filters)
|
||||
} else {
|
||||
go printMetricsFile(*metricsFilePath)
|
||||
}
|
||||
|
@ -427,13 +366,13 @@ func main() {
|
|||
|
||||
// If training was not early stopped, report the metrics.
|
||||
if !isEarlyStopped {
|
||||
reportMetrics(filters, fileFormat)
|
||||
reportMetrics(filters)
|
||||
}
|
||||
}
|
||||
|
||||
func reportMetrics(filters []string, fileFormat commonv1beta1.FileFormat) {
|
||||
func reportMetrics(filters []string) {
|
||||
|
||||
conn, err := grpc.NewClient(*dbManagerServiceAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))
|
||||
conn, err := grpc.Dial(*dbManagerServiceAddr, grpc.WithInsecure())
|
||||
if err != nil {
|
||||
klog.Fatalf("Could not connect to DB manager service, error: %v", err)
|
||||
}
|
||||
|
@ -444,7 +383,7 @@ func reportMetrics(filters []string, fileFormat commonv1beta1.FileFormat) {
|
|||
if len(*metricNames) != 0 {
|
||||
metricList = strings.Split(*metricNames, ";")
|
||||
}
|
||||
olog, err := filemc.CollectObservationLog(*metricsFilePath, metricList, filters, fileFormat)
|
||||
olog, err := filemc.CollectObservationLog(*metricsFilePath, metricList, filters)
|
||||
if err != nil {
|
||||
klog.Fatalf("Failed to collect logs: %v", err)
|
||||
}
|
||||
|
|
|
@ -1,24 +1,25 @@
|
|||
FROM python:3.11-slim
|
||||
FROM python:3.9
|
||||
|
||||
ARG TARGETARCH
|
||||
ENV TARGET_DIR /opt/katib
|
||||
ENV METRICS_COLLECTOR_DIR cmd/metricscollector/v1beta1/tfevent-metricscollector
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/metricscollector/v1beta1/tfevent-metricscollector/::${TARGET_DIR}/pkg/metricscollector/v1beta1/common/
|
||||
# tensorflow community build for aarch64
|
||||
# https://github.com/tensorflow/build#tensorflow-builds
|
||||
ENV PIP_EXTRA_INDEX_URL https://snapshots.linaro.org/ldcg/python-cache/
|
||||
|
||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||
ADD ./${METRICS_COLLECTOR_DIR}/ ${TARGET_DIR}/${METRICS_COLLECTOR_DIR}/
|
||||
|
||||
WORKDIR ${TARGET_DIR}/${METRICS_COLLECTOR_DIR}
|
||||
|
||||
RUN if [ "${TARGETARCH}" = "arm64" ]; then \
|
||||
apt-get -y update && \
|
||||
apt-get -y install gfortran libpcre3 libpcre3-dev && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
fi
|
||||
RUN if [ "$(uname -m)" = "aarch64" ]; then \
|
||||
pip install tensorflow-aarch64==2.7.0; \
|
||||
else \
|
||||
pip install tensorflow==2.7.0; \
|
||||
fi;
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||
&& chmod -R g+rwX ${TARGET_DIR}
|
||||
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/metricscollector/v1beta1/tfevent-metricscollector/::${TARGET_DIR}/pkg/metricscollector/v1beta1/common/
|
||||
|
||||
ENTRYPOINT ["python", "main.py"]
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
FROM ibmcom/tensorflow-ppc64le:2.2.0-py3
|
||||
ADD . /usr/src/app/github.com/kubeflow/katib
|
||||
WORKDIR /usr/src/app/github.com/kubeflow/katib/cmd/metricscollector/v1beta1/tfevent-metricscollector/
|
||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
ENV PYTHONPATH /usr/src/app/github.com/kubeflow/katib:/usr/src/app/github.com/kubeflow/katib/pkg/apis/manager/v1beta1/python:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/tfevent-metricscollector/:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/common/
|
||||
ENTRYPOINT ["python", "main.py"]
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
# Copyright 2022 The Kubeflow Authors.
|
||||
# Copyright 2021 The Kubeflow Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
|
@ -12,15 +12,13 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import argparse
|
||||
from logging import INFO, StreamHandler, getLogger
|
||||
|
||||
import api_pb2
|
||||
import api_pb2_grpc
|
||||
import const
|
||||
import grpc
|
||||
import argparse
|
||||
import api_pb2
|
||||
from pns import WaitMainProcesses
|
||||
import const
|
||||
from tfevent_loader import MetricsCollector
|
||||
from logging import getLogger, StreamHandler, INFO
|
||||
|
||||
timeout_in_seconds = 60
|
||||
|
||||
|
@ -57,28 +55,25 @@ if __name__ == '__main__':
|
|||
wait_all_processes = opt.wait_all_processes.lower() == "true"
|
||||
db_manager_server = opt.db_manager_server_addr.split(':')
|
||||
if len(db_manager_server) != 2:
|
||||
raise Exception(
|
||||
f"Invalid Katib DB manager service address: {opt.db_manager_server_addr}"
|
||||
)
|
||||
raise Exception("Invalid Katib DB manager service address: %s" %
|
||||
opt.db_manager_server_addr)
|
||||
|
||||
WaitMainProcesses(
|
||||
pool_interval=opt.poll_interval,
|
||||
timout=opt.timeout,
|
||||
wait_all=wait_all_processes,
|
||||
completed_marked_dir=opt.metrics_file_dir,
|
||||
)
|
||||
completed_marked_dir=opt.metrics_file_dir)
|
||||
|
||||
mc = MetricsCollector(opt.metric_names.split(";"))
|
||||
mc = MetricsCollector(opt.metric_names.split(';'))
|
||||
observation_log = mc.parse_file(opt.metrics_file_dir)
|
||||
|
||||
with grpc.insecure_channel(opt.db_manager_server_addr) as channel:
|
||||
stub = api_pb2_grpc.DBManagerStub(channel)
|
||||
logger.info(
|
||||
f"In {opt.trial_name} {str(len(observation_log.metric_logs))} metrics will be reported."
|
||||
)
|
||||
stub.ReportObservationLog(
|
||||
api_pb2.ReportObservationLogRequest(
|
||||
trial_name=opt.trial_name, observation_log=observation_log
|
||||
),
|
||||
timeout=timeout_in_seconds,
|
||||
)
|
||||
channel = grpc.beta.implementations.insecure_channel(
|
||||
db_manager_server[0], int(db_manager_server[1]))
|
||||
|
||||
with api_pb2.beta_create_DBManager_stub(channel) as client:
|
||||
logger.info("In " + opt.trial_name + " " +
|
||||
str(len(observation_log.metric_logs)) + " metrics will be reported.")
|
||||
client.ReportObservationLog(api_pb2.ReportObservationLogRequest(
|
||||
trial_name=opt.trial_name,
|
||||
observation_log=observation_log
|
||||
), timeout=timeout_in_seconds)
|
||||
|
|
|
@ -1,6 +1,4 @@
|
|||
psutil==5.9.4
|
||||
psutil==5.8.0
|
||||
rfc3339>=6.2
|
||||
grpcio>=1.64.1
|
||||
grpcio==1.41.1
|
||||
googleapis-common-protos==1.6.0
|
||||
tensorflow==2.16.1
|
||||
protobuf>=4.21.12,<5
|
||||
|
|
|
@ -0,0 +1,63 @@
|
|||
# --- Clone the kubeflow/kubeflow code ---
|
||||
FROM ubuntu AS fetch-kubeflow-kubeflow
|
||||
|
||||
RUN apt-get update && apt-get install git -y
|
||||
|
||||
WORKDIR /kf
|
||||
RUN git clone https://github.com/kubeflow/kubeflow.git && \
|
||||
cd kubeflow && \
|
||||
git checkout ecb72c2
|
||||
|
||||
# --- Build the frontend kubeflow library ---
|
||||
FROM node:12 AS frontend-kubeflow-lib
|
||||
|
||||
WORKDIR /src
|
||||
|
||||
ARG LIB=/kf/kubeflow/components/crud-web-apps/common/frontend/kubeflow-common-lib
|
||||
COPY --from=fetch-kubeflow-kubeflow $LIB/package*.json ./
|
||||
RUN npm ci
|
||||
|
||||
COPY --from=fetch-kubeflow-kubeflow $LIB/ ./
|
||||
RUN npm run build
|
||||
|
||||
# --- Build the frontend ---
|
||||
FROM node:12 AS frontend
|
||||
|
||||
WORKDIR /src
|
||||
COPY ./pkg/new-ui/v1beta1/frontend/package*.json ./
|
||||
RUN npm ci
|
||||
|
||||
COPY ./pkg/new-ui/v1beta1/frontend/ .
|
||||
COPY --from=frontend-kubeflow-lib /src/dist/kubeflow/ ./node_modules/kubeflow/
|
||||
|
||||
RUN npm run build:prod
|
||||
|
||||
# --- Build the backend ---
|
||||
FROM golang:alpine AS go-build
|
||||
|
||||
WORKDIR /go/src/github.com/kubeflow/katib
|
||||
|
||||
# Download packages.
|
||||
COPY go.mod .
|
||||
COPY go.sum .
|
||||
RUN go mod download -x
|
||||
|
||||
# Copy sources.
|
||||
COPY cmd/ cmd/
|
||||
COPY pkg/ pkg/
|
||||
|
||||
# Build the binary.
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-ui ./cmd/new-ui/v1beta1; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-ui ./cmd/new-ui/v1beta1; \
|
||||
else \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-ui ./cmd/new-ui/v1beta1; \
|
||||
fi
|
||||
|
||||
# --- Compose the web app ---
|
||||
FROM alpine:3.15
|
||||
WORKDIR /app
|
||||
COPY --from=go-build /go/src/github.com/kubeflow/katib/katib-ui /app/
|
||||
COPY --from=frontend /src/dist/static /app/build/static/
|
||||
ENTRYPOINT ["./katib-ui"]
|
|
@ -0,0 +1,74 @@
|
|||
/*
|
||||
Copyright 2021 The Kubeflow Authors.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
*/
|
||||
|
||||
package main
|
||||
|
||||
import (
|
||||
"flag"
|
||||
"fmt"
|
||||
"log"
|
||||
"net/http"
|
||||
|
||||
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
|
||||
|
||||
common_v1beta1 "github.com/kubeflow/katib/pkg/common/v1beta1"
|
||||
ui "github.com/kubeflow/katib/pkg/new-ui/v1beta1"
|
||||
)
|
||||
|
||||
var (
|
||||
port, host, buildDir, dbManagerAddr *string
|
||||
)
|
||||
|
||||
func init() {
|
||||
port = flag.String("port", "8080", "The port to listen to for incoming HTTP connections")
|
||||
host = flag.String("host", "0.0.0.0", "The host to listen to for incoming HTTP connections")
|
||||
buildDir = flag.String("build-dir", "/app/build", "The dir of frontend")
|
||||
dbManagerAddr = flag.String("db-manager-address", common_v1beta1.GetDBManagerAddr(), "The address of Katib DB manager")
|
||||
}
|
||||
|
||||
func main() {
|
||||
flag.Parse()
|
||||
kuh := ui.NewKatibUIHandler(*dbManagerAddr)
|
||||
|
||||
log.Printf("Serving the frontend dir %s", *buildDir)
|
||||
frontend := http.FileServer(http.Dir(*buildDir))
|
||||
http.HandleFunc("/katib/", kuh.ServeIndex(*buildDir))
|
||||
http.Handle("/katib/static/", http.StripPrefix("/katib/", frontend))
|
||||
|
||||
http.HandleFunc("/katib/fetch_experiments/", kuh.FetchAllExperiments)
|
||||
|
||||
http.HandleFunc("/katib/create_experiment/", kuh.CreateExperiment)
|
||||
|
||||
http.HandleFunc("/katib/delete_experiment/", kuh.DeleteExperiment)
|
||||
|
||||
http.HandleFunc("/katib/fetch_experiment/", kuh.FetchExperiment)
|
||||
http.HandleFunc("/katib/fetch_suggestion/", kuh.FetchSuggestion)
|
||||
|
||||
http.HandleFunc("/katib/fetch_hp_job_info/", kuh.FetchHPJobInfo)
|
||||
http.HandleFunc("/katib/fetch_hp_job_trial_info/", kuh.FetchHPJobTrialInfo)
|
||||
http.HandleFunc("/katib/fetch_nas_job_info/", kuh.FetchNASJobInfo)
|
||||
|
||||
http.HandleFunc("/katib/fetch_trial_templates/", kuh.FetchTrialTemplates)
|
||||
http.HandleFunc("/katib/add_template/", kuh.AddTemplate)
|
||||
http.HandleFunc("/katib/edit_template/", kuh.EditTemplate)
|
||||
http.HandleFunc("/katib/delete_template/", kuh.DeleteTemplate)
|
||||
http.HandleFunc("/katib/fetch_namespaces", kuh.FetchNamespaces)
|
||||
|
||||
log.Printf("Serving at %s:%s", *host, *port)
|
||||
if err := http.ListenAndServe(fmt.Sprintf("%s:%s", *host, *port), nil); err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
|
@ -0,0 +1,35 @@
|
|||
FROM python:3.9
|
||||
|
||||
ENV TARGET_DIR /opt/katib
|
||||
ENV SUGGESTION_DIR cmd/suggestion/chocolate/v1beta1
|
||||
ENV GRPC_HEALTH_PROBE_VERSION v0.4.6
|
||||
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||
apt-get -y update && \
|
||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
fi
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||
else \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||
fi && \
|
||||
chmod +x /bin/grpc_health_probe
|
||||
|
||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||
RUN if [ "$(uname -m)" = "aarch64" ]; then \
|
||||
sed -i -e '$a git+https://github.com/fmder/ghalton@master' -e '/^ghalton/d' requirements.txt; \
|
||||
fi;
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||
&& chmod -R g+rwX ${TARGET_DIR}
|
||||
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
|
||||
ENTRYPOINT ["python", "main.py"]
|
|
@ -1,4 +1,4 @@
|
|||
# Copyright 2022 The Kubeflow Authors.
|
||||
# Copyright 2021 The Kubeflow Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
|
@ -12,14 +12,12 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import time
|
||||
from concurrent import futures
|
||||
|
||||
import grpc
|
||||
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
import time
|
||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||
from pkg.suggestion.v1beta1.pbt.service import PbtService
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
from pkg.suggestion.v1beta1.chocolate.service import ChocolateService
|
||||
from concurrent import futures
|
||||
|
||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||
DEFAULT_PORT = "0.0.0.0:6789"
|
||||
|
@ -27,10 +25,9 @@ DEFAULT_PORT = "0.0.0.0:6789"
|
|||
|
||||
def serve():
|
||||
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
|
||||
service = PbtService()
|
||||
service = ChocolateService()
|
||||
api_pb2_grpc.add_SuggestionServicer_to_server(service, server)
|
||||
health_pb2_grpc.add_HealthServicer_to_server(service, server)
|
||||
|
||||
server.add_insecure_port(DEFAULT_PORT)
|
||||
print("Listening...")
|
||||
server.start()
|
|
@ -0,0 +1,12 @@
|
|||
grpcio==1.41.1
|
||||
cloudpickle==0.5.6
|
||||
numpy>=1.20.0
|
||||
scikit-learn>=0.24.0
|
||||
scipy>=1.5.4
|
||||
forestci==0.3
|
||||
protobuf==3.19.1
|
||||
googleapis-common-protos==1.6.0
|
||||
SQLAlchemy==1.4.26
|
||||
git+https://github.com/AIworx-Labs/chocolate@master
|
||||
ghalton>=0.6.2
|
||||
cython>=0.29.24
|
|
@ -1,7 +1,7 @@
|
|||
# Build the Goptuna Suggestion.
|
||||
FROM golang:alpine AS build-env
|
||||
|
||||
ARG TARGETARCH
|
||||
ENV GRPC_HEALTH_PROBE_VERSION v0.4.6
|
||||
|
||||
WORKDIR /go/src/github.com/kubeflow/katib
|
||||
|
||||
|
@ -15,7 +15,23 @@ COPY cmd/ cmd/
|
|||
COPY pkg/ pkg/
|
||||
|
||||
# Build the binary.
|
||||
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1; \
|
||||
else \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1; \
|
||||
fi
|
||||
|
||||
# Add GRPC health probe.
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||
else \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||
fi && \
|
||||
chmod +x /bin/grpc_health_probe
|
||||
|
||||
# Copy the Goptuna suggestion into a thin image.
|
||||
FROM alpine:3.15
|
||||
|
@ -23,7 +39,7 @@ FROM alpine:3.15
|
|||
ENV TARGET_DIR /opt/katib
|
||||
|
||||
WORKDIR ${TARGET_DIR}
|
||||
|
||||
COPY --from=build-env /bin/grpc_health_probe /bin/
|
||||
COPY --from=build-env /go/src/github.com/kubeflow/katib/goptuna-suggestion ${TARGET_DIR}/
|
||||
|
||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
/*
|
||||
Copyright 2022 The Kubeflow Authors.
|
||||
Copyright 2021 The Kubeflow Authors.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
|
@ -24,7 +24,7 @@ import (
|
|||
api_v1_beta1 "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
|
||||
suggestion "github.com/kubeflow/katib/pkg/suggestion/v1beta1/goptuna"
|
||||
"google.golang.org/grpc"
|
||||
"k8s.io/klog/v2"
|
||||
"k8s.io/klog"
|
||||
)
|
||||
|
||||
const (
|
||||
|
|
|
@ -1,24 +1,33 @@
|
|||
FROM python:3.11-slim
|
||||
FROM python:3.9
|
||||
|
||||
ARG TARGETARCH
|
||||
ENV TARGET_DIR /opt/katib
|
||||
ENV SUGGESTION_DIR cmd/suggestion/hyperband/v1beta1
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
ENV GRPC_HEALTH_PROBE_VERSION v0.4.6
|
||||
|
||||
RUN if [ "${TARGETARCH}" = "ppc64le" ] || [ "${TARGETARCH}" = "arm64" ]; then \
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||
apt-get -y update && \
|
||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
fi
|
||||
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||
else \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||
fi && \
|
||||
chmod +x /bin/grpc_health_probe
|
||||
|
||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||
|
||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||
&& chmod -R g+rwX ${TARGET_DIR}
|
||||
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
|
||||
ENTRYPOINT ["python", "main.py"]
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
# Copyright 2022 The Kubeflow Authors.
|
||||
# Copyright 2021 The Kubeflow Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
|
@ -12,14 +12,12 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import time
|
||||
from concurrent import futures
|
||||
|
||||
import grpc
|
||||
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
import time
|
||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
from pkg.suggestion.v1beta1.hyperband.service import HyperbandService
|
||||
from concurrent import futures
|
||||
|
||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||
DEFAULT_PORT = "0.0.0.0:6789"
|
||||
|
|
|
@ -1,9 +1,9 @@
|
|||
grpcio>=1.64.1
|
||||
grpcio==1.41.1
|
||||
cloudpickle==0.5.6
|
||||
numpy>=1.25.2
|
||||
numpy>=1.20.0
|
||||
scikit-learn>=0.24.0
|
||||
scipy>=1.5.4
|
||||
forestci==0.3
|
||||
protobuf>=4.21.12,<5
|
||||
protobuf==3.19.1
|
||||
googleapis-common-protos==1.6.0
|
||||
cython>=0.29.24
|
||||
|
|
|
@ -1,24 +1,34 @@
|
|||
FROM python:3.11-slim
|
||||
FROM python:3.9
|
||||
|
||||
ARG TARGETARCH
|
||||
ENV TARGET_DIR /opt/katib
|
||||
ENV SUGGESTION_DIR cmd/suggestion/hyperopt/v1beta1
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
ENV GRPC_HEALTH_PROBE_VERSION v0.4.6
|
||||
|
||||
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||
apt-get -y update && \
|
||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
fi
|
||||
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||
else \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||
fi && \
|
||||
chmod +x /bin/grpc_health_probe
|
||||
|
||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||
|
||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||
&& chmod -R g+rwX ${TARGET_DIR}
|
||||
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
|
||||
ENTRYPOINT ["python", "main.py"]
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
# Copyright 2022 The Kubeflow Authors.
|
||||
# Copyright 2021 The Kubeflow Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
|
@ -12,14 +12,12 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import time
|
||||
from concurrent import futures
|
||||
|
||||
import grpc
|
||||
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
import time
|
||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
from pkg.suggestion.v1beta1.hyperopt.service import HyperoptService
|
||||
from concurrent import futures
|
||||
|
||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||
DEFAULT_PORT = "0.0.0.0:6789"
|
||||
|
|
|
@ -1,10 +1,10 @@
|
|||
grpcio>=1.64.1
|
||||
grpcio==1.41.1
|
||||
cloudpickle==0.5.6
|
||||
numpy>=1.25.2
|
||||
numpy>=1.20.0
|
||||
scikit-learn>=0.24.0
|
||||
scipy>=1.5.4
|
||||
forestci==0.3
|
||||
protobuf>=4.21.12,<5
|
||||
protobuf==3.19.1
|
||||
googleapis-common-protos==1.6.0
|
||||
hyperopt==0.2.5
|
||||
cython>=0.29.24
|
||||
|
|
|
@ -1,24 +1,33 @@
|
|||
FROM python:3.11-slim
|
||||
FROM python:3.9
|
||||
|
||||
ARG TARGETARCH
|
||||
ENV TARGET_DIR /opt/katib
|
||||
ENV SUGGESTION_DIR cmd/suggestion/nas/darts/v1beta1
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
ENV GRPC_HEALTH_PROBE_VERSION v0.4.6
|
||||
|
||||
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||
apt-get -y update && \
|
||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
fi
|
||||
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||
else \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||
fi && \
|
||||
chmod +x /bin/grpc_health_probe
|
||||
|
||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||
|
||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||
&& chmod -R g+rwX ${TARGET_DIR}
|
||||
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
|
||||
ENTRYPOINT ["python", "main.py"]
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
# Copyright 2022 The Kubeflow Authors.
|
||||
# Copyright 2021 The Kubeflow Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
|
@ -12,15 +12,14 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import time
|
||||
from concurrent import futures
|
||||
|
||||
import grpc
|
||||
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
from concurrent import futures
|
||||
import time
|
||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
from pkg.suggestion.v1beta1.nas.darts.service import DartsService
|
||||
|
||||
|
||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||
DEFAULT_PORT = "0.0.0.0:6789"
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
grpcio>=1.64.1
|
||||
protobuf>=4.21.12,<5
|
||||
grpcio==1.41.1
|
||||
protobuf==3.19.1
|
||||
googleapis-common-protos==1.6.0
|
||||
cython>=0.29.24
|
||||
|
|
|
@ -1,24 +1,40 @@
|
|||
FROM python:3.11-slim
|
||||
FROM python:3.9
|
||||
|
||||
ARG TARGETARCH
|
||||
ENV TARGET_DIR /opt/katib
|
||||
ENV SUGGESTION_DIR cmd/suggestion/nas/enas/v1beta1
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
ENV GRPC_HEALTH_PROBE_VERSION v0.4.6
|
||||
# tensorflow community build for aarch64
|
||||
# https://github.com/tensorflow/build#tensorflow-builds
|
||||
ENV PIP_EXTRA_INDEX_URL https://snapshots.linaro.org/ldcg/python-cache/
|
||||
|
||||
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||
apt-get -y update && \
|
||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
fi
|
||||
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||
else \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||
fi && \
|
||||
chmod +x /bin/grpc_health_probe
|
||||
|
||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||
|
||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||
|
||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
||||
RUN if [ "$(uname -m)" = "aarch64" ]; then \
|
||||
sed -i 's/tensorflow==/tensorflow-aarch64==/' requirements.txt; \
|
||||
fi;
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||
&& chmod -R g+rwX ${TARGET_DIR}
|
||||
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
|
||||
ENTRYPOINT ["python", "main.py"]
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
# Copyright 2022 The Kubeflow Authors.
|
||||
# Copyright 2021 The Kubeflow Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
|
@ -12,15 +12,15 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import time
|
||||
from concurrent import futures
|
||||
|
||||
import grpc
|
||||
from concurrent import futures
|
||||
import time
|
||||
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
from pkg.suggestion.v1beta1.nas.enas.service import EnasService
|
||||
|
||||
|
||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||
DEFAULT_PORT = "0.0.0.0:6789"
|
||||
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
grpcio>=1.64.1
|
||||
grpcio==1.41.1
|
||||
protobuf==3.19.1
|
||||
googleapis-common-protos==1.6.0
|
||||
tensorflow==2.7.0
|
||||
cython>=0.29.24
|
||||
tensorflow==2.16.1
|
||||
protobuf>=4.21.12,<5
|
||||
|
|
|
@ -1,24 +1,32 @@
|
|||
FROM python:3.11-slim
|
||||
FROM python:3.9
|
||||
|
||||
ARG TARGETARCH
|
||||
ENV TARGET_DIR /opt/katib
|
||||
ENV SUGGESTION_DIR cmd/suggestion/optuna/v1beta1
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
ENV GRPC_HEALTH_PROBE_VERSION v0.4.6
|
||||
|
||||
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||
apt-get -y update && \
|
||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
fi
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||
else \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||
fi && \
|
||||
chmod +x /bin/grpc_health_probe
|
||||
|
||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||
|
||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||
&& chmod -R g+rwX ${TARGET_DIR}
|
||||
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
|
||||
ENTRYPOINT ["python", "main.py"]
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
# Copyright 2022 The Kubeflow Authors.
|
||||
# Copyright 2021 The Kubeflow Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
|
@ -12,14 +12,12 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import time
|
||||
from concurrent import futures
|
||||
|
||||
import grpc
|
||||
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
import time
|
||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
from pkg.suggestion.v1beta1.optuna.service import OptunaService
|
||||
from concurrent import futures
|
||||
|
||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||
DEFAULT_PORT = "0.0.0.0:6789"
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
grpcio>=1.64.1
|
||||
protobuf>=4.21.12,<5
|
||||
grpcio==1.41.1
|
||||
protobuf==3.19.1
|
||||
googleapis-common-protos==1.53.0
|
||||
optuna==3.3.0
|
||||
optuna>=2.8.0
|
||||
|
|
|
@ -1,24 +0,0 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
ARG TARGETARCH
|
||||
ENV TARGET_DIR /opt/katib
|
||||
ENV SUGGESTION_DIR cmd/suggestion/pbt/v1beta1
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
|
||||
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
|
||||
apt-get -y update && \
|
||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
fi
|
||||
|
||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||
|
||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||
|
||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||
&& chmod -R g+rwX ${TARGET_DIR}
|
||||
|
||||
ENTRYPOINT ["python", "main.py"]
|
|
@ -1,4 +0,0 @@
|
|||
grpcio>=1.64.1
|
||||
protobuf>=4.21.12,<5
|
||||
googleapis-common-protos==1.53.0
|
||||
numpy==1.25.2
|
|
@ -1,24 +1,32 @@
|
|||
FROM python:3.10-slim
|
||||
FROM python:3.9
|
||||
|
||||
ARG TARGETARCH
|
||||
ENV TARGET_DIR /opt/katib
|
||||
ENV SUGGESTION_DIR cmd/suggestion/skopt/v1beta1
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
ENV GRPC_HEALTH_PROBE_VERSION v0.4.6
|
||||
|
||||
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||
apt-get -y update && \
|
||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*; \
|
||||
fi
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||
else \
|
||||
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||
fi && \
|
||||
chmod +x /bin/grpc_health_probe
|
||||
|
||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||
|
||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||
&& chmod -R g+rwX ${TARGET_DIR}
|
||||
|
||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||
|
||||
ENTRYPOINT ["python", "main.py"]
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
# Copyright 2022 The Kubeflow Authors.
|
||||
# Copyright 2021 The Kubeflow Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
|
@ -12,14 +12,12 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import time
|
||||
from concurrent import futures
|
||||
|
||||
import grpc
|
||||
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
import time
|
||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||
from pkg.suggestion.v1beta1.skopt.service import SkoptService
|
||||
from concurrent import futures
|
||||
|
||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||
DEFAULT_PORT = "0.0.0.0:6789"
|
||||
|
|
|
@ -1,13 +1,10 @@
|
|||
grpcio>=1.64.1
|
||||
grpcio==1.41.1
|
||||
cloudpickle==0.5.6
|
||||
# This is a workaround to avoid the following error.
|
||||
# AttributeError: module 'numpy' has no attribute 'int'
|
||||
# See more: https://github.com/numpy/numpy/pull/22607
|
||||
numpy==1.23.5
|
||||
scikit-learn>=0.24.0, <=1.3.0
|
||||
numpy>=1.20.0
|
||||
scikit-learn>=0.24.0
|
||||
scipy>=1.5.4
|
||||
forestci==0.3
|
||||
protobuf>=4.21.12,<5
|
||||
protobuf==3.19.1
|
||||
googleapis-common-protos==1.6.0
|
||||
scikit-optimize>=0.9.0
|
||||
cython>=0.29.24
|
||||
|
|
|
@ -1,56 +1,15 @@
|
|||
# --- Clone the kubeflow/kubeflow code ---
|
||||
FROM alpine/git AS fetch-kubeflow-kubeflow
|
||||
# Build the Katib UI.
|
||||
FROM node:12.18.1 AS npm-build
|
||||
|
||||
WORKDIR /kf
|
||||
COPY ./pkg/ui/v1beta1/frontend/COMMIT ./
|
||||
RUN git clone https://github.com/kubeflow/kubeflow.git && \
|
||||
COMMIT=$(cat ./COMMIT) && \
|
||||
cd kubeflow && \
|
||||
git checkout $COMMIT
|
||||
# Build frontend.
|
||||
ADD /pkg/ui/v1beta1/frontend /frontend
|
||||
RUN cd /frontend && npm ci
|
||||
RUN cd /frontend && npm run build
|
||||
RUN rm -rf /frontend/node_modules
|
||||
|
||||
# --- Build the frontend kubeflow library ---
|
||||
FROM node:16-alpine AS frontend-kubeflow-lib
|
||||
|
||||
WORKDIR /src
|
||||
|
||||
ARG LIB=/kf/kubeflow/components/crud-web-apps/common/frontend/kubeflow-common-lib
|
||||
COPY --from=fetch-kubeflow-kubeflow $LIB/package*.json ./
|
||||
RUN npm config set fetch-retry-mintimeout 200000 && \
|
||||
npm config set fetch-retry-maxtimeout 1200000 && \
|
||||
npm config get registry && \
|
||||
npm config set registry https://registry.npmjs.org/ && \
|
||||
npm config delete https-proxy && \
|
||||
npm config set loglevel verbose && \
|
||||
npm cache clean --force && \
|
||||
npm ci --force --prefer-offline --no-audit
|
||||
|
||||
COPY --from=fetch-kubeflow-kubeflow $LIB/ ./
|
||||
RUN npm run build
|
||||
|
||||
# --- Build the frontend ---
|
||||
FROM node:16-alpine AS frontend
|
||||
|
||||
WORKDIR /src
|
||||
COPY ./pkg/ui/v1beta1/frontend/package*.json ./
|
||||
RUN npm config set fetch-retry-mintimeout 200000 && \
|
||||
npm config set fetch-retry-maxtimeout 1200000 && \
|
||||
npm config get registry && \
|
||||
npm config set registry https://registry.npmjs.org/ && \
|
||||
npm config delete https-proxy && \
|
||||
npm config set loglevel verbose && \
|
||||
npm cache clean --force && \
|
||||
npm ci --force --prefer-offline --no-audit
|
||||
|
||||
COPY ./pkg/ui/v1beta1/frontend/ .
|
||||
COPY --from=frontend-kubeflow-lib /src/dist/kubeflow/ ./node_modules/kubeflow/
|
||||
|
||||
RUN npm run build:prod
|
||||
|
||||
# --- Build the backend ---
|
||||
# Build backend.
|
||||
FROM golang:alpine AS go-build
|
||||
|
||||
ARG TARGETARCH
|
||||
|
||||
WORKDIR /go/src/github.com/kubeflow/katib
|
||||
|
||||
# Download packages.
|
||||
|
@ -63,11 +22,17 @@ COPY cmd/ cmd/
|
|||
COPY pkg/ pkg/
|
||||
|
||||
# Build the binary.
|
||||
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o katib-ui ./cmd/ui/v1beta1
|
||||
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-ui ./cmd/ui/v1beta1; \
|
||||
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-ui ./cmd/ui/v1beta1; \
|
||||
else \
|
||||
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-ui ./cmd/ui/v1beta1; \
|
||||
fi
|
||||
|
||||
# --- Compose the web app ---
|
||||
# Copy the backend and frontend into a thin image.
|
||||
FROM alpine:3.15
|
||||
WORKDIR /app
|
||||
COPY --from=go-build /go/src/github.com/kubeflow/katib/katib-ui /app/
|
||||
COPY --from=frontend /src/dist/static /app/build/static/
|
||||
COPY --from=npm-build /frontend/build /app/build
|
||||
ENTRYPOINT ["./katib-ui"]
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
/*
|
||||
Copyright 2022 The Kubeflow Authors.
|
||||
Copyright 2021 The Kubeflow Authors.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
|
@ -33,7 +33,7 @@ var (
|
|||
)
|
||||
|
||||
func init() {
|
||||
port = flag.String("port", "8080", "The port to listen to for incoming HTTP connections")
|
||||
port = flag.String("port", "80", "The port to listen to for incoming HTTP connections")
|
||||
host = flag.String("host", "0.0.0.0", "The host to listen to for incoming HTTP connections")
|
||||
buildDir = flag.String("build-dir", "/app/build", "The dir of frontend")
|
||||
dbManagerAddr = flag.String("db-manager-address", common_v1beta1.GetDBManagerAddr(), "The address of Katib DB manager")
|
||||
|
@ -45,17 +45,17 @@ func main() {
|
|||
|
||||
log.Printf("Serving the frontend dir %s", *buildDir)
|
||||
frontend := http.FileServer(http.Dir(*buildDir))
|
||||
http.HandleFunc("/katib/", kuh.ServeIndex(*buildDir))
|
||||
http.Handle("/katib/static/", http.StripPrefix("/katib/", frontend))
|
||||
http.Handle("/katib/", http.StripPrefix("/katib/", frontend))
|
||||
|
||||
http.HandleFunc("/katib/fetch_experiments/", kuh.FetchExperiments)
|
||||
http.HandleFunc("/katib/fetch_experiments/", kuh.FetchAllExperiments)
|
||||
|
||||
http.HandleFunc("/katib/create_experiment/", kuh.CreateExperiment)
|
||||
http.HandleFunc("/katib/submit_yaml/", kuh.SubmitYamlJob)
|
||||
http.HandleFunc("/katib/submit_hp_job/", kuh.SubmitParamsJob)
|
||||
http.HandleFunc("/katib/submit_nas_job/", kuh.SubmitParamsJob)
|
||||
|
||||
http.HandleFunc("/katib/delete_experiment/", kuh.DeleteExperiment)
|
||||
|
||||
http.HandleFunc("/katib/fetch_experiment/", kuh.FetchExperiment)
|
||||
http.HandleFunc("/katib/fetch_trial/", kuh.FetchTrial)
|
||||
http.HandleFunc("/katib/fetch_suggestion/", kuh.FetchSuggestion)
|
||||
|
||||
http.HandleFunc("/katib/fetch_hp_job_info/", kuh.FetchHPJobInfo)
|
||||
|
@ -67,7 +67,6 @@ func main() {
|
|||
http.HandleFunc("/katib/edit_template/", kuh.EditTemplate)
|
||||
http.HandleFunc("/katib/delete_template/", kuh.DeleteTemplate)
|
||||
http.HandleFunc("/katib/fetch_namespaces", kuh.FetchNamespaces)
|
||||
http.HandleFunc("/katib/fetch_trial_logs/", kuh.FetchTrialLogs)
|
||||
|
||||
log.Printf("Serving at %s:%s", *host, *port)
|
||||
if err := http.ListenAndServe(fmt.Sprintf("%s:%s", *host, *port), nil); err != nil {
|
||||
|
|
|
@ -1,13 +0,0 @@
|
|||
#!/bin/sh
|
||||
|
||||
# Run conformance test and generate test report.
|
||||
python test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py --experiment-path examples/v1beta1/hp-tuning/random.yaml --namespace kf-conformance \
|
||||
--trial-pod-labels '{"sidecar.istio.io/inject": "false"}' | tee /tmp/katib-conformance.log
|
||||
|
||||
|
||||
# Create the done file.
|
||||
touch /tmp/katib-conformance.done
|
||||
echo "Done..."
|
||||
|
||||
# Keep the container running so the test logs can be downloaded.
|
||||
while true; do sleep 10000; done
|
|
@ -1,5 +0,0 @@
|
|||
# Katib Documentation
|
||||
|
||||
Welcome to Kubeflow Katib!
|
||||
|
||||
The Katib documentation is available on [kubeflow.org](https://www.kubeflow.org/docs/components/katib/).
|
|
@ -0,0 +1,131 @@
|
|||
# Developer Guide
|
||||
|
||||
This developer guide is for people who want to contribute to the Katib project.
|
||||
If you're interesting in using Katib in your machine learning project,
|
||||
see the following user guides:
|
||||
|
||||
- [Concepts](https://www.kubeflow.org/docs/components/katib/overview/)
|
||||
in Katib, hyperparameter tuning, and neural architecture search.
|
||||
- [Getting started with Katib](https://kubeflow.org/docs/components/katib/hyperparameter/).
|
||||
- Detailed guide to [configuring and running a Katib
|
||||
experiment](https://kubeflow.org/docs/components/katib/experiment/).
|
||||
|
||||
## Requirements
|
||||
|
||||
- [Go](https://golang.org/) (1.17 or later)
|
||||
- [Docker](https://docs.docker.com/) (20.10 or later)
|
||||
- [Java](https://docs.oracle.com/javase/8/docs/technotes/guides/install/install_overview.html) (8 or later)
|
||||
- [Python](https://www.python.org/) (3.9 or later)
|
||||
- [kustomize](https://kustomize.io/) (4.0.5 or later)
|
||||
|
||||
## Build from source code
|
||||
|
||||
Check source code as follows:
|
||||
|
||||
```bash
|
||||
make build REGISTRY=<image-registry> TAG=<image-tag>
|
||||
```
|
||||
|
||||
To use your custom images for the Katib components, modify
|
||||
[Kustomization file](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml)
|
||||
and [Katib Config](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/controller/katib-config.yaml)
|
||||
|
||||
You can deploy Katib v1beta1 manifests into a Kubernetes cluster as follows:
|
||||
|
||||
```bash
|
||||
make deploy
|
||||
```
|
||||
|
||||
You can undeploy Katib v1beta1 manifests from a Kubernetes cluster as follows:
|
||||
|
||||
```bash
|
||||
make undeploy
|
||||
```
|
||||
|
||||
## Modify controller APIs
|
||||
|
||||
If you want to modify Katib controller APIs, you have to
|
||||
generate deepcopy, clientset, listers, informers, open-api and Python SDK with the changed APIs.
|
||||
You can update the necessary files as follows:
|
||||
|
||||
```bash
|
||||
make generate
|
||||
```
|
||||
|
||||
## Controller Flags
|
||||
|
||||
Below is a list of command-line flags accepted by Katib controller:
|
||||
|
||||
| Name | Type | Default | Description |
|
||||
| ------------------------------- | ------------------------- | ----------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
|
||||
| enable-grpc-probe-in-suggestion | bool | true | Enable grpc probe in suggestions |
|
||||
| experiment-suggestion-name | string | "default" | The implementation of suggestion interface in experiment controller |
|
||||
| metrics-addr | string | ":8080" | The address the metric endpoint binds to |
|
||||
| trial-resources | []schema.GroupVersionKind | null | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
|
||||
| webhook-inject-securitycontext | bool | false | Inject the securityContext of container[0] in the sidecar |
|
||||
| webhook-port | int | 8443 | The port number to be used for admission webhook server |
|
||||
| enable-leader-election | bool | false | Enable leader election for katib-controller. Enabling this will ensure there is only one active katib-controller. |
|
||||
| leader-election-id | string | "3fbc96e9.katib.kubeflow.org" | The ID for leader election. |
|
||||
|
||||
## Workflow design
|
||||
|
||||
Please see [workflow-design.md](./workflow-design.md).
|
||||
|
||||
## Katib admission webhooks
|
||||
|
||||
Katib uses three [Kubernetes admission webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/).
|
||||
|
||||
1. `validator.experiment.katib.kubeflow.org` -
|
||||
[Validating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook)
|
||||
to validate the Katib Experiment before the creation.
|
||||
|
||||
1. `defaulter.experiment.katib.kubeflow.org` -
|
||||
[Mutating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook)
|
||||
to set the [default values](../pkg/apis/controller/experiments/v1beta1/experiment_defaults.go)
|
||||
in the Katib Experiment before the creation.
|
||||
|
||||
1. `mutator.pod.katib.kubeflow.org` - Mutating admission webhook to inject the metrics
|
||||
collector sidecar container to the training pod. Learn more about the Katib's
|
||||
metrics collector in the
|
||||
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector).
|
||||
|
||||
You can find the YAMLs for the Katib webhooks
|
||||
[here](../manifests/v1beta1/components/webhook/webhooks.yaml).
|
||||
|
||||
**Note:** If you are using a private Kubernetes cluster, you have to allow traffic
|
||||
via `TCP:8443` by specifying the firewall rule and you have to update the master
|
||||
plane CIDR source range to use the Katib webhooks
|
||||
|
||||
### Katib cert generator
|
||||
|
||||
Katib uses the custom `cert-generator` [Kubernetes Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/)
|
||||
to generate certificates for the webhooks.
|
||||
|
||||
Once Katib is deployed in the Kubernetes cluster, the `cert-generator` Job follows these steps:
|
||||
|
||||
- Generate the self-signed CA certificate and private key.
|
||||
|
||||
- Generate public certificate and private key signed with the key generated in the previous step.
|
||||
|
||||
- Create a Kubernetes Secret with the signed certificate. Secret has
|
||||
the `katib-webhook-cert` name and `cert-generator` Job's `ownerReference` to
|
||||
clean-up resources once Katib is uninstalled.
|
||||
|
||||
Once Secret is created, the Katib controller Deployment spawns the Pod,
|
||||
since the controller has the `katib-webhook-cert` Secret volume.
|
||||
|
||||
- Patch the webhooks with the `CABundle`.
|
||||
|
||||
You can find the `cert-generator` source code [here](../cmd/cert-generator/v1beta1).
|
||||
|
||||
## Implement a new algorithm and use it in Katib
|
||||
|
||||
Please see [new-algorithm-service.md](./new-algorithm-service.md).
|
||||
|
||||
## Katib UI documentation
|
||||
|
||||
Please see [Katib UI README](https://github.com/kubeflow/katib/tree/master/pkg/ui/v1beta1).
|
||||
|
||||
## Design proposals
|
||||
|
||||
Please see [proposals](./proposals).
|
|
@ -5,7 +5,7 @@ Here you can find the location for images that are used in Katib.
|
|||
## Katib Components Images
|
||||
|
||||
The following table shows images for the
|
||||
[Katib components](https://www.kubeflow.org/docs/components/katib/reference/architecture/#katib-control-plane-components).
|
||||
[Katib components](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-components).
|
||||
|
||||
<table>
|
||||
<tbody>
|
||||
|
@ -22,7 +22,7 @@ The following table shows images for the
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/katib-controller</code>
|
||||
<code>docker.io/kubeflowkatib/katib-controller</code>
|
||||
</td>
|
||||
<td>
|
||||
Katib Controller
|
||||
|
@ -33,7 +33,7 @@ The following table shows images for the
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/katib-ui</code>
|
||||
<code>docker.io/kubeflowkatib/katib-ui</code>
|
||||
</td>
|
||||
<td>
|
||||
Katib User Interface
|
||||
|
@ -44,7 +44,7 @@ The following table shows images for the
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/katib-db-manager</code>
|
||||
<code>docker.io/kubeflowkatib/katib-db-manager</code>
|
||||
</td>
|
||||
<td>
|
||||
Katib DB Manager
|
||||
|
@ -64,13 +64,24 @@ The following table shows images for the
|
|||
<a href="https://github.com/docker-library/mysql/blob/c506174eab8ae160f56483e8d72410f8f1e1470f/8.0/Dockerfile.debian">Dockerfile</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>docker.io/kubeflowkatib/cert-generator</code>
|
||||
</td>
|
||||
<td>
|
||||
Katib Cert Generator
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://github.com/kubeflow/katib/blob/master/cmd/cert-generator/v1beta1/Dockerfile">Dockerfile</a>
|
||||
</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
## Katib Metrics Collectors Images
|
||||
|
||||
The following table shows images for the
|
||||
[Katib Metrics Collectors](https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/).
|
||||
[Katib Metrics Collectors](https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector).
|
||||
|
||||
<table>
|
||||
<tbody>
|
||||
|
@ -87,7 +98,7 @@ The following table shows images for the
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/file-metrics-collector</code>
|
||||
<code>docker.io/kubeflowkatib/file-metrics-collector</code>
|
||||
</td>
|
||||
<td>
|
||||
File Metrics Collector
|
||||
|
@ -98,7 +109,7 @@ The following table shows images for the
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/tfevent-metrics-collector</code>
|
||||
<code>docker.io/kubeflowkatib/tfevent-metrics-collector</code>
|
||||
</td>
|
||||
<td>
|
||||
Tensorflow Event Metrics Collector
|
||||
|
@ -113,8 +124,8 @@ The following table shows images for the
|
|||
## Katib Suggestions and Early Stopping Images
|
||||
|
||||
The following table shows images for the
|
||||
[Katib Suggestion services](https://www.kubeflow.org/docs/components/katib/reference/architecture/#suggestion)
|
||||
and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/components/katib/user-guides/early-stopping/#early-stopping-algorithms).
|
||||
[Katib Suggestions](https://www.kubeflow.org/docs/components/katib/experiment/#search-algorithms-in-detail)
|
||||
and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/components/katib/early-stopping/).
|
||||
|
||||
<table>
|
||||
<tbody>
|
||||
|
@ -131,7 +142,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/suggestion-hyperopt</code>
|
||||
<code>docker.io/kubeflowkatib/suggestion-hyperopt</code>
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://github.com/hyperopt/hyperopt">Hyperopt</a> Suggestion
|
||||
|
@ -142,7 +153,18 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/suggestion-skopt</code>
|
||||
<code>docker.io/kubeflowkatib/suggestion-chocolate</code>
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://github.com/AIworx-Labs/chocolate">Chocolate</a> Suggestion
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://github.com/kubeflow/katib/blob/master/cmd/suggestion/chocolate/v1beta1/Dockerfile">Dockerfile</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>docker.io/kubeflowkatib/suggestion-skopt</code>
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://github.com/scikit-optimize/scikit-optimize">Skopt</a> Suggestion
|
||||
|
@ -153,7 +175,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/suggestion-optuna</code>
|
||||
<code>docker.io/kubeflowkatib/suggestion-optuna</code>
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://github.com/optuna/optuna">Optuna</a> Suggestion
|
||||
|
@ -164,7 +186,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/suggestion-goptuna</code>
|
||||
<code>docker.io/kubeflowkatib/suggestion-goptuna</code>
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://github.com/c-bata/goptuna">Goptuna</a> Suggestion
|
||||
|
@ -175,7 +197,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/suggestion-hyperband</code>
|
||||
<code>docker.io/kubeflowkatib/suggestion-hyperband</code>
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#hyperband">Hyperband</a> Suggestion
|
||||
|
@ -186,7 +208,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/suggestion-enas</code>
|
||||
<code>docker.io/kubeflowkatib/suggestion-enas</code>
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#enas">ENAS</a> Suggestion
|
||||
|
@ -197,7 +219,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/suggestion-darts</code>
|
||||
<code>docker.io/kubeflowkatib/suggestion-darts</code>
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#differentiable-architecture-search-darts">DARTS</a> Suggestion
|
||||
|
@ -208,7 +230,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/earlystopping-medianstop</code>
|
||||
<code>docker.io/kubeflowkatib/earlystopping-medianstop</code>
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://www.kubeflow.org/docs/components/katib/early-stopping/#median-stopping-rule">Median Stopping Rule</a>
|
||||
|
@ -223,7 +245,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen
|
|||
## Training Containers Images
|
||||
|
||||
The following table shows images for training containers which are used in the
|
||||
[Katib Trials](https://www.kubeflow.org/docs/components/katib/reference/architecture/#trial).
|
||||
[Katib Trials](https://www.kubeflow.org/docs/components/katib/experiment/#packaging-your-training-code-in-a-container-image).
|
||||
|
||||
<table>
|
||||
<tbody>
|
||||
|
@ -240,29 +262,29 @@ The following table shows images for training containers which are used in the
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/pytorch-mnist-cpu</code>
|
||||
<code>docker.io/kubeflowkatib/mxnet-mnist</code>
|
||||
</td>
|
||||
<td>
|
||||
PyTorch MNIST example with printing metrics to the file or StdOut with CPU support
|
||||
MXNet MNIST example with collecting metrics time
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.cpu">Dockerfile</a>
|
||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/mxnet-mnist/Dockerfile">Dockerfile</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/pytorch-mnist-gpu</code>
|
||||
<code>docker.io/kubeflowkatib/pytorch-mnist</code>
|
||||
</td>
|
||||
<td>
|
||||
PyTorch MNIST example with printing metrics to the file or StdOut with GPU support
|
||||
PyTorch MNIST example with printing metrics to the file or StdOut
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.gpu">Dockerfile</a>
|
||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/pytorch-mnist/Dockerfile">Dockerfile</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/tf-mnist-with-summaries</code>
|
||||
<code>docker.io/kubeflowkatib/tf-mnist-with-summaries</code>
|
||||
</td>
|
||||
<td>
|
||||
Tensorflow MNIST example with saving metrics in the summaries
|
||||
|
@ -273,7 +295,18 @@ The following table shows images for training containers which are used in the
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/xgboost-lightgbm</code>
|
||||
<code>docker.io/bytepsimage/mxnet</code>
|
||||
</td>
|
||||
<td>
|
||||
Distributed BytePS example for MXJob
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://github.com/bytedance/byteps/blob/v0.2.5/docker/Dockerfile">Dockerfile</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>docker.io/kubeflowkatib/xgboost-lightgbm</code>
|
||||
</td>
|
||||
<td>
|
||||
Distributed LightGBM example for XGBoostJob
|
||||
|
@ -306,7 +339,7 @@ The following table shows images for training containers which are used in the
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/enas-cnn-cifar10-gpu</code>
|
||||
<code>docker.io/kubeflowkatib/enas-cnn-cifar10-gpu</code>
|
||||
</td>
|
||||
<td>
|
||||
Keras CIFAR-10 CNN example for ENAS with GPU support
|
||||
|
@ -317,7 +350,7 @@ The following table shows images for training containers which are used in the
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/enas-cnn-cifar10-cpu</code>
|
||||
<code>docker.io/kubeflowkatib/enas-cnn-cifar10-cpu</code>
|
||||
</td>
|
||||
<td>
|
||||
Keras CIFAR-10 CNN example for ENAS with CPU support
|
||||
|
@ -328,24 +361,13 @@ The following table shows images for training containers which are used in the
|
|||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/darts-cnn-cifar10-gpu</code>
|
||||
<code>docker.io/kubeflowkatib/darts-cnn-cifar10</code>
|
||||
</td>
|
||||
<td>
|
||||
PyTorch CIFAR-10 CNN example for DARTS with GPU support
|
||||
PyTorch CIFAR-10 CNN example for DARTS
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile.gpu">Dockerfile</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr align="center">
|
||||
<td>
|
||||
<code>ghcr.io/kubeflow/katib/darts-cnn-cifar10-cpu</code>
|
||||
</td>
|
||||
<td>
|
||||
PyTorch CIFAR-10 CNN example for DARTS with CPU support
|
||||
</td>
|
||||
<td>
|
||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile.cpu">Dockerfile</a>
|
||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile">Dockerfile</a>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
|
Binary file not shown.
After Width: | Height: | Size: 102 KiB |
Binary file not shown.
After Width: | Height: | Size: 192 KiB |
Before Width: | Height: | Size: 166 KiB After Width: | Height: | Size: 166 KiB |
|
@ -0,0 +1,191 @@
|
|||
# Document about how to add a new algorithm in Katib
|
||||
|
||||
## Implement a new algorithm and use it in Katib
|
||||
|
||||
### Implement the algorithm
|
||||
|
||||
The design of Katib follows the `ask-and-tell` pattern:
|
||||
|
||||
> They often follow a pattern a bit like this: 1. ask for a new set of parameters 1. walk to the Experiment and program in the new parameters 1. observe the outcome of running the Experiment 1. walk back to your laptop and tell the optimizer about the outcome 1. go to step 1
|
||||
|
||||
When an Experiment is created, one algorithm service will be created. Then Katib asks for new sets of parameters via `GetSuggestions` GRPC call. After that, Katib creates new trials according to the sets and observe the outcome. When the trials are finished, Katib tells the metrics of the finished trials to the algorithm, and ask another new sets.
|
||||
|
||||
The new algorithm needs to implement `Suggestion` service defined in [api.proto](../pkg/apis/manager/v1beta1/api.proto). One sample algorithm looks like:
|
||||
|
||||
```python
|
||||
from pkg.apis.manager.v1beta1.python import api_pb2
|
||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||
from pkg.suggestion.v1beta1.internal.search_space import HyperParameter, HyperParameterSearchSpace
|
||||
from pkg.suggestion.v1beta1.internal.trial import Trial, Assignment
|
||||
from pkg.suggestion.v1beta1.hyperopt.base_service import BaseHyperoptService
|
||||
from pkg.suggestion.v1beta1.base_health_service import HealthServicer
|
||||
|
||||
|
||||
# Inherit SuggestionServicer and implement GetSuggestions.
|
||||
class HyperoptService(
|
||||
api_pb2_grpc.SuggestionServicer, HealthServicer):
|
||||
def ValidateAlgorithmSettings(self, request, context):
|
||||
# Optional, it is used to validate algorithm settings defined by users.
|
||||
pass
|
||||
def GetSuggestions(self, request, context):
|
||||
# Convert the Experiment in GRPC request to the search space.
|
||||
# search_space example:
|
||||
# HyperParameterSearchSpace(
|
||||
# goal: MAXIMIZE,
|
||||
# params: [HyperParameter(name: param-1, type: INTEGER, min: 1, max: 5, step: 0),
|
||||
# HyperParameter(name: param-2, type: CATEGORICAL, list: cat1, cat2, cat3),
|
||||
# HyperParameter(name: param-3, type: DISCRETE, list: 3, 2, 6),
|
||||
# HyperParameter(name: param-4, type: DOUBLE, min: 1, max: 5, step: )]
|
||||
# )
|
||||
search_space = HyperParameterSearchSpace.convert(request.experiment)
|
||||
# Convert the trials in GRPC request to the trials in algorithm side.
|
||||
# trials example:
|
||||
# [Trial(
|
||||
# assignment: [Assignment(name=param-1, value=2),
|
||||
# Assignment(name=param-2, value=cat1),
|
||||
# Assignment(name=param-3, value=2),
|
||||
# Assignment(name=param-4, value=3.44)],
|
||||
# target_metric: Metric(name="metric-2" value="5643"),
|
||||
# additional_metrics: [Metric(name=metric-1, value=435),
|
||||
# Metric(name=metric-3, value=5643)],
|
||||
# Trial(
|
||||
# assignment: [Assignment(name=param-1, value=3),
|
||||
# Assignment(name=param-2, value=cat2),
|
||||
# Assignment(name=param-3, value=6),
|
||||
# Assignment(name=param-4, value=4.44)],
|
||||
# target_metric: Metric(name="metric-2" value="3242"),
|
||||
# additional_metrics: [Metric(name=metric=1, value=123),
|
||||
# Metric(name=metric-3, value=543)],
|
||||
trials = Trial.convert(request.trials)
|
||||
#--------------------------------------------------------------
|
||||
# Your code here
|
||||
# Implement the logic to generate new assignments for the given current request number.
|
||||
# For example, if request.current_request_number is 2, you should return:
|
||||
# [
|
||||
# [Assignment(name=param-1, value=3),
|
||||
# Assignment(name=param-2, value=cat2),
|
||||
# Assignment(name=param-3, value=3),
|
||||
# Assignment(name=param-4, value=3.22)
|
||||
# ],
|
||||
# [Assignment(name=param-1, value=4),
|
||||
# Assignment(name=param-2, value=cat4),
|
||||
# Assignment(name=param-3, value=2),
|
||||
# Assignment(name=param-4, value=4.32)
|
||||
# ],
|
||||
# ]
|
||||
list_of_assignments = your_logic(search_space, trials, request.current_request_number)
|
||||
#--------------------------------------------------------------
|
||||
# Convert list_of_assignments to
|
||||
return api_pb2.GetSuggestionsReply(
|
||||
trials=Assignment.generate(list_of_assignments)
|
||||
)
|
||||
```
|
||||
|
||||
### Make a GRPC server for the algorithm
|
||||
|
||||
Create a package under [cmd/suggestion](../cmd/suggestion). Then create the main function and Dockerfile. The new GRPC server should serve in port 6789.
|
||||
|
||||
Here is an example: [cmd/suggestion/hyperopt](../cmd/suggestion/hyperopt).
|
||||
Then build the Docker image.
|
||||
|
||||
### Use the algorithm in Katib.
|
||||
|
||||
Update the [Katib config](../manifests/v1beta1/components/controller/katib-config.yaml)
|
||||
and [Katib config patch](../manifests/v1beta1/installs/katib-standalone/katib-config-patch.yaml)
|
||||
with the new algorithm entity:
|
||||
|
||||
```diff
|
||||
suggestion: |-
|
||||
{
|
||||
"tpe": {
|
||||
"image": "docker.io/kubeflowkatib/suggestion-hyperopt"
|
||||
},
|
||||
"random": {
|
||||
"image": "docker.io/kubeflowkatib/suggestion-hyperopt"
|
||||
},
|
||||
+ "<new-algorithm-name>": {
|
||||
+ "image": "image built in the previous stage"
|
||||
+ }
|
||||
}
|
||||
```
|
||||
|
||||
Learn more about Katib config in the
|
||||
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/katib-config/)
|
||||
|
||||
### Contribute the algorithm to Katib
|
||||
|
||||
If you want to contribute the algorithm to Katib, you could add unit test and/or
|
||||
e2e test for it in the CI and submit a PR.
|
||||
|
||||
#### Unit Test
|
||||
|
||||
Here is an example [test_hyperopt_service.py](../test/unit/v1beta1/suggestion/test_hyperopt_service.py):
|
||||
|
||||
```python
|
||||
import grpc
|
||||
import grpc_testing
|
||||
import unittest
|
||||
|
||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||
from pkg.apis.manager.v1beta1.python import api_pb2
|
||||
|
||||
from pkg.suggestion.v1beta1.hyperopt.service import HyperoptService
|
||||
|
||||
class TestHyperopt(unittest.TestCase):
|
||||
def setUp(self):
|
||||
servicers = {
|
||||
api_pb2.DESCRIPTOR.services_by_name['Suggestion']: HyperoptService()
|
||||
}
|
||||
|
||||
self.test_server = grpc_testing.server_from_dictionary(
|
||||
servicers, grpc_testing.strict_real_time())
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
unittest.main()
|
||||
```
|
||||
|
||||
You can setup the GRPC server using `grpc_testing`, then define your own test cases.
|
||||
|
||||
#### E2E Test (Optional)
|
||||
|
||||
E2E tests help Katib verify that the algorithm works well.
|
||||
Follow below steps to add your algorithm (Suggestion) to the Katib CI
|
||||
(replace `<name>` with your Suggestion name):
|
||||
|
||||
1. Submit a PR to add a new ECR private registry to the AWS
|
||||
[`ECR_Private_Registry_List`](https://github.com/kubeflow/testing/blob/master/aws/IaC/CDK/test-infra/config/static_config/ECR_Resources.py#L18).
|
||||
Registry name should follow the pattern: `katib/v1beta1/suggestion-<name>`
|
||||
|
||||
1. Create a new Experiment YAML in the [examples/v1beta1](../examples/v1beta1)
|
||||
with the new algorithm.
|
||||
|
||||
1. Update [`setup-katib.sh`](../test/e2e/v1beta1/scripts/setup-katib.sh)
|
||||
script to modify `katib-config.yaml` with the new test Suggestion image name.
|
||||
For example:
|
||||
|
||||
```sh
|
||||
sed -i -e "s@docker.io/kubeflowkatib/suggestion-<name>@${ECR_REGISTRY}/${REPO_NAME}/v1beta1/suggestion-<name>@" ${CONFIG_PATCH}
|
||||
```
|
||||
|
||||
1. Update the following variables in [`argo_workflow.py`](../test/e2e/v1beta1/argo_workflow.py):
|
||||
|
||||
- [`KATIB_IMAGES`](../test/e2e/v1beta1/argo_workflow.py#L43) with your Suggestion Dockerfile location:
|
||||
|
||||
```diff
|
||||
. . .
|
||||
"suggestion-goptuna": "cmd/suggestion/goptuna/v1beta1/Dockerfile",
|
||||
"suggestion-optuna": "cmd/suggestion/optuna/v1beta1/Dockerfile",
|
||||
+ "suggestion-<name>": "cmd/suggestion/<name>/v1beta1/Dockerfile",
|
||||
. . .
|
||||
```
|
||||
|
||||
- [`KATIB_EXPERIMENTS`](../test/e2e/v1beta1/argo_workflow.py#L69) with your Experiment YAML location:
|
||||
|
||||
```diff
|
||||
. . .
|
||||
"multivariate-tpe": "examples/v1beta1/hp-tuning/multivariate-tpe.yaml",
|
||||
"cmaes": "examples/v1beta1/hp-tuning/cma-es.yaml",
|
||||
+ "<algorithm-name>: "examples/v1beta1/hp-tuning/<algorithm-name>.yaml",
|
||||
. . .
|
||||
```
|
|
@ -4,22 +4,19 @@ Below are the list of Katib presentations and demos. If you want to add your
|
|||
presentation or demo in this list please send a pull request. Please keep the
|
||||
list in reverse chronological order.
|
||||
|
||||
| Title | Presenters | Event | Date |
|
||||
| --- | --- | --- | --- |
|
||||
| [Hiding Kubernetes Complexity for ML Engineers Using Kubeflow](https://docs.google.com/presentation/d/1Fepo9TUgbsO7YpxenCq17Y9KKQU_VgqYjAVBFWAFIU4/edit?usp=sharing) | Andrey Velichkevich | RE-WORK MLOps Summit | 2022-11-10 |
|
||||
| [Managing Thousands of Automatic Machine Learning Experiments with Argo and Katib](https://youtu.be/0jBNXZjQ01I) | Andrey Velichkevich, [Yuan Tang](https://terrytangyuan.github.io/about/) | ArgoCon | 2022-09-21 |
|
||||
| [Cloud Native AutoML with Argo Workflows and Katib](https://youtu.be/KjHqmS4gIxM?t=181) | Andrey Velichkevich, Johnu George | Argo Community Meeting | 2022-02-16 |
|
||||
| [When Machine Learning Toolkit for Kubernetes Meets PaddlePaddle](https://github.com/terrytangyuan/public-talks/tree/main/talks/when-machine-learning-toolkit-for-kubernetes-meets-paddlepaddle-wave-summit-2021) | [Yuan Tang](https://terrytangyuan.github.io/about/) | Wave Summit | 2021-12-12 |
|
||||
| [Bridging into Python Ecosystem with Cloud-Native Distributed Machine Learning Pipelines](https://github.com/terrytangyuan/public-talks/tree/main/talks/bridging-into-python-ecosystem-with-cloud-native-distributed-machine-learning-pipelines-argocon-2021) | [Yuan Tang](https://terrytangyuan.github.io/about/) | ArgoCon | 2021-12-08 |
|
||||
| [Towards Cloud-Native Distributed Machine Learning Pipelines at Scale](https://github.com/terrytangyuan/public-talks/tree/main/talks/towards-cloud-native-distributed-machine-learning-pipelines-at-scale-pydata-global-2021) | [Yuan Tang](https://terrytangyuan.github.io/about/) | PyData | 2021-10-29 |
|
||||
| [AutoML and Training WG Summit July 2021](https://youtube.com/playlist?list=PL2gwy7BdKoGd9HQBCz1iC7vyFVN7Wa9N2) | Kubeflow Community | Kubeflow Summit | 2021-07-16 |
|
||||
| [MLOps and AutoML in Cloud-Native Way with Kubeflow and Katib](https://youtu.be/33VJ6KNBBvU) | Andrey Velichkevich | MLREPA | 2021-04-25 |
|
||||
| [A Tour of Katib's new UI for Kubeflow 1.3](https://youtu.be/1DtjB_boWcQ) | Kimonas Sotirchos | Kubeflow Community Meeting | 2021-03-30 |
|
||||
| [New UI for Kubeflow components](https://youtu.be/OKqx3IS2_G4) | Stefano Fioravanzo | Kubeflow Community Meeting | 2020-12-08 |
|
||||
| [Using Pipelines in Katib](https://youtu.be/BszcHMkGLgc) | Andrey Velichkevich | Kubeflow Community Meeting | 2020-11-10 |
|
||||
| [From Notebook to Kubeflow Pipelines with HP Tuning](https://youtu.be/QK0NxhyADpM) | Stefano Fioravanzo, Ilias Katsakioris | KubeCon | 2020-09-04 |
|
||||
| [Distributed Training and HPO Deep Dive](https://youtu.be/KJFOlhD3L1E) | Andrew Butler, Qianyang Yu, Tommy Li, Animesh Singh | Kubeflow Dojo | 2020-07-17 |
|
||||
| [Hyperparameter Tuning with Katib](https://youtu.be/nIKVlosDvrc) | Stephanie Wong | Kubeflow 101 | 2020-06-21 |
|
||||
| [Hyperparameter Tuning Using Kubeflow](https://youtu.be/OkAoiA6A2Ac) | Richard Liu, Johnu George | KubeCon | 2019-07-05 |
|
||||
| [Kubeflow Katib & Hyperparameter Tuning](https://youtu.be/1PKH_D6zjoM) | Richard Liu | Kubeflow Community Meeting | 2019-03-29 |
|
||||
| [Neural Architecture Search System on Kubeflow](https://youtu.be/WAK37UW7spo) | Andrey Velichkevich, Kirill Prosvirov, Jinan Zhou, Anubhav Garg | Kubeflow Community Meeting | 2019-03-26 |
|
||||
| Presentation or Demo title | Presenters | Date |
|
||||
| --- | --- | --- |
|
||||
| [When Machine Learning Toolkit for Kubernetes Meets PaddlePaddle](https://github.com/terrytangyuan/public-talks/tree/main/talks/when-machine-learning-toolkit-for-kubernetes-meets-paddlepaddle-wave-summit-2021) | [Yuan Tang](https://terrytangyuan.github.io/about/) | 2021-12-12 |
|
||||
| [Bridging into Python Ecosystem with Cloud-Native Distributed Machine Learning Pipelines](https://github.com/terrytangyuan/public-talks/tree/main/talks/bridging-into-python-ecosystem-with-cloud-native-distributed-machine-learning-pipelines-argocon-2021) | [Yuan Tang](https://terrytangyuan.github.io/about/) | 2021-12-08 |
|
||||
| [Towards Cloud-Native Distributed Machine Learning Pipelines at Scale](https://github.com/terrytangyuan/public-talks/tree/main/talks/towards-cloud-native-distributed-machine-learning-pipelines-at-scale-pydata-global-2021) | [Yuan Tang](https://terrytangyuan.github.io/about/) | 2021-10-29 |
|
||||
| [AutoML and Training WG Summit July 2021](https://youtube.com/playlist?list=PL2gwy7BdKoGd9HQBCz1iC7vyFVN7Wa9N2) | Kubeflow Community | 2021-07-16 |
|
||||
| [MLREPA 2021: MLOps and AutoML in Cloud-Native Way with Kubeflow and Katib](https://youtu.be/33VJ6KNBBvU) | Andrey Velichkevich | 2021-04-25 |
|
||||
| [KF Community: A Tour of Katib's new UI for Kubeflow 1.3](https://youtu.be/1DtjB_boWcQ) | Kimonas Sotirchos | 2021-03-30 |
|
||||
| [KF Community: New UI for Kubeflow components](https://youtu.be/OKqx3IS2_G4) | Stefano Fioravanzo | 2020-12-08 |
|
||||
| [KF Community: Using Pipelines in Katib](https://youtu.be/BszcHMkGLgc) | Andrey Velichkevich | 2020-11-10 |
|
||||
| [KubeCon 2020: From Notebook to Kubeflow Pipelines with HP Tuning](https://youtu.be/QK0NxhyADpM) | Stefano Fioravanzo, Ilias Katsakioris | 2020-09-04 |
|
||||
| [Kubeflow Dojo: Distributed Training and HPO Deep Dive](https://youtu.be/KJFOlhD3L1E) | Andrew Butler, Qianyang Yu, Tommy Li, Animesh Singh | 2020-07-17 |
|
||||
| [Kubeflow 101: Hyperparameter Tuning with Katib](https://youtu.be/nIKVlosDvrc) | Stephanie Wong | 2020-06-21 |
|
||||
| [KubeCon 2019: Hyperparameter Tuning Using Kubeflow](https://youtu.be/OkAoiA6A2Ac) | Richard Liu, Johnu George | 2019-07-05 |
|
||||
| [KF Community: Kubeflow Katib & Hyperparameter Tuning](https://youtu.be/1PKH_D6zjoM) | Richard Liu | 2019-03-29 |
|
||||
| [KF Community: Neural Architecture Search System on Kubeflow](https://youtu.be/WAK37UW7spo) | Andrey Velichkevich, Kirill Prosvirov, Jinan Zhou, Anubhav Garg | 2019-03-26 |
|
||||
|
|
|
@ -1,147 +0,0 @@
|
|||
# KEP-2044: Conformance Test for AutoML and Training Working Group
|
||||
|
||||
Andrey Velichkevich ([@andreyvelich](https://github.com/andreyvelich))
|
||||
Johnu George ([@johnugeorge](https://github.com/johnugeorge))
|
||||
2022-11-21
|
||||
[Original Google Doc](https://docs.google.com/document/d/1TRUKUY1zCCMdgF-nJ7QtzRwifsoQop0V8UnRo-GWlpI/edit#).
|
||||
|
||||
## Motivation
|
||||
|
||||
Kubeflow community needs to design conformance program so the distributions can
|
||||
become
|
||||
[Certified Kubeflow](https://docs.google.com/document/d/1a9ufoe_6DB1eSjpE9eK5nRBoH3ItoSkbPfxRA0AjPIc/edit?resourcekey=0-IRtbQzWfw5L_geRJ7F7GWQ#).
|
||||
Recently, Kubeflow Pipelines Working Group (WG) implemented the first version of
|
||||
[their conformance tests](https://github.com/kubeflow/kubeflow/issues/6485).
|
||||
We should design the same program for AutoML and Training WG.
|
||||
|
||||
This document is based on the original proposal for
|
||||
[the Kubeflow Pipelines conformance program](https://docs.google.com/document/d/1_til1HkVBFQ1wCgyUpWuMlKRYI4zP1YPmNxr75mzcps/edit#).
|
||||
|
||||
## Objective
|
||||
|
||||
Conformance program for AutoML and Training WG should follow the same goals as Pipelines program:
|
||||
|
||||
- The tests should be fully automated and executable by anyone who has public
|
||||
access to the Kubeflow repository.
|
||||
- The test results should be easy to verify by the Kubeflow Conformance Committee.
|
||||
- The tests should not depend on cloud provider (e.g. AWS or GCP).
|
||||
- The tests should cover basic functionality of Katib and the Training Operator.
|
||||
It will not cover all features.
|
||||
- The tests are expected to evolve in the future versions.
|
||||
- The tests should have a well documented and short list of set-up requirements.
|
||||
- The tests should install and complete in a relatively short period of time
|
||||
with suggested minimum infrastructure requirements
|
||||
(e.g. 3 nodes, 24 vCPU, 64 GB RAM, 500 GB Disk).
|
||||
|
||||
## Kubeflow Conformance
|
||||
|
||||
Initially the Kubeflow conformance will include the CRD based tests.
|
||||
In the future, API and UI based tests may be added. Kubeflow conformance consists
|
||||
the 3 category of tests:
|
||||
|
||||
- CRD-based tests
|
||||
|
||||
Most of Katib and Training Operator functionality are based on Kubernetes CRD.
|
||||
|
||||
**This document will define a design for CRD-based tests for Katib and the Training Operator.**
|
||||
|
||||
- API-based tests
|
||||
|
||||
Currently, Katib or Training Operator doesn’t have an API server that receives
|
||||
requests from the users. However, Katib has the DB Manager component that is
|
||||
responsible for writing/reading ML Training metrics.
|
||||
|
||||
In the following versions, we should design conformance program for the
|
||||
Katib API-based tests.
|
||||
|
||||
- UI-based tests
|
||||
|
||||
UI tests are valuable but complex to design, document and execute. In the following
|
||||
versions, we should design conformance program for the Katib UI-based tests.
|
||||
|
||||
## Design for the CRD-based tests
|
||||
|
||||

|
||||
|
||||
The design is similar to the KFP conformance program for the API-based tests.
|
||||
|
||||
For Katib, tests will be based on
|
||||
[the `run-e2e-experiment.go` script](https://github.com/kubeflow/katib/blob/570a3e68fff7b963889692d54ee1577fbf65e2ef/test/e2e/v1beta1/hack/gh-actions/run-e2e-experiment.go)
|
||||
that we run for our e2e tests.
|
||||
|
||||
This script will be converted to use Katib SDK. Tracking issue: https://github.com/kubeflow/katib/issues/2024.
|
||||
|
||||
For the Training Operator, tests will be based on [the SDK e2e test.](https://github.com/kubeflow/training-operator/tree/05badc6ee8a071400efe9019d8d60fc242818589/sdk/python/test/e2e)
|
||||
|
||||
### Test Workflow
|
||||
|
||||
All tests will be run in the _kf-conformance_ namespace inside the separate container.
|
||||
That will help to avoid environment variance and improve fault tolerance. Driver is required to trigger the deployment and download the results.
|
||||
|
||||
- We are going to use
|
||||
[the unified Makefile](https://github.com/kubeflow/kubeflow/blob/2fa0d3665234125aeb8cebe8fe44f0a5a50791c5/conformance/1.5/Makefile)
|
||||
for all Kubeflow conformance tests. Distributions (_driver_ on the diagram)
|
||||
need to run the following Makefile commands:
|
||||
|
||||
```makefile
|
||||
|
||||
# Run the conformance program.
|
||||
run: setup run-katib run-training-operator
|
||||
|
||||
# Sets up the Kubernetes resources (Kubeflow Profile, RBAC) that needs to run the test.
|
||||
# Create temporary folder for the conformance report.
|
||||
setup:
|
||||
kubectl apply -f ./setup.yaml
|
||||
mkdir -p /tmp/kf-conformance
|
||||
|
||||
# Create deployment and run the e2e tests for Katib and Training Operator.
|
||||
run-katib:
|
||||
kubectl apply -f ./katib-conformance.yaml
|
||||
|
||||
run-training-operator:
|
||||
kubectl apply -f ./training-operator-conformance.yaml
|
||||
|
||||
# Download the test deployment results to create PR for the Kubeflow Conformance Committee.
|
||||
report:
|
||||
./report-conformance.sh
|
||||
|
||||
# Cleans up created resources and directories.
|
||||
cleanup:
|
||||
kubectl delete -f ./setup.yaml
|
||||
kubectl delete -f ./katib-conformance.yaml
|
||||
kubectl delete -f ./training-operator-conformance.yaml
|
||||
rm -rf /tmp/kf-conformance
|
||||
```
|
||||
|
||||
- Katib and Training Operator conformance deployment will have the appropriate
|
||||
RBAC to Create/Read/Delete Katib Experiment and Training Operator Jobs in the
|
||||
_kf-conformance_ namespace.
|
||||
|
||||
- Distribution should have access to the internet to download the training datasets
|
||||
(e.g. MNIST) while running the tests.
|
||||
|
||||
- When the job is finished, the script generates output.
|
||||
|
||||
For Katib Experiment the output should be as follows:
|
||||
|
||||
```
|
||||
Test 1 - passed.
|
||||
Experiment name: random-search
|
||||
Experiment status: Experiment has succeeded because max trial count has reached
|
||||
```
|
||||
|
||||
For Training Operator the output should be as follows:
|
||||
|
||||
```
|
||||
Test 1 - passed.
|
||||
TFJob name: tfjob-mnist
|
||||
TFJob status: TFJob tfjob-mnist is successfully completed.
|
||||
```
|
||||
|
||||
- The above report can be downloaded from the test deployment by running `make report`.
|
||||
|
||||
- When all reports have been collected, the distributions are going to create PR
|
||||
to publish the reports and to update the appropriate [Kubeflow Documentation](https://www.kubeflow.org/)
|
||||
on conformant Kubeflow distributions. The Kubeflow Conformance Committee will
|
||||
verify it and make the distribution
|
||||
[Certified Kubeflow](https://github.com/kubeflow/community/blob/master/proposals/kubeflow-conformance-program-proposal.md#overview).
|
Binary file not shown.
Before Width: | Height: | Size: 77 KiB |
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue