mirror of https://github.com/kubeflow/katib.git
Compare commits
1 Commits
Author | SHA1 | Date |
---|---|---|
|
f009546d9d |
|
@ -4,3 +4,5 @@ docs
|
||||||
manifests
|
manifests
|
||||||
pkg/ui/*/frontend/node_modules
|
pkg/ui/*/frontend/node_modules
|
||||||
pkg/ui/*/frontend/build
|
pkg/ui/*/frontend/build
|
||||||
|
pkg/new-ui/*/frontend/node_modules
|
||||||
|
pkg/new-ui/*/frontend/build
|
||||||
|
|
4
.flake8
4
.flake8
|
@ -1,4 +0,0 @@
|
||||||
[flake8]
|
|
||||||
max-line-length = 100
|
|
||||||
# E203 is ignored to avoid conflicts with Black's formatting, as it's not PEP 8 compliant
|
|
||||||
extend-ignore = W503, E203
|
|
|
@ -0,0 +1,25 @@
|
||||||
|
---
|
||||||
|
name: Bug report
|
||||||
|
about: Tell us about a problem you are experiencing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
/kind bug
|
||||||
|
|
||||||
|
**What steps did you take and what happened:**
|
||||||
|
[A clear and concise description of what the bug is.]
|
||||||
|
|
||||||
|
|
||||||
|
**What did you expect to happen:**
|
||||||
|
|
||||||
|
|
||||||
|
**Anything else you would like to add:**
|
||||||
|
[Miscellaneous information that will assist in solving the issue.]
|
||||||
|
|
||||||
|
|
||||||
|
**Environment:**
|
||||||
|
|
||||||
|
- Kubeflow version (`kfctl version`):
|
||||||
|
- Minikube version (`minikube version`):
|
||||||
|
- Kubernetes version: (use `kubectl version`):
|
||||||
|
- OS (e.g. from `/etc/os-release`):
|
|
@ -1,50 +0,0 @@
|
||||||
name: Bug Report
|
|
||||||
description: Tell us about a problem you are experiencing with Katib
|
|
||||||
labels: ["kind/bug", "lifecycle/needs-triage"]
|
|
||||||
body:
|
|
||||||
- type: markdown
|
|
||||||
attributes:
|
|
||||||
value: |
|
|
||||||
Thanks for taking the time to fill out this Katib bug report!
|
|
||||||
- type: textarea
|
|
||||||
id: problem
|
|
||||||
attributes:
|
|
||||||
label: What happened?
|
|
||||||
description: |
|
|
||||||
Please provide as much info as possible. Not doing so may result in your bug not being
|
|
||||||
addressed in a timely manner.
|
|
||||||
validations:
|
|
||||||
required: true
|
|
||||||
- type: textarea
|
|
||||||
id: expected
|
|
||||||
attributes:
|
|
||||||
label: What did you expect to happen?
|
|
||||||
validations:
|
|
||||||
required: true
|
|
||||||
- type: textarea
|
|
||||||
id: environment
|
|
||||||
attributes:
|
|
||||||
label: Environment
|
|
||||||
value: |
|
|
||||||
Kubernetes version:
|
|
||||||
```bash
|
|
||||||
$ kubectl version
|
|
||||||
|
|
||||||
```
|
|
||||||
Katib controller version:
|
|
||||||
```bash
|
|
||||||
$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
|
|
||||||
|
|
||||||
```
|
|
||||||
Katib Python SDK version:
|
|
||||||
```bash
|
|
||||||
$ pip show kubeflow-katib
|
|
||||||
|
|
||||||
```
|
|
||||||
validations:
|
|
||||||
required: true
|
|
||||||
- type: input
|
|
||||||
id: votes
|
|
||||||
attributes:
|
|
||||||
label: Impacted by this bug?
|
|
||||||
value: Give it a 👍 We prioritize the issues with most 👍
|
|
|
@ -1,12 +0,0 @@
|
||||||
blank_issues_enabled: true
|
|
||||||
|
|
||||||
contact_links:
|
|
||||||
- name: Katib Documentation
|
|
||||||
url: https://www.kubeflow.org/docs/components/katib/
|
|
||||||
about: Much help can be found in the docs
|
|
||||||
- name: Kubeflow Katib Slack Channel
|
|
||||||
url: https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels
|
|
||||||
about: Ask the Katib community on CNCF Slack
|
|
||||||
- name: Kubeflow Katib Community Meeting
|
|
||||||
url: https://bit.ly/2PWVCkV
|
|
||||||
about: Join the Kubeflow AutoML working group meeting
|
|
|
@ -0,0 +1,14 @@
|
||||||
|
---
|
||||||
|
name: Feature enhancement request
|
||||||
|
about: Suggest an idea for this project
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
/kind feature
|
||||||
|
|
||||||
|
**Describe the solution you'd like**
|
||||||
|
[A clear and concise description of what you want to happen.]
|
||||||
|
|
||||||
|
|
||||||
|
**Anything else you would like to add:**
|
||||||
|
[Miscellaneous information that will assist in solving the issue.]
|
|
@ -1,28 +0,0 @@
|
||||||
name: Feature Request
|
|
||||||
description: Suggest an idea for Katib
|
|
||||||
labels: ["kind/feature", "lifecycle/needs-triage"]
|
|
||||||
body:
|
|
||||||
- type: markdown
|
|
||||||
attributes:
|
|
||||||
value: |
|
|
||||||
Thanks for taking the time to fill out this Katib feature request!
|
|
||||||
- type: textarea
|
|
||||||
id: feature
|
|
||||||
attributes:
|
|
||||||
label: What you would like to be added?
|
|
||||||
description: |
|
|
||||||
A clear and concise description of what you want to add to Katib.
|
|
||||||
Please consider to write Katib enhancement proposal if it is a large feature request.
|
|
||||||
validations:
|
|
||||||
required: true
|
|
||||||
- type: textarea
|
|
||||||
id: rationale
|
|
||||||
attributes:
|
|
||||||
label: Why is this needed?
|
|
||||||
validations:
|
|
||||||
required: true
|
|
||||||
- type: input
|
|
||||||
id: votes
|
|
||||||
attributes:
|
|
||||||
label: Love this feature?
|
|
||||||
value: Give it a 👍 We prioritize the features with most 👍
|
|
|
@ -1,14 +1,25 @@
|
||||||
<!-- Thanks for sending a pull request! Here are some tips for you:
|
<!-- Thanks for sending a pull request! Here are some tips for you:
|
||||||
1. If this is your first time, check our contributor guidelines https://www.kubeflow.org/docs/about/contributing
|
1. If this is your first time, read our contributor guidelines https://git.k8s.io/community/contributors/guide/pull-requests.md#the-pull-request-submit-process and developer guide https://git.k8s.io/community/contributors/devel/development.md#development-guide
|
||||||
2. To know more about Katib components, check developer guide https://github.com/kubeflow/katib/blob/master/CONTRIBUTING.md
|
2. If you want *faster* PR reviews, read how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
|
||||||
3. If you want *faster* PR reviews, check how: https://git.k8s.io/community/contributors/guide/pull-requests.md#best-practices-for-faster-reviews
|
3. Follow the instructions for writing a release note: https://git.k8s.io/community/contributors/guide/release-notes.md
|
||||||
|
4. If the PR is unfinished, see how to mark it: https://git.k8s.io/community/contributors/guide/pull-requests.md#marking-unfinished-pull-requests
|
||||||
|
5. If this PR changes image versions, please title this PR "Bump <image name> from x.x.x to y.y.y."
|
||||||
-->
|
-->
|
||||||
|
|
||||||
**What this PR does / why we need it**:
|
**What this PR does / why we need it**:
|
||||||
|
|
||||||
**Which issue(s) this PR fixes** _(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)_:
|
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
|
||||||
Fixes #
|
Fixes #
|
||||||
|
|
||||||
**Checklist:**
|
**Special notes for your reviewer**:
|
||||||
|
|
||||||
- [ ] [Docs](https://www.kubeflow.org/docs/components/katib/) included if any changes are user facing
|
1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
|
||||||
|
|
||||||
|
**Release note**:
|
||||||
|
<!-- Write your release note:
|
||||||
|
1. Enter your extended release note in the below block. If the PR requires additional action from users switching to the new release, include the string "action required".
|
||||||
|
2. If no release note is required, just write "NONE".
|
||||||
|
-->
|
||||||
|
```release-note
|
||||||
|
|
||||||
|
```
|
||||||
|
|
|
@ -0,0 +1,20 @@
|
||||||
|
# Configuration for stale probot https://probot.github.io/apps/stale/
|
||||||
|
|
||||||
|
# Number of days of inactivity before an issue becomes stale
|
||||||
|
daysUntilStale: 90
|
||||||
|
# Number of days of inactivity before a stale issue is closed
|
||||||
|
daysUntilClose: 20
|
||||||
|
# Issues with these labels will never be considered stale
|
||||||
|
exemptLabels:
|
||||||
|
- lifecycle/frozen
|
||||||
|
# Label to use when marking an issue as stale
|
||||||
|
staleLabel: lifecycle/stale
|
||||||
|
# Comment to post when marking an issue as stale. Set to `false` to disable
|
||||||
|
markComment: >
|
||||||
|
This issue has been automatically marked as stale because it has not had
|
||||||
|
recent activity. It will be closed if no further activity occurs. Thank you
|
||||||
|
for your contributions.
|
||||||
|
# Comment to post when closing a stale issue. Set to `false` to disable
|
||||||
|
closeComment: >
|
||||||
|
This issue has been automatically closed because it has not had recent
|
||||||
|
activity. Please comment "/reopen" to reopen it.
|
|
@ -1,81 +0,0 @@
|
||||||
# Reusable workflows for publishing Katib images.
|
|
||||||
name: Build and Publish Images
|
|
||||||
|
|
||||||
on:
|
|
||||||
workflow_call:
|
|
||||||
inputs:
|
|
||||||
component-name:
|
|
||||||
required: true
|
|
||||||
type: string
|
|
||||||
platforms:
|
|
||||||
required: true
|
|
||||||
type: string
|
|
||||||
dockerfile:
|
|
||||||
required: true
|
|
||||||
type: string
|
|
||||||
secrets:
|
|
||||||
DOCKERHUB_USERNAME:
|
|
||||||
required: false
|
|
||||||
DOCKERHUB_TOKEN:
|
|
||||||
required: false
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
build-and-publish:
|
|
||||||
name: Build and Publish Images
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
|
|
||||||
steps:
|
|
||||||
- name: Checkout
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Set Publish Condition
|
|
||||||
id: publish-condition
|
|
||||||
shell: bash
|
|
||||||
run: |
|
|
||||||
if [[ "${{ github.repository }}" == 'kubeflow/katib' && \
|
|
||||||
( "${{ github.ref }}" == 'refs/heads/master' || \
|
|
||||||
"${{ github.ref }}" =~ ^refs/heads/release- || \
|
|
||||||
"${{ github.ref }}" =~ ^refs/tags/v ) ]]; then
|
|
||||||
echo "should_publish=true" >> $GITHUB_OUTPUT
|
|
||||||
else
|
|
||||||
echo "should_publish=false" >> $GITHUB_OUTPUT
|
|
||||||
fi
|
|
||||||
|
|
||||||
- name: GHCR Login
|
|
||||||
if: steps.publish-condition.outputs.should_publish == 'true'
|
|
||||||
uses: docker/login-action@v3
|
|
||||||
with:
|
|
||||||
registry: ghcr.io
|
|
||||||
username: ${{ github.actor }}
|
|
||||||
password: ${{ secrets.GITHUB_TOKEN }}
|
|
||||||
|
|
||||||
- name: DockerHub Login
|
|
||||||
if: steps.publish-condition.outputs.should_publish == 'true'
|
|
||||||
uses: docker/login-action@v3
|
|
||||||
with:
|
|
||||||
registry: docker.io
|
|
||||||
username: ${{ secrets.DOCKERHUB_USERNAME }}
|
|
||||||
password: ${{ secrets.DOCKERHUB_TOKEN }}
|
|
||||||
|
|
||||||
- name: Publish Component ${{ inputs.component-name }}
|
|
||||||
if: steps.publish-condition.outputs.should_publish == 'true'
|
|
||||||
id: publish
|
|
||||||
uses: ./.github/workflows/template-publish-image
|
|
||||||
with:
|
|
||||||
image: |
|
|
||||||
ghcr.io/kubeflow/katib/${{ inputs.component-name }}
|
|
||||||
docker.io/kubeflowkatib/${{ inputs.component-name }}
|
|
||||||
dockerfile: ${{ inputs.dockerfile }}
|
|
||||||
platforms: ${{ inputs.platforms }}
|
|
||||||
push: true
|
|
||||||
|
|
||||||
- name: Test Build For Component ${{ inputs.component-name }}
|
|
||||||
if: steps.publish.outcome == 'skipped'
|
|
||||||
uses: ./.github/workflows/template-publish-image
|
|
||||||
with:
|
|
||||||
image: |
|
|
||||||
ghcr.io/kubeflow/katib/${{ inputs.component-name }}
|
|
||||||
docker.io/kubeflowkatib/${{ inputs.component-name }}
|
|
||||||
dockerfile: ${{ inputs.dockerfile }}
|
|
||||||
platforms: ${{ inputs.platforms }}
|
|
||||||
push: false
|
|
|
@ -1,38 +0,0 @@
|
||||||
name: E2E Test with darts-cnn-cifar10
|
|
||||||
|
|
||||||
on:
|
|
||||||
pull_request:
|
|
||||||
paths-ignore:
|
|
||||||
- "pkg/ui/v1beta1/frontend/**"
|
|
||||||
|
|
||||||
concurrency:
|
|
||||||
group: ${{ github.workflow }}-${{ github.ref }}
|
|
||||||
cancel-in-progress: true
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
e2e:
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
timeout-minutes: 120
|
|
||||||
steps:
|
|
||||||
- name: Checkout
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Setup Test Env
|
|
||||||
uses: ./.github/workflows/template-setup-e2e-test
|
|
||||||
with:
|
|
||||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
|
||||||
python-version: "3.11"
|
|
||||||
|
|
||||||
- name: Run e2e test with ${{ matrix.experiments }} experiments
|
|
||||||
uses: ./.github/workflows/template-e2e-test
|
|
||||||
with:
|
|
||||||
experiments: ${{ matrix.experiments }}
|
|
||||||
# Comma Delimited
|
|
||||||
trial-images: darts-cnn-cifar10-cpu
|
|
||||||
|
|
||||||
strategy:
|
|
||||||
fail-fast: false
|
|
||||||
matrix:
|
|
||||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
|
||||||
# Comma Delimited
|
|
||||||
experiments: ["darts-cpu"]
|
|
|
@ -1,38 +0,0 @@
|
||||||
name: E2E Test with enas-cnn-cifar10
|
|
||||||
|
|
||||||
on:
|
|
||||||
pull_request:
|
|
||||||
paths-ignore:
|
|
||||||
- "pkg/ui/v1beta1/frontend/**"
|
|
||||||
|
|
||||||
concurrency:
|
|
||||||
group: ${{ github.workflow }}-${{ github.ref }}
|
|
||||||
cancel-in-progress: true
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
e2e:
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
timeout-minutes: 120
|
|
||||||
steps:
|
|
||||||
- name: Checkout
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Setup Test Env
|
|
||||||
uses: ./.github/workflows/template-setup-e2e-test
|
|
||||||
with:
|
|
||||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
|
||||||
python-version: "3.8"
|
|
||||||
|
|
||||||
- name: Run e2e test with ${{ matrix.experiments }} experiments
|
|
||||||
uses: ./.github/workflows/template-e2e-test
|
|
||||||
with:
|
|
||||||
experiments: ${{ matrix.experiments }}
|
|
||||||
# Comma Delimited
|
|
||||||
trial-images: enas-cnn-cifar10-cpu
|
|
||||||
|
|
||||||
strategy:
|
|
||||||
fail-fast: false
|
|
||||||
matrix:
|
|
||||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
|
||||||
# Comma Delimited
|
|
||||||
experiments: ["enas-cpu"]
|
|
|
@ -1,46 +0,0 @@
|
||||||
name: E2E Test with pytorch-mnist
|
|
||||||
|
|
||||||
on:
|
|
||||||
pull_request:
|
|
||||||
paths-ignore:
|
|
||||||
- "pkg/ui/v1beta1/frontend/**"
|
|
||||||
|
|
||||||
concurrency:
|
|
||||||
group: ${{ github.workflow }}-${{ github.ref }}
|
|
||||||
cancel-in-progress: true
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
e2e:
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
timeout-minutes: 120
|
|
||||||
steps:
|
|
||||||
- name: Checkout
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Setup Test Env
|
|
||||||
uses: ./.github/workflows/template-setup-e2e-test
|
|
||||||
with:
|
|
||||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
|
||||||
python-version: "3.10"
|
|
||||||
|
|
||||||
- name: Run e2e test with ${{ matrix.experiments }} experiments
|
|
||||||
uses: ./.github/workflows/template-e2e-test
|
|
||||||
with:
|
|
||||||
experiments: ${{ matrix.experiments }}
|
|
||||||
training-operator: true
|
|
||||||
# Comma Delimited
|
|
||||||
trial-images: pytorch-mnist-cpu
|
|
||||||
|
|
||||||
strategy:
|
|
||||||
fail-fast: false
|
|
||||||
matrix:
|
|
||||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
|
||||||
# Comma Delimited
|
|
||||||
experiments:
|
|
||||||
# suggestion-hyperopt
|
|
||||||
- "long-running-resume,from-volume-resume,median-stop"
|
|
||||||
# others
|
|
||||||
- "grid,bayesian-optimization,tpe,multivariate-tpe,cma-es,hyperband"
|
|
||||||
- "hyperopt-distribution,optuna-distribution"
|
|
||||||
- "file-metrics-collector,pytorchjob-mnist"
|
|
||||||
- "median-stop-with-json-format,file-metrics-collector-with-json-format"
|
|
|
@ -1,38 +0,0 @@
|
||||||
name: E2E Test with simple-pbt
|
|
||||||
|
|
||||||
on:
|
|
||||||
pull_request:
|
|
||||||
paths-ignore:
|
|
||||||
- "pkg/ui/v1beta1/frontend/**"
|
|
||||||
|
|
||||||
concurrency:
|
|
||||||
group: ${{ github.workflow }}-${{ github.ref }}
|
|
||||||
cancel-in-progress: true
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
e2e:
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
timeout-minutes: 120
|
|
||||||
steps:
|
|
||||||
- name: Checkout
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Setup Test Env
|
|
||||||
uses: ./.github/workflows/template-setup-e2e-test
|
|
||||||
with:
|
|
||||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
|
||||||
|
|
||||||
- name: Run e2e test with ${{ matrix.experiments }} experiments
|
|
||||||
uses: ./.github/workflows/template-e2e-test
|
|
||||||
with:
|
|
||||||
experiments: ${{ matrix.experiments }}
|
|
||||||
# Comma Delimited
|
|
||||||
trial-images: simple-pbt
|
|
||||||
|
|
||||||
strategy:
|
|
||||||
fail-fast: false
|
|
||||||
matrix:
|
|
||||||
# Detail: https://hub.docker.com/r/kindest/node
|
|
||||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
|
||||||
# Comma Delimited
|
|
||||||
experiments: ["simple-pbt"]
|
|
|
@ -1,38 +0,0 @@
|
||||||
name: E2E Test with tf-mnist-with-summaries
|
|
||||||
|
|
||||||
on:
|
|
||||||
pull_request:
|
|
||||||
paths-ignore:
|
|
||||||
- "pkg/ui/v1beta1/frontend/**"
|
|
||||||
|
|
||||||
concurrency:
|
|
||||||
group: ${{ github.workflow }}-${{ github.ref }}
|
|
||||||
cancel-in-progress: true
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
e2e:
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
timeout-minutes: 120
|
|
||||||
steps:
|
|
||||||
- name: Checkout
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Setup Test Env
|
|
||||||
uses: ./.github/workflows/template-setup-e2e-test
|
|
||||||
with:
|
|
||||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
|
||||||
|
|
||||||
- name: Run e2e test with ${{ matrix.experiments }} experiments
|
|
||||||
uses: ./.github/workflows/template-e2e-test
|
|
||||||
with:
|
|
||||||
experiments: ${{ matrix.experiments }}
|
|
||||||
training-operator: true
|
|
||||||
# Comma Delimited
|
|
||||||
trial-images: tf-mnist-with-summaries
|
|
||||||
|
|
||||||
strategy:
|
|
||||||
fail-fast: false
|
|
||||||
matrix:
|
|
||||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
|
||||||
# Comma Delimited
|
|
||||||
experiments: ["tfjob-mnist-with-summaries"]
|
|
|
@ -1,40 +0,0 @@
|
||||||
name: E2E Test with tune API
|
|
||||||
|
|
||||||
on:
|
|
||||||
pull_request:
|
|
||||||
paths-ignore:
|
|
||||||
- "pkg/ui/v1beta1/frontend/**"
|
|
||||||
|
|
||||||
concurrency:
|
|
||||||
group: ${{ github.workflow }}-${{ github.ref }}
|
|
||||||
cancel-in-progress: true
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
e2e:
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
timeout-minutes: 120
|
|
||||||
steps:
|
|
||||||
- name: Checkout
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Setup Test Env
|
|
||||||
uses: ./.github/workflows/template-setup-e2e-test
|
|
||||||
with:
|
|
||||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
|
||||||
|
|
||||||
- name: Install Katib SDK with extra requires
|
|
||||||
shell: bash
|
|
||||||
run: |
|
|
||||||
pip install --prefer-binary -e 'sdk/python/v1beta1[huggingface]'
|
|
||||||
|
|
||||||
- name: Run e2e test with tune API
|
|
||||||
uses: ./.github/workflows/template-e2e-test
|
|
||||||
with:
|
|
||||||
tune-api: true
|
|
||||||
training-operator: true
|
|
||||||
|
|
||||||
strategy:
|
|
||||||
fail-fast: false
|
|
||||||
matrix:
|
|
||||||
# Detail: https://hub.docker.com/r/kindest/node
|
|
||||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
|
|
@ -1,35 +0,0 @@
|
||||||
name: E2E Test with Katib UI, random search, and postgres
|
|
||||||
|
|
||||||
on:
|
|
||||||
- pull_request
|
|
||||||
|
|
||||||
concurrency:
|
|
||||||
group: ${{ github.workflow }}-${{ github.ref }}
|
|
||||||
cancel-in-progress: true
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
e2e:
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
timeout-minutes: 120
|
|
||||||
steps:
|
|
||||||
- name: Checkout
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Setup Test Env
|
|
||||||
uses: ./.github/workflows/template-setup-e2e-test
|
|
||||||
with:
|
|
||||||
kubernetes-version: ${{ matrix.kubernetes-version }}
|
|
||||||
|
|
||||||
- name: Run e2e test with ${{ matrix.experiments }} experiments
|
|
||||||
uses: ./.github/workflows/template-e2e-test
|
|
||||||
with:
|
|
||||||
experiments: random
|
|
||||||
# Comma Delimited
|
|
||||||
trial-images: pytorch-mnist-cpu
|
|
||||||
katib-ui: true
|
|
||||||
database-type: postgres
|
|
||||||
|
|
||||||
strategy:
|
|
||||||
fail-fast: false
|
|
||||||
matrix:
|
|
||||||
kubernetes-version: ["v1.29.2", "v1.30.7", "v1.31.3"]
|
|
|
@ -1,49 +0,0 @@
|
||||||
name: Free-Up Disk Space
|
|
||||||
description: Remove Non-Essential Tools And Move Docker Data Directory to /mnt/docker
|
|
||||||
|
|
||||||
runs:
|
|
||||||
using: composite
|
|
||||||
steps:
|
|
||||||
# This step is a Workaround to avoid the "No space left on device" error.
|
|
||||||
# ref: https://github.com/actions/runner-images/issues/2840
|
|
||||||
- name: Remove unnecessary files
|
|
||||||
shell: bash
|
|
||||||
run: |
|
|
||||||
echo "Disk usage before cleanup:"
|
|
||||||
df -hT
|
|
||||||
|
|
||||||
sudo rm -rf /usr/share/dotnet
|
|
||||||
sudo rm -rf /opt/ghc
|
|
||||||
sudo rm -rf /usr/local/share/boost
|
|
||||||
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
|
|
||||||
sudo rm -rf /usr/local/lib/android
|
|
||||||
sudo rm -rf /usr/local/share/powershell
|
|
||||||
sudo rm -rf /usr/share/swift
|
|
||||||
|
|
||||||
echo "Disk usage after cleanup:"
|
|
||||||
df -hT
|
|
||||||
|
|
||||||
- name: Prune docker images
|
|
||||||
shell: bash
|
|
||||||
run: |
|
|
||||||
docker image prune -a -f
|
|
||||||
docker system df
|
|
||||||
df -hT
|
|
||||||
|
|
||||||
- name: Move docker data directory
|
|
||||||
shell: bash
|
|
||||||
run: |
|
|
||||||
echo "Stopping docker service ..."
|
|
||||||
sudo systemctl stop docker
|
|
||||||
DOCKER_DEFAULT_ROOT_DIR=/var/lib/docker
|
|
||||||
DOCKER_ROOT_DIR=/mnt/docker
|
|
||||||
echo "Moving ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
|
|
||||||
sudo mv ${DOCKER_DEFAULT_ROOT_DIR} ${DOCKER_ROOT_DIR}
|
|
||||||
echo "Creating symlink ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}"
|
|
||||||
sudo ln -s ${DOCKER_ROOT_DIR} ${DOCKER_DEFAULT_ROOT_DIR}
|
|
||||||
echo "$(sudo ls -l ${DOCKER_DEFAULT_ROOT_DIR})"
|
|
||||||
echo "Starting docker service ..."
|
|
||||||
sudo systemctl daemon-reload
|
|
||||||
sudo systemctl start docker
|
|
||||||
echo "Docker service status:"
|
|
||||||
sudo systemctl --no-pager -l -o short status docker
|
|
|
@ -1,42 +0,0 @@
|
||||||
name: Publish AutoML Algorithm Images
|
|
||||||
|
|
||||||
on:
|
|
||||||
push:
|
|
||||||
pull_request:
|
|
||||||
paths-ignore:
|
|
||||||
- "pkg/ui/v1beta1/frontend/**"
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
algorithm:
|
|
||||||
name: Publish Image
|
|
||||||
uses: ./.github/workflows/build-and-publish-images.yaml
|
|
||||||
with:
|
|
||||||
component-name: ${{ matrix.component-name }}
|
|
||||||
platforms: linux/amd64,linux/arm64
|
|
||||||
dockerfile: ${{ matrix.dockerfile }}
|
|
||||||
secrets:
|
|
||||||
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
|
|
||||||
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
|
|
||||||
|
|
||||||
strategy:
|
|
||||||
fail-fast: false
|
|
||||||
matrix:
|
|
||||||
include:
|
|
||||||
- component-name: suggestion-hyperopt
|
|
||||||
dockerfile: cmd/suggestion/hyperopt/v1beta1/Dockerfile
|
|
||||||
- component-name: suggestion-hyperband
|
|
||||||
dockerfile: cmd/suggestion/hyperband/v1beta1/Dockerfile
|
|
||||||
- component-name: suggestion-skopt
|
|
||||||
dockerfile: cmd/suggestion/skopt/v1beta1/Dockerfile
|
|
||||||
- component-name: suggestion-goptuna
|
|
||||||
dockerfile: cmd/suggestion/goptuna/v1beta1/Dockerfile
|
|
||||||
- component-name: suggestion-optuna
|
|
||||||
dockerfile: cmd/suggestion/optuna/v1beta1/Dockerfile
|
|
||||||
- component-name: suggestion-pbt
|
|
||||||
dockerfile: cmd/suggestion/pbt/v1beta1/Dockerfile
|
|
||||||
- component-name: suggestion-enas
|
|
||||||
dockerfile: cmd/suggestion/nas/enas/v1beta1/Dockerfile
|
|
||||||
- component-name: suggestion-darts
|
|
||||||
dockerfile: cmd/suggestion/nas/darts/v1beta1/Dockerfile
|
|
||||||
- component-name: earlystopping-medianstop
|
|
||||||
dockerfile: cmd/earlystopping/medianstop/v1beta1/Dockerfile
|
|
|
@ -1,24 +0,0 @@
|
||||||
name: Publish Katib Conformance Test Images
|
|
||||||
|
|
||||||
on:
|
|
||||||
- push
|
|
||||||
- pull_request
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
core:
|
|
||||||
name: Publish Image
|
|
||||||
uses: ./.github/workflows/build-and-publish-images.yaml
|
|
||||||
with:
|
|
||||||
component-name: ${{ matrix.component-name }}
|
|
||||||
platforms: linux/amd64,linux/arm64
|
|
||||||
dockerfile: ${{ matrix.dockerfile }}
|
|
||||||
secrets:
|
|
||||||
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
|
|
||||||
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
|
|
||||||
|
|
||||||
strategy:
|
|
||||||
fail-fast: false
|
|
||||||
matrix:
|
|
||||||
include:
|
|
||||||
- component-name: katib-conformance
|
|
||||||
dockerfile: Dockerfile.conformance
|
|
|
@ -1,32 +0,0 @@
|
||||||
name: Publish Katib Core Images
|
|
||||||
|
|
||||||
on:
|
|
||||||
- push
|
|
||||||
- pull_request
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
core:
|
|
||||||
name: Publish Image
|
|
||||||
uses: ./.github/workflows/build-and-publish-images.yaml
|
|
||||||
with:
|
|
||||||
component-name: ${{ matrix.component-name }}
|
|
||||||
platforms: linux/amd64,linux/arm64
|
|
||||||
dockerfile: ${{ matrix.dockerfile }}
|
|
||||||
secrets:
|
|
||||||
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
|
|
||||||
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
|
|
||||||
|
|
||||||
strategy:
|
|
||||||
fail-fast: false
|
|
||||||
matrix:
|
|
||||||
include:
|
|
||||||
- component-name: katib-controller
|
|
||||||
dockerfile: cmd/katib-controller/v1beta1/Dockerfile
|
|
||||||
- component-name: katib-db-manager
|
|
||||||
dockerfile: cmd/db-manager/v1beta1/Dockerfile
|
|
||||||
- component-name: katib-ui
|
|
||||||
dockerfile: cmd/ui/v1beta1/Dockerfile
|
|
||||||
- component-name: file-metrics-collector
|
|
||||||
dockerfile: cmd/metricscollector/v1beta1/file-metricscollector/Dockerfile
|
|
||||||
- component-name: tfevent-metrics-collector
|
|
||||||
dockerfile: cmd/metricscollector/v1beta1/tfevent-metricscollector/Dockerfile
|
|
|
@ -1,48 +0,0 @@
|
||||||
name: Publish Trial Images
|
|
||||||
|
|
||||||
on:
|
|
||||||
push:
|
|
||||||
pull_request:
|
|
||||||
paths-ignore:
|
|
||||||
- "pkg/ui/v1beta1/frontend/**"
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
trial:
|
|
||||||
name: Publish Image
|
|
||||||
uses: ./.github/workflows/build-and-publish-images.yaml
|
|
||||||
with:
|
|
||||||
component-name: ${{ matrix.trial-name }}
|
|
||||||
platforms: ${{ matrix.platforms }}
|
|
||||||
dockerfile: ${{ matrix.dockerfile }}
|
|
||||||
secrets:
|
|
||||||
DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
|
|
||||||
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
|
|
||||||
|
|
||||||
strategy:
|
|
||||||
fail-fast: false
|
|
||||||
matrix:
|
|
||||||
include:
|
|
||||||
- trial-name: pytorch-mnist-cpu
|
|
||||||
platforms: linux/amd64,linux/arm64
|
|
||||||
dockerfile: examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.cpu
|
|
||||||
- trial-name: pytorch-mnist-gpu
|
|
||||||
platforms: linux/amd64
|
|
||||||
dockerfile: examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.gpu
|
|
||||||
- trial-name: tf-mnist-with-summaries
|
|
||||||
platforms: linux/amd64,linux/arm64
|
|
||||||
dockerfile: examples/v1beta1/trial-images/tf-mnist-with-summaries/Dockerfile
|
|
||||||
- trial-name: enas-cnn-cifar10-gpu
|
|
||||||
platforms: linux/amd64
|
|
||||||
dockerfile: examples/v1beta1/trial-images/enas-cnn-cifar10/Dockerfile.gpu
|
|
||||||
- trial-name: enas-cnn-cifar10-cpu
|
|
||||||
platforms: linux/amd64,linux/arm64
|
|
||||||
dockerfile: examples/v1beta1/trial-images/enas-cnn-cifar10/Dockerfile.cpu
|
|
||||||
- trial-name: darts-cnn-cifar10-cpu
|
|
||||||
platforms: linux/amd64,linux/arm64
|
|
||||||
dockerfile: examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile.cpu
|
|
||||||
- trial-name: darts-cnn-cifar10-gpu
|
|
||||||
platforms: linux/amd64
|
|
||||||
dockerfile: examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile.gpu
|
|
||||||
- trial-name: simple-pbt
|
|
||||||
platforms: linux/amd64,linux/arm64
|
|
||||||
dockerfile: examples/v1beta1/trial-images/simple-pbt/Dockerfile
|
|
|
@ -1,42 +0,0 @@
|
||||||
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
|
|
||||||
#
|
|
||||||
# You can adjust the behavior by modifying this file.
|
|
||||||
# For more information, see:
|
|
||||||
# https://github.com/actions/stale
|
|
||||||
name: Mark stale issues and pull requests
|
|
||||||
|
|
||||||
on:
|
|
||||||
schedule:
|
|
||||||
- cron: "0 */5 * * *"
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
stale:
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
permissions:
|
|
||||||
issues: write
|
|
||||||
pull-requests: write
|
|
||||||
|
|
||||||
steps:
|
|
||||||
- uses: actions/stale@v5
|
|
||||||
with:
|
|
||||||
repo-token: ${{ secrets.GITHUB_TOKEN }}
|
|
||||||
days-before-stale: 90
|
|
||||||
days-before-close: 20
|
|
||||||
stale-issue-message: >
|
|
||||||
This issue has been automatically marked as stale because it has not had
|
|
||||||
recent activity. It will be closed if no further activity occurs. Thank you
|
|
||||||
for your contributions.
|
|
||||||
close-issue-message: >
|
|
||||||
This issue has been automatically closed because it has not had recent
|
|
||||||
activity. Please comment "/reopen" to reopen it.
|
|
||||||
stale-issue-label: lifecycle/stale
|
|
||||||
exempt-issue-labels: lifecycle/frozen
|
|
||||||
stale-pr-message: >
|
|
||||||
This pull request has been automatically marked as stale because it has not had
|
|
||||||
recent activity. It will be closed if no further activity occurs. Thank you
|
|
||||||
for your contributions.
|
|
||||||
close-pr-message: >
|
|
||||||
This pull request has been automatically closed because it has not had recent
|
|
||||||
activity. Please comment "/reopen" to reopen it.
|
|
||||||
stale-pr-label: lifecycle/stale
|
|
||||||
exempt-pr-labels: lifecycle/frozen
|
|
|
@ -1,49 +0,0 @@
|
||||||
# Composite action for e2e tests.
|
|
||||||
name: Run E2E Test
|
|
||||||
description: Run e2e test using the minikube cluster
|
|
||||||
|
|
||||||
inputs:
|
|
||||||
experiments:
|
|
||||||
required: false
|
|
||||||
description: comma delimited experiment name
|
|
||||||
default: ""
|
|
||||||
training-operator:
|
|
||||||
required: false
|
|
||||||
description: whether to deploy training-operator or not
|
|
||||||
default: false
|
|
||||||
trial-images:
|
|
||||||
required: false
|
|
||||||
description: comma delimited trial image name
|
|
||||||
default: ""
|
|
||||||
katib-ui:
|
|
||||||
required: true
|
|
||||||
description: whether to deploy katib-ui or not
|
|
||||||
default: false
|
|
||||||
database-type:
|
|
||||||
required: false
|
|
||||||
description: mysql or postgres
|
|
||||||
default: mysql
|
|
||||||
tune-api:
|
|
||||||
required: true
|
|
||||||
description: whether to execute tune-api test or not
|
|
||||||
default: false
|
|
||||||
|
|
||||||
runs:
|
|
||||||
using: composite
|
|
||||||
steps:
|
|
||||||
- name: Setup Minikube Cluster
|
|
||||||
shell: bash
|
|
||||||
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh ${{ inputs.katib-ui }} ${{ inputs.tune-api }} ${{ inputs.trial-images }} ${{ inputs.experiments }}
|
|
||||||
|
|
||||||
- name: Setup Katib
|
|
||||||
shell: bash
|
|
||||||
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-katib.sh ${{ inputs.katib-ui }} ${{ inputs.training-operator }} ${{ inputs.database-type }}
|
|
||||||
|
|
||||||
- name: Run E2E Experiment
|
|
||||||
shell: bash
|
|
||||||
run: |
|
|
||||||
if "${{ inputs.tune-api }}"; then
|
|
||||||
./test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.sh
|
|
||||||
else
|
|
||||||
./test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.sh ${{ inputs.experiments }}
|
|
||||||
fi
|
|
|
@ -1,62 +0,0 @@
|
||||||
# Composite action for publishing Katib images.
|
|
||||||
name: Build And Publish Container Images
|
|
||||||
description: Build MultiPlatform Supporting Container Images
|
|
||||||
|
|
||||||
inputs:
|
|
||||||
image:
|
|
||||||
required: true
|
|
||||||
description: image tag
|
|
||||||
dockerfile:
|
|
||||||
required: true
|
|
||||||
description: path for dockerfile
|
|
||||||
platforms:
|
|
||||||
required: true
|
|
||||||
description: linux/amd64 or linux/amd64,linux/arm64
|
|
||||||
push:
|
|
||||||
required: true
|
|
||||||
description: whether to push container images or not
|
|
||||||
|
|
||||||
runs:
|
|
||||||
using: composite
|
|
||||||
steps:
|
|
||||||
# This step is a Workaround to avoid the "No space left on device" error.
|
|
||||||
# ref: https://github.com/actions/runner-images/issues/2840
|
|
||||||
- name: Remove unnecessary files
|
|
||||||
shell: bash
|
|
||||||
run: |
|
|
||||||
sudo rm -rf /usr/share/dotnet
|
|
||||||
sudo rm -rf /opt/ghc
|
|
||||||
sudo rm -rf "/usr/local/share/boost"
|
|
||||||
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
|
|
||||||
sudo rm -rf /usr/local/lib/android
|
|
||||||
sudo rm -rf /usr/local/share/powershell
|
|
||||||
sudo rm -rf /usr/share/swift
|
|
||||||
|
|
||||||
echo "Disk usage after cleanup:"
|
|
||||||
df -h
|
|
||||||
|
|
||||||
- name: Set up QEMU
|
|
||||||
uses: docker/setup-qemu-action@v3
|
|
||||||
|
|
||||||
- name: Set Up Docker Buildx
|
|
||||||
uses: docker/setup-buildx-action@v3
|
|
||||||
|
|
||||||
- name: Add Docker Tags
|
|
||||||
id: meta
|
|
||||||
uses: docker/metadata-action@v5
|
|
||||||
with:
|
|
||||||
images: ${{ inputs.image }}
|
|
||||||
tags: |
|
|
||||||
type=raw,latest
|
|
||||||
type=sha,prefix=v1beta1-
|
|
||||||
|
|
||||||
- name: Build and Push
|
|
||||||
uses: docker/build-push-action@v5
|
|
||||||
with:
|
|
||||||
context: .
|
|
||||||
file: ${{ inputs.dockerfile }}
|
|
||||||
push: ${{ inputs.push }}
|
|
||||||
tags: ${{ steps.meta.outputs.tags }}
|
|
||||||
cache-from: type=gha
|
|
||||||
cache-to: type=gha,mode=max,ignore-error=true
|
|
||||||
platforms: ${{ inputs.platforms }}
|
|
|
@ -1,48 +0,0 @@
|
||||||
# Composite action to setup e2e tests.
|
|
||||||
name: Setup E2E Test
|
|
||||||
description: setup env for e2e test using the minikube cluster
|
|
||||||
|
|
||||||
inputs:
|
|
||||||
kubernetes-version:
|
|
||||||
required: true
|
|
||||||
description: kubernetes version
|
|
||||||
python-version:
|
|
||||||
required: false
|
|
||||||
description: Python version
|
|
||||||
# Most latest supporting version
|
|
||||||
default: "3.10"
|
|
||||||
|
|
||||||
runs:
|
|
||||||
using: composite
|
|
||||||
steps:
|
|
||||||
# This step is a Workaround to avoid the "No space left on device" error.
|
|
||||||
# ref: https://github.com/actions/runner-images/issues/2840
|
|
||||||
- name: Free-Up Disk Space
|
|
||||||
uses: ./.github/workflows/free-up-disk-space
|
|
||||||
|
|
||||||
- name: Setup kubectl
|
|
||||||
uses: azure/setup-kubectl@v4
|
|
||||||
with:
|
|
||||||
version: ${{ inputs.kubernetes-version }}
|
|
||||||
|
|
||||||
- name: Setup Minikube Cluster
|
|
||||||
uses: medyagh/setup-minikube@v0.0.18
|
|
||||||
with:
|
|
||||||
network-plugin: cni
|
|
||||||
cni: flannel
|
|
||||||
driver: none
|
|
||||||
kubernetes-version: ${{ inputs.kubernetes-version }}
|
|
||||||
minikube-version: 1.34.0
|
|
||||||
start-args: --wait-timeout=120s
|
|
||||||
|
|
||||||
- name: Setup Docker Buildx
|
|
||||||
uses: docker/setup-buildx-action@v3
|
|
||||||
|
|
||||||
- name: Setup Python
|
|
||||||
uses: actions/setup-python@v5
|
|
||||||
with:
|
|
||||||
python-version: ${{ inputs.python-version }}
|
|
||||||
|
|
||||||
- name: Install Katib SDK
|
|
||||||
shell: bash
|
|
||||||
run: pip install --prefer-binary -e sdk/python/v1beta1
|
|
|
@ -0,0 +1,108 @@
|
||||||
|
name: Charmed Katib
|
||||||
|
|
||||||
|
on:
|
||||||
|
- push
|
||||||
|
- pull_request
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
lint:
|
||||||
|
name: Lint
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- name: Check out code
|
||||||
|
uses: actions/checkout@v2
|
||||||
|
|
||||||
|
- name: Install dependencies
|
||||||
|
run: |
|
||||||
|
sudo apt-get install python3-setuptools
|
||||||
|
sudo pip3 install black flake8
|
||||||
|
|
||||||
|
- name: Check black
|
||||||
|
run: black --check operators
|
||||||
|
|
||||||
|
- name: Check flake8
|
||||||
|
run: cd operators && flake8
|
||||||
|
|
||||||
|
build:
|
||||||
|
name: Test
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- name: Check out repo
|
||||||
|
uses: actions/checkout@v2
|
||||||
|
|
||||||
|
- uses: balchua/microk8s-actions@v0.2.2
|
||||||
|
with:
|
||||||
|
channel: "1.20/stable"
|
||||||
|
addons: '["dns", "storage", "rbac"]'
|
||||||
|
|
||||||
|
- name: Install dependencies
|
||||||
|
run: |
|
||||||
|
set -eux
|
||||||
|
sudo snap install charm --classic
|
||||||
|
sudo snap install juju --classic
|
||||||
|
sudo snap install juju-helpers --classic
|
||||||
|
sudo snap install juju-wait --classic
|
||||||
|
sudo pip3 install charmcraft
|
||||||
|
|
||||||
|
- name: Build Docker images
|
||||||
|
run: |
|
||||||
|
set -eux
|
||||||
|
images=("katib-controller" "katib-ui" "katib-db-manager")
|
||||||
|
folders=("katib-controller" "ui" "db-manager")
|
||||||
|
for idx in {0..2}; do
|
||||||
|
docker build . \
|
||||||
|
-t docker.io/kubeflowkatib/${images[$idx]}:latest \
|
||||||
|
-f cmd/${folders[$idx]}/v1beta1/Dockerfile
|
||||||
|
docker save docker.io/kubeflowkatib/${images[$idx]} > ${images[$idx]}.tar
|
||||||
|
microk8s ctr image import ${images[$idx]}.tar
|
||||||
|
done
|
||||||
|
|
||||||
|
- name: Deploy Katib
|
||||||
|
run: |
|
||||||
|
set -eux
|
||||||
|
cd operators/
|
||||||
|
git clone git://git.launchpad.net/canonical-osm
|
||||||
|
cp -r canonical-osm/charms/interfaces/juju-relation-mysql mysql
|
||||||
|
sg microk8s -c 'juju bootstrap microk8s uk8s'
|
||||||
|
juju add-model kubeflow
|
||||||
|
juju bundle deploy -b bundle-edge.yaml --build
|
||||||
|
juju wait -wvt 300
|
||||||
|
|
||||||
|
- name: Test Katib
|
||||||
|
run: |
|
||||||
|
set -eux
|
||||||
|
kubectl apply -f examples/v1beta1/random-example.yaml
|
||||||
|
|
||||||
|
- name: Get pod statuses
|
||||||
|
run: kubectl get all -A
|
||||||
|
if: failure()
|
||||||
|
|
||||||
|
- name: Get juju status
|
||||||
|
run: juju status
|
||||||
|
if: failure()
|
||||||
|
|
||||||
|
- name: Get katib-controller workload logs
|
||||||
|
run: kubectl logs --tail 100 -nkubeflow -ljuju-app=katib-controller
|
||||||
|
if: failure()
|
||||||
|
|
||||||
|
- name: Get katib-controller operator logs
|
||||||
|
run: kubectl logs --tail 100 -nkubeflow -ljuju-operator=katib-controller
|
||||||
|
if: failure()
|
||||||
|
|
||||||
|
- name: Get katib-ui workload logs
|
||||||
|
run: kubectl logs --tail 100 -nkubeflow -ljuju-app=katib-ui
|
||||||
|
if: failure()
|
||||||
|
|
||||||
|
- name: Get katib-ui operator logs
|
||||||
|
run: kubectl logs --tail 100 -nkubeflow -ljuju-operator=katib-ui
|
||||||
|
if: failure()
|
||||||
|
|
||||||
|
- name: Get katib-db-manager workload logs
|
||||||
|
run: kubectl logs --tail 100 -nkubeflow -ljuju-app=katib-db-manager
|
||||||
|
if: failure()
|
||||||
|
|
||||||
|
- name: Get katib-db-manager operator logs
|
||||||
|
run: kubectl logs --tail 100 -nkubeflow -ljuju-operator=katib-db-manager
|
||||||
|
if: failure()
|
|
@ -1,79 +0,0 @@
|
||||||
name: Go Test
|
|
||||||
|
|
||||||
on:
|
|
||||||
pull_request:
|
|
||||||
paths-ignore:
|
|
||||||
- "pkg/ui/v1beta1/frontend/**"
|
|
||||||
|
|
||||||
concurrency:
|
|
||||||
group: ${{ github.workflow }}-${{ github.ref }}
|
|
||||||
cancel-in-progress: true
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
generatetests:
|
|
||||||
name: Generate And Format Test
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
env:
|
|
||||||
GOPATH: ${{ github.workspace }}/go
|
|
||||||
defaults:
|
|
||||||
run:
|
|
||||||
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
|
|
||||||
steps:
|
|
||||||
- name: Check out code
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
with:
|
|
||||||
path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
|
|
||||||
|
|
||||||
- name: Setup Go
|
|
||||||
uses: actions/setup-go@v5
|
|
||||||
with:
|
|
||||||
go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.mod
|
|
||||||
cache-dependency-path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.sum
|
|
||||||
|
|
||||||
- name: Check Go Modules, Generated Go/Python codes, and Format
|
|
||||||
run: make check
|
|
||||||
|
|
||||||
unittests:
|
|
||||||
name: Unit Test
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
env:
|
|
||||||
GOPATH: ${{ github.workspace }}/go
|
|
||||||
defaults:
|
|
||||||
run:
|
|
||||||
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
|
|
||||||
steps:
|
|
||||||
- name: Check out code
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
with:
|
|
||||||
path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
|
|
||||||
|
|
||||||
- name: Setup Go
|
|
||||||
uses: actions/setup-go@v5
|
|
||||||
with:
|
|
||||||
go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.mod
|
|
||||||
cache-dependency-path: ${{ env.GOPATH }}/src/github.com/kubeflow/katib/go.sum
|
|
||||||
|
|
||||||
- name: Run Go test
|
|
||||||
run: go mod download && make test ENVTEST_K8S_VERSION=${{ matrix.kubernetes-version }}
|
|
||||||
|
|
||||||
- name: Coveralls report
|
|
||||||
uses: shogo82148/actions-goveralls@v1
|
|
||||||
with:
|
|
||||||
path-to-profile: coverage.out
|
|
||||||
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/katib
|
|
||||||
parallel: true
|
|
||||||
|
|
||||||
strategy:
|
|
||||||
fail-fast: false
|
|
||||||
matrix:
|
|
||||||
# Detail: `setup-envtest list`
|
|
||||||
kubernetes-version: ["1.29.3", "1.30.0", "1.31.0"]
|
|
||||||
|
|
||||||
# notifies that all test jobs are finished.
|
|
||||||
finish:
|
|
||||||
needs: unittests
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
steps:
|
|
||||||
- uses: shogo82148/actions-goveralls@v1
|
|
||||||
with:
|
|
||||||
parallel-finished: true
|
|
|
@ -1,30 +0,0 @@
|
||||||
name: Lint Files
|
|
||||||
|
|
||||||
on:
|
|
||||||
pull_request:
|
|
||||||
paths-ignore:
|
|
||||||
- "pkg/ui/v1beta1/frontend/**"
|
|
||||||
|
|
||||||
concurrency:
|
|
||||||
group: ${{ github.workflow }}-${{ github.ref }}
|
|
||||||
cancel-in-progress: true
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
lint:
|
|
||||||
name: Lint
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
|
|
||||||
steps:
|
|
||||||
- name: Check out code
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Setup Python
|
|
||||||
uses: actions/setup-python@v5
|
|
||||||
with:
|
|
||||||
python-version: 3.9
|
|
||||||
|
|
||||||
- name: Check shell scripts
|
|
||||||
run: make shellcheck
|
|
||||||
|
|
||||||
- name: Run pre-commit
|
|
||||||
uses: pre-commit/action@v3.0.1
|
|
|
@ -1,101 +0,0 @@
|
||||||
name: Frontend Test
|
|
||||||
|
|
||||||
on:
|
|
||||||
pull_request:
|
|
||||||
paths:
|
|
||||||
- pkg/ui/v1beta1/frontend/**
|
|
||||||
|
|
||||||
concurrency:
|
|
||||||
group: ${{ github.workflow }}-${{ github.ref }}
|
|
||||||
cancel-in-progress: true
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
test:
|
|
||||||
name: Code format and lint
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
|
|
||||||
steps:
|
|
||||||
- name: Check out code
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Setup Node
|
|
||||||
uses: actions/setup-node@v4
|
|
||||||
with:
|
|
||||||
node-version: 16.20.2
|
|
||||||
|
|
||||||
- name: Format katib code
|
|
||||||
run: |
|
|
||||||
npm install prettier --prefix ./pkg/ui/v1beta1/frontend
|
|
||||||
make prettier-check
|
|
||||||
|
|
||||||
- name: Lint katib code
|
|
||||||
run: |
|
|
||||||
cd pkg/ui/v1beta1/frontend
|
|
||||||
npm run lint-check
|
|
||||||
|
|
||||||
frontend-unit-tests:
|
|
||||||
name: Frontend Unit Tests
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
|
|
||||||
steps:
|
|
||||||
- name: Check out code
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Setup Node
|
|
||||||
uses: actions/setup-node@v4
|
|
||||||
with:
|
|
||||||
node-version: 16.20.2
|
|
||||||
|
|
||||||
- name: Fetch Kubeflow and install common code dependencies
|
|
||||||
run: |
|
|
||||||
COMMIT=$(cat pkg/ui/v1beta1/frontend/COMMIT)
|
|
||||||
cd /tmp && git clone https://github.com/kubeflow/kubeflow.git
|
|
||||||
cd kubeflow
|
|
||||||
git checkout $COMMIT
|
|
||||||
cd components/crud-web-apps/common/frontend/kubeflow-common-lib
|
|
||||||
npm i
|
|
||||||
npm run build
|
|
||||||
npm link ./dist/kubeflow
|
|
||||||
|
|
||||||
- name: Install KWA dependencies
|
|
||||||
run: |
|
|
||||||
cd pkg/ui/v1beta1/frontend
|
|
||||||
npm i
|
|
||||||
npm link kubeflow
|
|
||||||
|
|
||||||
- name: Run unit tests
|
|
||||||
run: |
|
|
||||||
cd pkg/ui/v1beta1/frontend
|
|
||||||
npm run test:prod
|
|
||||||
|
|
||||||
frontend-ui-tests:
|
|
||||||
name: UI tests with Cypress
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
steps:
|
|
||||||
- name: Checkout
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
- name: Setup node version to 16
|
|
||||||
uses: actions/setup-node@v4
|
|
||||||
with:
|
|
||||||
node-version: 16
|
|
||||||
|
|
||||||
- name: Fetch Kubeflow and install common code dependencies
|
|
||||||
run: |
|
|
||||||
COMMIT=$(cat pkg/ui/v1beta1/frontend/COMMIT)
|
|
||||||
cd /tmp && git clone https://github.com/kubeflow/kubeflow.git
|
|
||||||
cd kubeflow
|
|
||||||
git checkout $COMMIT
|
|
||||||
cd components/crud-web-apps/common/frontend/kubeflow-common-lib
|
|
||||||
npm i
|
|
||||||
npm run build
|
|
||||||
npm link ./dist/kubeflow
|
|
||||||
- name: Install KWA dependencies
|
|
||||||
run: |
|
|
||||||
cd pkg/ui/v1beta1/frontend
|
|
||||||
npm i
|
|
||||||
npm link kubeflow
|
|
||||||
- name: Serve UI & run Cypress tests in Chrome and Firefox
|
|
||||||
run: |
|
|
||||||
cd pkg/ui/v1beta1/frontend
|
|
||||||
npm run start & npx wait-on http://localhost:4200
|
|
||||||
npm run ui-test-ci-all
|
|
|
@ -1,47 +0,0 @@
|
||||||
name: Python Test
|
|
||||||
|
|
||||||
on:
|
|
||||||
pull_request:
|
|
||||||
paths-ignore:
|
|
||||||
- "pkg/ui/v1beta1/frontend/**"
|
|
||||||
|
|
||||||
concurrency:
|
|
||||||
group: ${{ github.workflow }}-${{ github.ref }}
|
|
||||||
cancel-in-progress: true
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
test:
|
|
||||||
name: Test
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
|
|
||||||
steps:
|
|
||||||
- name: Check out code
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Setup Python
|
|
||||||
uses: actions/setup-python@v5
|
|
||||||
with:
|
|
||||||
python-version: 3.11
|
|
||||||
|
|
||||||
- name: Run Python test
|
|
||||||
run: make pytest
|
|
||||||
|
|
||||||
# The skopt service doesn't work appropriately with Python 3.11.
|
|
||||||
# So, we need to run the test with Python 3.9.
|
|
||||||
# TODO (tenzen-y): Once we stop to support skopt, we can remove this test.
|
|
||||||
# REF: https://github.com/kubeflow/katib/issues/2280
|
|
||||||
test-skopt:
|
|
||||||
name: Test Skopt
|
|
||||||
runs-on: ubuntu-22.04
|
|
||||||
|
|
||||||
steps:
|
|
||||||
- name: Check out code
|
|
||||||
uses: actions/checkout@v4
|
|
||||||
|
|
||||||
- name: Setup Python
|
|
||||||
uses: actions/setup-python@v5
|
|
||||||
with:
|
|
||||||
python-version: 3.9
|
|
||||||
|
|
||||||
- name: Run Python test
|
|
||||||
run: make pytest-skopt
|
|
|
@ -6,10 +6,6 @@ __pycache__/
|
||||||
*.egg-info
|
*.egg-info
|
||||||
build/
|
build/
|
||||||
*.charm
|
*.charm
|
||||||
test/unit/v1beta1/metricscollector/testdata
|
|
||||||
|
|
||||||
# SDK generator JAR file
|
|
||||||
hack/gen-python-sdk/openapi-generator-cli.jar
|
|
||||||
|
|
||||||
# Project specific ignore files
|
# Project specific ignore files
|
||||||
*.swp
|
*.swp
|
||||||
|
@ -22,7 +18,6 @@ bin
|
||||||
*.dll
|
*.dll
|
||||||
*.so
|
*.so
|
||||||
*.dylib
|
*.dylib
|
||||||
pkg/metricscollector/v1beta1/file-metricscollector/testdata
|
|
||||||
|
|
||||||
## Test binary, build with `go test -c`
|
## Test binary, build with `go test -c`
|
||||||
*.test
|
*.test
|
||||||
|
@ -78,6 +73,3 @@ $RECYCLE.BIN/
|
||||||
|
|
||||||
## Vendor dir
|
## Vendor dir
|
||||||
vendor
|
vendor
|
||||||
|
|
||||||
# Jupyter Notebooks.
|
|
||||||
**/.ipynb_checkpoints
|
|
||||||
|
|
|
@ -1,38 +0,0 @@
|
||||||
repos:
|
|
||||||
- repo: https://github.com/pre-commit/pre-commit-hooks
|
|
||||||
rev: v2.3.0
|
|
||||||
hooks:
|
|
||||||
- id: check-yaml
|
|
||||||
args: [--allow-multiple-documents]
|
|
||||||
- id: check-json
|
|
||||||
- repo: https://github.com/pycqa/isort
|
|
||||||
rev: 5.11.5
|
|
||||||
hooks:
|
|
||||||
- id: isort
|
|
||||||
name: isort
|
|
||||||
entry: isort --profile black
|
|
||||||
- repo: https://github.com/psf/black
|
|
||||||
rev: 24.2.0
|
|
||||||
hooks:
|
|
||||||
- id: black
|
|
||||||
files: (sdk|examples|pkg)/.*
|
|
||||||
- repo: https://github.com/pycqa/flake8
|
|
||||||
rev: 7.1.1
|
|
||||||
hooks:
|
|
||||||
- id: flake8
|
|
||||||
files: (sdk|examples|pkg)/.*
|
|
||||||
exclude: |
|
|
||||||
(?x)^(
|
|
||||||
.*zz_generated.deepcopy.*|
|
|
||||||
.*pb.go|
|
|
||||||
pkg/apis/manager/.*pb2(?:_grpc)?.py(?:i)?|
|
|
||||||
pkg/apis/v1beta1/openapi_generated.go|
|
|
||||||
pkg/mock/.*|
|
|
||||||
pkg/client/controller/.*|
|
|
||||||
sdk/python/v1beta1/kubeflow/katib/configuration.py|
|
|
||||||
sdk/python/v1beta1/kubeflow/katib/rest.py|
|
|
||||||
sdk/python/v1beta1/kubeflow/katib/__init__.py|
|
|
||||||
sdk/python/v1beta1/kubeflow/katib/exceptions.py|
|
|
||||||
sdk/python/v1beta1/kubeflow/katib/api_client.py|
|
|
||||||
sdk/python/v1beta1/kubeflow/katib/models/.*
|
|
||||||
)$
|
|
|
@ -0,0 +1,26 @@
|
||||||
|
jobs:
|
||||||
|
include:
|
||||||
|
- name: "Go unit tests, gofmt, golint and coveralls"
|
||||||
|
language: go
|
||||||
|
go: "1.15.8"
|
||||||
|
go_import_path: github.com/kubeflow/katib
|
||||||
|
install:
|
||||||
|
- curl -L -O "https://github.com/kubernetes-sigs/kubebuilder/releases/download/v1.0.7/kubebuilder_1.0.7_linux_amd64.tar.gz"
|
||||||
|
- # extract the archive
|
||||||
|
- tar -zxvf kubebuilder_1.0.7_linux_amd64.tar.gz
|
||||||
|
- sudo mv kubebuilder_1.0.7_linux_amd64 /usr/local/kubebuilder
|
||||||
|
- export PATH=$PATH:/usr/local/kubebuilder/bin
|
||||||
|
# get coveralls.io support
|
||||||
|
- go get github.com/mattn/goveralls
|
||||||
|
script:
|
||||||
|
- make check
|
||||||
|
- make test
|
||||||
|
after_success:
|
||||||
|
- goveralls -coverprofile=coverage.out
|
||||||
|
- name: "Prettier frontend check"
|
||||||
|
language: node_js
|
||||||
|
node_js: "12.18.1"
|
||||||
|
install:
|
||||||
|
- npm install --global prettier@2.2.0
|
||||||
|
script:
|
||||||
|
- make prettier-check
|
|
@ -5,16 +5,12 @@ please add yourself into the following list by a pull request.
|
||||||
Please keep the list in alphabetical order.
|
Please keep the list in alphabetical order.
|
||||||
|
|
||||||
| Organization | Contact | Description of Use |
|
| Organization | Contact | Description of Use |
|
||||||
|--------------------------------------------------|------------------------------------------------------|----------------------------------------------------------------------|
|
| ------------ | ------- | ------------------ |
|
||||||
| [Akuity](https://akuity.io/) | [@terrytangyuan](https://github.com/terrytangyuan) | |
|
|
||||||
| [Ant Group](https://www.antgroup.com/) |[@ohmystack](https://github.com/ohmystack) | Automatic training in Ant Group internal AI Platform |
|
| [Ant Group](https://www.antgroup.com/) |[@ohmystack](https://github.com/ohmystack) | Automatic training in Ant Group internal AI Platform |
|
||||||
| [babylon health](https://www.babylonhealth.com/) |[@jeremievallee](https://github.com/jeremievallee) | Hyperparameter tuning for AIR internal AI Platform |
|
| [babylon health](https://www.babylonhealth.com/) |[@jeremievallee](https://github.com/jeremievallee) | Hyperparameter tuning for AIR internal AI Platform |
|
||||||
| [caicloud](https://caicloud.io/) |[@gaocegege](https://github.com/gaocegege) | Hyperparameter tuning in Caicloud Cloud-Native AI Platform |
|
| [caicloud](https://caicloud.io/) |[@gaocegege](https://github.com/gaocegege) | Hyperparameter tuning in Caicloud Cloud-Native AI Platform |
|
||||||
| [canonical](https://ubuntu.com/) |[@RFMVasconcelos](https://github.com/rfmvasconcelos) | Hyperparameter tuning for customer projects in Defense and Fintech |
|
| [canonical](https://ubuntu.com/) |[@RFMVasconcelos](https://github.com/rfmvasconcelos) | Hyperparameter tuning for customer projects in Defense and Fintech |
|
||||||
| [CERN](https://home.cern/) | [@d-gol](https://github.com/d-gol) | Hyperparameter tuning within the ML platform on private cloud |
|
|
||||||
| [cisco](https://cisco.com/) |[@ramdootp](https://github.com/ramdootp) | Hyperparameter tuning for conversational AI interface using Rasa |
|
| [cisco](https://cisco.com/) |[@ramdootp](https://github.com/ramdootp) | Hyperparameter tuning for conversational AI interface using Rasa |
|
||||||
| [cubonacci](https://www.cubonacci.com) |[@janvdvegt](https://github.com/janvdvegt) | Hyperparameter tuning within the Cubonacci machine learning platform |
|
| [cubonacci](https://www.cubonacci.com) |[@janvdvegt](https://github.com/janvdvegt) | Hyperparameter tuning within the Cubonacci machine learning platform |
|
||||||
| [CyberAgent](https://www.cyberagent.co.jp/en/) | [@tenzen-y](https://github.com/tenzen-y) | Experiment in CyberAgent internal ML Platform on Private Cloud |
|
|
||||||
| [fuzhi](http://www.fuzhi.ai/) | [@planck0591](https://github.com/planck0591) | Experiment and Trial in autoML Platform |
|
| [fuzhi](http://www.fuzhi.ai/) | [@planck0591](https://github.com/planck0591) | Experiment and Trial in autoML Platform |
|
||||||
| [karrot](https://uk.karrotmarket.com/) |[@muik](https://github.com/muik) | Hyperparameter tuning in Karrot ML Platform |
|
| [karrot](https://uk.karrotmarket.com/) |[@muik](https://github.com/muik) | Hyperparameter tuning in Karrot ML Platform |
|
||||||
| [PITS Global Data Recovery Services](https://www.pitsdatarecovery.net/) | [@pheianox](https://github.com/pheianox) | CyberAgent and ML Platform |
|
|
||||||
|
|
1798
CHANGELOG.md
1798
CHANGELOG.md
File diff suppressed because it is too large
Load Diff
43
CITATION.cff
43
CITATION.cff
|
@ -1,43 +0,0 @@
|
||||||
cff-version: 1.2.0
|
|
||||||
message: "If you use Katib in your scientific publication, please cite it as below."
|
|
||||||
authors:
|
|
||||||
- family-names: "George"
|
|
||||||
given-names: "Johnu"
|
|
||||||
- family-names: "Gao"
|
|
||||||
given-names: "Ce"
|
|
||||||
- family-names: "Liu"
|
|
||||||
given-names: "Richard"
|
|
||||||
- family-names: "Liu"
|
|
||||||
given-names: "Hou Gang"
|
|
||||||
- family-names: "Tang"
|
|
||||||
given-names: "Yuan"
|
|
||||||
- family-names: "Pydipaty"
|
|
||||||
given-names: "Ramdoot"
|
|
||||||
- family-names: "Saha"
|
|
||||||
given-names: "Amit Kumar"
|
|
||||||
title: "Katib"
|
|
||||||
type: software
|
|
||||||
repository-code: "https://github.com/kubeflow/katib"
|
|
||||||
preferred-citation:
|
|
||||||
type: misc
|
|
||||||
title: "A Scalable and Cloud-Native Hyperparameter Tuning System"
|
|
||||||
authors:
|
|
||||||
- family-names: "George"
|
|
||||||
given-names: "Johnu"
|
|
||||||
- family-names: "Gao"
|
|
||||||
given-names: "Ce"
|
|
||||||
- family-names: "Liu"
|
|
||||||
given-names: "Richard"
|
|
||||||
- family-names: "Liu"
|
|
||||||
given-names: "Hou Gang"
|
|
||||||
- family-names: "Tang"
|
|
||||||
given-names: "Yuan"
|
|
||||||
- family-names: "Pydipaty"
|
|
||||||
given-names: "Ramdoot"
|
|
||||||
- family-names: "Saha"
|
|
||||||
given-names: "Amit Kumar"
|
|
||||||
year: 2020
|
|
||||||
url: "https://arxiv.org/abs/2006.02085"
|
|
||||||
identifiers:
|
|
||||||
- type: "other"
|
|
||||||
value: "arXiv:2006.02085"
|
|
167
CONTRIBUTING.md
167
CONTRIBUTING.md
|
@ -1,167 +0,0 @@
|
||||||
# Developer Guide
|
|
||||||
|
|
||||||
This developer guide is for people who want to contribute to the Katib project.
|
|
||||||
If you're interesting in using Katib in your machine learning project,
|
|
||||||
see the following guides:
|
|
||||||
|
|
||||||
- [Getting started with Katib](https://kubeflow.org/docs/components/katib/hyperparameter/).
|
|
||||||
- [How to configure Katib Experiment](https://kubeflow.org/docs/components/katib/experiment/).
|
|
||||||
- [Katib architecture and concepts](https://www.kubeflow.org/docs/components/katib/reference/architecture/)
|
|
||||||
for hyperparameter tuning and neural architecture search.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
- [Go](https://golang.org/) (1.22 or later)
|
|
||||||
- [Docker](https://docs.docker.com/) (24.0 or later)
|
|
||||||
- [Docker Buildx](https://docs.docker.com/build/buildx/) (0.8.0 or later)
|
|
||||||
- [Java](https://docs.oracle.com/javase/8/docs/technotes/guides/install/install_overview.html) (8 or later)
|
|
||||||
- [Python](https://www.python.org/) (3.11 or later)
|
|
||||||
- [kustomize](https://kustomize.io/) (4.0.5 or later)
|
|
||||||
- [pre-commit](https://pre-commit.com/)
|
|
||||||
|
|
||||||
## Build from source code
|
|
||||||
|
|
||||||
**Note** that your Docker Desktop should
|
|
||||||
[enable containerd image store](https://docs.docker.com/desktop/containerd/#enable-the-containerd-image-store)
|
|
||||||
to build multi-arch images. Check source code as follows:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
make build REGISTRY=<image-registry> TAG=<image-tag>
|
|
||||||
```
|
|
||||||
|
|
||||||
If you are using an Apple Silicon machine and encounter the "rosetta error: bss_size overflow," go to Docker Desktop -> General and uncheck "Use Rosetta for x86_64/amd64 emulation on Apple Silicon."
|
|
||||||
|
|
||||||
To use your custom images for the Katib components, modify
|
|
||||||
[Kustomization file](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml)
|
|
||||||
and [Katib Config](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/katib-config.yaml)
|
|
||||||
|
|
||||||
You can deploy Katib v1beta1 manifests into a Kubernetes cluster as follows:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
make deploy
|
|
||||||
```
|
|
||||||
|
|
||||||
You can undeploy Katib v1beta1 manifests from a Kubernetes cluster as follows:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
make undeploy
|
|
||||||
```
|
|
||||||
|
|
||||||
## Technical and style guide
|
|
||||||
|
|
||||||
The following guidelines apply primarily to Katib,
|
|
||||||
but other projects like [Training Operator](https://github.com/kubeflow/training-operator) might also adhere to them.
|
|
||||||
|
|
||||||
## Go Development
|
|
||||||
|
|
||||||
When coding:
|
|
||||||
|
|
||||||
- Follow [effective go](https://go.dev/doc/effective_go) guidelines.
|
|
||||||
- Run locally [`make check`](https://github.com/kubeflow/katib/blob/46173463027e4fd2e604e25d7075b2b31a702049/Makefile#L31)
|
|
||||||
to verify if changes follow best practices before submitting PRs.
|
|
||||||
|
|
||||||
Testing:
|
|
||||||
|
|
||||||
- Use [`cmp.Diff`](https://pkg.go.dev/github.com/google/go-cmp/cmp#Diff) instead of `reflect.Equal`, to provide useful comparisons.
|
|
||||||
- Define test cases as maps instead of slices to avoid dependencies on the running order.
|
|
||||||
Map key should be equal to the test case name.
|
|
||||||
|
|
||||||
## Modify controller APIs
|
|
||||||
|
|
||||||
If you want to modify Katib controller APIs, you have to
|
|
||||||
generate deepcopy, clientset, listers, informers, open-api and Python SDK with the changed APIs.
|
|
||||||
You can update the necessary files as follows:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
make generate
|
|
||||||
```
|
|
||||||
|
|
||||||
## Controller Flags
|
|
||||||
|
|
||||||
Below is a list of command-line flags accepted by Katib controller:
|
|
||||||
|
|
||||||
| Name | Type | Default | Description |
|
|
||||||
| ------------ | ------ | ------- | -------------------------------------------------------------------------------------------------------------------------------- |
|
|
||||||
| katib-config | string | "" | The katib-controller will load its initial configuration from this file. Omit this flag to use the default configuration values. |
|
|
||||||
|
|
||||||
## DB Manager Flags
|
|
||||||
|
|
||||||
Below is a list of command-line flags accepted by Katib DB Manager:
|
|
||||||
|
|
||||||
| Name | Type | Default | Description |
|
|
||||||
| --------------- | ------------- | -------------| ------------------------------------------------------------------- |
|
|
||||||
| connect-timeout | time.Duration | 60s | Timeout before calling error during database connection |
|
|
||||||
| listen-address | string | 0.0.0.0:6789 | The network interface or IP address to receive incoming connections |
|
|
||||||
|
|
||||||
## Katib admission webhooks
|
|
||||||
|
|
||||||
Katib uses three [Kubernetes admission webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/).
|
|
||||||
|
|
||||||
1. `validator.experiment.katib.kubeflow.org` -
|
|
||||||
[Validating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook)
|
|
||||||
to validate the Katib Experiment before the creation.
|
|
||||||
|
|
||||||
1. `defaulter.experiment.katib.kubeflow.org` -
|
|
||||||
[Mutating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook)
|
|
||||||
to set the [default values](../pkg/apis/controller/experiments/v1beta1/experiment_defaults.go)
|
|
||||||
in the Katib Experiment before the creation.
|
|
||||||
|
|
||||||
1. `mutator.pod.katib.kubeflow.org` - Mutating admission webhook to inject the metrics
|
|
||||||
collector sidecar container to the training pod. Learn more about the Katib's
|
|
||||||
metrics collector in the
|
|
||||||
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/).
|
|
||||||
|
|
||||||
You can find the YAMLs for the Katib webhooks
|
|
||||||
[here](../manifests/v1beta1/components/webhook/webhooks.yaml).
|
|
||||||
|
|
||||||
**Note:** If you are using a private Kubernetes cluster, you have to allow traffic
|
|
||||||
via `TCP:8443` by specifying the firewall rule and you have to update the master
|
|
||||||
plane CIDR source range to use the Katib webhooks
|
|
||||||
|
|
||||||
### Katib cert generator
|
|
||||||
|
|
||||||
Katib Controller has the internal `cert-generator` to generate certificates for the webhooks.
|
|
||||||
|
|
||||||
Once Katib is deployed in the Kubernetes cluster, the `cert-generator` follows these steps:
|
|
||||||
|
|
||||||
- Generate the self-signed certificate and private key.
|
|
||||||
|
|
||||||
- Update a Kubernetes Secret with the self-signed TLS certificate and private key.
|
|
||||||
- Patch the webhooks with the `CABundle`.
|
|
||||||
|
|
||||||
Once the `cert-generator` finished, the Katib controller starts to register controllers such as `experiment-controller` to the manager.
|
|
||||||
|
|
||||||
You can find the `cert-generator` source code [here](../pkg/certgenerator/v1beta1).
|
|
||||||
|
|
||||||
NOTE: the Katib also supports the [cert-manager](https://cert-manager.io/) to generate certs for the admission webhooks instead of using cert-generator.
|
|
||||||
You can find the installation with the cert-manager [here](../manifests/v1beta1/installs/katib-cert-manager).
|
|
||||||
|
|
||||||
## Implement a new algorithm and use it in Katib
|
|
||||||
|
|
||||||
Please see [new-algorithm-service.md](./new-algorithm-service.md).
|
|
||||||
|
|
||||||
## Katib UI documentation
|
|
||||||
|
|
||||||
Please see [Katib UI README](../pkg/ui/v1beta1).
|
|
||||||
|
|
||||||
## Design proposals
|
|
||||||
|
|
||||||
Please see [proposals](./proposals).
|
|
||||||
|
|
||||||
## Code Style
|
|
||||||
|
|
||||||
### pre-commit
|
|
||||||
|
|
||||||
Make sure to install [pre-commit](https://pre-commit.com/) (`pip install
|
|
||||||
pre-commit`) and run `pre-commit install` from the root of the repository at
|
|
||||||
least once before creating git commits.
|
|
||||||
|
|
||||||
The pre-commit [hooks](../.pre-commit-config.yaml) ensure code quality and
|
|
||||||
consistency. They are executed in CI. PRs that fail to comply with the hooks
|
|
||||||
will not be able to pass the corresponding CI gate. The hooks are only executed
|
|
||||||
against staged files unless you run `pre-commit run --all`, in which case,
|
|
||||||
they'll be executed against every file in the repository.
|
|
||||||
|
|
||||||
Specific programmatically generated files listed in the `exclude` field in
|
|
||||||
[.pre-commit-config.yaml](../.pre-commit-config.yaml) are deliberately excluded
|
|
||||||
from the hooks.
|
|
|
@ -1,32 +0,0 @@
|
||||||
# Copyright 2023 The Kubeflow Authors
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
|
|
||||||
# Dockerfile for building the source code of conformance tests
|
|
||||||
FROM python:3.10-slim
|
|
||||||
|
|
||||||
WORKDIR /kubeflow/katib
|
|
||||||
|
|
||||||
COPY sdk/ /kubeflow/katib/sdk/
|
|
||||||
COPY examples/ /kubeflow/katib/examples/
|
|
||||||
COPY test/ /kubeflow/katib/test/
|
|
||||||
COPY pkg/ /kubeflow/katib/pkg/
|
|
||||||
|
|
||||||
COPY conformance/run.sh .
|
|
||||||
|
|
||||||
# Add test script.
|
|
||||||
RUN chmod +x run.sh
|
|
||||||
|
|
||||||
RUN pip install --prefer-binary -e sdk/python/v1beta1
|
|
||||||
|
|
||||||
ENTRYPOINT [ "./run.sh" ]
|
|
|
@ -1,130 +1,70 @@
|
||||||
HAS_LINT := $(shell command -v golangci-lint;)
|
HAS_LINT := $(shell command -v golint;)
|
||||||
HAS_YAMLLINT := $(shell command -v yamllint;)
|
|
||||||
HAS_SHELLCHECK := $(shell command -v shellcheck;)
|
|
||||||
HAS_SETUP_ENVTEST := $(shell command -v setup-envtest;)
|
|
||||||
HAS_MOCKGEN := $(shell command -v mockgen;)
|
|
||||||
|
|
||||||
COMMIT := v1beta1-$(shell git rev-parse --short=7 HEAD)
|
COMMIT := v1beta1-$(shell git rev-parse --short=7 HEAD)
|
||||||
KATIB_REGISTRY := ghcr.io/kubeflow/katib
|
KATIB_REGISTRY := docker.io/kubeflowkatib
|
||||||
CPU_ARCH ?= linux/amd64,linux/arm64
|
|
||||||
ENVTEST_K8S_VERSION ?= 1.31
|
|
||||||
MOCKGEN_VERSION ?= $(shell grep 'go.uber.org/mock' go.mod | cut -d ' ' -f 2)
|
|
||||||
GO_VERSION=$(shell grep '^go' go.mod | cut -d ' ' -f 2)
|
|
||||||
GOPATH ?= $(shell go env GOPATH)
|
|
||||||
|
|
||||||
TEST_TENSORFLOW_EVENT_FILE_PATH ?= $(CURDIR)/test/unit/v1beta1/metricscollector/testdata/tfevent-metricscollector/logs
|
|
||||||
|
|
||||||
# Run tests
|
# Run tests
|
||||||
.PHONY: test
|
.PHONY: test
|
||||||
test: envtest
|
test:
|
||||||
KUBEBUILDER_ASSETS="$(shell setup-envtest use $(ENVTEST_K8S_VERSION) -p path)" go test ./pkg/... ./cmd/... -coverprofile coverage.out
|
go test ./pkg/... ./cmd/... -coverprofile coverage.out
|
||||||
|
|
||||||
envtest:
|
check: generate fmt vet lint
|
||||||
ifndef HAS_SETUP_ENVTEST
|
|
||||||
go install sigs.k8s.io/controller-runtime/tools/setup-envtest@release-0.19
|
|
||||||
$(info "setup-envtest has been installed")
|
|
||||||
endif
|
|
||||||
$(info "setup-envtest has already installed")
|
|
||||||
|
|
||||||
check: generated-codes go-mod fmt vet lint
|
|
||||||
|
|
||||||
fmt:
|
fmt:
|
||||||
hack/verify-gofmt.sh
|
hack/verify-gofmt.sh
|
||||||
|
|
||||||
lint:
|
lint:
|
||||||
ifndef HAS_LINT
|
ifndef HAS_LINT
|
||||||
go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.64.7
|
go get -u golang.org/x/lint/golint
|
||||||
$(info "golangci-lint has been installed")
|
echo "installing golint"
|
||||||
endif
|
endif
|
||||||
hack/verify-golangci-lint.sh
|
hack/verify-golint.sh
|
||||||
|
|
||||||
yamllint:
|
|
||||||
ifndef HAS_YAMLLINT
|
|
||||||
pip install --prefer-binary yamllint
|
|
||||||
$(info "yamllint has been installed")
|
|
||||||
endif
|
|
||||||
hack/verify-yamllint.sh
|
|
||||||
|
|
||||||
vet:
|
vet:
|
||||||
go vet ./pkg/... ./cmd/...
|
go vet ./pkg/... ./cmd/...
|
||||||
|
|
||||||
shellcheck:
|
|
||||||
ifndef HAS_SHELLCHECK
|
|
||||||
bash hack/install-shellcheck.sh
|
|
||||||
$(info "shellcheck has been installed")
|
|
||||||
endif
|
|
||||||
hack/verify-shellcheck.sh
|
|
||||||
|
|
||||||
update:
|
update:
|
||||||
hack/update-gofmt.sh
|
hack/update-gofmt.sh
|
||||||
|
|
||||||
# Deploy Katib v1beta1 manifests using Kustomize into a k8s cluster.
|
# Deploy Katib v1beta1 manifests using Kustomize into a k8s cluster.
|
||||||
deploy:
|
deploy:
|
||||||
bash scripts/v1beta1/deploy.sh $(WITH_DATABASE_TYPE)
|
bash scripts/v1beta1/deploy.sh
|
||||||
|
|
||||||
# Undeploy Katib v1beta1 manifests using Kustomize from a k8s cluster
|
# Undeploy Katib v1beta1 manifests using Kustomize from a k8s cluster
|
||||||
undeploy:
|
undeploy:
|
||||||
bash scripts/v1beta1/undeploy.sh
|
bash scripts/v1beta1/undeploy.sh
|
||||||
|
|
||||||
generated-codes: generate
|
# Generate deepcopy, clientset, listers, informers, open-api and python SDK for APIs.
|
||||||
ifneq ($(shell bash hack/verify-generated-codes.sh '.'; echo $$?),0)
|
|
||||||
$(error 'Please run "make generate" to generate codes')
|
|
||||||
endif
|
|
||||||
|
|
||||||
go-mod: sync-go-mod
|
|
||||||
ifneq ($(shell bash hack/verify-generated-codes.sh 'go.*'; echo $$?),0)
|
|
||||||
$(error 'Please run "go mod tidy -go $(GO_VERSION)" to sync Go modules')
|
|
||||||
endif
|
|
||||||
|
|
||||||
sync-go-mod:
|
|
||||||
go mod tidy -go $(GO_VERSION)
|
|
||||||
|
|
||||||
.PHONY: go-mod-download
|
|
||||||
go-mod-download:
|
|
||||||
go mod download
|
|
||||||
|
|
||||||
CONTROLLER_GEN = $(shell pwd)/bin/controller-gen
|
|
||||||
.PHONY: controller-gen
|
|
||||||
controller-gen:
|
|
||||||
@GOBIN=$(shell pwd)/bin GO111MODULE=on go install sigs.k8s.io/controller-tools/cmd/controller-gen@v0.16.5
|
|
||||||
|
|
||||||
# Run this if you update any existing controller APIs.
|
# Run this if you update any existing controller APIs.
|
||||||
# 1. Generate deepcopy, clientset, listers, informers for the APIs (hack/update-codegen.sh)
|
generate:
|
||||||
# 2. Generate open-api for the APIs (hack/update-openapigen)
|
ifndef GOPATH
|
||||||
# 3. Generate Python SDK for Katib (hack/gen-python-sdk/gen-sdk.sh)
|
$(error GOPATH not defined, please define GOPATH. Run "go help gopath" to learn more about GOPATH)
|
||||||
# 4. Generate gRPC manager APIs (pkg/apis/manager/v1beta1/build.sh and pkg/apis/manager/health/build.sh)
|
|
||||||
# 5. Generate Go mock codes
|
|
||||||
generate: go-mod-download controller-gen
|
|
||||||
ifndef HAS_MOCKGEN
|
|
||||||
go install go.uber.org/mock/mockgen@$(MOCKGEN_VERSION)
|
|
||||||
$(info "mockgen has been installed")
|
|
||||||
endif
|
endif
|
||||||
go generate ./pkg/... ./cmd/...
|
go generate ./pkg/... ./cmd/...
|
||||||
hack/gen-python-sdk/gen-sdk.sh
|
hack/gen-python-sdk/gen-sdk.sh
|
||||||
hack/update-proto.sh
|
|
||||||
hack/update-mockgen.sh
|
|
||||||
|
|
||||||
# Build images for the Katib v1beta1 components.
|
# Build images for the Katib v1beta1 components.
|
||||||
build: generate
|
build: generate
|
||||||
ifeq ($(and $(REGISTRY),$(TAG),$(CPU_ARCH)),)
|
ifeq ($(and $(REGISTRY),$(TAG)),)
|
||||||
$(error REGISTRY and TAG must be set. Usage: make build REGISTRY=<registry> TAG=<tag> CPU_ARCH=<cpu-architecture>)
|
$(error REGISTRY and TAG must be set. Usage: make build REGISTRY=<registry> TAG=<tag>)
|
||||||
endif
|
endif
|
||||||
bash scripts/v1beta1/build.sh $(REGISTRY) $(TAG) $(CPU_ARCH)
|
bash scripts/v1beta1/build.sh $(REGISTRY) $(TAG)
|
||||||
|
|
||||||
# Build and push Katib images from the latest master commit.
|
# Build and push Katib images from the latest master commit.
|
||||||
push-latest: generate
|
push-latest: generate
|
||||||
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) latest $(CPU_ARCH)
|
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) latest
|
||||||
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) $(COMMIT) $(CPU_ARCH)
|
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) $(COMMIT)
|
||||||
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) latest
|
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) latest
|
||||||
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(COMMIT)
|
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(COMMIT)
|
||||||
|
|
||||||
# Build and push Katib images for the given tag.
|
# Build and push Katib images for the given tag.
|
||||||
push-tag:
|
push-tag: generate
|
||||||
ifeq ($(TAG),)
|
ifeq ($(TAG),)
|
||||||
$(error TAG must be set. Usage: make push-tag TAG=<release-tag>)
|
$(error TAG must be set. Usage: make push-tag TAG=<release-tag>)
|
||||||
endif
|
endif
|
||||||
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) $(TAG) $(CPU_ARCH)
|
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) $(TAG)
|
||||||
|
bash scripts/v1beta1/build.sh $(KATIB_REGISTRY) $(COMMIT)
|
||||||
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(TAG)
|
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(TAG)
|
||||||
|
bash scripts/v1beta1/push.sh $(KATIB_REGISTRY) $(COMMIT)
|
||||||
|
|
||||||
# Release a new version of Katib.
|
# Release a new version of Katib.
|
||||||
release:
|
release:
|
||||||
|
@ -133,15 +73,6 @@ ifeq ($(and $(BRANCH),$(TAG)),)
|
||||||
endif
|
endif
|
||||||
bash scripts/v1beta1/release.sh $(BRANCH) $(TAG)
|
bash scripts/v1beta1/release.sh $(BRANCH) $(TAG)
|
||||||
|
|
||||||
# Update all Katib images.
|
|
||||||
update-images:
|
|
||||||
ifeq ($(and $(OLD_PREFIX),$(NEW_PREFIX),$(TAG)),)
|
|
||||||
$(error OLD_PREFIX, NEW_PREFIX, and TAG must be set. \
|
|
||||||
Usage: make update-images OLD_PREFIX=<old-prefix> NEW_PREFIX=<new-prefix> TAG=<tag> \
|
|
||||||
For more information, check this file: scripts/v1beta1/update-images.sh)
|
|
||||||
endif
|
|
||||||
bash scripts/v1beta1/update-images.sh $(OLD_PREFIX) $(NEW_PREFIX) $(TAG)
|
|
||||||
|
|
||||||
# Prettier UI format check for Katib v1beta1.
|
# Prettier UI format check for Katib v1beta1.
|
||||||
prettier-check:
|
prettier-check:
|
||||||
npm run format:check --prefix pkg/ui/v1beta1/frontend
|
npm run format:check --prefix pkg/ui/v1beta1/frontend
|
||||||
|
@ -149,45 +80,3 @@ prettier-check:
|
||||||
# Update boilerplate for the source code.
|
# Update boilerplate for the source code.
|
||||||
update-boilerplate:
|
update-boilerplate:
|
||||||
./hack/boilerplate/update-boilerplate.sh
|
./hack/boilerplate/update-boilerplate.sh
|
||||||
|
|
||||||
prepare-pytest:
|
|
||||||
pip install --prefer-binary -r test/unit/v1beta1/requirements.txt
|
|
||||||
pip install --prefer-binary -r cmd/suggestion/hyperopt/v1beta1/requirements.txt
|
|
||||||
pip install --prefer-binary -r cmd/suggestion/optuna/v1beta1/requirements.txt
|
|
||||||
pip install --prefer-binary -r cmd/suggestion/hyperband/v1beta1/requirements.txt
|
|
||||||
pip install --prefer-binary -r cmd/suggestion/nas/enas/v1beta1/requirements.txt
|
|
||||||
pip install --prefer-binary -r cmd/suggestion/nas/darts/v1beta1/requirements.txt
|
|
||||||
pip install --prefer-binary -r cmd/suggestion/pbt/v1beta1/requirements.txt
|
|
||||||
pip install --prefer-binary -r cmd/earlystopping/medianstop/v1beta1/requirements.txt
|
|
||||||
pip install --prefer-binary -r cmd/metricscollector/v1beta1/tfevent-metricscollector/requirements.txt
|
|
||||||
# `TypeIs` was introduced in typing-extensions 4.10.0, and torch 2.6.0 requires typing-extensions>=4.10.0.
|
|
||||||
# REF: https://github.com/kubeflow/katib/pull/2504
|
|
||||||
# TODO (tenzen-y): Once we upgrade libraries depended on typing-extensions==4.5.0, we can remove this line.
|
|
||||||
pip install typing-extensions==4.10.0
|
|
||||||
|
|
||||||
prepare-pytest-testdata:
|
|
||||||
ifeq ("$(wildcard $(TEST_TENSORFLOW_EVENT_FILE_PATH))", "")
|
|
||||||
python examples/v1beta1/trial-images/tf-mnist-with-summaries/mnist.py --epochs 5 --batch-size 200 --log-path $(TEST_TENSORFLOW_EVENT_FILE_PATH)
|
|
||||||
endif
|
|
||||||
|
|
||||||
# TODO(Electronic-Waste): Remove the import rewrite when protobuf supports `python_package` option.
|
|
||||||
# REF: https://github.com/protocolbuffers/protobuf/issues/7061
|
|
||||||
pytest: prepare-pytest prepare-pytest-testdata
|
|
||||||
pytest ./test/unit/v1beta1/suggestion --ignore=./test/unit/v1beta1/suggestion/test_skopt_service.py
|
|
||||||
pytest ./test/unit/v1beta1/earlystopping
|
|
||||||
pytest ./test/unit/v1beta1/metricscollector
|
|
||||||
cp ./pkg/apis/manager/v1beta1/python/api_pb2.py ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2.py
|
|
||||||
cp ./pkg/apis/manager/v1beta1/python/api_pb2_grpc.py ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2_grpc.py
|
|
||||||
sed -i "s/api_pb2/kubeflow\.katib\.katib_api_pb2/g" ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2_grpc.py
|
|
||||||
pytest ./sdk/python/v1beta1/kubeflow/katib
|
|
||||||
rm ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2.py ./sdk/python/v1beta1/kubeflow/katib/katib_api_pb2_grpc.py
|
|
||||||
|
|
||||||
# The skopt service doesn't work appropriately with Python 3.11.
|
|
||||||
# So, we need to run the test with Python 3.9.
|
|
||||||
# TODO (tenzen-y): Once we stop to support skopt, we can remove this test.
|
|
||||||
# REF: https://github.com/kubeflow/katib/issues/2280
|
|
||||||
pytest-skopt:
|
|
||||||
pip install six
|
|
||||||
pip install --prefer-binary -r test/unit/v1beta1/requirements.txt
|
|
||||||
pip install --prefer-binary -r cmd/suggestion/skopt/v1beta1/requirements.txt
|
|
||||||
pytest ./test/unit/v1beta1/suggestion/test_skopt_service.py
|
|
||||||
|
|
6
OWNERS
6
OWNERS
|
@ -1,10 +1,8 @@
|
||||||
approvers:
|
approvers:
|
||||||
- andreyvelich
|
- andreyvelich
|
||||||
- gaocegege
|
- gaocegege
|
||||||
|
- hougangliu
|
||||||
- johnugeorge
|
- johnugeorge
|
||||||
reviewers:
|
reviewers:
|
||||||
- anencore94
|
|
||||||
- c-bata
|
- c-bata
|
||||||
- Electronic-Waste
|
- sperlingxx
|
||||||
emeritus_approvers:
|
|
||||||
- tenzen-y
|
|
||||||
|
|
2
PROJECT
2
PROJECT
|
@ -1,3 +1,3 @@
|
||||||
version: "3"
|
version: "1"
|
||||||
domain: kubeflow.org
|
domain: kubeflow.org
|
||||||
repo: github.com/kubeflow/katib
|
repo: github.com/kubeflow/katib
|
||||||
|
|
511
README.md
511
README.md
|
@ -1,18 +1,15 @@
|
||||||
# Kubeflow Katib
|
|
||||||
|
|
||||||
[](https://github.com/kubeflow/katib/actions/workflows/test-go.yaml?branch=master)
|
|
||||||
[](https://coveralls.io/github/kubeflow/katib?branch=master)
|
|
||||||
[](https://goreportcard.com/report/github.com/kubeflow/katib)
|
|
||||||
[](https://github.com/kubeflow/katib/releases)
|
|
||||||
[](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels)
|
|
||||||
[](https://www.bestpractices.dev/projects/9941)
|
|
||||||
|
|
||||||
<h1 align="center">
|
<h1 align="center">
|
||||||
<img src="./docs/images/logo-title.png" alt="logo" width="200">
|
<img src="./docs/images/logo-title.png" alt="logo" width="200">
|
||||||
<br>
|
<br>
|
||||||
</h1>
|
</h1>
|
||||||
|
|
||||||
Kubeflow Katib is a Kubernetes-native project for automated machine learning (AutoML).
|
[](https://travis-ci.com/kubeflow/katib)
|
||||||
|
[](https://coveralls.io/github/kubeflow/katib?branch=master)
|
||||||
|
[](https://goreportcard.com/report/github.com/kubeflow/katib)
|
||||||
|
[](https://github.com/kubeflow/katib/releases)
|
||||||
|
[](https://kubeflow.slack.com/archives/C018PMV53NW)
|
||||||
|
|
||||||
|
Katib is a Kubernetes-native project for automated machine learning (AutoML).
|
||||||
Katib supports
|
Katib supports
|
||||||
[Hyperparameter Tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization),
|
[Hyperparameter Tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization),
|
||||||
[Early Stopping](https://en.wikipedia.org/wiki/Early_stopping) and
|
[Early Stopping](https://en.wikipedia.org/wiki/Early_stopping) and
|
||||||
|
@ -20,187 +17,349 @@ Katib supports
|
||||||
|
|
||||||
Katib is the project which is agnostic to machine learning (ML) frameworks.
|
Katib is the project which is agnostic to machine learning (ML) frameworks.
|
||||||
It can tune hyperparameters of applications written in any language of the
|
It can tune hyperparameters of applications written in any language of the
|
||||||
users’ choice and natively supports many ML frameworks, such as
|
users’ choice and natively supports many ML frameworks, such as TensorFlow,
|
||||||
[TensorFlow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), [XGBoost](https://xgboost.readthedocs.io/en/latest/), and others.
|
MXNet, PyTorch, XGBoost, and others.
|
||||||
|
|
||||||
Katib can perform training jobs using any Kubernetes
|
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
|
||||||
[Custom Resources](https://www.kubeflow.org/docs/components/katib/trial-template/)
|
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
|
||||||
with out of the box support for [Kubeflow Training Operator](https://github.com/kubeflow/training-operator),
|
|
||||||
[Argo Workflows](https://github.com/argoproj/argo-workflows), [Tekton Pipelines](https://github.com/tektoncd/pipeline)
|
|
||||||
and many more.
|
|
||||||
|
|
||||||
Katib stands for `secretary` in Arabic.
|
# Table of Contents
|
||||||
|
|
||||||
## Search Algorithms
|
- [Getting Started](#getting-started)
|
||||||
|
- [Name](#name)
|
||||||
|
- [Concepts in Katib](#concepts-in-katib)
|
||||||
|
- [Experiment](#experiment)
|
||||||
|
- [Suggestion](#suggestion)
|
||||||
|
- [Trial](#trial)
|
||||||
|
- [Worker Job](#worker-job)
|
||||||
|
- [Search Algorithms](#search-algorithms)
|
||||||
|
- [Hyperparameter Tuning](#hyperparameter-tuning)
|
||||||
|
- [Neural Architecture Search](#neural-architecture-search)
|
||||||
|
- [Components in Katib](#components-in-katib)
|
||||||
|
- [Web UI](#web-ui)
|
||||||
|
- [New UI](#new-ui)
|
||||||
|
- [GRPC API documentation](#grpc-api-documentation)
|
||||||
|
- [Installation](#installation)
|
||||||
|
- [TF operator](#tf-operator)
|
||||||
|
- [PyTorch operator](#pytorch-operator)
|
||||||
|
- [Katib](#katib)
|
||||||
|
- [Running examples](#running-examples)
|
||||||
|
- [Katib SDK](#katib-sdk)
|
||||||
|
- [Cleanups](#cleanups)
|
||||||
|
- [Quick Start](#quick-start)
|
||||||
|
- [Community](#community)
|
||||||
|
- [Blog posts](#blog-posts)
|
||||||
|
- [Contributing](#contributing)
|
||||||
|
- [Citation](#citation)
|
||||||
|
|
||||||
Katib supports several search algorithms. Follow the
|
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
|
||||||
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-algorithm/#hp-tuning-algorithms)
|
|
||||||
to know more about each algorithm and check the
|
|
||||||
[this guide](https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-algorithm/#use-custom-algorithm-in-katib)
|
|
||||||
to implement your custom algorithm.
|
|
||||||
|
|
||||||
<table>
|
Created by [doctoc](https://github.com/thlorenz/doctoc).
|
||||||
<tbody>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<b>Hyperparameter Tuning</b>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<b>Neural Architecture Search</b>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<b>Early Stopping</b>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#random-search">Random Search</a>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#neural-architecture-search-based-on-enas">ENAS</a>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/early-stopping/#median-stopping-rule">Median Stop</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#grid-search">Grid Search</a>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#differentiable-architecture-search-darts">DARTS</a>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#bayesian-optimization">Bayesian Optimization</a>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#tree-of-parzen-estimators-tpe">TPE</a>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#multivariate-tpe">Multivariate TPE</a>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#covariance-matrix-adaptation-evolution-strategy-cma-es">CMA-ES</a>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#sobols-quasirandom-sequence">Sobol's Quasirandom Sequence</a>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#hyperband">HyperBand</a>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#pbt">Population Based Training</a>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</tbody>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
To perform the above algorithms Katib supports the following frameworks:
|
|
||||||
|
|
||||||
- [Goptuna](https://github.com/c-bata/goptuna)
|
|
||||||
- [Hyperopt](https://github.com/hyperopt/hyperopt)
|
|
||||||
- [Optuna](https://github.com/optuna/optuna)
|
|
||||||
- [Scikit Optimize](https://github.com/scikit-optimize/scikit-optimize)
|
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
Please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/installation/#prerequisites)
|
|
||||||
for prerequisites to install Katib.
|
|
||||||
|
|
||||||
## Installation
|
|
||||||
|
|
||||||
Please follow [the Kubeflow Katib guide](https://www.kubeflow.org/docs/components/katib/installation/#installing-katib)
|
|
||||||
for the detailed instructions on how to install Katib.
|
|
||||||
|
|
||||||
### Installing the Control Plane
|
|
||||||
|
|
||||||
Run the following command to install the latest stable release of Katib control plane:
|
|
||||||
|
|
||||||
```
|
|
||||||
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.17.0"
|
|
||||||
```
|
|
||||||
|
|
||||||
Run the following command to install the latest changes of Katib control plane:
|
|
||||||
|
|
||||||
```
|
|
||||||
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master"
|
|
||||||
```
|
|
||||||
|
|
||||||
For the Katib Experiments check the [complete examples list](./examples/v1beta1).
|
|
||||||
|
|
||||||
### Installing the Python SDK
|
|
||||||
|
|
||||||
Katib implements [a Python SDK](https://pypi.org/project/kubeflow-katib/) to simplify creation of
|
|
||||||
hyperparameter tuning jobs for Data Scientists.
|
|
||||||
|
|
||||||
Run the following command to install the latest stable release of Katib SDK:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
pip install -U kubeflow-katib
|
|
||||||
```
|
|
||||||
|
|
||||||
## Getting Started
|
## Getting Started
|
||||||
|
|
||||||
Please refer to [the getting started guide](https://www.kubeflow.org/docs/components/katib/getting-started/#getting-started-with-katib-python-sdk)
|
Follow the
|
||||||
to quickly create your first hyperparameter tuning Experiment using the Python SDK.
|
[getting-started guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/)
|
||||||
|
on the Kubeflow website.
|
||||||
|
|
||||||
|
## Name
|
||||||
|
|
||||||
|
Katib stands for `secretary` in Arabic.
|
||||||
|
|
||||||
|
## Concepts in Katib
|
||||||
|
|
||||||
|
For a detailed description of the concepts in Katib and AutoML, check the
|
||||||
|
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/overview/).
|
||||||
|
|
||||||
|
Katib has the concepts of `Experiment`, `Suggestion`, `Trial` and `Worker Job`.
|
||||||
|
|
||||||
|
### Experiment
|
||||||
|
|
||||||
|
An `Experiment` represents a single optimization run over a feasible space.
|
||||||
|
Each `Experiment` contains a configuration:
|
||||||
|
|
||||||
|
1. **Objective**: What you want to optimize.
|
||||||
|
2. **Search Space**: Constraints for configurations describing the feasible space.
|
||||||
|
3. **Search Algorithm**: How to find the optimal configurations.
|
||||||
|
|
||||||
|
Katib `Experiment` is defined as a CRD. Check the detailed guide to
|
||||||
|
[configuring and running a Katib `Experiment`](https://kubeflow.org/docs/components/katib/experiment/)
|
||||||
|
in the Kubeflow docs.
|
||||||
|
|
||||||
|
### Suggestion
|
||||||
|
|
||||||
|
A `Suggestion` is a set of hyperparameter values that the hyperparameter tuning
|
||||||
|
process has proposed. Katib creates a `Trial` to evaluate
|
||||||
|
the suggested set of values.
|
||||||
|
|
||||||
|
Katib `Suggestion` is defined as a CRD.
|
||||||
|
|
||||||
|
### Trial
|
||||||
|
|
||||||
|
A `Trial` is one iteration of the hyperparameter tuning process.
|
||||||
|
A `Trial` corresponds to one worker job instance with a list of parameter
|
||||||
|
assignments. The list of parameter assignments corresponds to a `Suggestion`.
|
||||||
|
|
||||||
|
Each `Experiment` runs several `Trials`. The `Experiment` runs the `Trials` until
|
||||||
|
it reaches either the objective or the configured maximum number of `Trials`.
|
||||||
|
|
||||||
|
Katib `Trial` is defined as a CRD.
|
||||||
|
|
||||||
|
### Worker Job
|
||||||
|
|
||||||
|
The `Worker Job` is the process that runs to evaluate a `Trial` and calculate
|
||||||
|
its objective value.
|
||||||
|
|
||||||
|
The `Worker Job` can be any type of Kubernetes resource or
|
||||||
|
[Kubernetes CRD](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/).
|
||||||
|
Follow the [`Trial` template guide](https://www.kubeflow.org/docs/components/katib/trial-template/#custom-resource)
|
||||||
|
to support your own Kubernetes resource in Katib.
|
||||||
|
|
||||||
|
Katib has these CRD examples in upstream:
|
||||||
|
|
||||||
|
- [Kubernetes `Job`](https://kubernetes.io/docs/concepts/workloads/controllers/job/)
|
||||||
|
|
||||||
|
- [Kubeflow `TFJob`](https://www.kubeflow.org/docs/components/training/tftraining/)
|
||||||
|
|
||||||
|
- [Kubeflow `PyTorchJob`](https://www.kubeflow.org/docs/components/training/pytorch/)
|
||||||
|
|
||||||
|
- [Kubeflow `MPIJob`](https://www.kubeflow.org/docs/components/training/mpi/)
|
||||||
|
|
||||||
|
- [Tekton `Pipeline`](https://github.com/tektoncd/pipeline)
|
||||||
|
|
||||||
|
Thus, Katib supports multiple frameworks with the help of different job kinds.
|
||||||
|
|
||||||
|
### Search Algorithms
|
||||||
|
|
||||||
|
Katib currently supports several search algorithms. Follow the
|
||||||
|
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/experiment/#search-algorithms-in-detail)
|
||||||
|
to know more about each algorithm.
|
||||||
|
|
||||||
|
#### Hyperparameter Tuning
|
||||||
|
|
||||||
|
- [Random Search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Random_search)
|
||||||
|
- [Tree of Parzen Estimators (TPE)](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf)
|
||||||
|
- [Grid Search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search)
|
||||||
|
- [Hyperband](https://arxiv.org/pdf/1603.06560.pdf)
|
||||||
|
- [Bayesian Optimization](https://arxiv.org/pdf/1012.2599.pdf)
|
||||||
|
- [Covariance Matrix Adaptation Evolution Strategy (CMA-ES)](https://arxiv.org/abs/1604.00772)
|
||||||
|
|
||||||
|
#### Neural Architecture Search
|
||||||
|
|
||||||
|
- [Efficient Neural Architecture Search (ENAS)](https://github.com/kubeflow/katib/tree/master/pkg/suggestion/v1beta1/nas/enas)
|
||||||
|
- [Differentiable Architecture Search (DARTS)](https://github.com/kubeflow/katib/tree/master/pkg/suggestion/v1beta1/nas/darts)
|
||||||
|
|
||||||
|
## Components in Katib
|
||||||
|
|
||||||
|
Katib consists of several components as shown below. Each component is running
|
||||||
|
on Kubernetes as a deployment. Each component communicates with others via GRPC
|
||||||
|
and the API is defined at `pkg/apis/manager/v1beta1/api.proto`.
|
||||||
|
|
||||||
|
- Katib main components:
|
||||||
|
- `katib-db-manager` - the GRPC API server of Katib which is the DB Interface.
|
||||||
|
- `katib-mysql` - the data storage backend of Katib using mysql.
|
||||||
|
- `katib-ui` - the user interface of Katib.
|
||||||
|
- `katib-controller` - the controller for the Katib CRDs in Kubernetes.
|
||||||
|
|
||||||
|
## Web UI
|
||||||
|
|
||||||
|
Katib provides a Web UI.
|
||||||
|
You can visualize general trend of Hyper parameter space and
|
||||||
|
each training history. You can use
|
||||||
|
[random-example](https://github.com/kubeflow/katib/blob/master/examples/v1beta1/random-example.yaml)
|
||||||
|
or
|
||||||
|
[other examples](https://github.com/kubeflow/katib/blob/master/examples/v1beta1)
|
||||||
|
to generate a similar UI. Follow the
|
||||||
|
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-ui)
|
||||||
|
to access the Katib UI.
|
||||||
|

|
||||||
|
|
||||||
|
### New UI
|
||||||
|
|
||||||
|
During 1.3 we've worked on a new iteration of the UI, which is rewritten in
|
||||||
|
Angular and is utilizing the common code of the other Kubeflow [dashboards](https://github.com/kubeflow/kubeflow/tree/master/components/crud-web-apps).
|
||||||
|
While this UI is not yet on par with the current default one, we are actively
|
||||||
|
working to get it up to speed and provide all the existing functionalities.
|
||||||
|
|
||||||
|
The users are currently able to list, delete and create Experiments in their
|
||||||
|
cluster via this new UI as well as inspect the owned Trials. One important
|
||||||
|
missing functionalities are the ability to edit the TrialTemplate ConfigMaps.
|
||||||
|
|
||||||
|
While this UI is not ready to replace the current one we would like to
|
||||||
|
encourage users to also give it a try and provide us with feedback. To try it
|
||||||
|
out the user has to update the Katib UI image `newName` with the new registry
|
||||||
|
`docker.io/kubeflowkatib/katib-new-ui` in the [Kustomize](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml#L43)
|
||||||
|
manifests.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## GRPC API documentation
|
||||||
|
|
||||||
|
Check the [Katib v1beta1 API reference docs](https://www.kubeflow.org/docs/reference/katib/v1beta1/katib/).
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
For standard installation of Katib with support for all job operators,
|
||||||
|
install Kubeflow.
|
||||||
|
Follow the documentation:
|
||||||
|
|
||||||
|
- [Kubeflow installation guide](https://www.kubeflow.org/docs/started/getting-started/)
|
||||||
|
- [Kubeflow Katib guides](https://www.kubeflow.org/docs/components/katib/).
|
||||||
|
|
||||||
|
If you install Katib with other Kubeflow components,
|
||||||
|
you can't submit Katib jobs in Kubeflow namespace. Check the
|
||||||
|
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-algorithm)
|
||||||
|
to know more about it.
|
||||||
|
|
||||||
|
Alternatively, if you want to install Katib manually with TF and PyTorch
|
||||||
|
operators support, follow these steps:
|
||||||
|
|
||||||
|
Create Kubeflow namespace:
|
||||||
|
|
||||||
|
```
|
||||||
|
kubectl create namespace kubeflow
|
||||||
|
```
|
||||||
|
|
||||||
|
Clone Kubeflow manifest repository:
|
||||||
|
|
||||||
|
```
|
||||||
|
git clone -b v1.2-branch git@github.com:kubeflow/manifests.git
|
||||||
|
Set `MANIFESTS_DIR` to the cloned folder.
|
||||||
|
export MANIFESTS_DIR=<cloned-folder>
|
||||||
|
```
|
||||||
|
|
||||||
|
### TF operator
|
||||||
|
|
||||||
|
For installing TF operator, run the following:
|
||||||
|
|
||||||
|
```
|
||||||
|
cd "${MANIFESTS_DIR}/tf-training/tf-job-crds/base"
|
||||||
|
kustomize build . | kubectl apply -f -
|
||||||
|
cd "${MANIFESTS_DIR}/tf-training/tf-job-operator/base"
|
||||||
|
kustomize build . | kubectl apply -f -
|
||||||
|
```
|
||||||
|
|
||||||
|
### PyTorch operator
|
||||||
|
|
||||||
|
For installing PyTorch operator, run the following:
|
||||||
|
|
||||||
|
```
|
||||||
|
cd "${MANIFESTS_DIR}/pytorch-job/pytorch-job-crds/base"
|
||||||
|
kustomize build . | kubectl apply -f -
|
||||||
|
cd "${MANIFESTS_DIR}/pytorch-job/pytorch-operator/base/"
|
||||||
|
kustomize build . | kubectl apply -f -
|
||||||
|
```
|
||||||
|
|
||||||
|
### Katib
|
||||||
|
|
||||||
|
Note that your [kustomize](https://kustomize.io/) version should be >= 3.2.
|
||||||
|
To install Katib run:
|
||||||
|
|
||||||
|
```
|
||||||
|
git clone git@github.com:kubeflow/katib.git
|
||||||
|
make deploy
|
||||||
|
```
|
||||||
|
|
||||||
|
Check if all components are running successfully:
|
||||||
|
|
||||||
|
```
|
||||||
|
kubectl get pods -n kubeflow
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected output:
|
||||||
|
|
||||||
|
```
|
||||||
|
NAME READY STATUS RESTARTS AGE
|
||||||
|
katib-controller-858d6cc48c-df9jc 1/1 Running 1 20m
|
||||||
|
katib-db-manager-7966fbdf9b-w2tn8 1/1 Running 0 20m
|
||||||
|
katib-mysql-7f8bc6956f-898f9 1/1 Running 0 20m
|
||||||
|
katib-ui-7cf9f967bf-nm72p 1/1 Running 0 20m
|
||||||
|
pytorch-operator-55f966b548-9gq9v 1/1 Running 0 20m
|
||||||
|
tf-job-operator-796b4747d8-4fh82 1/1 Running 0 21m
|
||||||
|
```
|
||||||
|
|
||||||
|
### Running examples
|
||||||
|
|
||||||
|
After deploy everything, you can run examples to verify the installation.
|
||||||
|
|
||||||
|
This is an example for TF operator:
|
||||||
|
|
||||||
|
```
|
||||||
|
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/tfjob-example.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
This is an example for PyTorch operator:
|
||||||
|
|
||||||
|
```
|
||||||
|
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/pytorchjob-example.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
Check the
|
||||||
|
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-algorithm)
|
||||||
|
how to monitor your `Experiment` status.
|
||||||
|
|
||||||
|
You can view your results in Katib UI.
|
||||||
|
If you used standard installation, access the Katib UI via Kubeflow dashboard.
|
||||||
|
Otherwise, port-forward the `katib-ui`:
|
||||||
|
|
||||||
|
```
|
||||||
|
kubectl -n kubeflow port-forward svc/katib-ui 8080:80
|
||||||
|
```
|
||||||
|
|
||||||
|
You can access the Katib UI using this URL: `http://localhost:8080/katib/`.
|
||||||
|
|
||||||
|
### Katib SDK
|
||||||
|
|
||||||
|
Katib supports Python SDK:
|
||||||
|
|
||||||
|
- Check the [Katib v1beta1 SDK documentation](https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1).
|
||||||
|
|
||||||
|
Run `make generate` to update Katib SDK.
|
||||||
|
|
||||||
|
### Cleanups
|
||||||
|
|
||||||
|
To delete installed TF and PyTorch operator run `kubectl delete -f`
|
||||||
|
on the respective folders.
|
||||||
|
|
||||||
|
To delete Katib run `make undeploy`.
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
Please follow the
|
||||||
|
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/hyperparameter/#examples)
|
||||||
|
to submit your first Katib experiment.
|
||||||
|
|
||||||
## Community
|
## Community
|
||||||
|
|
||||||
The following links provide information on how to get involved in the community:
|
We are always growing our community and invite new users and AutoML enthusiasts
|
||||||
|
to contribute to the Katib project. The following links provide information
|
||||||
|
about getting involved in the community:
|
||||||
|
|
||||||
- Attend [the bi-weekly AutoML and Training Working Group](https://bit.ly/2PWVCkV)
|
- If you use Katib, please update [the adopters list](ADOPTERS.md).
|
||||||
community meeting.
|
|
||||||
- Join our [`#kubeflow-katib`](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels)
|
- Subscribe
|
||||||
Slack channel.
|
[to the calendar](https://calendar.google.com/calendar/u/0/r?cid=ZDQ5bnNpZWZzbmZna2Y5MW8wdThoMmpoazRAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ)
|
||||||
- Check out [who is using Katib](ADOPTERS.md) and [presentations about Katib project](docs/presentations.md).
|
to attend the AutoML WG community meeting.
|
||||||
|
|
||||||
|
- Check
|
||||||
|
[the AutoML WG meeting notes](https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit).
|
||||||
|
|
||||||
|
- Join
|
||||||
|
[the AutoML WG Slack channel](https://kubeflow.slack.com/archives/C018PMV53NW).
|
||||||
|
|
||||||
|
- Learn more about Katib in
|
||||||
|
[the presentations and demos list](./docs/presentations.md).
|
||||||
|
|
||||||
|
### Blog posts
|
||||||
|
|
||||||
|
- [Kubeflow Katib: Scalable, Portable and Cloud Native System for AutoML](https://blog.kubeflow.org/katib/)
|
||||||
|
(by Andrey Velichkevich)
|
||||||
|
|
||||||
## Contributing
|
## Contributing
|
||||||
|
|
||||||
Please refer to the [CONTRIBUTING guide](CONTRIBUTING.md).
|
Please feel free to test the system!
|
||||||
|
[developer-guide.md](./docs/developer-guide.md) is a good starting point
|
||||||
|
for developers.
|
||||||
|
|
||||||
## Citation
|
## Citation
|
||||||
|
|
||||||
|
|
68
ROADMAP.md
68
ROADMAP.md
|
@ -1,71 +1,3 @@
|
||||||
# Katib 2022/2023 Roadmap
|
|
||||||
|
|
||||||
## AutoML Features
|
|
||||||
|
|
||||||
- Support advance HyperParameter tuning algorithms:
|
|
||||||
|
|
||||||
- Population Based Training (PBT) - [#1382](https://github.com/kubeflow/katib/issues/1382)
|
|
||||||
- Tree of Parzen Estimators (TPE)
|
|
||||||
- Multivariate TPE
|
|
||||||
- Sobol’s Quasirandom Sequence
|
|
||||||
- Asynchronous Successive Halving - [ASHA](https://arxiv.org/pdf/1810.05934.pdf)
|
|
||||||
|
|
||||||
- Support multi-objective optimization - [#1549](https://github.com/kubeflow/katib/issues/1549)
|
|
||||||
- Support various HP distributions (log-uniform, uniform, normal) - [#1207](https://github.com/kubeflow/katib/issues/1207)
|
|
||||||
- Support Auto Model Compression - [#460](https://github.com/kubeflow/katib/issues/460)
|
|
||||||
- Support Auto Feature Engineering - [#475](https://github.com/kubeflow/katib/issues/475)
|
|
||||||
- Improve Neural Architecture Search design
|
|
||||||
|
|
||||||
## Backend and API Enhancements
|
|
||||||
|
|
||||||
- Conformance tests for Katib - [#2044](https://github.com/kubeflow/katib/issues/2044)
|
|
||||||
- Support push-based metrics collection in Katib - [#577](https://github.com/kubeflow/katib/issues/577)
|
|
||||||
- Support PostgreSQL as a Katib DB - [#915](https://github.com/kubeflow/katib/issues/915)
|
|
||||||
- Improve Katib scalability - [#1847](https://github.com/kubeflow/katib/issues/1847)
|
|
||||||
- Promote Katib APIs to the `v1` version
|
|
||||||
- Support multiple CRD versions (`v1beta1`, `v1`) with conversion webhook
|
|
||||||
|
|
||||||
## Improve Katib User Experience
|
|
||||||
|
|
||||||
- Simplify Katib Experiment creation with Katib SDK - [#1951](https://github.com/kubeflow/katib/pull/1951)
|
|
||||||
- Fully migrate to a new Katib UI - [Project 1](https://github.com/kubeflow/katib/projects/1)
|
|
||||||
- Expose Trial logs in Katib UI - [#971](https://github.com/kubeflow/katib/issues/971)
|
|
||||||
- Enhance Katib UI visualization metrics for AutoML Experiments
|
|
||||||
- Improve Katib Config UX - [#2150](https://github.com/kubeflow/katib/issues/2150)
|
|
||||||
|
|
||||||
## Integration with Kubeflow Components
|
|
||||||
|
|
||||||
- Kubeflow Pipeline as a Katib Trial target - [#1914](https://github.com/kubeflow/katib/issues/1914)
|
|
||||||
- Improve data passing when Katib Experiment is part of Kubeflow Pipeline - [#1846](https://github.com/kubeflow/katib/issues/1846)
|
|
||||||
|
|
||||||
# History
|
|
||||||
|
|
||||||
# Katib 2021 Roadmap
|
|
||||||
|
|
||||||
## New Features
|
|
||||||
|
|
||||||
### AutoML
|
|
||||||
|
|
||||||
- Support Population Based Training [#1382](https://github.com/kubeflow/katib/issues/1382)
|
|
||||||
- Support [ASHA](https://arxiv.org/pdf/1810.05934.pdf)
|
|
||||||
- Support Auto Model Compression [#460](https://github.com/kubeflow/katib/issues/460)
|
|
||||||
- Support Auto Feature Engineering [#475](https://github.com/kubeflow/katib/issues/475)
|
|
||||||
- Various CRDs for HP, NAS and other AutoML techniques.
|
|
||||||
|
|
||||||
### UI
|
|
||||||
|
|
||||||
- Migrate to the new Katib UI [Project 1](https://github.com/kubeflow/katib/projects/1)
|
|
||||||
- Hyperparameter importances visualization with fANOVA algorithm
|
|
||||||
|
|
||||||
## Enhancements
|
|
||||||
|
|
||||||
- Finish AWS CI/CD migration
|
|
||||||
- Support various parameter distribution [#1207](https://github.com/kubeflow/katib/issues/1207)
|
|
||||||
- Finish validation for Algorithms [#1126](https://github.com/kubeflow/katib/issues/1126)
|
|
||||||
- Refactor Hyperband [#1389](https://github.com/kubeflow/katib/issues/1389)
|
|
||||||
- Support multiple CRD version with conversion webhook
|
|
||||||
- MLMD integration with Katib Experiments
|
|
||||||
|
|
||||||
# Katib 2020 Roadmap
|
# Katib 2020 Roadmap
|
||||||
|
|
||||||
## New Features
|
## New Features
|
||||||
|
|
64
SECURITY.md
64
SECURITY.md
|
@ -1,64 +0,0 @@
|
||||||
# Security Policy
|
|
||||||
|
|
||||||
## Supported Versions
|
|
||||||
|
|
||||||
Kubeflow Katib versions are expressed as `vX.Y.Z`, where X is the major version,
|
|
||||||
Y is the minor version, and Z is the patch version, following the
|
|
||||||
[Semantic Versioning](https://semver.org/) terminology.
|
|
||||||
|
|
||||||
The Kubeflow Katib project maintains release branches for the most recent two minor releases.
|
|
||||||
Applicable fixes, including security fixes, may be backported to those two release branches,
|
|
||||||
depending on severity and feasibility.
|
|
||||||
|
|
||||||
Users are encouraged to stay updated with the latest releases to benefit from security patches and
|
|
||||||
improvements.
|
|
||||||
|
|
||||||
## Reporting a Vulnerability
|
|
||||||
|
|
||||||
We're extremely grateful for security researchers and users that report vulnerabilities to the
|
|
||||||
Kubeflow Open Source Community. All reports are thoroughly investigated by Kubeflow projects owners.
|
|
||||||
|
|
||||||
You can use the following ways to report security vulnerabilities privately:
|
|
||||||
|
|
||||||
- Using the Kubeflow Katib repository [GitHub Security Advisory](https://github.com/kubeflow/katib/security/advisories/new).
|
|
||||||
- Using our private Kubeflow Steering Committee mailing list: ksc@kubeflow.org.
|
|
||||||
|
|
||||||
Please provide detailed information to help us understand and address the issue promptly.
|
|
||||||
|
|
||||||
## Disclosure Process
|
|
||||||
|
|
||||||
**Acknowledgment**: We will acknowledge receipt of your report within 10 business days.
|
|
||||||
|
|
||||||
**Assessment**: The Kubeflow projects owners will investigate the reported issue to determine its
|
|
||||||
validity and severity.
|
|
||||||
|
|
||||||
**Resolution**: If the issue is confirmed, we will work on a fix and prepare a release.
|
|
||||||
|
|
||||||
**Notification**: Once a fix is available, we will notify the reporter and coordinate a public
|
|
||||||
disclosure.
|
|
||||||
|
|
||||||
**Public Disclosure**: Details of the vulnerability and the fix will be published in the project's
|
|
||||||
release notes and communicated through appropriate channels.
|
|
||||||
|
|
||||||
## Prevention Mechanisms
|
|
||||||
|
|
||||||
Kubeflow Katib employs several measures to prevent security issues:
|
|
||||||
|
|
||||||
**Code Reviews**: All code changes are reviewed by maintainers to ensure code quality and security.
|
|
||||||
|
|
||||||
**Dependency Management**: Regular updates and monitoring of dependencies (e.g. Dependabot) to
|
|
||||||
address known vulnerabilities.
|
|
||||||
|
|
||||||
**Continuous Integration**: Automated testing and security checks are integrated into the CI/CD pipeline.
|
|
||||||
|
|
||||||
**Image Scanning**: Container images are scanned for vulnerabilities.
|
|
||||||
|
|
||||||
## Communication Channels
|
|
||||||
|
|
||||||
For the general questions please join the following resources:
|
|
||||||
|
|
||||||
- Kubeflow [Slack channels](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels).
|
|
||||||
|
|
||||||
- Kubeflow discuss [mailing list](https://www.kubeflow.org/docs/about/community/#kubeflow-mailing-list).
|
|
||||||
|
|
||||||
Please **do not report** security vulnerabilities through public channels.
|
|
|
@ -0,0 +1,13 @@
|
||||||
|
FROM alpine:3.12.4
|
||||||
|
|
||||||
|
ARG KUBECTL_VERSION="v1.19.3"
|
||||||
|
|
||||||
|
RUN apk add --update openssl
|
||||||
|
RUN wget https://storage.googleapis.com/kubernetes-release/release/$KUBECTL_VERSION/bin/linux/amd64/kubectl \
|
||||||
|
&& chmod +x ./kubectl && mv ./kubectl /usr/local/bin/kubectl
|
||||||
|
|
||||||
|
COPY ./hack/cert-generator.sh /app/cert-generator.sh
|
||||||
|
RUN chmod +x /app/cert-generator.sh
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
ENTRYPOINT ["sh", "./cert-generator.sh"]
|
|
@ -1,8 +1,6 @@
|
||||||
# Build the Katib DB manager.
|
# Build the Katib DB manager.
|
||||||
FROM golang:alpine AS build-env
|
FROM golang:alpine AS build-env
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
|
|
||||||
WORKDIR /go/src/github.com/kubeflow/katib
|
WORKDIR /go/src/github.com/kubeflow/katib
|
||||||
|
|
||||||
# Download packages.
|
# Download packages.
|
||||||
|
@ -15,10 +13,29 @@ COPY cmd/ cmd/
|
||||||
COPY pkg/ pkg/
|
COPY pkg/ pkg/
|
||||||
|
|
||||||
# Build the binary.
|
# Build the binary.
|
||||||
RUN CGO_ENABLED=0 GOOS=linux GOARCH="${TARGETARCH}" go build -a -o katib-db-manager ./cmd/db-manager/v1beta1
|
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-db-manager ./cmd/db-manager/v1beta1; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-db-manager ./cmd/db-manager/v1beta1; \
|
||||||
|
else \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-db-manager ./cmd/db-manager/v1beta1; \
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Add GRPC health probe.
|
||||||
|
RUN GRPC_HEALTH_PROBE_VERSION=v0.3.1 && \
|
||||||
|
if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||||
|
else \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||||
|
fi && \
|
||||||
|
chmod +x /bin/grpc_health_probe
|
||||||
|
|
||||||
# Copy the db-manager into a thin image.
|
# Copy the db-manager into a thin image.
|
||||||
FROM alpine:3.15
|
FROM alpine:3.7
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
|
COPY --from=build-env /bin/grpc_health_probe /bin/
|
||||||
COPY --from=build-env /go/src/github.com/kubeflow/katib/katib-db-manager /app/
|
COPY --from=build-env /go/src/github.com/kubeflow/katib/katib-db-manager /app/
|
||||||
ENTRYPOINT ["./katib-db-manager"]
|
ENTRYPOINT ["./katib-db-manager"]
|
||||||
|
CMD ["-w", "kubernetes"]
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
/*
|
/*
|
||||||
Copyright 2022 The Kubeflow Authors.
|
Copyright 2021 The Kubeflow Authors.
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License");
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
you may not use this file except in compliance with the License.
|
you may not use this file except in compliance with the License.
|
||||||
|
@ -22,21 +22,19 @@ import (
|
||||||
"fmt"
|
"fmt"
|
||||||
"net"
|
"net"
|
||||||
"os"
|
"os"
|
||||||
"time"
|
|
||||||
|
|
||||||
health_pb "github.com/kubeflow/katib/pkg/apis/manager/health"
|
health_pb "github.com/kubeflow/katib/pkg/apis/manager/health"
|
||||||
api_pb "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
|
api_pb "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
|
||||||
db "github.com/kubeflow/katib/pkg/db/v1beta1"
|
db "github.com/kubeflow/katib/pkg/db/v1beta1"
|
||||||
"github.com/kubeflow/katib/pkg/db/v1beta1/common"
|
"github.com/kubeflow/katib/pkg/db/v1beta1/common"
|
||||||
"k8s.io/klog/v2"
|
"k8s.io/klog"
|
||||||
|
|
||||||
"google.golang.org/grpc"
|
"google.golang.org/grpc"
|
||||||
"google.golang.org/grpc/reflection"
|
"google.golang.org/grpc/reflection"
|
||||||
)
|
)
|
||||||
|
|
||||||
const (
|
const (
|
||||||
defaultListenAddress = "0.0.0.0:6789"
|
port = "0.0.0.0:6789"
|
||||||
defaultConnectTimeout = time.Second * 60
|
|
||||||
)
|
)
|
||||||
|
|
||||||
var dbIf common.KatibDBInterface
|
var dbIf common.KatibDBInterface
|
||||||
|
@ -89,30 +87,25 @@ func (s *server) Check(ctx context.Context, in *health_pb.HealthCheckRequest) (*
|
||||||
}
|
}
|
||||||
|
|
||||||
func main() {
|
func main() {
|
||||||
var connectTimeout time.Duration
|
|
||||||
var listenAddress string
|
|
||||||
flag.DurationVar(&connectTimeout, "connect-timeout", defaultConnectTimeout, "Timeout before calling error during database connection. (e.g. 120s)")
|
|
||||||
flag.StringVar(&listenAddress, "listen-address", defaultListenAddress, "The network interface or IP address to receive incoming connections. (e.g. 0.0.0.0:6789)")
|
|
||||||
flag.Parse()
|
flag.Parse()
|
||||||
|
|
||||||
var err error
|
var err error
|
||||||
dbNameEnvName := common.DBNameEnvName
|
dbNameEnvName := common.DBNameEnvName
|
||||||
dbName := os.Getenv(dbNameEnvName)
|
dbName := os.Getenv(dbNameEnvName)
|
||||||
if dbName == "" {
|
if dbName == "" {
|
||||||
klog.Fatal("DB_NAME env is not set. Exiting")
|
klog.Fatal("DB_NAME env is not set. Exiting")
|
||||||
}
|
}
|
||||||
dbIf, err = db.NewKatibDBInterface(dbName, connectTimeout)
|
dbIf, err = db.NewKatibDBInterface(dbName)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
klog.Fatalf("Failed to open db connection: %v", err)
|
klog.Fatalf("Failed to open db connection: %v", err)
|
||||||
}
|
}
|
||||||
dbIf.DBInit()
|
dbIf.DBInit()
|
||||||
listener, err := net.Listen("tcp", listenAddress)
|
listener, err := net.Listen("tcp", port)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
klog.Fatalf("Failed to listen: %v", err)
|
klog.Fatalf("Failed to listen: %v", err)
|
||||||
}
|
}
|
||||||
|
|
||||||
size := 1<<31 - 1
|
size := 1<<31 - 1
|
||||||
klog.Infof("Start Katib manager: %s", listenAddress)
|
klog.Infof("Start Katib manager: %s", port)
|
||||||
s := grpc.NewServer(grpc.MaxRecvMsgSize(size), grpc.MaxSendMsgSize(size))
|
s := grpc.NewServer(grpc.MaxRecvMsgSize(size), grpc.MaxSendMsgSize(size))
|
||||||
api_pb.RegisterDBManagerServer(s, &server{})
|
api_pb.RegisterDBManagerServer(s, &server{})
|
||||||
health_pb.RegisterHealthServer(s, &server{})
|
health_pb.RegisterHealthServer(s, &server{})
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
/*
|
/*
|
||||||
Copyright 2022 The Kubeflow Authors.
|
Copyright 2021 The Kubeflow Authors.
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License");
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
you may not use this file except in compliance with the License.
|
you may not use this file except in compliance with the License.
|
||||||
|
@ -20,7 +20,7 @@ import (
|
||||||
"context"
|
"context"
|
||||||
"testing"
|
"testing"
|
||||||
|
|
||||||
"go.uber.org/mock/gomock"
|
"github.com/golang/mock/gomock"
|
||||||
|
|
||||||
health_pb "github.com/kubeflow/katib/pkg/apis/manager/health"
|
health_pb "github.com/kubeflow/katib/pkg/apis/manager/health"
|
||||||
api_pb "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
|
api_pb "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
|
||||||
|
|
|
@ -1,24 +1,22 @@
|
||||||
FROM python:3.11-slim
|
FROM python:3.6
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
ENV TARGET_DIR /opt/katib
|
ENV TARGET_DIR /opt/katib
|
||||||
ENV EARLY_STOPPING_DIR cmd/earlystopping/medianstop/v1beta1
|
ENV EARLY_STOPPING_DIR cmd/earlystopping/medianstop/v1beta1
|
||||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python
|
|
||||||
|
|
||||||
RUN if [ "${TARGETARCH}" = "ppc64le" ] || [ "${TARGETARCH}" = "arm64" ]; then \
|
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
apt-get -y update && \
|
apt-get -y update && \
|
||||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||||
apt-get clean && \
|
pip install cython; \
|
||||||
rm -rf /var/lib/apt/lists/*; \
|
|
||||||
fi
|
fi
|
||||||
|
|
||||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||||
ADD ./${EARLY_STOPPING_DIR}/ ${TARGET_DIR}/${EARLY_STOPPING_DIR}/
|
ADD ./${EARLY_STOPPING_DIR}/ ${TARGET_DIR}/${EARLY_STOPPING_DIR}/
|
||||||
|
|
||||||
WORKDIR ${TARGET_DIR}/${EARLY_STOPPING_DIR}
|
WORKDIR ${TARGET_DIR}/${EARLY_STOPPING_DIR}
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
|
||||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||||
&& chmod -R g+rwX ${TARGET_DIR}
|
&& chmod -R g+rwX ${TARGET_DIR}
|
||||||
|
|
||||||
|
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python
|
||||||
|
|
||||||
ENTRYPOINT ["python", "main.py"]
|
ENTRYPOINT ["python", "main.py"]
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# Copyright 2022 The Kubeflow Authors.
|
# Copyright 2021 The Kubeflow Authors.
|
||||||
#
|
#
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
# you may not use this file except in compliance with the License.
|
# you may not use this file except in compliance with the License.
|
||||||
|
@ -12,14 +12,12 @@
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
|
|
||||||
import logging
|
|
||||||
import time
|
|
||||||
from concurrent import futures
|
|
||||||
|
|
||||||
import grpc
|
import grpc
|
||||||
|
import time
|
||||||
|
import logging
|
||||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||||
from pkg.earlystopping.v1beta1.medianstop.service import MedianStopService
|
from pkg.earlystopping.v1beta1.medianstop.service import MedianStopService
|
||||||
|
from concurrent import futures
|
||||||
|
|
||||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||||
DEFAULT_PORT = "0.0.0.0:6788"
|
DEFAULT_PORT = "0.0.0.0:6788"
|
||||||
|
|
|
@ -1,5 +1,4 @@
|
||||||
grpcio>=1.64.1
|
grpcio==1.23.0
|
||||||
protobuf>=4.21.12,<5
|
protobuf==3.9.1
|
||||||
googleapis-common-protos==1.6.0
|
googleapis-common-protos==1.6.0
|
||||||
kubernetes==22.6.0
|
kubernetes==11.0.0
|
||||||
cython>=0.29.24
|
|
||||||
|
|
|
@ -1,8 +1,6 @@
|
||||||
# Build the Katib controller.
|
# Build the Katib controller.
|
||||||
FROM golang:alpine AS build-env
|
FROM golang:alpine AS build-env
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
|
|
||||||
WORKDIR /go/src/github.com/kubeflow/katib
|
WORKDIR /go/src/github.com/kubeflow/katib
|
||||||
|
|
||||||
# Download packages.
|
# Download packages.
|
||||||
|
@ -15,10 +13,16 @@ COPY cmd/ cmd/
|
||||||
COPY pkg/ pkg/
|
COPY pkg/ pkg/
|
||||||
|
|
||||||
# Build the binary.
|
# Build the binary.
|
||||||
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o katib-controller ./cmd/katib-controller/v1beta1
|
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-controller ./cmd/katib-controller/v1beta1; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-controller ./cmd/katib-controller/v1beta1; \
|
||||||
|
else \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-controller ./cmd/katib-controller/v1beta1; \
|
||||||
|
fi
|
||||||
|
|
||||||
# Copy the controller-manager into a thin image.
|
# Copy the controller-manager into a thin image.
|
||||||
FROM alpine:3.15
|
FROM alpine:3.7
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
COPY --from=build-env /go/src/github.com/kubeflow/katib/katib-controller .
|
COPY --from=build-env /go/src/github.com/kubeflow/katib/katib-controller .
|
||||||
ENTRYPOINT ["./katib-controller"]
|
ENTRYPOINT ["./katib-controller"]
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
/*
|
/*
|
||||||
Copyright 2022 The Kubeflow Authors.
|
Copyright 2021 The Kubeflow Authors.
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License");
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
you may not use this file except in compliance with the License.
|
you may not use this file except in compliance with the License.
|
||||||
|
@ -24,75 +24,59 @@ import (
|
||||||
"os"
|
"os"
|
||||||
|
|
||||||
"github.com/spf13/viper"
|
"github.com/spf13/viper"
|
||||||
"k8s.io/apimachinery/pkg/runtime"
|
|
||||||
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
|
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
|
||||||
"sigs.k8s.io/controller-runtime/pkg/client/config"
|
"sigs.k8s.io/controller-runtime/pkg/client/config"
|
||||||
"sigs.k8s.io/controller-runtime/pkg/healthz"
|
|
||||||
logf "sigs.k8s.io/controller-runtime/pkg/log"
|
logf "sigs.k8s.io/controller-runtime/pkg/log"
|
||||||
"sigs.k8s.io/controller-runtime/pkg/log/zap"
|
"sigs.k8s.io/controller-runtime/pkg/log/zap"
|
||||||
"sigs.k8s.io/controller-runtime/pkg/manager"
|
"sigs.k8s.io/controller-runtime/pkg/manager"
|
||||||
"sigs.k8s.io/controller-runtime/pkg/manager/signals"
|
"sigs.k8s.io/controller-runtime/pkg/manager/signals"
|
||||||
metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
|
|
||||||
"sigs.k8s.io/controller-runtime/pkg/webhook"
|
|
||||||
|
|
||||||
configv1beta1 "github.com/kubeflow/katib/pkg/apis/config/v1beta1"
|
|
||||||
apis "github.com/kubeflow/katib/pkg/apis/controller"
|
apis "github.com/kubeflow/katib/pkg/apis/controller"
|
||||||
cert "github.com/kubeflow/katib/pkg/certgenerator/v1beta1"
|
controller "github.com/kubeflow/katib/pkg/controller.v1beta1"
|
||||||
"github.com/kubeflow/katib/pkg/controller.v1beta1"
|
|
||||||
"github.com/kubeflow/katib/pkg/controller.v1beta1/consts"
|
"github.com/kubeflow/katib/pkg/controller.v1beta1/consts"
|
||||||
"github.com/kubeflow/katib/pkg/util/v1beta1/katibconfig"
|
trialutil "github.com/kubeflow/katib/pkg/controller.v1beta1/trial/util"
|
||||||
webhookv1beta1 "github.com/kubeflow/katib/pkg/webhook/v1beta1"
|
webhook "github.com/kubeflow/katib/pkg/webhook/v1beta1"
|
||||||
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
|
|
||||||
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
|
|
||||||
)
|
)
|
||||||
|
|
||||||
var (
|
|
||||||
scheme = runtime.NewScheme()
|
|
||||||
log = logf.Log.WithName("entrypoint")
|
|
||||||
)
|
|
||||||
|
|
||||||
func init() {
|
|
||||||
utilruntime.Must(apis.AddToScheme(scheme))
|
|
||||||
utilruntime.Must(configv1beta1.AddToScheme(scheme))
|
|
||||||
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
|
|
||||||
}
|
|
||||||
|
|
||||||
func main() {
|
func main() {
|
||||||
logf.SetLogger(zap.New())
|
logf.SetLogger(zap.New())
|
||||||
|
log := logf.Log.WithName("entrypoint")
|
||||||
|
|
||||||
|
var experimentSuggestionName string
|
||||||
|
var metricsAddr string
|
||||||
|
var webhookPort int
|
||||||
|
var injectSecurityContext bool
|
||||||
|
var enableGRPCProbeInSuggestion bool
|
||||||
|
var trialResources trialutil.GvkListFlag
|
||||||
|
|
||||||
|
flag.StringVar(&experimentSuggestionName, "experiment-suggestion-name",
|
||||||
|
"default", "The implementation of suggestion interface in experiment controller (default)")
|
||||||
|
flag.StringVar(&metricsAddr, "metrics-addr", ":8080", "The address the metric endpoint binds to.")
|
||||||
|
flag.BoolVar(&injectSecurityContext, "webhook-inject-securitycontext", false, "Inject the securityContext of container[0] in the sidecar")
|
||||||
|
flag.BoolVar(&enableGRPCProbeInSuggestion, "enable-grpc-probe-in-suggestion", true, "enable grpc probe in suggestions")
|
||||||
|
flag.Var(&trialResources, "trial-resources", "The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org)")
|
||||||
|
flag.IntVar(&webhookPort, "webhook-port", 8443, "The port number to be used for admission webhook server.")
|
||||||
|
|
||||||
|
// TODO (andreyvelich): Currently it is not possible to set different webhook service name.
|
||||||
|
// flag.StringVar(&serviceName, "webhook-service-name", "katib-controller", "The service name which will be used in webhook")
|
||||||
|
// TODO (andreyvelich): Currently is is not possible to store webhook cert in the local file system.
|
||||||
|
// flag.BoolVar(&certLocalFS, "cert-localfs", false, "Store the webhook cert in local file system")
|
||||||
|
|
||||||
var katibConfigFile string
|
|
||||||
flag.StringVar(&katibConfigFile, "katib-config", "",
|
|
||||||
"The katib-controller will load its initial configuration from this file. "+
|
|
||||||
"Omit this flag to use the default configuration values. ")
|
|
||||||
flag.Parse()
|
flag.Parse()
|
||||||
|
|
||||||
initConfig, err := katibconfig.GetInitConfigData(scheme, katibConfigFile)
|
|
||||||
if err != nil {
|
|
||||||
log.Error(err, "Failed to get KatibConfig")
|
|
||||||
os.Exit(1)
|
|
||||||
}
|
|
||||||
|
|
||||||
// Set the config in viper.
|
// Set the config in viper.
|
||||||
viper.Set(consts.ConfigExperimentSuggestionName, initConfig.ControllerConfig.ExperimentSuggestionName)
|
viper.Set(consts.ConfigExperimentSuggestionName, experimentSuggestionName)
|
||||||
viper.Set(consts.ConfigInjectSecurityContext, initConfig.ControllerConfig.InjectSecurityContext)
|
viper.Set(consts.ConfigInjectSecurityContext, injectSecurityContext)
|
||||||
viper.Set(consts.ConfigEnableGRPCProbeInSuggestion, initConfig.ControllerConfig.EnableGRPCProbeInSuggestion)
|
viper.Set(consts.ConfigEnableGRPCProbeInSuggestion, enableGRPCProbeInSuggestion)
|
||||||
|
viper.Set(consts.ConfigTrialResources, trialResources)
|
||||||
trialGVKs, err := katibconfig.TrialResourcesToGVKs(initConfig.ControllerConfig.TrialResources)
|
|
||||||
if err != nil {
|
|
||||||
log.Error(err, "Failed to parse trialResources")
|
|
||||||
os.Exit(1)
|
|
||||||
}
|
|
||||||
viper.Set(consts.ConfigTrialResources, trialGVKs)
|
|
||||||
|
|
||||||
log.Info("Config:",
|
log.Info("Config:",
|
||||||
consts.ConfigExperimentSuggestionName,
|
consts.ConfigExperimentSuggestionName,
|
||||||
viper.GetString(consts.ConfigExperimentSuggestionName),
|
viper.GetString(consts.ConfigExperimentSuggestionName),
|
||||||
"webhook-port",
|
"webhook-port",
|
||||||
initConfig.ControllerConfig.WebhookPort,
|
webhookPort,
|
||||||
"metrics-addr",
|
"metrics-addr",
|
||||||
initConfig.ControllerConfig.MetricsAddr,
|
metricsAddr,
|
||||||
"healthz-addr",
|
|
||||||
initConfig.ControllerConfig.HealthzAddr,
|
|
||||||
consts.ConfigInjectSecurityContext,
|
consts.ConfigInjectSecurityContext,
|
||||||
viper.GetBool(consts.ConfigInjectSecurityContext),
|
viper.GetBool(consts.ConfigInjectSecurityContext),
|
||||||
consts.ConfigEnableGRPCProbeInSuggestion,
|
consts.ConfigEnableGRPCProbeInSuggestion,
|
||||||
|
@ -110,65 +94,20 @@ func main() {
|
||||||
|
|
||||||
// Create a new katib controller to provide shared dependencies and start components
|
// Create a new katib controller to provide shared dependencies and start components
|
||||||
mgr, err := manager.New(cfg, manager.Options{
|
mgr, err := manager.New(cfg, manager.Options{
|
||||||
Metrics: metricsserver.Options{
|
MetricsBindAddress: metricsAddr,
|
||||||
BindAddress: initConfig.ControllerConfig.MetricsAddr,
|
|
||||||
},
|
|
||||||
HealthProbeBindAddress: initConfig.ControllerConfig.HealthzAddr,
|
|
||||||
LeaderElection: initConfig.ControllerConfig.EnableLeaderElection,
|
|
||||||
LeaderElectionID: initConfig.ControllerConfig.LeaderElectionID,
|
|
||||||
Scheme: scheme,
|
|
||||||
})
|
})
|
||||||
if err != nil {
|
if err != nil {
|
||||||
log.Error(err, "Failed to create the manager")
|
log.Error(err, "unable add APIs to scheme")
|
||||||
os.Exit(1)
|
os.Exit(1)
|
||||||
}
|
}
|
||||||
|
|
||||||
log.Info("Registering Components.")
|
log.Info("Registering Components.")
|
||||||
|
|
||||||
// Create a webhook server.
|
// Setup Scheme for all resources
|
||||||
hookServer := webhook.NewServer(webhook.Options{
|
if err := apis.AddToScheme(mgr.GetScheme()); err != nil {
|
||||||
Port: *initConfig.ControllerConfig.WebhookPort,
|
log.Error(err, "Fail to create the manager")
|
||||||
CertDir: consts.CertDir,
|
|
||||||
})
|
|
||||||
|
|
||||||
ctx := signals.SetupSignalHandler()
|
|
||||||
certsReady := make(chan struct{})
|
|
||||||
defer close(certsReady)
|
|
||||||
|
|
||||||
// The setupControllers will register controllers to the manager
|
|
||||||
// after generated certs for the admission webhooks.
|
|
||||||
go setupControllers(mgr, certsReady, hookServer)
|
|
||||||
|
|
||||||
if initConfig.CertGeneratorConfig.Enable {
|
|
||||||
if err = cert.AddToManager(mgr, initConfig.CertGeneratorConfig, certsReady); err != nil {
|
|
||||||
log.Error(err, "Failed to set up cert-generator")
|
|
||||||
}
|
|
||||||
} else {
|
|
||||||
certsReady <- struct{}{}
|
|
||||||
}
|
|
||||||
|
|
||||||
log.Info("Setting up health checker.")
|
|
||||||
if err := mgr.AddReadyzCheck("readyz", hookServer.StartedChecker()); err != nil {
|
|
||||||
log.Error(err, "Unable to add readyz endpoint to the manager")
|
|
||||||
os.Exit(1)
|
os.Exit(1)
|
||||||
}
|
}
|
||||||
if err = mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
|
|
||||||
log.Error(err, "Add webhook server health checker to the manager failed")
|
|
||||||
os.Exit(1)
|
|
||||||
}
|
|
||||||
|
|
||||||
// Start the Cmd
|
|
||||||
log.Info("Starting the manager.")
|
|
||||||
if err = mgr.Start(ctx); err != nil {
|
|
||||||
log.Error(err, "Unable to run the manager")
|
|
||||||
os.Exit(1)
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func setupControllers(mgr manager.Manager, certsReady chan struct{}, hookServer webhook.Server) {
|
|
||||||
// The certsReady blocks to register controllers until generated certs.
|
|
||||||
<-certsReady
|
|
||||||
log.Info("Certs ready")
|
|
||||||
|
|
||||||
// Setup all Controllers
|
// Setup all Controllers
|
||||||
log.Info("Setting up controller.")
|
log.Info("Setting up controller.")
|
||||||
|
@ -178,8 +117,15 @@ func setupControllers(mgr manager.Manager, certsReady chan struct{}, hookServer
|
||||||
}
|
}
|
||||||
|
|
||||||
log.Info("Setting up webhooks.")
|
log.Info("Setting up webhooks.")
|
||||||
if err := webhookv1beta1.AddToManager(mgr, hookServer); err != nil {
|
if err := webhook.AddToManager(mgr, webhookPort); err != nil {
|
||||||
log.Error(err, "Unable to register webhooks to the manager")
|
log.Error(err, "Unable to register webhooks to the manager")
|
||||||
os.Exit(1)
|
os.Exit(1)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Start the Cmd
|
||||||
|
log.Info("Starting the Cmd.")
|
||||||
|
if err := mgr.Start(signals.SetupSignalHandler()); err != nil {
|
||||||
|
log.Error(err, "Unable to run the manager")
|
||||||
|
os.Exit(1)
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
|
@ -1,8 +1,6 @@
|
||||||
# Build the Katib file metrics collector.
|
# Build the Katib file metrics collector.
|
||||||
FROM golang:alpine AS build-env
|
FROM golang:alpine AS build-env
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
|
|
||||||
WORKDIR /go/src/github.com/kubeflow/katib
|
WORKDIR /go/src/github.com/kubeflow/katib
|
||||||
|
|
||||||
# Download packages.
|
# Download packages.
|
||||||
|
@ -15,10 +13,16 @@ COPY cmd/ cmd/
|
||||||
COPY pkg/ pkg/
|
COPY pkg/ pkg/
|
||||||
|
|
||||||
# Build the binary.
|
# Build the binary.
|
||||||
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o file-metricscollector ./cmd/metricscollector/v1beta1/file-metricscollector
|
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o file-metricscollector ./cmd/metricscollector/v1beta1/file-metricscollector; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o file-metricscollector ./cmd/metricscollector/v1beta1/file-metricscollector; \
|
||||||
|
else \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o file-metricscollector ./cmd/metricscollector/v1beta1/file-metricscollector; \
|
||||||
|
fi
|
||||||
|
|
||||||
# Copy the file metrics collector into a thin image.
|
# Copy the file metrics collector into a thin image.
|
||||||
FROM alpine:3.15
|
FROM alpine:3.7
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
COPY --from=build-env /go/src/github.com/kubeflow/katib/file-metricscollector .
|
COPY --from=build-env /go/src/github.com/kubeflow/katib/file-metricscollector .
|
||||||
ENTRYPOINT ["./file-metricscollector"]
|
ENTRYPOINT ["./file-metricscollector"]
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
/*
|
/*
|
||||||
Copyright 2022 The Kubeflow Authors.
|
Copyright 2021 The Kubeflow Authors.
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License");
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
you may not use this file except in compliance with the License.
|
you may not use this file except in compliance with the License.
|
||||||
|
@ -39,21 +39,19 @@ package main
|
||||||
|
|
||||||
import (
|
import (
|
||||||
"context"
|
"context"
|
||||||
"encoding/json"
|
|
||||||
"flag"
|
"flag"
|
||||||
"fmt"
|
"fmt"
|
||||||
|
"io/ioutil"
|
||||||
"os"
|
"os"
|
||||||
"path/filepath"
|
"path/filepath"
|
||||||
"regexp"
|
|
||||||
"strconv"
|
"strconv"
|
||||||
"strings"
|
"strings"
|
||||||
"time"
|
"time"
|
||||||
|
|
||||||
"github.com/nxadm/tail"
|
"github.com/hpcloud/tail"
|
||||||
psutil "github.com/shirou/gopsutil/v3/process"
|
psutil "github.com/shirou/gopsutil/process"
|
||||||
"google.golang.org/grpc"
|
"google.golang.org/grpc"
|
||||||
"google.golang.org/grpc/credentials/insecure"
|
"k8s.io/klog"
|
||||||
"k8s.io/klog/v2"
|
|
||||||
|
|
||||||
commonv1beta1 "github.com/kubeflow/katib/pkg/apis/controller/common/v1beta1"
|
commonv1beta1 "github.com/kubeflow/katib/pkg/apis/controller/common/v1beta1"
|
||||||
api "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
|
api "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
|
||||||
|
@ -104,7 +102,6 @@ var (
|
||||||
earlyStopServiceAddr = flag.String("s-earlystop", "", "Katib Early Stopping service endpoint")
|
earlyStopServiceAddr = flag.String("s-earlystop", "", "Katib Early Stopping service endpoint")
|
||||||
trialName = flag.String("t", "", "Trial Name")
|
trialName = flag.String("t", "", "Trial Name")
|
||||||
metricsFilePath = flag.String("path", "", "Metrics File Path")
|
metricsFilePath = flag.String("path", "", "Metrics File Path")
|
||||||
metricsFileFormat = flag.String("format", "", "Metrics File Format")
|
|
||||||
metricNames = flag.String("m", "", "Metric names")
|
metricNames = flag.String("m", "", "Metric names")
|
||||||
objectiveType = flag.String("o-type", "", "Objective type")
|
objectiveType = flag.String("o-type", "", "Objective type")
|
||||||
metricFilters = flag.String("f", "", "Metric filters")
|
metricFilters = flag.String("f", "", "Metric filters")
|
||||||
|
@ -134,17 +131,13 @@ func printMetricsFile(mFile string) {
|
||||||
checkMetricFile(mFile)
|
checkMetricFile(mFile)
|
||||||
|
|
||||||
// Print lines from metrics file.
|
// Print lines from metrics file.
|
||||||
t, err := tail.TailFile(mFile, tail.Config{Follow: true, ReOpen: true})
|
t, _ := tail.TailFile(mFile, tail.Config{Follow: true})
|
||||||
if err != nil {
|
|
||||||
klog.Errorf("Failed to open metrics file: %v", err)
|
|
||||||
}
|
|
||||||
|
|
||||||
for line := range t.Lines {
|
for line := range t.Lines {
|
||||||
klog.Info(line.Text)
|
klog.Info(line.Text)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, fileFormat commonv1beta1.FileFormat) {
|
func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string) {
|
||||||
|
|
||||||
// metricStartStep is the dict where key = metric name, value = start step.
|
// metricStartStep is the dict where key = metric name, value = start step.
|
||||||
// We should apply early stopping rule only if metric is reported at least "start_step" times.
|
// We should apply early stopping rule only if metric is reported at least "start_step" times.
|
||||||
|
@ -155,6 +148,9 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// First metric is objective in metricNames array.
|
||||||
|
objMetric := strings.Split(*metricNames, ";")[0]
|
||||||
|
objType := commonv1beta1.ObjectiveType(*objectiveType)
|
||||||
// For objective metric we calculate best optimal value from the recorded metrics.
|
// For objective metric we calculate best optimal value from the recorded metrics.
|
||||||
// This is workaround for Median Stop algorithm.
|
// This is workaround for Median Stop algorithm.
|
||||||
// TODO (andreyvelich): Think about it, maybe define latest, max or min strategy type in stop-rule as well ?
|
// TODO (andreyvelich): Think about it, maybe define latest, max or min strategy type in stop-rule as well ?
|
||||||
|
@ -163,10 +159,8 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
||||||
// Check that metric file exists.
|
// Check that metric file exists.
|
||||||
checkMetricFile(mFile)
|
checkMetricFile(mFile)
|
||||||
|
|
||||||
// Get Main process.
|
// Get Main proccess.
|
||||||
// Extract the metric file dir path based on the file name.
|
_, mainProcPid, err := common.GetMainProcesses(mFile)
|
||||||
mDirPath, _ := filepath.Split(mFile)
|
|
||||||
_, mainProcPid, err := common.GetMainProcesses(mDirPath)
|
|
||||||
if err != nil {
|
if err != nil {
|
||||||
klog.Fatalf("GetMainProcesses failed: %v", err)
|
klog.Fatalf("GetMainProcesses failed: %v", err)
|
||||||
}
|
}
|
||||||
|
@ -175,6 +169,9 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
||||||
klog.Fatalf("Failed to create new Process from pid %v, error: %v", mainProcPid, err)
|
klog.Fatalf("Failed to create new Process from pid %v, error: %v", mainProcPid, err)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Get list of regural expressions from filters.
|
||||||
|
metricRegList := filemc.GetFilterRegexpList(filters)
|
||||||
|
|
||||||
// Start watch log lines.
|
// Start watch log lines.
|
||||||
t, _ := tail.TailFile(mFile, tail.Config{Follow: true})
|
t, _ := tail.TailFile(mFile, tail.Config{Follow: true})
|
||||||
for line := range t.Lines {
|
for line := range t.Lines {
|
||||||
|
@ -182,12 +179,6 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
||||||
// Print log line
|
// Print log line
|
||||||
klog.Info(logText)
|
klog.Info(logText)
|
||||||
|
|
||||||
switch fileFormat {
|
|
||||||
case commonv1beta1.TextFormat:
|
|
||||||
// Get list of regural expressions from filters.
|
|
||||||
var metricRegList []*regexp.Regexp
|
|
||||||
metricRegList = filemc.GetFilterRegexpList(filters)
|
|
||||||
|
|
||||||
// Check if log line contains metric from stop rules.
|
// Check if log line contains metric from stop rules.
|
||||||
isRuleLine := false
|
isRuleLine := false
|
||||||
for _, rule := range stopRules {
|
for _, rule := range stopRules {
|
||||||
|
@ -221,43 +212,45 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
||||||
if metricName != rule.Name {
|
if metricName != rule.Name {
|
||||||
continue
|
continue
|
||||||
}
|
}
|
||||||
stopRules, optimalObjValue = updateStopRules(stopRules, optimalObjValue, metricValue, metricStartStep, rule, idx)
|
|
||||||
|
// Calculate optimalObjValue.
|
||||||
|
if metricName == objMetric {
|
||||||
|
if optimalObjValue == nil {
|
||||||
|
optimalObjValue = &metricValue
|
||||||
|
} else if objType == commonv1beta1.ObjectiveTypeMaximize && metricValue > *optimalObjValue {
|
||||||
|
optimalObjValue = &metricValue
|
||||||
|
} else if objType == commonv1beta1.ObjectiveTypeMinimize && metricValue < *optimalObjValue {
|
||||||
|
optimalObjValue = &metricValue
|
||||||
}
|
}
|
||||||
}
|
// Assign best optimal value to metric value.
|
||||||
}
|
metricValue = *optimalObjValue
|
||||||
case commonv1beta1.JsonFormat:
|
|
||||||
var logJsonObj map[string]interface{}
|
|
||||||
if err = json.Unmarshal([]byte(logText), &logJsonObj); err != nil {
|
|
||||||
klog.Fatalf("Failed to unmarshal logs in %v format, log: %s, error: %v", commonv1beta1.JsonFormat, logText, err)
|
|
||||||
}
|
|
||||||
// Check if log line contains metric from stop rules.
|
|
||||||
isRuleLine := false
|
|
||||||
for _, rule := range stopRules {
|
|
||||||
if _, exist := logJsonObj[rule.Name]; exist {
|
|
||||||
isRuleLine = true
|
|
||||||
break
|
|
||||||
}
|
|
||||||
}
|
|
||||||
// If log line doesn't contain appropriate metric, continue track file.
|
|
||||||
if !isRuleLine {
|
|
||||||
continue
|
|
||||||
}
|
}
|
||||||
|
|
||||||
// stopRules contains array of EarlyStoppingRules that has not been reached yet.
|
// Reduce steps if appropriate metric is reported.
|
||||||
// After rule is reached we delete appropriate element from the array.
|
// Once rest steps are empty we apply early stopping rule.
|
||||||
for idx, rule := range stopRules {
|
if _, ok := metricStartStep[metricName]; ok {
|
||||||
value, exist := logJsonObj[rule.Name].(string)
|
metricStartStep[metricName]--
|
||||||
if !exist {
|
if metricStartStep[metricName] != 0 {
|
||||||
continue
|
continue
|
||||||
}
|
}
|
||||||
metricValue, err := strconv.ParseFloat(strings.TrimSpace(value), 64)
|
}
|
||||||
|
|
||||||
|
ruleValue, err := strconv.ParseFloat(rule.Value, 64)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
klog.Fatalf("Unable to parse value %v to float for metric %v", metricValue, rule.Name)
|
klog.Fatalf("Unable to parse value %v to float for rule metric %v", rule.Value, rule.Name)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Metric value can be equal, less or greater than stop rule.
|
||||||
|
// Deleting suitable stop rule from the array.
|
||||||
|
if rule.Comparison == commonv1beta1.ComparisonTypeEqual && metricValue == ruleValue {
|
||||||
|
stopRules = deleteStopRule(stopRules, idx)
|
||||||
|
} else if rule.Comparison == commonv1beta1.ComparisonTypeLess && metricValue < ruleValue {
|
||||||
|
stopRules = deleteStopRule(stopRules, idx)
|
||||||
|
} else if rule.Comparison == commonv1beta1.ComparisonTypeGreater && metricValue > ruleValue {
|
||||||
|
stopRules = deleteStopRule(stopRules, idx)
|
||||||
|
}
|
||||||
}
|
}
|
||||||
stopRules, optimalObjValue = updateStopRules(stopRules, optimalObjValue, metricValue, metricStartStep, rule, idx)
|
|
||||||
}
|
}
|
||||||
default:
|
|
||||||
klog.Fatalf("Format must be set to %v or %v", commonv1beta1.TextFormat, commonv1beta1.JsonFormat)
|
|
||||||
}
|
}
|
||||||
|
|
||||||
// If stopRules array is empty, Trial is early stopped.
|
// If stopRules array is empty, Trial is early stopped.
|
||||||
|
@ -273,12 +266,12 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
||||||
klog.Fatalf("Create mark file %v error: %v", markFile, err)
|
klog.Fatalf("Create mark file %v error: %v", markFile, err)
|
||||||
}
|
}
|
||||||
|
|
||||||
err = os.WriteFile(markFile, []byte(common.TrainingEarlyStopped), 0)
|
err = ioutil.WriteFile(markFile, []byte(common.TrainingEarlyStopped), 0)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
klog.Fatalf("Write to file %v error: %v", markFile, err)
|
klog.Fatalf("Write to file %v error: %v", markFile, err)
|
||||||
}
|
}
|
||||||
|
|
||||||
// Get child process from main PID.
|
// Get child proccess from main PID.
|
||||||
childProc, err := mainProc.Children()
|
childProc, err := mainProc.Children()
|
||||||
if err != nil {
|
if err != nil {
|
||||||
klog.Fatalf("Get children proceses for main PID: %v failed: %v", mainProcPid, err)
|
klog.Fatalf("Get children proceses for main PID: %v failed: %v", mainProcPid, err)
|
||||||
|
@ -296,9 +289,9 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
||||||
}
|
}
|
||||||
|
|
||||||
// Report metrics to DB.
|
// Report metrics to DB.
|
||||||
reportMetrics(filters, fileFormat)
|
reportMetrics(filters)
|
||||||
|
|
||||||
// Wait until main process is completed.
|
// Wait until main proccess is completed.
|
||||||
timeout := 60 * time.Second
|
timeout := 60 * time.Second
|
||||||
endTime := time.Now().Add(timeout)
|
endTime := time.Now().Add(timeout)
|
||||||
isProcRunning := true
|
isProcRunning := true
|
||||||
|
@ -311,10 +304,11 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
||||||
}
|
}
|
||||||
|
|
||||||
// Create connection and client for Early Stopping service.
|
// Create connection and client for Early Stopping service.
|
||||||
conn, err := grpc.NewClient(*earlyStopServiceAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))
|
conn, err := grpc.Dial(*earlyStopServiceAddr, grpc.WithInsecure())
|
||||||
if err != nil {
|
if err != nil {
|
||||||
klog.Fatalf("Could not connect to Early Stopping service, error: %v", err)
|
klog.Fatalf("Could not connect to Early Stopping service, error: %v", err)
|
||||||
}
|
}
|
||||||
|
defer conn.Close()
|
||||||
c := api.NewEarlyStoppingClient(conn)
|
c := api.NewEarlyStoppingClient(conn)
|
||||||
|
|
||||||
setTrialStatusReq := &api.SetTrialStatusRequest{
|
setTrialStatusReq := &api.SetTrialStatusRequest{
|
||||||
|
@ -326,63 +320,11 @@ func watchMetricsFile(mFile string, stopRules stopRulesFlag, filters []string, f
|
||||||
if err != nil {
|
if err != nil {
|
||||||
klog.Fatalf("Set Trial status error: %v", err)
|
klog.Fatalf("Set Trial status error: %v", err)
|
||||||
}
|
}
|
||||||
conn.Close()
|
|
||||||
|
|
||||||
klog.Infof("Trial status is successfully updated")
|
klog.Infof("Trial status is successfully updated")
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
func updateStopRules(
|
|
||||||
stopRules []commonv1beta1.EarlyStoppingRule,
|
|
||||||
optimalObjValue *float64,
|
|
||||||
metricValue float64,
|
|
||||||
metricStartStep map[string]int,
|
|
||||||
rule commonv1beta1.EarlyStoppingRule,
|
|
||||||
ruleIdx int,
|
|
||||||
) ([]commonv1beta1.EarlyStoppingRule, *float64) {
|
|
||||||
|
|
||||||
// First metric is objective in metricNames array.
|
|
||||||
objMetric := strings.Split(*metricNames, ";")[0]
|
|
||||||
objType := commonv1beta1.ObjectiveType(*objectiveType)
|
|
||||||
|
|
||||||
// Calculate optimalObjValue.
|
|
||||||
if rule.Name == objMetric {
|
|
||||||
if optimalObjValue == nil {
|
|
||||||
optimalObjValue = &metricValue
|
|
||||||
} else if objType == commonv1beta1.ObjectiveTypeMaximize && metricValue > *optimalObjValue {
|
|
||||||
optimalObjValue = &metricValue
|
|
||||||
} else if objType == commonv1beta1.ObjectiveTypeMinimize && metricValue < *optimalObjValue {
|
|
||||||
optimalObjValue = &metricValue
|
|
||||||
}
|
|
||||||
// Assign best optimal value to metric value.
|
|
||||||
metricValue = *optimalObjValue
|
|
||||||
}
|
|
||||||
|
|
||||||
// Reduce steps if appropriate metric is reported.
|
|
||||||
// Once rest steps are empty we apply early stopping rule.
|
|
||||||
if _, ok := metricStartStep[rule.Name]; ok {
|
|
||||||
metricStartStep[rule.Name]--
|
|
||||||
if metricStartStep[rule.Name] != 0 {
|
|
||||||
return stopRules, optimalObjValue
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
ruleValue, err := strconv.ParseFloat(rule.Value, 64)
|
|
||||||
if err != nil {
|
|
||||||
klog.Fatalf("Unable to parse value %v to float for rule metric %v", rule.Value, rule.Name)
|
|
||||||
}
|
|
||||||
|
|
||||||
// Metric value can be equal, less or greater than stop rule.
|
|
||||||
// Deleting suitable stop rule from the array.
|
|
||||||
if rule.Comparison == commonv1beta1.ComparisonTypeEqual && metricValue == ruleValue {
|
|
||||||
return deleteStopRule(stopRules, ruleIdx), optimalObjValue
|
|
||||||
} else if rule.Comparison == commonv1beta1.ComparisonTypeLess && metricValue < ruleValue {
|
|
||||||
return deleteStopRule(stopRules, ruleIdx), optimalObjValue
|
|
||||||
} else if rule.Comparison == commonv1beta1.ComparisonTypeGreater && metricValue > ruleValue {
|
|
||||||
return deleteStopRule(stopRules, ruleIdx), optimalObjValue
|
|
||||||
}
|
|
||||||
return stopRules, optimalObjValue
|
|
||||||
}
|
}
|
||||||
|
|
||||||
func deleteStopRule(stopRules []commonv1beta1.EarlyStoppingRule, idx int) []commonv1beta1.EarlyStoppingRule {
|
func deleteStopRule(stopRules []commonv1beta1.EarlyStoppingRule, idx int) []commonv1beta1.EarlyStoppingRule {
|
||||||
|
@ -404,11 +346,9 @@ func main() {
|
||||||
filters = strings.Split(*metricFilters, ";")
|
filters = strings.Split(*metricFilters, ";")
|
||||||
}
|
}
|
||||||
|
|
||||||
fileFormat := commonv1beta1.FileFormat(*metricsFileFormat)
|
|
||||||
|
|
||||||
// If stop rule is set we need to parse metrics during run.
|
// If stop rule is set we need to parse metrics during run.
|
||||||
if len(stopRules) != 0 {
|
if len(stopRules) != 0 {
|
||||||
go watchMetricsFile(*metricsFilePath, stopRules, filters, fileFormat)
|
go watchMetricsFile(*metricsFilePath, stopRules, filters)
|
||||||
} else {
|
} else {
|
||||||
go printMetricsFile(*metricsFilePath)
|
go printMetricsFile(*metricsFilePath)
|
||||||
}
|
}
|
||||||
|
@ -427,13 +367,13 @@ func main() {
|
||||||
|
|
||||||
// If training was not early stopped, report the metrics.
|
// If training was not early stopped, report the metrics.
|
||||||
if !isEarlyStopped {
|
if !isEarlyStopped {
|
||||||
reportMetrics(filters, fileFormat)
|
reportMetrics(filters)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
func reportMetrics(filters []string, fileFormat commonv1beta1.FileFormat) {
|
func reportMetrics(filters []string) {
|
||||||
|
|
||||||
conn, err := grpc.NewClient(*dbManagerServiceAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))
|
conn, err := grpc.Dial(*dbManagerServiceAddr, grpc.WithInsecure())
|
||||||
if err != nil {
|
if err != nil {
|
||||||
klog.Fatalf("Could not connect to DB manager service, error: %v", err)
|
klog.Fatalf("Could not connect to DB manager service, error: %v", err)
|
||||||
}
|
}
|
||||||
|
@ -444,7 +384,7 @@ func reportMetrics(filters []string, fileFormat commonv1beta1.FileFormat) {
|
||||||
if len(*metricNames) != 0 {
|
if len(*metricNames) != 0 {
|
||||||
metricList = strings.Split(*metricNames, ";")
|
metricList = strings.Split(*metricNames, ";")
|
||||||
}
|
}
|
||||||
olog, err := filemc.CollectObservationLog(*metricsFilePath, metricList, filters, fileFormat)
|
olog, err := filemc.CollectObservationLog(*metricsFilePath, metricList, filters)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
klog.Fatalf("Failed to collect logs: %v", err)
|
klog.Fatalf("Failed to collect logs: %v", err)
|
||||||
}
|
}
|
||||||
|
|
|
@ -1,24 +1,7 @@
|
||||||
FROM python:3.11-slim
|
FROM tensorflow/tensorflow:1.11.0
|
||||||
|
RUN pip install rfc3339 grpcio googleapis-common-protos
|
||||||
ARG TARGETARCH
|
ADD . /usr/src/app/github.com/kubeflow/katib
|
||||||
ENV TARGET_DIR /opt/katib
|
WORKDIR /usr/src/app/github.com/kubeflow/katib/cmd/metricscollector/v1beta1/tfevent-metricscollector/
|
||||||
ENV METRICS_COLLECTOR_DIR cmd/metricscollector/v1beta1/tfevent-metricscollector
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/metricscollector/v1beta1/tfevent-metricscollector/::${TARGET_DIR}/pkg/metricscollector/v1beta1/common/
|
ENV PYTHONPATH /usr/src/app/github.com/kubeflow/katib:/usr/src/app/github.com/kubeflow/katib/pkg/apis/manager/v1beta1/python:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/tfevent-metricscollector/:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/common/
|
||||||
|
|
||||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
|
||||||
ADD ./${METRICS_COLLECTOR_DIR}/ ${TARGET_DIR}/${METRICS_COLLECTOR_DIR}/
|
|
||||||
|
|
||||||
WORKDIR ${TARGET_DIR}/${METRICS_COLLECTOR_DIR}
|
|
||||||
|
|
||||||
RUN if [ "${TARGETARCH}" = "arm64" ]; then \
|
|
||||||
apt-get -y update && \
|
|
||||||
apt-get -y install gfortran libpcre3 libpcre3-dev && \
|
|
||||||
apt-get clean && \
|
|
||||||
rm -rf /var/lib/apt/lists/*; \
|
|
||||||
fi
|
|
||||||
|
|
||||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
|
||||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
|
||||||
&& chmod -R g+rwX ${TARGET_DIR}
|
|
||||||
|
|
||||||
ENTRYPOINT ["python", "main.py"]
|
ENTRYPOINT ["python", "main.py"]
|
||||||
|
|
|
@ -0,0 +1,28 @@
|
||||||
|
FROM ubuntu:18.04
|
||||||
|
|
||||||
|
RUN apt-get update \
|
||||||
|
&& apt-get -y install software-properties-common \
|
||||||
|
autoconf \
|
||||||
|
automake \
|
||||||
|
build-essential \
|
||||||
|
cmake \
|
||||||
|
pkg-config \
|
||||||
|
wget \
|
||||||
|
python-pip \
|
||||||
|
libhdf5-dev \
|
||||||
|
libhdf5-serial-dev \
|
||||||
|
hdf5-tools\
|
||||||
|
&& apt-get clean \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
RUN wget https://github.com/lhelontra/tensorflow-on-arm/releases/download/v1.11.0/tensorflow-1.11.0-cp27-none-linux_aarch64.whl \
|
||||||
|
&& pip install tensorflow-1.11.0-cp27-none-linux_aarch64.whl \
|
||||||
|
&& rm tensorflow-1.11.0-cp27-none-linux_aarch64.whl \
|
||||||
|
&& rm -rf .cache
|
||||||
|
|
||||||
|
RUN pip install rfc3339 grpcio googleapis-common-protos jupyter
|
||||||
|
ADD . /usr/src/app/github.com/kubeflow/katib
|
||||||
|
WORKDIR /usr/src/app/github.com/kubeflow/katib/cmd/metricscollector/v1beta1/tfevent-metricscollector/
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
ENV PYTHONPATH /usr/src/app/github.com/kubeflow/katib:/usr/src/app/github.com/kubeflow/katib/pkg/apis/manager/v1beta1/python:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/tfevent-metricscollector/:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/common/
|
||||||
|
ENTRYPOINT ["python", "main.py"]
|
|
@ -1,6 +1,7 @@
|
||||||
FROM ibmcom/tensorflow-ppc64le:2.2.0-py3
|
FROM ibmcom/tensorflow-ppc64le:1.14.0-py3
|
||||||
|
RUN pip install rfc3339 grpcio googleapis-common-protos
|
||||||
ADD . /usr/src/app/github.com/kubeflow/katib
|
ADD . /usr/src/app/github.com/kubeflow/katib
|
||||||
WORKDIR /usr/src/app/github.com/kubeflow/katib/cmd/metricscollector/v1beta1/tfevent-metricscollector/
|
WORKDIR /usr/src/app/github.com/kubeflow/katib/cmd/metricscollector/v1beta1/tfevent-metricscollector/
|
||||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
ENV PYTHONPATH /usr/src/app/github.com/kubeflow/katib:/usr/src/app/github.com/kubeflow/katib/pkg/apis/manager/v1beta1/python:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/tfevent-metricscollector/:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/common/
|
ENV PYTHONPATH /usr/src/app/github.com/kubeflow/katib:/usr/src/app/github.com/kubeflow/katib/pkg/apis/manager/v1beta1/python:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/tfevent-metricscollector/:/usr/src/app/github.com/kubeflow/katib/pkg/metricscollector/v1beta1/common/
|
||||||
ENTRYPOINT ["python", "main.py"]
|
ENTRYPOINT ["python", "main.py"]
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# Copyright 2022 The Kubeflow Authors.
|
# Copyright 2021 The Kubeflow Authors.
|
||||||
#
|
#
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
# you may not use this file except in compliance with the License.
|
# you may not use this file except in compliance with the License.
|
||||||
|
@ -12,15 +12,13 @@
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
|
|
||||||
import argparse
|
|
||||||
from logging import INFO, StreamHandler, getLogger
|
|
||||||
|
|
||||||
import api_pb2
|
|
||||||
import api_pb2_grpc
|
|
||||||
import const
|
|
||||||
import grpc
|
import grpc
|
||||||
|
import argparse
|
||||||
|
import api_pb2
|
||||||
from pns import WaitMainProcesses
|
from pns import WaitMainProcesses
|
||||||
|
import const
|
||||||
from tfevent_loader import MetricsCollector
|
from tfevent_loader import MetricsCollector
|
||||||
|
from logging import getLogger, StreamHandler, INFO
|
||||||
|
|
||||||
timeout_in_seconds = 60
|
timeout_in_seconds = 60
|
||||||
|
|
||||||
|
@ -57,28 +55,25 @@ if __name__ == '__main__':
|
||||||
wait_all_processes = opt.wait_all_processes.lower() == "true"
|
wait_all_processes = opt.wait_all_processes.lower() == "true"
|
||||||
db_manager_server = opt.db_manager_server_addr.split(':')
|
db_manager_server = opt.db_manager_server_addr.split(':')
|
||||||
if len(db_manager_server) != 2:
|
if len(db_manager_server) != 2:
|
||||||
raise Exception(
|
raise Exception("Invalid Katib DB manager service address: %s" %
|
||||||
f"Invalid Katib DB manager service address: {opt.db_manager_server_addr}"
|
opt.db_manager_server_addr)
|
||||||
)
|
|
||||||
|
|
||||||
WaitMainProcesses(
|
WaitMainProcesses(
|
||||||
pool_interval=opt.poll_interval,
|
pool_interval=opt.poll_interval,
|
||||||
timout=opt.timeout,
|
timout=opt.timeout,
|
||||||
wait_all=wait_all_processes,
|
wait_all=wait_all_processes,
|
||||||
completed_marked_dir=opt.metrics_file_dir,
|
completed_marked_dir=opt.metrics_file_dir)
|
||||||
)
|
|
||||||
|
|
||||||
mc = MetricsCollector(opt.metric_names.split(";"))
|
mc = MetricsCollector(opt.metric_names.split(';'))
|
||||||
observation_log = mc.parse_file(opt.metrics_file_dir)
|
observation_log = mc.parse_file(opt.metrics_file_dir)
|
||||||
|
|
||||||
with grpc.insecure_channel(opt.db_manager_server_addr) as channel:
|
channel = grpc.beta.implementations.insecure_channel(
|
||||||
stub = api_pb2_grpc.DBManagerStub(channel)
|
db_manager_server[0], int(db_manager_server[1]))
|
||||||
logger.info(
|
|
||||||
f"In {opt.trial_name} {str(len(observation_log.metric_logs))} metrics will be reported."
|
with api_pb2.beta_create_DBManager_stub(channel) as client:
|
||||||
)
|
logger.info("In " + opt.trial_name + " " +
|
||||||
stub.ReportObservationLog(
|
str(len(observation_log.metric_logs)) + " metrics will be reported.")
|
||||||
api_pb2.ReportObservationLogRequest(
|
client.ReportObservationLog(api_pb2.ReportObservationLogRequest(
|
||||||
trial_name=opt.trial_name, observation_log=observation_log
|
trial_name=opt.trial_name,
|
||||||
),
|
observation_log=observation_log
|
||||||
timeout=timeout_in_seconds,
|
), timeout=timeout_in_seconds)
|
||||||
)
|
|
||||||
|
|
|
@ -1,6 +1 @@
|
||||||
psutil==5.9.4
|
psutil==5.6.6
|
||||||
rfc3339>=6.2
|
|
||||||
grpcio>=1.64.1
|
|
||||||
googleapis-common-protos==1.6.0
|
|
||||||
tensorflow==2.16.1
|
|
||||||
protobuf>=4.21.12,<5
|
|
||||||
|
|
|
@ -0,0 +1,63 @@
|
||||||
|
# --- Clone the kubeflow/kubeflow code ---
|
||||||
|
FROM ubuntu AS fetch-kubeflow-kubeflow
|
||||||
|
|
||||||
|
RUN apt-get update && apt-get install git -y
|
||||||
|
|
||||||
|
WORKDIR /kf
|
||||||
|
RUN git clone https://github.com/kubeflow/kubeflow.git && \
|
||||||
|
cd kubeflow && \
|
||||||
|
git checkout 24bcb8e
|
||||||
|
|
||||||
|
# --- Build the frontend kubeflow library ---
|
||||||
|
FROM node:12 AS frontend-kubeflow-lib
|
||||||
|
|
||||||
|
WORKDIR /src
|
||||||
|
|
||||||
|
ARG LIB=/kf/kubeflow/components/crud-web-apps/common/frontend/kubeflow-common-lib
|
||||||
|
COPY --from=fetch-kubeflow-kubeflow $LIB/package*.json ./
|
||||||
|
RUN npm ci
|
||||||
|
|
||||||
|
COPY --from=fetch-kubeflow-kubeflow $LIB/ ./
|
||||||
|
RUN npm run build
|
||||||
|
|
||||||
|
# --- Build the frontend ---
|
||||||
|
FROM node:12 AS frontend
|
||||||
|
|
||||||
|
WORKDIR /src
|
||||||
|
COPY ./pkg/new-ui/v1beta1/frontend/package*.json ./
|
||||||
|
RUN npm ci
|
||||||
|
|
||||||
|
COPY ./pkg/new-ui/v1beta1/frontend/ .
|
||||||
|
COPY --from=frontend-kubeflow-lib /src/dist/kubeflow/ ./node_modules/kubeflow/
|
||||||
|
|
||||||
|
RUN npm run build:prod
|
||||||
|
|
||||||
|
# --- Build the backend ---
|
||||||
|
FROM golang:alpine AS go-build
|
||||||
|
|
||||||
|
WORKDIR /go/src/github.com/kubeflow/katib
|
||||||
|
|
||||||
|
# Download packages.
|
||||||
|
COPY go.mod .
|
||||||
|
COPY go.sum .
|
||||||
|
RUN go mod download -x
|
||||||
|
|
||||||
|
# Copy sources.
|
||||||
|
COPY cmd/ cmd/
|
||||||
|
COPY pkg/ pkg/
|
||||||
|
|
||||||
|
# Build the binary.
|
||||||
|
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-ui ./cmd/new-ui/v1beta1; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-ui ./cmd/new-ui/v1beta1; \
|
||||||
|
else \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-ui ./cmd/new-ui/v1beta1; \
|
||||||
|
fi
|
||||||
|
|
||||||
|
# --- Compose the web app ---
|
||||||
|
FROM alpine:3.7
|
||||||
|
WORKDIR /app
|
||||||
|
COPY --from=go-build /go/src/github.com/kubeflow/katib/katib-ui /app/
|
||||||
|
COPY --from=frontend /src/dist/static /app/build/static/
|
||||||
|
ENTRYPOINT ["./katib-ui"]
|
|
@ -0,0 +1,74 @@
|
||||||
|
/*
|
||||||
|
Copyright 2021 The Kubeflow Authors.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
*/
|
||||||
|
|
||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"flag"
|
||||||
|
"fmt"
|
||||||
|
"log"
|
||||||
|
"net/http"
|
||||||
|
|
||||||
|
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
|
||||||
|
|
||||||
|
common_v1beta1 "github.com/kubeflow/katib/pkg/common/v1beta1"
|
||||||
|
ui "github.com/kubeflow/katib/pkg/new-ui/v1beta1"
|
||||||
|
)
|
||||||
|
|
||||||
|
var (
|
||||||
|
port, host, buildDir, dbManagerAddr *string
|
||||||
|
)
|
||||||
|
|
||||||
|
func init() {
|
||||||
|
port = flag.String("port", "8080", "The port to listen to for incoming HTTP connections")
|
||||||
|
host = flag.String("host", "0.0.0.0", "The host to listen to for incoming HTTP connections")
|
||||||
|
buildDir = flag.String("build-dir", "/app/build", "The dir of frontend")
|
||||||
|
dbManagerAddr = flag.String("db-manager-address", common_v1beta1.GetDBManagerAddr(), "The address of Katib DB manager")
|
||||||
|
}
|
||||||
|
|
||||||
|
func main() {
|
||||||
|
flag.Parse()
|
||||||
|
kuh := ui.NewKatibUIHandler(*dbManagerAddr)
|
||||||
|
|
||||||
|
log.Printf("Serving the frontend dir %s", *buildDir)
|
||||||
|
frontend := http.FileServer(http.Dir(*buildDir))
|
||||||
|
http.HandleFunc("/katib/", kuh.ServeIndex(*buildDir))
|
||||||
|
http.Handle("/katib/static/", http.StripPrefix("/katib/", frontend))
|
||||||
|
|
||||||
|
http.HandleFunc("/katib/fetch_experiments/", kuh.FetchAllExperiments)
|
||||||
|
|
||||||
|
http.HandleFunc("/katib/create_experiment/", kuh.CreateExperiment)
|
||||||
|
|
||||||
|
http.HandleFunc("/katib/delete_experiment/", kuh.DeleteExperiment)
|
||||||
|
|
||||||
|
http.HandleFunc("/katib/fetch_experiment/", kuh.FetchExperiment)
|
||||||
|
http.HandleFunc("/katib/fetch_suggestion/", kuh.FetchSuggestion)
|
||||||
|
|
||||||
|
http.HandleFunc("/katib/fetch_hp_job_info/", kuh.FetchHPJobInfo)
|
||||||
|
http.HandleFunc("/katib/fetch_hp_job_trial_info/", kuh.FetchHPJobTrialInfo)
|
||||||
|
http.HandleFunc("/katib/fetch_nas_job_info/", kuh.FetchNASJobInfo)
|
||||||
|
|
||||||
|
http.HandleFunc("/katib/fetch_trial_templates/", kuh.FetchTrialTemplates)
|
||||||
|
http.HandleFunc("/katib/add_template/", kuh.AddTemplate)
|
||||||
|
http.HandleFunc("/katib/edit_template/", kuh.EditTemplate)
|
||||||
|
http.HandleFunc("/katib/delete_template/", kuh.DeleteTemplate)
|
||||||
|
http.HandleFunc("/katib/fetch_namespaces", kuh.FetchNamespaces)
|
||||||
|
|
||||||
|
log.Printf("Serving at %s:%s", *host, *port)
|
||||||
|
if err := http.ListenAndServe(fmt.Sprintf("%s:%s", *host, *port), nil); err != nil {
|
||||||
|
panic(err)
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,31 @@
|
||||||
|
FROM python:3.6
|
||||||
|
|
||||||
|
ENV TARGET_DIR /opt/katib
|
||||||
|
ENV SUGGESTION_DIR cmd/suggestion/chocolate/v1beta1
|
||||||
|
|
||||||
|
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
apt-get -y update && \
|
||||||
|
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||||
|
pip install cython 'numpy>=1.13.3'; \
|
||||||
|
fi
|
||||||
|
RUN GRPC_HEALTH_PROBE_VERSION=v0.3.1 && \
|
||||||
|
if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||||
|
else \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||||
|
fi && \
|
||||||
|
chmod +x /bin/grpc_health_probe
|
||||||
|
|
||||||
|
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||||
|
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||||
|
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||||
|
&& chmod -R g+rwX ${TARGET_DIR}
|
||||||
|
|
||||||
|
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||||
|
|
||||||
|
ENTRYPOINT ["python", "main.py"]
|
|
@ -1,4 +1,4 @@
|
||||||
# Copyright 2022 The Kubeflow Authors.
|
# Copyright 2021 The Kubeflow Authors.
|
||||||
#
|
#
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
# you may not use this file except in compliance with the License.
|
# you may not use this file except in compliance with the License.
|
||||||
|
@ -12,14 +12,12 @@
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
|
|
||||||
import time
|
|
||||||
from concurrent import futures
|
|
||||||
|
|
||||||
import grpc
|
import grpc
|
||||||
|
import time
|
||||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
|
||||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||||
from pkg.suggestion.v1beta1.optuna.service import OptunaService
|
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||||
|
from pkg.suggestion.v1beta1.chocolate.service import ChocolateService
|
||||||
|
from concurrent import futures
|
||||||
|
|
||||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||||
DEFAULT_PORT = "0.0.0.0:6789"
|
DEFAULT_PORT = "0.0.0.0:6789"
|
||||||
|
@ -27,7 +25,7 @@ DEFAULT_PORT = "0.0.0.0:6789"
|
||||||
|
|
||||||
def serve():
|
def serve():
|
||||||
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
|
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
|
||||||
service = OptunaService()
|
service = ChocolateService()
|
||||||
api_pb2_grpc.add_SuggestionServicer_to_server(service, server)
|
api_pb2_grpc.add_SuggestionServicer_to_server(service, server)
|
||||||
health_pb2_grpc.add_HealthServicer_to_server(service, server)
|
health_pb2_grpc.add_HealthServicer_to_server(service, server)
|
||||||
server.add_insecure_port(DEFAULT_PORT)
|
server.add_insecure_port(DEFAULT_PORT)
|
|
@ -0,0 +1,11 @@
|
||||||
|
grpcio==1.23.0
|
||||||
|
cloudpickle==0.5.6
|
||||||
|
numpy>=1.13.3
|
||||||
|
scikit-learn>=0.19.0
|
||||||
|
scipy>=0.19.1
|
||||||
|
forestci==0.3
|
||||||
|
protobuf==3.9.1
|
||||||
|
googleapis-common-protos==1.6.0
|
||||||
|
SQLAlchemy==1.3.8
|
||||||
|
git+https://github.com/AIworx-Labs/chocolate@master
|
||||||
|
ghalton>=0.6
|
|
@ -1,8 +1,6 @@
|
||||||
# Build the Goptuna Suggestion.
|
# Build the Goptuna Suggestion.
|
||||||
FROM golang:alpine AS build-env
|
FROM golang:alpine AS build-env
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
|
|
||||||
WORKDIR /go/src/github.com/kubeflow/katib
|
WORKDIR /go/src/github.com/kubeflow/katib
|
||||||
|
|
||||||
# Download packages.
|
# Download packages.
|
||||||
|
@ -15,15 +13,32 @@ COPY cmd/ cmd/
|
||||||
COPY pkg/ pkg/
|
COPY pkg/ pkg/
|
||||||
|
|
||||||
# Build the binary.
|
# Build the binary.
|
||||||
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1
|
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1; \
|
||||||
|
else \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o goptuna-suggestion ./cmd/suggestion/goptuna/v1beta1; \
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Add GRPC health probe.
|
||||||
|
RUN GRPC_HEALTH_PROBE_VERSION=v0.3.1 && \
|
||||||
|
if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||||
|
else \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||||
|
fi && \
|
||||||
|
chmod +x /bin/grpc_health_probe
|
||||||
|
|
||||||
# Copy the Goptuna suggestion into a thin image.
|
# Copy the Goptuna suggestion into a thin image.
|
||||||
FROM alpine:3.15
|
FROM alpine:3.7
|
||||||
|
|
||||||
ENV TARGET_DIR /opt/katib
|
ENV TARGET_DIR /opt/katib
|
||||||
|
|
||||||
WORKDIR ${TARGET_DIR}
|
WORKDIR ${TARGET_DIR}
|
||||||
|
COPY --from=build-env /bin/grpc_health_probe /bin/
|
||||||
COPY --from=build-env /go/src/github.com/kubeflow/katib/goptuna-suggestion ${TARGET_DIR}/
|
COPY --from=build-env /go/src/github.com/kubeflow/katib/goptuna-suggestion ${TARGET_DIR}/
|
||||||
|
|
||||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
/*
|
/*
|
||||||
Copyright 2022 The Kubeflow Authors.
|
Copyright 2021 The Kubeflow Authors.
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License");
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
you may not use this file except in compliance with the License.
|
you may not use this file except in compliance with the License.
|
||||||
|
@ -24,7 +24,7 @@ import (
|
||||||
api_v1_beta1 "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
|
api_v1_beta1 "github.com/kubeflow/katib/pkg/apis/manager/v1beta1"
|
||||||
suggestion "github.com/kubeflow/katib/pkg/suggestion/v1beta1/goptuna"
|
suggestion "github.com/kubeflow/katib/pkg/suggestion/v1beta1/goptuna"
|
||||||
"google.golang.org/grpc"
|
"google.golang.org/grpc"
|
||||||
"k8s.io/klog/v2"
|
"k8s.io/klog"
|
||||||
)
|
)
|
||||||
|
|
||||||
const (
|
const (
|
||||||
|
|
|
@ -1,24 +1,32 @@
|
||||||
FROM python:3.11-slim
|
FROM python:3.6
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
ENV TARGET_DIR /opt/katib
|
ENV TARGET_DIR /opt/katib
|
||||||
ENV SUGGESTION_DIR cmd/suggestion/hyperband/v1beta1
|
ENV SUGGESTION_DIR cmd/suggestion/hyperband/v1beta1
|
||||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
|
||||||
|
|
||||||
RUN if [ "${TARGETARCH}" = "ppc64le" ] || [ "${TARGETARCH}" = "arm64" ]; then \
|
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
apt-get -y update && \
|
apt-get -y update && \
|
||||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||||
apt-get clean && \
|
pip install cython; \
|
||||||
rm -rf /var/lib/apt/lists/*; \
|
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
RUN GRPC_HEALTH_PROBE_VERSION=v0.3.1 && \
|
||||||
|
if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||||
|
else \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||||
|
fi && \
|
||||||
|
chmod +x /bin/grpc_health_probe
|
||||||
|
|
||||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||||
|
|
||||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
|
||||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||||
&& chmod -R g+rwX ${TARGET_DIR}
|
&& chmod -R g+rwX ${TARGET_DIR}
|
||||||
|
|
||||||
|
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||||
|
|
||||||
ENTRYPOINT ["python", "main.py"]
|
ENTRYPOINT ["python", "main.py"]
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# Copyright 2022 The Kubeflow Authors.
|
# Copyright 2021 The Kubeflow Authors.
|
||||||
#
|
#
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
# you may not use this file except in compliance with the License.
|
# you may not use this file except in compliance with the License.
|
||||||
|
@ -12,14 +12,12 @@
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
|
|
||||||
import time
|
|
||||||
from concurrent import futures
|
|
||||||
|
|
||||||
import grpc
|
import grpc
|
||||||
|
import time
|
||||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
|
||||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||||
|
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||||
from pkg.suggestion.v1beta1.hyperband.service import HyperbandService
|
from pkg.suggestion.v1beta1.hyperband.service import HyperbandService
|
||||||
|
from concurrent import futures
|
||||||
|
|
||||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||||
DEFAULT_PORT = "0.0.0.0:6789"
|
DEFAULT_PORT = "0.0.0.0:6789"
|
||||||
|
|
|
@ -1,9 +1,8 @@
|
||||||
grpcio>=1.64.1
|
grpcio==1.23.0
|
||||||
cloudpickle==0.5.6
|
cloudpickle==0.5.6
|
||||||
numpy>=1.25.2
|
numpy>=1.13.3
|
||||||
scikit-learn>=0.24.0
|
scikit-learn>=0.19.0
|
||||||
scipy>=1.5.4
|
scipy>=0.19.1
|
||||||
forestci==0.3
|
forestci==0.3
|
||||||
protobuf>=4.21.12,<5
|
protobuf==3.9.1
|
||||||
googleapis-common-protos==1.6.0
|
googleapis-common-protos==1.6.0
|
||||||
cython>=0.29.24
|
|
||||||
|
|
|
@ -1,24 +1,33 @@
|
||||||
FROM python:3.11-slim
|
FROM python:3.6
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
ENV TARGET_DIR /opt/katib
|
ENV TARGET_DIR /opt/katib
|
||||||
ENV SUGGESTION_DIR cmd/suggestion/hyperopt/v1beta1
|
ENV SUGGESTION_DIR cmd/suggestion/hyperopt/v1beta1
|
||||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
|
||||||
|
|
||||||
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
|
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
apt-get -y update && \
|
apt-get -y update && \
|
||||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||||
apt-get clean && \
|
pip install cython; \
|
||||||
rm -rf /var/lib/apt/lists/*; \
|
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
RUN GRPC_HEALTH_PROBE_VERSION=v0.3.1 && \
|
||||||
|
if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||||
|
else \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||||
|
fi && \
|
||||||
|
chmod +x /bin/grpc_health_probe
|
||||||
|
|
||||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||||
|
|
||||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
|
||||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||||
&& chmod -R g+rwX ${TARGET_DIR}
|
&& chmod -R g+rwX ${TARGET_DIR}
|
||||||
|
|
||||||
|
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||||
|
|
||||||
ENTRYPOINT ["python", "main.py"]
|
ENTRYPOINT ["python", "main.py"]
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# Copyright 2022 The Kubeflow Authors.
|
# Copyright 2021 The Kubeflow Authors.
|
||||||
#
|
#
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
# you may not use this file except in compliance with the License.
|
# you may not use this file except in compliance with the License.
|
||||||
|
@ -12,14 +12,12 @@
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
|
|
||||||
import time
|
|
||||||
from concurrent import futures
|
|
||||||
|
|
||||||
import grpc
|
import grpc
|
||||||
|
import time
|
||||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
|
||||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||||
|
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||||
from pkg.suggestion.v1beta1.hyperopt.service import HyperoptService
|
from pkg.suggestion.v1beta1.hyperopt.service import HyperoptService
|
||||||
|
from concurrent import futures
|
||||||
|
|
||||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||||
DEFAULT_PORT = "0.0.0.0:6789"
|
DEFAULT_PORT = "0.0.0.0:6789"
|
||||||
|
|
|
@ -1,10 +1,9 @@
|
||||||
grpcio>=1.64.1
|
grpcio==1.23.0
|
||||||
cloudpickle==0.5.6
|
cloudpickle==0.5.6
|
||||||
numpy>=1.25.2
|
numpy>=1.13.3
|
||||||
scikit-learn>=0.24.0
|
scikit-learn>=0.19.0
|
||||||
scipy>=1.5.4
|
scipy>=0.19.1
|
||||||
forestci==0.3
|
forestci==0.3
|
||||||
protobuf>=4.21.12,<5
|
protobuf==3.9.1
|
||||||
googleapis-common-protos==1.6.0
|
googleapis-common-protos==1.6.0
|
||||||
hyperopt==0.2.5
|
hyperopt==0.2.3
|
||||||
cython>=0.29.24
|
|
||||||
|
|
|
@ -1,24 +1,33 @@
|
||||||
FROM python:3.11-slim
|
FROM python:3.6
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
ENV TARGET_DIR /opt/katib
|
ENV TARGET_DIR /opt/katib
|
||||||
ENV SUGGESTION_DIR cmd/suggestion/nas/darts/v1beta1
|
ENV SUGGESTION_DIR cmd/suggestion/nas/darts/v1beta1
|
||||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
|
||||||
|
|
||||||
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
|
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
apt-get -y update && \
|
apt-get -y update && \
|
||||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||||
apt-get clean && \
|
pip install cython; \
|
||||||
rm -rf /var/lib/apt/lists/*; \
|
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
RUN GRPC_HEALTH_PROBE_VERSION=v0.3.1 && \
|
||||||
|
if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||||
|
else \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||||
|
fi && \
|
||||||
|
chmod +x /bin/grpc_health_probe
|
||||||
|
|
||||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||||
|
|
||||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
|
||||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||||
&& chmod -R g+rwX ${TARGET_DIR}
|
&& chmod -R g+rwX ${TARGET_DIR}
|
||||||
|
|
||||||
|
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||||
|
|
||||||
ENTRYPOINT ["python", "main.py"]
|
ENTRYPOINT ["python", "main.py"]
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# Copyright 2022 The Kubeflow Authors.
|
# Copyright 2021 The Kubeflow Authors.
|
||||||
#
|
#
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
# you may not use this file except in compliance with the License.
|
# you may not use this file except in compliance with the License.
|
||||||
|
@ -12,15 +12,14 @@
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
|
|
||||||
import time
|
|
||||||
from concurrent import futures
|
|
||||||
|
|
||||||
import grpc
|
import grpc
|
||||||
|
from concurrent import futures
|
||||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
import time
|
||||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||||
|
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||||
from pkg.suggestion.v1beta1.nas.darts.service import DartsService
|
from pkg.suggestion.v1beta1.nas.darts.service import DartsService
|
||||||
|
|
||||||
|
|
||||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||||
DEFAULT_PORT = "0.0.0.0:6789"
|
DEFAULT_PORT = "0.0.0.0:6789"
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,3 @@
|
||||||
grpcio>=1.64.1
|
grpcio==1.23.0
|
||||||
protobuf>=4.21.12,<5
|
protobuf==3.9.1
|
||||||
googleapis-common-protos==1.6.0
|
googleapis-common-protos==1.6.0
|
||||||
cython>=0.29.24
|
|
||||||
|
|
|
@ -1,24 +1,29 @@
|
||||||
FROM python:3.11-slim
|
FROM python:3.6
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
ENV TARGET_DIR /opt/katib
|
ENV TARGET_DIR /opt/katib
|
||||||
ENV SUGGESTION_DIR cmd/suggestion/nas/enas/v1beta1
|
ENV SUGGESTION_DIR cmd/suggestion/nas/enas/v1beta1
|
||||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
|
||||||
|
|
||||||
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
|
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
apt-get -y update && \
|
apt-get -y update && \
|
||||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||||
apt-get clean && \
|
pip install cython; \
|
||||||
rm -rf /var/lib/apt/lists/*; \
|
|
||||||
fi
|
fi
|
||||||
|
RUN GRPC_HEALTH_PROBE_VERSION=v0.3.1 && \
|
||||||
|
if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||||
|
else \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||||
|
fi && \
|
||||||
|
chmod +x /bin/grpc_health_probe
|
||||||
|
|
||||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||||
|
|
||||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
|
||||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||||
&& chmod -R g+rwX ${TARGET_DIR}
|
&& chmod -R g+rwX ${TARGET_DIR}
|
||||||
|
|
||||||
|
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||||
|
|
||||||
ENTRYPOINT ["python", "main.py"]
|
ENTRYPOINT ["python", "main.py"]
|
||||||
|
|
|
@ -0,0 +1,58 @@
|
||||||
|
FROM golang:alpine AS build-env
|
||||||
|
# The GOPATH in the image is /go.
|
||||||
|
ADD . /go/src/github.com/kubeflow/katib
|
||||||
|
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
apk --update add git gcc musl-dev && \
|
||||||
|
go get github.com/grpc-ecosystem/grpc-health-probe && \
|
||||||
|
mv $GOPATH/bin/grpc-health-probe /bin/grpc_health_probe && \
|
||||||
|
chmod +x /bin/grpc_health_probe; \
|
||||||
|
else \
|
||||||
|
GRPC_HEALTH_PROBE_VERSION=v0.3.1 && \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64 && \
|
||||||
|
chmod +x /bin/grpc_health_probe; \
|
||||||
|
fi
|
||||||
|
|
||||||
|
FROM python:3.7-slim-buster
|
||||||
|
|
||||||
|
ENV TARGET_DIR /opt/katib
|
||||||
|
ENV SUGGESTION_DIR cmd/suggestion/nas/enas/v1beta1
|
||||||
|
|
||||||
|
RUN apt-get update \
|
||||||
|
&& apt-get -y install software-properties-common \
|
||||||
|
autoconf \
|
||||||
|
automake \
|
||||||
|
build-essential \
|
||||||
|
cmake \
|
||||||
|
libtool \
|
||||||
|
pkg-config \
|
||||||
|
wget \
|
||||||
|
gfortran \
|
||||||
|
libopenblas-dev \
|
||||||
|
liblapack-dev \
|
||||||
|
libhdf5-dev \
|
||||||
|
libhdf5-serial-dev \
|
||||||
|
hdf5-tools \
|
||||||
|
&& apt-get clean \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
RUN pip install cython numpy
|
||||||
|
|
||||||
|
RUN wget https://github.com/lhelontra/tensorflow-on-arm/releases/download/v1.14.0-buster/tensorflow-1.14.0-cp37-none-linux_aarch64.whl \
|
||||||
|
&& pip install tensorflow-1.14.0-cp37-none-linux_aarch64.whl \
|
||||||
|
&& rm tensorflow-1.14.0-cp37-none-linux_aarch64.whl \
|
||||||
|
&& rm -rf .cache
|
||||||
|
|
||||||
|
RUN pip install 'grpcio==1.23.0' 'protobuf==3.9.1' 'googleapis-common-protos==1.6.0'
|
||||||
|
|
||||||
|
COPY --from=build-env /bin/grpc_health_probe /bin/
|
||||||
|
|
||||||
|
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||||
|
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||||
|
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||||
|
|
||||||
|
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||||
|
&& chmod -R g+rwX ${TARGET_DIR}
|
||||||
|
|
||||||
|
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||||
|
|
||||||
|
ENTRYPOINT ["python", "main.py"]
|
|
@ -1,4 +1,4 @@
|
||||||
# Copyright 2022 The Kubeflow Authors.
|
# Copyright 2021 The Kubeflow Authors.
|
||||||
#
|
#
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
# you may not use this file except in compliance with the License.
|
# you may not use this file except in compliance with the License.
|
||||||
|
@ -12,15 +12,15 @@
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
|
|
||||||
import time
|
|
||||||
from concurrent import futures
|
|
||||||
|
|
||||||
import grpc
|
import grpc
|
||||||
|
from concurrent import futures
|
||||||
|
import time
|
||||||
|
|
||||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
|
||||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||||
|
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||||
from pkg.suggestion.v1beta1.nas.enas.service import EnasService
|
from pkg.suggestion.v1beta1.nas.enas.service import EnasService
|
||||||
|
|
||||||
|
|
||||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||||
DEFAULT_PORT = "0.0.0.0:6789"
|
DEFAULT_PORT = "0.0.0.0:6789"
|
||||||
|
|
||||||
|
|
|
@ -1,5 +1,4 @@
|
||||||
grpcio>=1.64.1
|
grpcio==1.23.0
|
||||||
|
protobuf==3.9.1
|
||||||
googleapis-common-protos==1.6.0
|
googleapis-common-protos==1.6.0
|
||||||
cython>=0.29.24
|
tensorflow==1.15.4
|
||||||
tensorflow==2.16.1
|
|
||||||
protobuf>=4.21.12,<5
|
|
||||||
|
|
|
@ -1,24 +0,0 @@
|
||||||
FROM python:3.11-slim
|
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
ENV TARGET_DIR /opt/katib
|
|
||||||
ENV SUGGESTION_DIR cmd/suggestion/optuna/v1beta1
|
|
||||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
|
||||||
|
|
||||||
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
|
|
||||||
apt-get -y update && \
|
|
||||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
|
||||||
apt-get clean && \
|
|
||||||
rm -rf /var/lib/apt/lists/*; \
|
|
||||||
fi
|
|
||||||
|
|
||||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
|
||||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
|
||||||
|
|
||||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
|
||||||
|
|
||||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
|
||||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
|
||||||
&& chmod -R g+rwX ${TARGET_DIR}
|
|
||||||
|
|
||||||
ENTRYPOINT ["python", "main.py"]
|
|
|
@ -1,4 +0,0 @@
|
||||||
grpcio>=1.64.1
|
|
||||||
protobuf>=4.21.12,<5
|
|
||||||
googleapis-common-protos==1.53.0
|
|
||||||
optuna==3.3.0
|
|
|
@ -1,24 +0,0 @@
|
||||||
FROM python:3.11-slim
|
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
ENV TARGET_DIR /opt/katib
|
|
||||||
ENV SUGGESTION_DIR cmd/suggestion/pbt/v1beta1
|
|
||||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
|
||||||
|
|
||||||
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
|
|
||||||
apt-get -y update && \
|
|
||||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
|
||||||
apt-get clean && \
|
|
||||||
rm -rf /var/lib/apt/lists/*; \
|
|
||||||
fi
|
|
||||||
|
|
||||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
|
||||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
|
||||||
|
|
||||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
|
||||||
|
|
||||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
|
||||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
|
||||||
&& chmod -R g+rwX ${TARGET_DIR}
|
|
||||||
|
|
||||||
ENTRYPOINT ["python", "main.py"]
|
|
|
@ -1,45 +0,0 @@
|
||||||
# Copyright 2022 The Kubeflow Authors.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
|
|
||||||
import time
|
|
||||||
from concurrent import futures
|
|
||||||
|
|
||||||
import grpc
|
|
||||||
|
|
||||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
|
||||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
|
||||||
from pkg.suggestion.v1beta1.pbt.service import PbtService
|
|
||||||
|
|
||||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
|
||||||
DEFAULT_PORT = "0.0.0.0:6789"
|
|
||||||
|
|
||||||
|
|
||||||
def serve():
|
|
||||||
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
|
|
||||||
service = PbtService()
|
|
||||||
api_pb2_grpc.add_SuggestionServicer_to_server(service, server)
|
|
||||||
health_pb2_grpc.add_HealthServicer_to_server(service, server)
|
|
||||||
|
|
||||||
server.add_insecure_port(DEFAULT_PORT)
|
|
||||||
print("Listening...")
|
|
||||||
server.start()
|
|
||||||
try:
|
|
||||||
while True:
|
|
||||||
time.sleep(_ONE_DAY_IN_SECONDS)
|
|
||||||
except KeyboardInterrupt:
|
|
||||||
server.stop(0)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
serve()
|
|
|
@ -1,4 +0,0 @@
|
||||||
grpcio>=1.64.1
|
|
||||||
protobuf>=4.21.12,<5
|
|
||||||
googleapis-common-protos==1.53.0
|
|
||||||
numpy==1.25.2
|
|
|
@ -1,24 +1,32 @@
|
||||||
FROM python:3.10-slim
|
FROM python:3.6
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
ENV TARGET_DIR /opt/katib
|
ENV TARGET_DIR /opt/katib
|
||||||
ENV SUGGESTION_DIR cmd/suggestion/skopt/v1beta1
|
ENV SUGGESTION_DIR cmd/suggestion/skopt/v1beta1
|
||||||
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
|
||||||
|
|
||||||
RUN if [ "${TARGETARCH}" = "ppc64le" ]; then \
|
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
apt-get -y update && \
|
apt-get -y update && \
|
||||||
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
apt-get -y install gfortran libopenblas-dev liblapack-dev && \
|
||||||
apt-get clean && \
|
pip install cython; \
|
||||||
rm -rf /var/lib/apt/lists/*; \
|
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
RUN GRPC_HEALTH_PROBE_VERSION=v0.3.1 && \
|
||||||
|
if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-ppc64le; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-arm64; \
|
||||||
|
else \
|
||||||
|
wget -qO/bin/grpc_health_probe https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64; \
|
||||||
|
fi && \
|
||||||
|
chmod +x /bin/grpc_health_probe
|
||||||
|
|
||||||
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
ADD ./pkg/ ${TARGET_DIR}/pkg/
|
||||||
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
ADD ./${SUGGESTION_DIR}/ ${TARGET_DIR}/${SUGGESTION_DIR}/
|
||||||
|
|
||||||
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
WORKDIR ${TARGET_DIR}/${SUGGESTION_DIR}
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
RUN pip install --prefer-binary --no-cache-dir -r requirements.txt
|
|
||||||
RUN chgrp -R 0 ${TARGET_DIR} \
|
RUN chgrp -R 0 ${TARGET_DIR} \
|
||||||
&& chmod -R g+rwX ${TARGET_DIR}
|
&& chmod -R g+rwX ${TARGET_DIR}
|
||||||
|
|
||||||
|
ENV PYTHONPATH ${TARGET_DIR}:${TARGET_DIR}/pkg/apis/manager/v1beta1/python:${TARGET_DIR}/pkg/apis/manager/health/python
|
||||||
|
|
||||||
ENTRYPOINT ["python", "main.py"]
|
ENTRYPOINT ["python", "main.py"]
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# Copyright 2022 The Kubeflow Authors.
|
# Copyright 2021 The Kubeflow Authors.
|
||||||
#
|
#
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
# you may not use this file except in compliance with the License.
|
# you may not use this file except in compliance with the License.
|
||||||
|
@ -12,14 +12,12 @@
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
|
|
||||||
import time
|
|
||||||
from concurrent import futures
|
|
||||||
|
|
||||||
import grpc
|
import grpc
|
||||||
|
import time
|
||||||
from pkg.apis.manager.health.python import health_pb2_grpc
|
|
||||||
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
from pkg.apis.manager.v1beta1.python import api_pb2_grpc
|
||||||
|
from pkg.apis.manager.health.python import health_pb2_grpc
|
||||||
from pkg.suggestion.v1beta1.skopt.service import SkoptService
|
from pkg.suggestion.v1beta1.skopt.service import SkoptService
|
||||||
|
from concurrent import futures
|
||||||
|
|
||||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||||
DEFAULT_PORT = "0.0.0.0:6789"
|
DEFAULT_PORT = "0.0.0.0:6789"
|
||||||
|
|
|
@ -1,13 +1,9 @@
|
||||||
grpcio>=1.64.1
|
grpcio==1.23.0
|
||||||
cloudpickle==0.5.6
|
cloudpickle==0.5.6
|
||||||
# This is a workaround to avoid the following error.
|
numpy>=1.13.3
|
||||||
# AttributeError: module 'numpy' has no attribute 'int'
|
scikit-learn==0.22.0
|
||||||
# See more: https://github.com/numpy/numpy/pull/22607
|
scipy>=0.19.1
|
||||||
numpy==1.23.5
|
|
||||||
scikit-learn>=0.24.0, <=1.3.0
|
|
||||||
scipy>=1.5.4
|
|
||||||
forestci==0.3
|
forestci==0.3
|
||||||
protobuf>=4.21.12,<5
|
protobuf==3.9.1
|
||||||
googleapis-common-protos==1.6.0
|
googleapis-common-protos==1.6.0
|
||||||
scikit-optimize>=0.9.0
|
scikit-optimize==0.5.2
|
||||||
cython>=0.29.24
|
|
||||||
|
|
|
@ -1,56 +1,15 @@
|
||||||
# --- Clone the kubeflow/kubeflow code ---
|
# Build the Katib UI.
|
||||||
FROM alpine/git AS fetch-kubeflow-kubeflow
|
FROM node:12.18.1 AS npm-build
|
||||||
|
|
||||||
WORKDIR /kf
|
# Build frontend.
|
||||||
COPY ./pkg/ui/v1beta1/frontend/COMMIT ./
|
ADD /pkg/ui/v1beta1/frontend /frontend
|
||||||
RUN git clone https://github.com/kubeflow/kubeflow.git && \
|
RUN cd /frontend && npm ci
|
||||||
COMMIT=$(cat ./COMMIT) && \
|
RUN cd /frontend && npm run build
|
||||||
cd kubeflow && \
|
RUN rm -rf /frontend/node_modules
|
||||||
git checkout $COMMIT
|
|
||||||
|
|
||||||
# --- Build the frontend kubeflow library ---
|
# Build backend.
|
||||||
FROM node:16-alpine AS frontend-kubeflow-lib
|
|
||||||
|
|
||||||
WORKDIR /src
|
|
||||||
|
|
||||||
ARG LIB=/kf/kubeflow/components/crud-web-apps/common/frontend/kubeflow-common-lib
|
|
||||||
COPY --from=fetch-kubeflow-kubeflow $LIB/package*.json ./
|
|
||||||
RUN npm config set fetch-retry-mintimeout 200000 && \
|
|
||||||
npm config set fetch-retry-maxtimeout 1200000 && \
|
|
||||||
npm config get registry && \
|
|
||||||
npm config set registry https://registry.npmjs.org/ && \
|
|
||||||
npm config delete https-proxy && \
|
|
||||||
npm config set loglevel verbose && \
|
|
||||||
npm cache clean --force && \
|
|
||||||
npm ci --force --prefer-offline --no-audit
|
|
||||||
|
|
||||||
COPY --from=fetch-kubeflow-kubeflow $LIB/ ./
|
|
||||||
RUN npm run build
|
|
||||||
|
|
||||||
# --- Build the frontend ---
|
|
||||||
FROM node:16-alpine AS frontend
|
|
||||||
|
|
||||||
WORKDIR /src
|
|
||||||
COPY ./pkg/ui/v1beta1/frontend/package*.json ./
|
|
||||||
RUN npm config set fetch-retry-mintimeout 200000 && \
|
|
||||||
npm config set fetch-retry-maxtimeout 1200000 && \
|
|
||||||
npm config get registry && \
|
|
||||||
npm config set registry https://registry.npmjs.org/ && \
|
|
||||||
npm config delete https-proxy && \
|
|
||||||
npm config set loglevel verbose && \
|
|
||||||
npm cache clean --force && \
|
|
||||||
npm ci --force --prefer-offline --no-audit
|
|
||||||
|
|
||||||
COPY ./pkg/ui/v1beta1/frontend/ .
|
|
||||||
COPY --from=frontend-kubeflow-lib /src/dist/kubeflow/ ./node_modules/kubeflow/
|
|
||||||
|
|
||||||
RUN npm run build:prod
|
|
||||||
|
|
||||||
# --- Build the backend ---
|
|
||||||
FROM golang:alpine AS go-build
|
FROM golang:alpine AS go-build
|
||||||
|
|
||||||
ARG TARGETARCH
|
|
||||||
|
|
||||||
WORKDIR /go/src/github.com/kubeflow/katib
|
WORKDIR /go/src/github.com/kubeflow/katib
|
||||||
|
|
||||||
# Download packages.
|
# Download packages.
|
||||||
|
@ -63,11 +22,17 @@ COPY cmd/ cmd/
|
||||||
COPY pkg/ pkg/
|
COPY pkg/ pkg/
|
||||||
|
|
||||||
# Build the binary.
|
# Build the binary.
|
||||||
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build -a -o katib-ui ./cmd/ui/v1beta1
|
RUN if [ "$(uname -m)" = "ppc64le" ]; then \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=ppc64le go build -a -o katib-ui ./cmd/ui/v1beta1; \
|
||||||
|
elif [ "$(uname -m)" = "aarch64" ]; then \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -o katib-ui ./cmd/ui/v1beta1; \
|
||||||
|
else \
|
||||||
|
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o katib-ui ./cmd/ui/v1beta1; \
|
||||||
|
fi
|
||||||
|
|
||||||
# --- Compose the web app ---
|
# Copy the backend and frontend into a thin image.
|
||||||
FROM alpine:3.15
|
FROM alpine:3.7
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
COPY --from=go-build /go/src/github.com/kubeflow/katib/katib-ui /app/
|
COPY --from=go-build /go/src/github.com/kubeflow/katib/katib-ui /app/
|
||||||
COPY --from=frontend /src/dist/static /app/build/static/
|
COPY --from=npm-build /frontend/build /app/build
|
||||||
ENTRYPOINT ["./katib-ui"]
|
ENTRYPOINT ["./katib-ui"]
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
/*
|
/*
|
||||||
Copyright 2022 The Kubeflow Authors.
|
Copyright 2021 The Kubeflow Authors.
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License");
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
you may not use this file except in compliance with the License.
|
you may not use this file except in compliance with the License.
|
||||||
|
@ -33,7 +33,7 @@ var (
|
||||||
)
|
)
|
||||||
|
|
||||||
func init() {
|
func init() {
|
||||||
port = flag.String("port", "8080", "The port to listen to for incoming HTTP connections")
|
port = flag.String("port", "80", "The port to listen to for incoming HTTP connections")
|
||||||
host = flag.String("host", "0.0.0.0", "The host to listen to for incoming HTTP connections")
|
host = flag.String("host", "0.0.0.0", "The host to listen to for incoming HTTP connections")
|
||||||
buildDir = flag.String("build-dir", "/app/build", "The dir of frontend")
|
buildDir = flag.String("build-dir", "/app/build", "The dir of frontend")
|
||||||
dbManagerAddr = flag.String("db-manager-address", common_v1beta1.GetDBManagerAddr(), "The address of Katib DB manager")
|
dbManagerAddr = flag.String("db-manager-address", common_v1beta1.GetDBManagerAddr(), "The address of Katib DB manager")
|
||||||
|
@ -45,17 +45,17 @@ func main() {
|
||||||
|
|
||||||
log.Printf("Serving the frontend dir %s", *buildDir)
|
log.Printf("Serving the frontend dir %s", *buildDir)
|
||||||
frontend := http.FileServer(http.Dir(*buildDir))
|
frontend := http.FileServer(http.Dir(*buildDir))
|
||||||
http.HandleFunc("/katib/", kuh.ServeIndex(*buildDir))
|
http.Handle("/katib/", http.StripPrefix("/katib/", frontend))
|
||||||
http.Handle("/katib/static/", http.StripPrefix("/katib/", frontend))
|
|
||||||
|
|
||||||
http.HandleFunc("/katib/fetch_experiments/", kuh.FetchExperiments)
|
http.HandleFunc("/katib/fetch_experiments/", kuh.FetchAllExperiments)
|
||||||
|
|
||||||
http.HandleFunc("/katib/create_experiment/", kuh.CreateExperiment)
|
http.HandleFunc("/katib/submit_yaml/", kuh.SubmitYamlJob)
|
||||||
|
http.HandleFunc("/katib/submit_hp_job/", kuh.SubmitParamsJob)
|
||||||
|
http.HandleFunc("/katib/submit_nas_job/", kuh.SubmitParamsJob)
|
||||||
|
|
||||||
http.HandleFunc("/katib/delete_experiment/", kuh.DeleteExperiment)
|
http.HandleFunc("/katib/delete_experiment/", kuh.DeleteExperiment)
|
||||||
|
|
||||||
http.HandleFunc("/katib/fetch_experiment/", kuh.FetchExperiment)
|
http.HandleFunc("/katib/fetch_experiment/", kuh.FetchExperiment)
|
||||||
http.HandleFunc("/katib/fetch_trial/", kuh.FetchTrial)
|
|
||||||
http.HandleFunc("/katib/fetch_suggestion/", kuh.FetchSuggestion)
|
http.HandleFunc("/katib/fetch_suggestion/", kuh.FetchSuggestion)
|
||||||
|
|
||||||
http.HandleFunc("/katib/fetch_hp_job_info/", kuh.FetchHPJobInfo)
|
http.HandleFunc("/katib/fetch_hp_job_info/", kuh.FetchHPJobInfo)
|
||||||
|
@ -67,7 +67,6 @@ func main() {
|
||||||
http.HandleFunc("/katib/edit_template/", kuh.EditTemplate)
|
http.HandleFunc("/katib/edit_template/", kuh.EditTemplate)
|
||||||
http.HandleFunc("/katib/delete_template/", kuh.DeleteTemplate)
|
http.HandleFunc("/katib/delete_template/", kuh.DeleteTemplate)
|
||||||
http.HandleFunc("/katib/fetch_namespaces", kuh.FetchNamespaces)
|
http.HandleFunc("/katib/fetch_namespaces", kuh.FetchNamespaces)
|
||||||
http.HandleFunc("/katib/fetch_trial_logs/", kuh.FetchTrialLogs)
|
|
||||||
|
|
||||||
log.Printf("Serving at %s:%s", *host, *port)
|
log.Printf("Serving at %s:%s", *host, *port)
|
||||||
if err := http.ListenAndServe(fmt.Sprintf("%s:%s", *host, *port), nil); err != nil {
|
if err := http.ListenAndServe(fmt.Sprintf("%s:%s", *host, *port), nil); err != nil {
|
||||||
|
|
|
@ -1,13 +0,0 @@
|
||||||
#!/bin/sh
|
|
||||||
|
|
||||||
# Run conformance test and generate test report.
|
|
||||||
python test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py --experiment-path examples/v1beta1/hp-tuning/random.yaml --namespace kf-conformance \
|
|
||||||
--trial-pod-labels '{"sidecar.istio.io/inject": "false"}' | tee /tmp/katib-conformance.log
|
|
||||||
|
|
||||||
|
|
||||||
# Create the done file.
|
|
||||||
touch /tmp/katib-conformance.done
|
|
||||||
echo "Done..."
|
|
||||||
|
|
||||||
# Keep the container running so the test logs can be downloaded.
|
|
||||||
while true; do sleep 10000; done
|
|
|
@ -1,5 +0,0 @@
|
||||||
# Katib Documentation
|
|
||||||
|
|
||||||
Welcome to Kubeflow Katib!
|
|
||||||
|
|
||||||
The Katib documentation is available on [kubeflow.org](https://www.kubeflow.org/docs/components/katib/).
|
|
|
@ -0,0 +1,150 @@
|
||||||
|
# Table of Contents
|
||||||
|
|
||||||
|
- [Table of Contents](#table-of-contents)
|
||||||
|
- [Developer Guide](#developer-guide)
|
||||||
|
- [Requirements](#requirements)
|
||||||
|
- [Build from source code](#build-from-source-code)
|
||||||
|
- [Modify controller APIs](#modify-controller-apis)
|
||||||
|
- [Controller Flags](#controller-flags)
|
||||||
|
- [Workflow design](#workflow-design)
|
||||||
|
- [Katib admission webhooks](#katib-admission-webhooks)
|
||||||
|
- [Katib cert generator](#katib-cert-generator)
|
||||||
|
- [Implement a new algorithm and use it in Katib](#implement-a-new-algorithm-and-use-it-in-katib)
|
||||||
|
- [Algorithm settings documentation](#algorithm-settings-documentation)
|
||||||
|
- [Katib UI documentation](#katib-ui-documentation)
|
||||||
|
- [Design proposals](#design-proposals)
|
||||||
|
|
||||||
|
Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)
|
||||||
|
|
||||||
|
# Developer Guide
|
||||||
|
|
||||||
|
This developer guide is for people who want to contribute to the Katib project.
|
||||||
|
If you're interesting in using Katib in your machine learning project,
|
||||||
|
see the following user guides:
|
||||||
|
|
||||||
|
- [Concepts](https://www.kubeflow.org/docs/components/katib/overview/)
|
||||||
|
in Katib, hyperparameter tuning, and neural architecture search.
|
||||||
|
- [Getting started with Katib](https://kubeflow.org/docs/components/katib/hyperparameter/).
|
||||||
|
- Detailed guide to [configuring and running a Katib
|
||||||
|
experiment](https://kubeflow.org/docs/components/katib/experiment/).
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- [Go](https://golang.org/) (1.13 or later)
|
||||||
|
- [Docker](https://docs.docker.com/) (17.05 or later.)
|
||||||
|
- [kustomize](https://kustomize.io/) (3.2 or later)
|
||||||
|
|
||||||
|
## Build from source code
|
||||||
|
|
||||||
|
Check source code as follows:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make build REGISTRY=<image-registry> TAG=<image-tag>
|
||||||
|
```
|
||||||
|
|
||||||
|
To use your custom images for the Katib component, modify
|
||||||
|
[Kustomization file](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml)
|
||||||
|
and [Katib config patch](https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/katib-config-patch.yaml)
|
||||||
|
|
||||||
|
You can deploy Katib v1beta1 manifests into a k8s cluster as follows:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make deploy
|
||||||
|
```
|
||||||
|
|
||||||
|
You can undeploy Katib v1beta1 manifests from a k8s cluster as follows:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make undeploy
|
||||||
|
```
|
||||||
|
|
||||||
|
## Modify controller APIs
|
||||||
|
|
||||||
|
If you want to modify Katib controller APIs, you have to
|
||||||
|
generate deepcopy, clientset, listers, informers, open-api and Python SDK with the changed APIs.
|
||||||
|
You can update the necessary files as follows:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make generate
|
||||||
|
```
|
||||||
|
|
||||||
|
## Controller Flags
|
||||||
|
|
||||||
|
Below is a list of command-line flags accepted by Katib controller:
|
||||||
|
|
||||||
|
| Name | Type | Default | Description |
|
||||||
|
| ------------------------------- | ------------------------- | --------- | ---------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| enable-grpc-probe-in-suggestion | bool | true | Enable grpc probe in suggestions |
|
||||||
|
| experiment-suggestion-name | string | "default" | The implementation of suggestion interface in experiment controller |
|
||||||
|
| metrics-addr | string | ":8080" | The address the metric endpoint binds to |
|
||||||
|
| trial-resources | []schema.GroupVersionKind | null | The list of resources that can be used as trial template, in the form: Kind.version.group (e.g. TFJob.v1.kubeflow.org) |
|
||||||
|
| webhook-inject-securitycontext | bool | false | Inject the securityContext of container[0] in the sidecar |
|
||||||
|
| webhook-port | int | 8443 | The port number to be used for admission webhook server |
|
||||||
|
|
||||||
|
## Workflow design
|
||||||
|
|
||||||
|
Please see [workflow-design.md](./workflow-design.md).
|
||||||
|
|
||||||
|
## Katib admission webhooks
|
||||||
|
|
||||||
|
Katib uses three [Kubernetes admission webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/).
|
||||||
|
|
||||||
|
1. `validator.experiment.katib.kubeflow.org` -
|
||||||
|
[Validating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook)
|
||||||
|
to validate the Katib Experiment before the creation.
|
||||||
|
|
||||||
|
1. `defaulter.experiment.katib.kubeflow.org` -
|
||||||
|
[Mutating admission webhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook)
|
||||||
|
to set the [default values](../pkg/apis/controller/experiments/v1beta1/experiment_defaults.go)
|
||||||
|
in the Katib Experiment before the creation.
|
||||||
|
|
||||||
|
1. `mutator.pod.katib.kubeflow.org` - Mutating admission webhook to inject the metrics
|
||||||
|
collector sidecar container to the training pod. Learn more about the Katib's
|
||||||
|
metrics collector in the
|
||||||
|
[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector).
|
||||||
|
|
||||||
|
You can find the YAMLs for the Katib webhooks
|
||||||
|
[here](../manifests/v1beta1/components/webhook/webhooks.yaml).
|
||||||
|
|
||||||
|
**Note:** If you are using a private Kubernetes cluster, you have to allow traffic
|
||||||
|
via `TCP:8443` by specifying the firewall rule and you have to update the master
|
||||||
|
plane CIDR source range to use the Katib webhooks
|
||||||
|
|
||||||
|
### Katib cert generator
|
||||||
|
|
||||||
|
Katib uses the custom `cert-generator` [Kubernetes Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/)
|
||||||
|
to generate certificates for the webhooks.
|
||||||
|
|
||||||
|
Once Katib is deployed in the Kubernetes cluster, the `cert-generator` Job follows these steps:
|
||||||
|
|
||||||
|
- Generate a certificate using [`openssl`](https://www.openssl.org/).
|
||||||
|
|
||||||
|
- Create a Kubernetes [Certificate Signing Request](https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/)
|
||||||
|
to approve and sign the certificate.
|
||||||
|
|
||||||
|
- Create a Kubernetes Secret with the signed certificate. Secret has
|
||||||
|
the `katib-webhook-cert` name and `cert-generator` Job's `ownerReference` to
|
||||||
|
clean-up resources once Katib is uninstalled.
|
||||||
|
|
||||||
|
Once Secret is created, the Katib controller Deployment spawns the Pod,
|
||||||
|
since the controller has the `katib-webhook-cert` Secret volume.
|
||||||
|
|
||||||
|
- Patch the webhooks with the `CABundle`.
|
||||||
|
|
||||||
|
You can find the `cert-generator` source code [here](../hack/cert-generator.sh).
|
||||||
|
|
||||||
|
## Implement a new algorithm and use it in Katib
|
||||||
|
|
||||||
|
Please see [new-algorithm-service.md](./new-algorithm-service.md).
|
||||||
|
|
||||||
|
## Algorithm settings documentation
|
||||||
|
|
||||||
|
Please see [algorithm-settings.md](./algorithm-settings.md).
|
||||||
|
|
||||||
|
## Katib UI documentation
|
||||||
|
|
||||||
|
Please see [Katib UI README](https://github.com/kubeflow/katib/tree/master/pkg/ui/v1beta1).
|
||||||
|
|
||||||
|
## Design proposals
|
||||||
|
|
||||||
|
Please see [proposals](./proposals).
|
|
@ -1,351 +0,0 @@
|
||||||
# Katib Images Location
|
|
||||||
|
|
||||||
Here you can find the location for images that are used in Katib.
|
|
||||||
|
|
||||||
## Katib Components Images
|
|
||||||
|
|
||||||
The following table shows images for the
|
|
||||||
[Katib components](https://www.kubeflow.org/docs/components/katib/reference/architecture/#katib-control-plane-components).
|
|
||||||
|
|
||||||
<table>
|
|
||||||
<tbody>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<b>Image Name</b>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<b>Description</b>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<b>Location</b>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/katib-controller</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Katib Controller
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/tree/master/cmd/katib-controller/v1beta1/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/katib-ui</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Katib User Interface
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/tree/master/cmd/ui/v1beta1/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/katib-db-manager</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Katib DB Manager
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/tree/master/cmd/db-manager/v1beta1/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>docker.io/mysql</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Katib MySQL DB
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/docker-library/mysql/blob/c506174eab8ae160f56483e8d72410f8f1e1470f/8.0/Dockerfile.debian">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</tbody>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
## Katib Metrics Collectors Images
|
|
||||||
|
|
||||||
The following table shows images for the
|
|
||||||
[Katib Metrics Collectors](https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/).
|
|
||||||
|
|
||||||
<table>
|
|
||||||
<tbody>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<b>Image Name</b>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<b>Description</b>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<b>Location</b>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/file-metrics-collector</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
File Metrics Collector
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/cmd/metricscollector/v1beta1/file-metricscollector/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/tfevent-metrics-collector</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Tensorflow Event Metrics Collector
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/cmd/metricscollector/v1beta1/tfevent-metricscollector/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</tbody>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
## Katib Suggestions and Early Stopping Images
|
|
||||||
|
|
||||||
The following table shows images for the
|
|
||||||
[Katib Suggestion services](https://www.kubeflow.org/docs/components/katib/reference/architecture/#suggestion)
|
|
||||||
and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/components/katib/user-guides/early-stopping/#early-stopping-algorithms).
|
|
||||||
|
|
||||||
<table>
|
|
||||||
<tbody>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<b>Image Name</b>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<b>Description</b>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<b>Location</b>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/suggestion-hyperopt</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/hyperopt/hyperopt">Hyperopt</a> Suggestion
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/cmd/suggestion/hyperopt/v1beta1/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/suggestion-skopt</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/scikit-optimize/scikit-optimize">Skopt</a> Suggestion
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/cmd/suggestion/skopt/v1beta1/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/suggestion-optuna</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/optuna/optuna">Optuna</a> Suggestion
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/cmd/suggestion/optuna/v1beta1/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/suggestion-goptuna</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/c-bata/goptuna">Goptuna</a> Suggestion
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/cmd/suggestion/goptuna/v1beta1/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/suggestion-hyperband</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#hyperband">Hyperband</a> Suggestion
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/cmd/suggestion/hyperband/v1beta1/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/suggestion-enas</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#enas">ENAS</a> Suggestion
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/cmd/suggestion/nas/enas/v1beta1/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/suggestion-darts</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/experiment/#differentiable-architecture-search-darts">DARTS</a> Suggestion
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/cmd/suggestion/nas/darts/v1beta1/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/earlystopping-medianstop</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://www.kubeflow.org/docs/components/katib/early-stopping/#median-stopping-rule">Median Stopping Rule</a>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/cmd/earlystopping/medianstop/v1beta1/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</tbody>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
## Training Containers Images
|
|
||||||
|
|
||||||
The following table shows images for training containers which are used in the
|
|
||||||
[Katib Trials](https://www.kubeflow.org/docs/components/katib/reference/architecture/#trial).
|
|
||||||
|
|
||||||
<table>
|
|
||||||
<tbody>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<b>Image Name</b>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<b>Description</b>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<b>Location</b>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/pytorch-mnist-cpu</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
PyTorch MNIST example with printing metrics to the file or StdOut with CPU support
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.cpu">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/pytorch-mnist-gpu</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
PyTorch MNIST example with printing metrics to the file or StdOut with GPU support
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/pytorch-mnist/Dockerfile.gpu">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/tf-mnist-with-summaries</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Tensorflow MNIST example with saving metrics in the summaries
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/tf-mnist-with-summaries/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/xgboost-lightgbm</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Distributed LightGBM example for XGBoostJob
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/xgboost-operator/blob/9c8c97d0125a8156f12b8ef5b93f99e709fb57ea/config/samples/lightgbm-dist/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>docker.io/kubeflow/mpi-horovod-mnist</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Distributed Horovod example for MPIJob
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/mpi-operator/blob/947d396a9caf70d3c94bf587d5e5da32b70f0f53/examples/horovod/Dockerfile.cpu">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>docker.io/inaccel/jupyter:lab</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
FPGA XGBoost with parameter tuning
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/inaccel/jupyter/blob/master/lab/Dockerfile">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/enas-cnn-cifar10-gpu</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Keras CIFAR-10 CNN example for ENAS with GPU support
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/enas-cnn-cifar10/Dockerfile.gpu">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/enas-cnn-cifar10-cpu</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Keras CIFAR-10 CNN example for ENAS with CPU support
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/enas-cnn-cifar10/Dockerfile.cpu">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/darts-cnn-cifar10-gpu</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
PyTorch CIFAR-10 CNN example for DARTS with GPU support
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile.gpu">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr align="center">
|
|
||||||
<td>
|
|
||||||
<code>ghcr.io/kubeflow/katib/darts-cnn-cifar10-cpu</code>
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
PyTorch CIFAR-10 CNN example for DARTS with CPU support
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
<a href="https://github.com/kubeflow/katib/blob/master/examples/v1beta1/trial-images/darts-cnn-cifar10/Dockerfile.cpu">Dockerfile</a>
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</table>
|
|
Binary file not shown.
After Width: | Height: | Size: 102 KiB |
Binary file not shown.
After Width: | Height: | Size: 338 KiB |
Binary file not shown.
After Width: | Height: | Size: 1.2 MiB |
Binary file not shown.
After Width: | Height: | Size: 192 KiB |
Before Width: | Height: | Size: 166 KiB After Width: | Height: | Size: 166 KiB |
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue