Commit Graph

39 Commits

Author SHA1 Message Date
一条肥鱼 07d88db222 fix: deprecation of 'go get' for installing modules (kubeflow/kubeflow#7177)
Co-authored-by: esacif <esacif@gmail.com>
2023-07-20 09:35:25 +00:00
dependabot[bot] fe21ffe649 build(deps): bump github.com/prometheus/client_golang from 1.11.0 to 1.11.1 in /components/tensorboard-controller (kubeflow/kubeflow#6956)
Bumps [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) from 1.11.0 to 1.11.1.
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](https://github.com/prometheus/client_golang/compare/v1.11.0...v1.11.1)

---
updated-dependencies:
- dependency-name: github.com/prometheus/client_golang
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-02-16 13:45:06 +00:00
Suraj Kota 144fa6805f Support Pod Defaults in Tensorboard controller (kubeflow/kubeflow#6874)
* support poddefaults in tensorboard controller

* initilize empty map
2023-01-18 15:22:21 +00:00
apoger 6b3fd05ea2 Update KF manifests and gh-action workflows to use the tag=`latest` (kubeflow/kubeflow#6854)
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>

review changes

* build images with the latest tag only when a PR
  is merged to master branch

* revert changes  in manifests/workflows for the
  notebook-server images

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
2022-12-20 15:59:18 +00:00
apoger 54ab6a815e Fix workflows for publishing images only when PR is merged (kubeflow/kubeflow#6842)
* Fix docker-publish workflows

* Remove workflow that builds/push all images

* Remove redundant files from manifests
2022-12-15 09:51:21 +00:00
apoger be85f9f1bb tensorboard-controller: Extend tests for using images of each PR (kubeflow/kubeflow#6831)
* Introduce intergration test workflow for tensorboard-controller

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>

* Publish Docker image only when PR is merged

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>

* Remove kind & manifest gh-action workflows

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>

* Update tag in manifests to v1.6.0

This change is required as images with v1.5.0 do not
exist in Dockerhub.

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
2022-12-12 14:12:28 +00:00
amitmukati-2604 16eb8e9f7c Adding support for linux/ppc64le in CI for tensorboard-controller multi-arch docker images. (kubeflow/kubeflow#6805) 2022-12-09 09:15:11 +00:00
apoger 10e0e93085 Cherry-pick commits for using DockerHub for all images (kubeflow/kubeflow#6825)
cherry-picking: #6548
* Update all images to use DockerHub
* Update releasing script for dockerhub

Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
Cherry-picked-by: Apostolos Gerakaris <apoger@arrikto.com>

Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
Co-authored-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
2022-12-08 15:37:10 +00:00
apoger 46f14d4e97 Use K8s 1.25 for the tests (kubeflow/kubeflow#6751)
* kind: Introduce config file for 1.25

* Add a new KinD configuration file for testing with K8s 1.25.3
* Install kind v0.17.0 for testing with K8s 1.25.3

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>

* gh-actions: Use 1.25 for testing

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>

* testing: Install Istio 1.16 for testing

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>

* Test commit for enabling the tests

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>

* notebook-controller: Fix Makefile

Remove the test rule as a prerequisite for running docker-build

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>

Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
2022-11-24 08:30:10 +00:00
Pranav Pandit d83d55a892 updated compatible base images & removed arch dependencies in different components for multiple arch support (kubeflow/kubeflow#6650)
* updated base images in Volume Web component for multiple arch

* updated base images in Tensorboard Web component for multiple arch

* updated base images in Jupyter Web component for multiple arch

* updated admission webhook component for multiple arch

* removed goarch depedency for multiarch building

* removed goarch depedency for multiarch building in admission webhook component

* removed goarch depedency & added powerPC case for multiarch building in access-management component

* removed goarch depedency for multiarch building in tensorboard controller

* removed goarch depedency for multiarch building in notebook Controller

* Removing empty computation to resolve future build issues
2022-11-23 13:42:42 +00:00
Oleksandr Shepotinnik a17c966aff tensorboard-controller: Fix tensorboard endless restarts (kubeflow/kubeflow#6722) 2022-11-09 10:54:39 +00:00
apoger 939e1e22ca Introduce a mechanism to build all Kubeflow images (kubeflow/kubeflow#6555)
* build: Update components makefiles for building

We'll create a top-level Makefile under components/ dir
that has the following rules:

* build-all:
  To build all images locally

* push-all:
  * We can use a specific REGISTRY and retag the images
  * Push all the images

This top-level Makefile will run the sub-Makefiles that every
component has for building and pushing the images.

We modified every sub-Makefile as follows:
* We don't use a registry in images by default
* Removed unused rules and vars
* Use the --dirty flag of git describe in TAG

        --dirty[=<mark>]
               Describe the working tree. It means describe HEAD and
               appends <mark> (-dirty by default) if the working tree
               is dirty.

Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>

* build: Create makefiles for notebook servers

The common starting point of building the notebook-server
images are the following 4 *base* images:
- base
- codeserver
- jupyter
- rstudio

All other server images need to build on top of them. We'll
dynamically pass the base images in every Dockerfile by
using an ARG IMG. We can set the value of this ARG during
docker build with the --build-arg CLI argument.

This way we build both the base images with a tag locally,
and then we pass that image as arg via the Makefile and build the rest

So we modified our building procedure as follows:
1. Build the base image since everything starts from there

2. Pass the base image as an ARG in the Dockerfiles of
jupyter, codeserver, rstudio images and build on top

3. Pass the base images in all other server images and build
on top

For that we will:

1. Create a Makefile for each of the notebook servers, in each folder
   a. Each makefile will be responsible for building the bases and use args for passing them on

2. Use the central Makefile to call each Makefile from above

Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>

* cherry-pick: Notebook server upstream fixes

Relevant upstream PR: https://github.com/kubeflow/kubeflow/pull/6466/files

Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>

* build: Fix Makefiles

- Remove build-gcp and build-gcr rules as we don't use them anywhere in
the project
- Fix code conficts

Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>

* build: Fix Dockerfiles for notbook-server images

We remove the previous logic of using already built images as bases.
The users must use only the Makefiles to pass the appropriate BASE_IMG
and build the images correctly.

Thus, we have Makefiles everywhere that:

- Can build any base image
- If an image requires another notebook base, then we first build that one using its makefile,
  and then use it as docker ARG for building the next one

Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>

* build: Fix notebook-controller Makefile

Removed a misplaced "|" char that breaks the Makefile

Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>

* build: Update GH action workflows

* Update workflow for notebook-server images:
  - Add a step for building all images by using the
  central-Makefile under components/example-notebook-servers/ dir.
  - Add a step for pushing all images by using the
  central-Makefile under components/example-notebook-servers/ dir.

* Update workflow for all Kubeflow images:
  - Add a step for building & pushing all images by using the
  top-level Makefile under components/ dir.

Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>

* build: Remove completely ECR references from images

Replace everywhere the "public.ecr.aws/j1r0q0g6/notebooks/notebook-servers"
prefix with "kubeflownotebookswg"

Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>

* build: Fix GH actions for Kubeflow components

Fix GH actions to use the updated make rules
when building the Kubeflow component images.

Remove the "docker.io" prefix when building with
GH action workflows

Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
2022-07-01 17:54:06 +00:00
Alex Lembiyeuski 8654406f47 Upgrade API version of `Tensorboard` CRD to `v1` (kubeflow/kubeflow#6406)
* Migrate tensorboard-controller to Kubebuilder v3

* Fix paths inside Docker context

* Remove test dependency from docker-build

* Switch to kustomize 3.2.0, fix image tag

* Fix namePrefix

* Rename deployments, remove namespaces

* Add runAsUser

* Make tensorboard image and istio gateway configurable
2022-06-17 09:20:10 +00:00
dependabot[bot] c4a07e5dd3 build(deps): bump github.com/gogo/protobuf from 1.1.1 to 1.3.2 in /components/tensorboard-controller (kubeflow/kubeflow#6424)
Bumps [github.com/gogo/protobuf](https://github.com/gogo/protobuf) from 1.1.1 to 1.3.2.
- [Release notes](https://github.com/gogo/protobuf/releases)
- [Commits](https://github.com/gogo/protobuf/compare/v1.1.1...v1.3.2)

---
updated-dependencies:
- dependency-name: github.com/gogo/protobuf
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-04-14 07:14:53 +00:00
Kimonas Sotirchos a61650ee88 release: Images for the 1.5.0 tag (kubeflow/kubeflow#6398)
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
2022-03-09 22:37:11 +00:00
Kimonas Sotirchos 9ba5be1c1c releasing: Create v1.5.0-rc.2 images (kubeflow/kubeflow#6394)
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
2022-03-04 17:55:59 +00:00
Kimonas Sotirchos fff5155e1e releasing: Update tags for v1.5.0-rc.1 (kubeflow/kubeflow#6343)
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
2022-02-10 18:57:15 +00:00
Hao Xin fcc4786a49 Fix(manifests): Upgrade rbac.authorization.k8s.io from v1beta1 to v1 (kubeflow/kubeflow#6261) 2022-02-03 16:18:16 +00:00
Kimonas Sotirchos 64903665dc Update images for the 1.5 rc0 release (kubeflow/kubeflow#6319)
* Update the releasing version tag

Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>

* Run automated script for updating versions

Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
2022-01-27 14:16:10 +00:00
juliusvonkohout 483cabb7e2 fix(backend): tensorboard-controller does not work because of missing permissions (kubeflow/kubeflow#6216) 2021-11-23 23:57:47 +00:00
juliusvonkohout f2df5f5b84 fix: tensorboard-controller is killed due to out of memory (kubeflow/kubeflow#6148)
* Update manager.yaml

* Update manager.yaml
2021-10-19 21:07:15 -07:00
Kimonas Sotirchos dacbda949d Bump Golang version in PodDefaults, TensorBoard Controller and KFAM to 1.17 (kubeflow/kubeflow#6180)
* kfam: Upgrade go to 1.17

Update to a more recent docker image that has a newer version of
openssl.

Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>

* poddefaults: Upgrade go to 1.17

Update to a more recent docker image that has a newer version of
openssl.

Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>

* tensorboards: Upgrade go to 1.17

Update to a more recent docker image that has a newer version of
openssl.

Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
2021-10-08 06:02:23 -07:00
DavidSpek 4b59c008b9 tensorboard-controller: fix binding issue (kubeflow/kubeflow#5925) 2021-05-25 07:30:09 -07:00
Ilias Katsakioris 92ca8a2f84 tensorboard-controller: Fix scheduling unbound PVCs (kubeflow/kubeflow#5819)
When the TB controller attempts to schedule a RWO PVC it checks its
accessModes in the PVC status. The controller panics if the list is
empty.

This commit adds a check to ensure the list is not empty.

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
2021-04-08 17:27:03 -07:00
DavidSpek 94390858bc Specify commonLabels for tensorboard-controller (kubeflow/kubeflow#5780) 2021-03-26 03:35:46 -07:00
DavidSpek 4842c53f7a Update manifests to use ECR and fix fieldPath in kustomization files (kubeflow/kubeflow#5765)
* Update manifests to use ECR and latest image tags

* remove duplicate value in central-dashboard kustomization.yaml
2021-03-24 07:35:45 -07:00
Kimonas Sotirchos 0fe8bf5463 Tensorboards web app manifests: Don't use specific namespace in base (kubeflow/kubeflow#5753)
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
2021-03-23 08:46:44 -07:00
Kimonas Sotirchos ca44b1c4ee Manifests for Tensorboard controller (kubeflow/kubeflow#5730)
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
2021-03-21 14:28:17 -07:00
Rui Fang 7f9c309586 Tesorboard-Controller: use updateStatus instead of update (kubeflow/kubeflow#5644) 2021-03-10 07:46:24 -08:00
Konstantinos Andriopoulos a97b442e5b Add RWO_PVC_SCHEDULING env var to the Tensorboard Controller deployment (kubeflow/kubeflow#5266)
* Add RWO_PVC_SCHEDULING env var to Tensorboard controller deployment

The value of the 'RWO_PVC_SCHEDULING' env var is set to "false" by
default. The user will be able to change the value of the env var
manually by modifying the 'config/manager/manager.yaml' file.

* Update README.md
2020-08-31 08:12:21 -07:00
Konstantinos Andriopoulos 254c3f7bfc Add roles for Tensorboard controller pod (kubeflow/kubeflow#5262)
* Add Tensorboard controller permissions for managing resources

The pod running the Tensorboard controller didn't have permissions
to manage the deployments, services, and VirtualServices needed
so that the Tensorboard servers would function properly.

In order for the deployed Tensorboard controller to run properly,
permissions to 'get', 'list', 'watch', 'create' and 'update'
are given to the Tensorboard controller pod so that the necessary
deployments, services and VirtualServices are created and managed
as expected. Also, permissions to 'get', 'list', 'watch' PVCs and
pods were added.

* Add namespace of Tensorboard CR to VirtualService prefix

In order to avoid creating 2 virtual services that have the same
prefix in different namespaces, the namespace of the corresponding
Tensorboard CR was added in the prefix of the generated Virtual
Service.

* Fix directory bug in Makefile

* Add README.md
2020-08-30 06:56:20 -07:00
Konstantinos Andriopoulos e32222032c Tensorboard web-app: Add functionality to inform TWA frontend about the status of Tensorboard servers (kubeflow/kubeflow#5259)
* Extend Tensorboard CRD with status.readyReplicas field

The Tensorboard CRD didn't contain any information about the
Tensorboard server being ready or not. So, the status of the
Tensorboard resource is extended so that it contains a
readyReplicas field, similar to the status.readyReplicas of
the deployment of the Tensorboard server.

* Extend Tensorboard controller to update status of Tensorboard CR

The frontend of the Tensorboard web-app will need information
about whether the Tensorboard servers are ready to connect or not.
As a result, the Tensorboard controller now copies the value of the
status.readyReplicas field of the Tensorboard deployment to the
status.readyReplicas of the Tensorboard CR.

Also, a Deployment() function was added for applying and updating
Tensorboard server deployments.

* Update tensorboard.status.phase of TWA backend response

The frontend of the TWA will need information about the status
of the Tensorboard server, so that it can inform the user about
the server being ready being ready to connect or not.

As a result, the backend sets the status.phase field of the response
to "ready", if tensorboard.status.readyReplicas == 1. Otherwise, the
status.phase field of the response is set to "unavailable".

Also, the getPVCName() function was added, which extracts the name
of a given PVC object.

* Add GET route for PVCs

The Tensorboard web-app frontend will be using an autocomplete
drop-bar to show user the PVCs that live in a specific namespace.
These PVCs could be used as log storages for the Tensorboard server.

So, a PVC GET route was added to the Tensorboard web-app backend.

* Add message to Tensorboard response object in TWA backend

The frontend of the TWA will need to output a response message for
every Tensorboard object. This response message will inform the
user about the current state of the Tensorboard server.

* Use status.STATUS_PHASE for backend response

* Add requirements.txt to TWA backend

* Use status.create_status() for backend response
2020-08-30 05:08:20 -07:00
Konstantinos Andriopoulos 1936429ea5 Tensorboard controller: Add scheduling functionality for Tensorboard servers that use RWO PVCs as log storages (kubeflow/kubeflow#5218)
* Add indexers as custom field selectors for list requests to cache

The tensorboard controller must be able to list pods that have
mounted a PVC with a specific ClaimName.

In order for this list request to cache to work properly, custom
field selectors are added. These selectors are used to index the
"pod.spec.volumes.persistentvolumeclaim.claimname" field so that
unneeded pods can be filtered out.

* Set pod's nodeAffinity if log files exist in a PVC

In the case of using a PVC as a logdir for Tensorboard Server, if
the PVC had a ReadWriteOnce access mode and was alread mounted by
another running pod X, then the Tensorboard Server pod would not
always be scheduled on the same node as X. As a result, the
Tensorboard Server pod would be blocked since multi-node access
is prohibited on ReadWriteOnce volumes.

In order for the Tensorboard Server pod to run successfully,
nodeAffinity was added to the spec.template.spec.affinity field
of the returned deployment.

As a result, both X and the Tensorboard
Server pod are now scheduled on the same node.

Resolves kubernetes/kubernetes#26567

* Set Tensorboard Server scheduling feature to 'off' by default

In the case that the Tensorboard Server used a RWO PVC (as a log
storage) that was already mounted by another pod, nodeAffinity
was used so that the Tensorboard Server would be scheduled
(if possible) on the same node as that pod.

Now, this added functionality is used only if the
'RWO_PVC_SCHEDULING' environmental variable is set to "true"
when running the Tensorboard controller.

This scheduling functionality is disabled by default.
2020-08-26 02:58:03 -07:00
Kimonas Sotirchos db97455152 Add OWNERs file to tensorboard controller (kubeflow/kubeflow#5088)
The tensorboard controller should have a distinct list of reviewers and
approvers.

Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
2020-08-07 06:32:19 -07:00
Konstantinos Andriopoulos 9ae8d1ff40 tensorboard-controller: Mount GCP secret only when accessing Google storage (kubeflow/kubeflow#5069)
* Remove duplicate package import

Package "k8s.io/api/core/v1" was imported twice with names "v1"
and "corev1".

* Mount GCP secret only when accessing Google storage

The Tensorboard controller used to create pods (running the Tensorboard
server) that would always mount user-gcp-sa secret, regardless of the
logs storage being a Google cloud bucket or not. This would lead to pods
never starting properly in the case of using other cloud services (or
PVCs) as log storages, if the user-gcp-sa secret didn't exist on the
cluster.

In order for the Tensorboard server pods to run properly, user-gcp-sa
secret is now mounted only when Google cloud buckets are used as log
storages.

Fixes kubeflow/kubeflow#5065
2020-06-18 06:46:20 -07:00
Jeremy Lewi 0895c4d135 Fix docker builds of notebook and tensorboard controller (kubeflow/kubeflow#4664)
* Fix docker builds of notebook and tensorboard controller

* The notebook-controllers and tensorboard-controllers now depend on
  the go package components/common

* We need to rewrite the Dockerfiles so that the context is now

  ${KUBEfLOW_REPO}/common

  * so that components/common can be included in the context and copied
    to the Dockerfile

* Create skaffold configs to make it easier to do remote builds with Kaniko

  * The skaffold configs are currently written assuming the kubeflow-ci cluster
    is used to build the images. This could be generalized in the future.

* Remove the code to build the notebook-controller with GCB; we can just
  use skaffold and kaniko to do efficient remote builds.

* Related to #4582 - Jupyter image doesn't build.

* Fix docker build rule.
2020-01-21 17:54:34 -08:00
Jeremy Lewi d25a14aea2 Fix notebook controller and tensorboard controller docker image build. (kubeflow/kubeflow#4631)
* The jupyter docker image isn't building because it now depends on code
  in components/common

* To make this work we need to configure it as a multi module package
  and modify go.mod to redirect to a local path.

* Ref: https://github.com/golang/go/wiki/Modules#when-should-i-use-the-replace-directive

* Replaces PR #4583

Related to #4582 - Jupyter image doesn't build.
2020-01-07 16:25:41 -08:00
MrXinWang d4fb94b020 Add arm64 support for controllers (kubeflow/kubeflow#4438)
Change-Id: I9f4b4871a5d02a53230abb836787f665dd8e3998
Signed-off-by: Henry Wang <henry.wang@arm.com>
Jira: ENTOS-1322
2019-10-31 19:53:23 -07:00
Quanjie Lin 1236c5e6d7 initial checkin of tensorboard controller (kubeflow/kubeflow#4312)
* initial checkin of tensorboard controller

* initial checkin of tensorboard controller

* typo

* typo

* fix typo

* support local path

* add status

* conflict

* remove binary
2019-10-29 09:12:44 -07:00