Commit Graph

12 Commits

Author SHA1 Message Date
Kimonas Sotirchos ca44b1c4ee Manifests for Tensorboard controller (kubeflow/kubeflow#5730)
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
2021-03-21 14:28:17 -07:00
Rui Fang 7f9c309586 Tesorboard-Controller: use updateStatus instead of update (kubeflow/kubeflow#5644) 2021-03-10 07:46:24 -08:00
Konstantinos Andriopoulos a97b442e5b Add RWO_PVC_SCHEDULING env var to the Tensorboard Controller deployment (kubeflow/kubeflow#5266)
* Add RWO_PVC_SCHEDULING env var to Tensorboard controller deployment

The value of the 'RWO_PVC_SCHEDULING' env var is set to "false" by
default. The user will be able to change the value of the env var
manually by modifying the 'config/manager/manager.yaml' file.

* Update README.md
2020-08-31 08:12:21 -07:00
Konstantinos Andriopoulos 254c3f7bfc Add roles for Tensorboard controller pod (kubeflow/kubeflow#5262)
* Add Tensorboard controller permissions for managing resources

The pod running the Tensorboard controller didn't have permissions
to manage the deployments, services, and VirtualServices needed
so that the Tensorboard servers would function properly.

In order for the deployed Tensorboard controller to run properly,
permissions to 'get', 'list', 'watch', 'create' and 'update'
are given to the Tensorboard controller pod so that the necessary
deployments, services and VirtualServices are created and managed
as expected. Also, permissions to 'get', 'list', 'watch' PVCs and
pods were added.

* Add namespace of Tensorboard CR to VirtualService prefix

In order to avoid creating 2 virtual services that have the same
prefix in different namespaces, the namespace of the corresponding
Tensorboard CR was added in the prefix of the generated Virtual
Service.

* Fix directory bug in Makefile

* Add README.md
2020-08-30 06:56:20 -07:00
Konstantinos Andriopoulos e32222032c Tensorboard web-app: Add functionality to inform TWA frontend about the status of Tensorboard servers (kubeflow/kubeflow#5259)
* Extend Tensorboard CRD with status.readyReplicas field

The Tensorboard CRD didn't contain any information about the
Tensorboard server being ready or not. So, the status of the
Tensorboard resource is extended so that it contains a
readyReplicas field, similar to the status.readyReplicas of
the deployment of the Tensorboard server.

* Extend Tensorboard controller to update status of Tensorboard CR

The frontend of the Tensorboard web-app will need information
about whether the Tensorboard servers are ready to connect or not.
As a result, the Tensorboard controller now copies the value of the
status.readyReplicas field of the Tensorboard deployment to the
status.readyReplicas of the Tensorboard CR.

Also, a Deployment() function was added for applying and updating
Tensorboard server deployments.

* Update tensorboard.status.phase of TWA backend response

The frontend of the TWA will need information about the status
of the Tensorboard server, so that it can inform the user about
the server being ready being ready to connect or not.

As a result, the backend sets the status.phase field of the response
to "ready", if tensorboard.status.readyReplicas == 1. Otherwise, the
status.phase field of the response is set to "unavailable".

Also, the getPVCName() function was added, which extracts the name
of a given PVC object.

* Add GET route for PVCs

The Tensorboard web-app frontend will be using an autocomplete
drop-bar to show user the PVCs that live in a specific namespace.
These PVCs could be used as log storages for the Tensorboard server.

So, a PVC GET route was added to the Tensorboard web-app backend.

* Add message to Tensorboard response object in TWA backend

The frontend of the TWA will need to output a response message for
every Tensorboard object. This response message will inform the
user about the current state of the Tensorboard server.

* Use status.STATUS_PHASE for backend response

* Add requirements.txt to TWA backend

* Use status.create_status() for backend response
2020-08-30 05:08:20 -07:00
Konstantinos Andriopoulos 1936429ea5 Tensorboard controller: Add scheduling functionality for Tensorboard servers that use RWO PVCs as log storages (kubeflow/kubeflow#5218)
* Add indexers as custom field selectors for list requests to cache

The tensorboard controller must be able to list pods that have
mounted a PVC with a specific ClaimName.

In order for this list request to cache to work properly, custom
field selectors are added. These selectors are used to index the
"pod.spec.volumes.persistentvolumeclaim.claimname" field so that
unneeded pods can be filtered out.

* Set pod's nodeAffinity if log files exist in a PVC

In the case of using a PVC as a logdir for Tensorboard Server, if
the PVC had a ReadWriteOnce access mode and was alread mounted by
another running pod X, then the Tensorboard Server pod would not
always be scheduled on the same node as X. As a result, the
Tensorboard Server pod would be blocked since multi-node access
is prohibited on ReadWriteOnce volumes.

In order for the Tensorboard Server pod to run successfully,
nodeAffinity was added to the spec.template.spec.affinity field
of the returned deployment.

As a result, both X and the Tensorboard
Server pod are now scheduled on the same node.

Resolves kubernetes/kubernetes#26567

* Set Tensorboard Server scheduling feature to 'off' by default

In the case that the Tensorboard Server used a RWO PVC (as a log
storage) that was already mounted by another pod, nodeAffinity
was used so that the Tensorboard Server would be scheduled
(if possible) on the same node as that pod.

Now, this added functionality is used only if the
'RWO_PVC_SCHEDULING' environmental variable is set to "true"
when running the Tensorboard controller.

This scheduling functionality is disabled by default.
2020-08-26 02:58:03 -07:00
Kimonas Sotirchos db97455152 Add OWNERs file to tensorboard controller (kubeflow/kubeflow#5088)
The tensorboard controller should have a distinct list of reviewers and
approvers.

Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
2020-08-07 06:32:19 -07:00
Konstantinos Andriopoulos 9ae8d1ff40 tensorboard-controller: Mount GCP secret only when accessing Google storage (kubeflow/kubeflow#5069)
* Remove duplicate package import

Package "k8s.io/api/core/v1" was imported twice with names "v1"
and "corev1".

* Mount GCP secret only when accessing Google storage

The Tensorboard controller used to create pods (running the Tensorboard
server) that would always mount user-gcp-sa secret, regardless of the
logs storage being a Google cloud bucket or not. This would lead to pods
never starting properly in the case of using other cloud services (or
PVCs) as log storages, if the user-gcp-sa secret didn't exist on the
cluster.

In order for the Tensorboard server pods to run properly, user-gcp-sa
secret is now mounted only when Google cloud buckets are used as log
storages.

Fixes kubeflow/kubeflow#5065
2020-06-18 06:46:20 -07:00
Jeremy Lewi 0895c4d135 Fix docker builds of notebook and tensorboard controller (kubeflow/kubeflow#4664)
* Fix docker builds of notebook and tensorboard controller

* The notebook-controllers and tensorboard-controllers now depend on
  the go package components/common

* We need to rewrite the Dockerfiles so that the context is now

  ${KUBEfLOW_REPO}/common

  * so that components/common can be included in the context and copied
    to the Dockerfile

* Create skaffold configs to make it easier to do remote builds with Kaniko

  * The skaffold configs are currently written assuming the kubeflow-ci cluster
    is used to build the images. This could be generalized in the future.

* Remove the code to build the notebook-controller with GCB; we can just
  use skaffold and kaniko to do efficient remote builds.

* Related to #4582 - Jupyter image doesn't build.

* Fix docker build rule.
2020-01-21 17:54:34 -08:00
Jeremy Lewi d25a14aea2 Fix notebook controller and tensorboard controller docker image build. (kubeflow/kubeflow#4631)
* The jupyter docker image isn't building because it now depends on code
  in components/common

* To make this work we need to configure it as a multi module package
  and modify go.mod to redirect to a local path.

* Ref: https://github.com/golang/go/wiki/Modules#when-should-i-use-the-replace-directive

* Replaces PR #4583

Related to #4582 - Jupyter image doesn't build.
2020-01-07 16:25:41 -08:00
MrXinWang d4fb94b020 Add arm64 support for controllers (kubeflow/kubeflow#4438)
Change-Id: I9f4b4871a5d02a53230abb836787f665dd8e3998
Signed-off-by: Henry Wang <henry.wang@arm.com>
Jira: ENTOS-1322
2019-10-31 19:53:23 -07:00
Quanjie Lin 1236c5e6d7 initial checkin of tensorboard controller (kubeflow/kubeflow#4312)
* initial checkin of tensorboard controller

* initial checkin of tensorboard controller

* typo

* typo

* fix typo

* support local path

* add status

* conflict

* remove binary
2019-10-29 09:12:44 -07:00